Entity Resolution for SaaS: The Missing Layer

Your customer in Stripe is an email address. In your product database, they're an account ID. In HubSpot, they're a company name. In your support tool, they're a ticket requester.

They're all the same customer. But your systems don't know that. And this disconnect is the root cause of most data quality problems in SaaS.

The problem in numbers

The average SaaS company at the $5–20M ARR stage has 4–7 core data systems: billing (Stripe/Chargebee), product database (Postgres/MySQL), CRM (HubSpot/Salesforce), support (Zendesk/Intercom), analytics (Mixpanel/Amplitude), and sometimes a data warehouse.

Each system has its own identifier for "customer." None of them agree. The result:

**3–8% duplicate customers** inflating your MRR and customer count
**Incomplete customer profiles** — you can't see usage, billing, and support data in one view
**Wrong metric calculations** — churn rates computed on inflated customer counts are artificially low
**Missed signals** — usage decay in the product DB isn't connected to the billing record, so nobody notices until the cancellation

What entity resolution actually is

Entity resolution (also called record linkage, deduplication, or identity resolution) is the process of determining which records across different systems refer to the same real-world entity.

It's a well-studied problem in computer science, with applications in healthcare (matching patient records), finance (fraud detection), and government (census deduplication). But it's surprisingly underused in SaaS, where the stakes are lower but the ROI is immediate.

The process has four stages:

Stage 1: Blocking

You can't compare every record to every other record — that's O(n^2). Blocking reduces the search space by grouping records into "blocks" based on shared attributes. For SaaS data, email domain is the most effective blocking key: records with the same email domain are likely to be the same company.

Other useful blocking keys: normalized company name, phone number prefix, and IP address subnet (for product data).

Stage 2: Scoring

Within each block, you compute pairwise similarity scores. For SaaS entity resolution, the most useful signals are:

**Email exact match** — if two records share an email, they're almost certainly the same entity. Weight: very high.
**Company name fuzzy match** — "Acme Inc" and "Acme Inc." and "ACME INCORPORATED" are the same company. We use Jaro-Winkler distance with a threshold of 0.85.
**Domain match** — records from the same email domain, even with different local parts, are likely the same company.
**Behavioral signals** — same IP ranges, same usage patterns, similar creation timestamps.

Stage 3: Clustering

Scoring produces pairs: "Record A and Record B are 92% likely to be the same entity." Clustering groups these pairs into connected components. If A matches B and B matches C, then A, B, and C are all the same entity — even if A and C don't match directly.

We use union-find (disjoint set) for clustering, which runs in near-linear time. The output is a set of clusters, each representing one real-world customer.

Stage 4: Canonical record

Each cluster needs a single "golden record" — the canonical representation of that customer. We build this by selecting the best value for each field across all source records:

**Name**: prefer the CRM record (manually entered, usually most accurate)
**Email**: prefer the billing record (verified by payment)
**Plan/MRR**: always from the billing system (source of truth for revenue)
**Usage data**: always from the product database
**Confidence score**: computed from match quality and source reliability

Why SaaS companies don't do this

Entity resolution sounds like a data warehouse problem. And the traditional solution is a data warehouse: ETL everything into Snowflake, build dbt models, run deduplication queries.

That works. It also costs $200K+ in tooling and 6+ months to set up. For a company at $8M ARR with 12 engineers, that's not a reasonable investment just to get an accurate customer count.

The alternative — which is what we built at Vesh AI — is to run entity resolution directly against your source systems via read-only connections. No ETL. No warehouse. No migration. You connect Stripe and Postgres, and within hours you have a deduplicated customer list with confidence scores.

The downstream impact

Once you have resolved entities, everything else gets better:

**MRR accuracy**: duplicate customers are merged, so your MRR reflects reality
**Churn calculation**: computed on real customer count, not inflated numbers
**Customer 360**: one view that combines billing, product, and CRM data
**Anomaly detection**: cross-source signals (usage drop + approaching renewal) become visible
**Lineage**: every metric can trace back to which source records contributed to it

Entity resolution isn't glamorous. It's plumbing. But it's the plumbing that makes everything above it trustworthy. Without it, your metrics are built on sand.

Getting started

If you want to try entity resolution without a full platform:

. Export customers from Stripe and users from your product DB
. Join on email address (exact match gets you 70–80%)
. For the remainder, try fuzzy matching on company name
. Count how many "customers" collapse into fewer entities

If the answer is more than 3%, you have a data quality problem worth fixing. And if you want to automate it across all your sources with confidence scores and lineage tracking, that's exactly what Vesh AI does.