Why Your CRM Has 47 Versions of 'Microsoft Corp' (And How to Fix It)
A practical guide to CRM deduplication — why it's harder than your vendor claims, and the strategies that actually work.
I’ve been in more CRM cleanup projects than I care to count. And they all start the same way: someone pulls a quick export, glances at it, and says “oh wow, we have a lot of duplicates.”
A lot. Like, embarrassing-for-the-sales-team a lot.
The record I always remember? A single client’s CRM with 23 variations of “AT&T.” AT&T, AT & T, At&t Inc, AT&T Corp, at&t, AT&T - Enterprise, ATT, A.T.&T.… you get the idea. Every sales rep who ever touched that account left their fingerprints in the data.
That’s deduplication in a nutshell: the quiet, unglamorous work of taking a database that reflects how humans actually type things and turning it into something a machine can reason about.
Why It Gets This Bad
Nobody wakes up and decides to create 47 versions of “Microsoft Corp.” It happens slowly, then all at once.
The usual culprits:
- Manual entry — three reps, three conventions for entering company names
- Imports without validation — CSV uploads that bypass your field rules
- Integrations that don’t normalize — your marketing tool feeds leads with “ACME, Inc.” while your form captures “ACME Inc” (no comma)
- Acquired databases — merging two CRMs is a data quality event, not a migration
- No ownership — if nobody’s responsible for data hygiene, nobody does it
The longer you go without addressing it, the worse it gets. Duplicates breed more duplicates. Sales reps see their target account missing, create a new one, and now you’ve got three.
The Deduplication Toolkit
There’s no silver bullet here — a good dedup strategy uses several methods in combination.
1. Exact Match (the easy part)
Start with the obvious: same email address, same phone number, same domain. These are high-confidence matches you can merge automatically.
import pandas as pd
df = pd.read_csv("crm_export.csv")
# Find exact email dupes
dupes = df[df.duplicated(subset=["email"], keep=False)]
print(f"Exact email duplicates: {len(dupes)}")
This usually gets you 10–20% of the problem. The rest is messier.
2. Fuzzy Matching on Company Names
This is where it gets interesting. You need to compare strings that mean the same thing but look different.
from rapidfuzz import fuzz, process
companies = df["company_name"].dropna().unique().tolist()
# Find close matches for "Microsoft Corporation"
matches = process.extract(
"Microsoft Corporation",
companies,
scorer=fuzz.token_sort_ratio,
score_cutoff=85
)
for match, score, _ in matches:
print(f"{match} — score: {score}")
token_sort_ratio handles word order ("Corp Microsoft" vs "Microsoft Corp"). token_set_ratio handles extra words ("Microsoft Corporation LLC" vs "Microsoft"). Use both and compare.
3. Normalization First
Before matching anything, normalize everything. Strip legal suffixes, lowercase, remove punctuation:
import re
def normalize_company(name: str) -> str:
if not name:
return ""
name = name.lower().strip()
# Remove legal suffixes
suffixes = r"\b(inc|llc|ltd|corp|co|company|incorporated|limited)\b\.?"
name = re.sub(suffixes, "", name)
# Remove punctuation
name = re.sub(r"[^\w\s]", "", name)
# Collapse whitespace
name = re.sub(r"\s+", " ", name).strip()
return name
df["company_normalized"] = df["company_name"].apply(normalize_company)
This alone collapses a huge chunk of your duplicates before you run a single match.
4. Domain Matching
If you have websites or emails, extract the root domain and use that as a match key:
from urllib.parse import urlparse
def extract_domain(url: str) -> str:
if not url:
return ""
if not url.startswith("http"):
url = "https://" + url
try:
domain = urlparse(url).netloc
# Strip www.
return domain.replace("www.", "").lower()
except:
return ""
df["domain"] = df["website"].apply(extract_domain)
microsoft.com will match across http://www.microsoft.com, https://microsoft.com/en-us/, etc.
The Hard Part: Merging
Finding duplicates is 30% of the work. Merging them is the other 70%.
A few principles that have saved me:
Keep the oldest record as the master. It usually has more history — emails, notes, activity. If you’re uncertain, default to oldest creation date.
Don’t auto-merge above a certain confidence threshold. I usually set a human review queue for anything between 70–90% match confidence. Auto-merge >90%, flag <70% as non-duplicates, review the middle.
Log everything. Before any merge operation, snapshot the records. You will get a call from a sales rep asking why their account history disappeared, and you’ll want to be able to restore it.
Don’t merge contacts across different companies. “John Smith at Microsoft” and “John Smith at Apple” are not duplicates, even if the fuzzy matcher thinks so. Always include company in your matching criteria.
A Realistic Timeline
Here’s what a typical CRM cleanup looks like when you’re doing it properly:
| Phase | Time | What Happens |
|---|---|---|
| Export & audit | 1 day | How bad is it, actually? |
| Normalization | 2–3 days | Clean input data |
| Match candidate generation | 1–2 days | Build the merge queue |
| Human review | 3–5 days | Sales/marketing validates |
| Merge execution | 1 day | Run it, monitor for errors |
| Validation | 1–2 days | Did it work? What broke? |
Budget two weeks for a medium-sized CRM (10k–50k accounts). Larger? Add a week per order of magnitude.
Preventing the Next Mess
Cleanup is a one-time fix. Prevention is the real win.
- Duplicate detection on entry — most modern CRMs have this; turn it on if you haven’t
- Domain-based matching at lead creation — if the domain already exists, suggest the existing account
- Field validation — standardize company name format rules in your import templates
- Quarterly hygiene checks — a quick fuzzy match pass every 90 days catches drift before it compounds
- Data ownership — someone on the team has to own this. Shared responsibility = nobody’s responsibility.
The Bottom Line
Every messy CRM can be fixed. I’ve seen worse than yours, I promise. But it takes a systematic approach: normalize first, match second, merge carefully, and build the guardrails so you don’t end up back here in 18 months.
The 47 versions of Microsoft Corp aren’t a judgment on your team. They’re just what happens when humans enter data over years without guardrails. Your job is to build the guardrails.
Need help untangling a particularly gnarly dataset? Let’s talk.
— Matthew