Matthew Kassel
Back to Blog Data analytics dashboard showing graphs and charts representing CRM data cleanup
datacrmtutorial

Why Your CRM Has 47 Versions of 'Microsoft Corp' (And How to Fix It)

A practical guide to CRM deduplication — why it's harder than your vendor claims, and the strategies that actually work.

I’ve been in more CRM cleanup projects than I care to count. And they all start the same way: someone pulls a quick export, glances at it, and says “oh wow, we have a lot of duplicates.”

A lot. Like, embarrassing-for-the-sales-team a lot.

The record I always remember? A single client’s CRM with 23 variations of “AT&T.” AT&T, AT & T, At&t Inc, AT&T Corp, at&t, AT&T - Enterprise, ATT, A.T.&T.… you get the idea. Every sales rep who ever touched that account left their fingerprints in the data.

That’s deduplication in a nutshell: the quiet, unglamorous work of taking a database that reflects how humans actually type things and turning it into something a machine can reason about.

Why It Gets This Bad

Nobody wakes up and decides to create 47 versions of “Microsoft Corp.” It happens slowly, then all at once.

The usual culprits:

  • Manual entry — three reps, three conventions for entering company names
  • Imports without validation — CSV uploads that bypass your field rules
  • Integrations that don’t normalize — your marketing tool feeds leads with “ACME, Inc.” while your form captures “ACME Inc” (no comma)
  • Acquired databases — merging two CRMs is a data quality event, not a migration
  • No ownership — if nobody’s responsible for data hygiene, nobody does it

The longer you go without addressing it, the worse it gets. Duplicates breed more duplicates. Sales reps see their target account missing, create a new one, and now you’ve got three.

The Deduplication Toolkit

There’s no silver bullet here — a good dedup strategy uses several methods in combination.

1. Exact Match (the easy part)

Start with the obvious: same email address, same phone number, same domain. These are high-confidence matches you can merge automatically.

import pandas as pd

df = pd.read_csv("crm_export.csv")

# Find exact email dupes
dupes = df[df.duplicated(subset=["email"], keep=False)]
print(f"Exact email duplicates: {len(dupes)}")

This usually gets you 10–20% of the problem. The rest is messier.

2. Fuzzy Matching on Company Names

This is where it gets interesting. You need to compare strings that mean the same thing but look different.

from rapidfuzz import fuzz, process

companies = df["company_name"].dropna().unique().tolist()

# Find close matches for "Microsoft Corporation"
matches = process.extract(
    "Microsoft Corporation",
    companies,
    scorer=fuzz.token_sort_ratio,
    score_cutoff=85
)

for match, score, _ in matches:
    print(f"{match} — score: {score}")

token_sort_ratio handles word order ("Corp Microsoft" vs "Microsoft Corp"). token_set_ratio handles extra words ("Microsoft Corporation LLC" vs "Microsoft"). Use both and compare.

3. Normalization First

Before matching anything, normalize everything. Strip legal suffixes, lowercase, remove punctuation:

import re

def normalize_company(name: str) -> str:
    if not name:
        return ""
    name = name.lower().strip()
    # Remove legal suffixes
    suffixes = r"\b(inc|llc|ltd|corp|co|company|incorporated|limited)\b\.?"
    name = re.sub(suffixes, "", name)
    # Remove punctuation
    name = re.sub(r"[^\w\s]", "", name)
    # Collapse whitespace
    name = re.sub(r"\s+", " ", name).strip()
    return name

df["company_normalized"] = df["company_name"].apply(normalize_company)

This alone collapses a huge chunk of your duplicates before you run a single match.

4. Domain Matching

If you have websites or emails, extract the root domain and use that as a match key:

from urllib.parse import urlparse

def extract_domain(url: str) -> str:
    if not url:
        return ""
    if not url.startswith("http"):
        url = "https://" + url
    try:
        domain = urlparse(url).netloc
        # Strip www.
        return domain.replace("www.", "").lower()
    except:
        return ""

df["domain"] = df["website"].apply(extract_domain)

microsoft.com will match across http://www.microsoft.com, https://microsoft.com/en-us/, etc.

The Hard Part: Merging

Finding duplicates is 30% of the work. Merging them is the other 70%.

A few principles that have saved me:

Keep the oldest record as the master. It usually has more history — emails, notes, activity. If you’re uncertain, default to oldest creation date.

Don’t auto-merge above a certain confidence threshold. I usually set a human review queue for anything between 70–90% match confidence. Auto-merge >90%, flag <70% as non-duplicates, review the middle.

Log everything. Before any merge operation, snapshot the records. You will get a call from a sales rep asking why their account history disappeared, and you’ll want to be able to restore it.

Don’t merge contacts across different companies. “John Smith at Microsoft” and “John Smith at Apple” are not duplicates, even if the fuzzy matcher thinks so. Always include company in your matching criteria.

A Realistic Timeline

Here’s what a typical CRM cleanup looks like when you’re doing it properly:

PhaseTimeWhat Happens
Export & audit1 dayHow bad is it, actually?
Normalization2–3 daysClean input data
Match candidate generation1–2 daysBuild the merge queue
Human review3–5 daysSales/marketing validates
Merge execution1 dayRun it, monitor for errors
Validation1–2 daysDid it work? What broke?

Budget two weeks for a medium-sized CRM (10k–50k accounts). Larger? Add a week per order of magnitude.

Preventing the Next Mess

Cleanup is a one-time fix. Prevention is the real win.

  • Duplicate detection on entry — most modern CRMs have this; turn it on if you haven’t
  • Domain-based matching at lead creation — if the domain already exists, suggest the existing account
  • Field validation — standardize company name format rules in your import templates
  • Quarterly hygiene checks — a quick fuzzy match pass every 90 days catches drift before it compounds
  • Data ownership — someone on the team has to own this. Shared responsibility = nobody’s responsibility.

The Bottom Line

Every messy CRM can be fixed. I’ve seen worse than yours, I promise. But it takes a systematic approach: normalize first, match second, merge carefully, and build the guardrails so you don’t end up back here in 18 months.

The 47 versions of Microsoft Corp aren’t a judgment on your team. They’re just what happens when humans enter data over years without guardrails. Your job is to build the guardrails.

Need help untangling a particularly gnarly dataset? Let’s talk.

— Matthew