How to Audit a Database You've Never Seen Before

Someone drops a database in your lap. Could be a client handoff. A new job. An acquisition. A spreadsheet that got out of control and graduated to Postgres. Whatever the story — you’re in it now, and you need to figure out what you’re actually dealing with.

This is the playbook I run every single time.

The goal isn’t to understand everything immediately. The goal is to get oriented without breaking anything, and build enough of a map to know what questions to ask next.

Phase 1: Look Before You Touch

Before writing a single query, gather context from humans.

Ask:

What does this database power? (App, reports, exports, all of the above?)
When was it last modified significantly?
Is there documentation anywhere? (Spoiler: there isn’t. Ask anyway.)
Who are the power users — the people who actually know what the data means?
Are there any tables or records I should absolutely not touch?

That last one matters. Every database has a landmine — a table that a billing system reads from live, a view that an executive’s dashboard depends on, a stored procedure that runs payroll. Find out where the landmines are before you start poking around.

Phase 2: Get the Lay of the Land

Now you can start querying. Start at the schema level — count tables, look at sizes, get a feel for the shape of the thing.

Table inventory

-- PostgreSQL
SELECT
    table_name,
    pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) AS size
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY pg_total_relation_size(quote_ident(table_name)) DESC;

-- MySQL / MariaDB
SELECT
    table_name,
    table_rows,
    ROUND((data_length + index_length) / 1024 / 1024, 2) AS size_mb
FROM information_schema.tables
WHERE table_schema = DATABASE()
ORDER BY (data_length + index_length) DESC;

This tells you immediately what’s big and what’s probably important. A 50-million-row events table and a 200-row users table? You’re looking at a time-series system with a small user base. A 500k-row orders table and a 2-million-row order_line_items table? Ecommerce transactional data.

Column inspection

-- What's in the biggest tables?
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'your_table_here'
  AND table_schema = 'public'
ORDER BY ordinal_position;

Look for:

Nullable columns that probably shouldn’t be nullable — this is a data quality flag
VARCHAR columns with no length limit — free-text fields where anything might live
Generic names like data, value, info, misc — these are where the bodies are buried
Timestamp columns — created_at, updated_at, deleted_at tell you a lot about the data lifecycle

Phase 3: Profile the Data

Schema tells you what the database thinks it contains. Profiling tells you what it actually contains.

Null rates

-- How complete is each column?
SELECT
    COUNT(*) AS total_rows,
    COUNT(email) AS email_filled,
    COUNT(phone) AS phone_filled,
    COUNT(company) AS company_filled,
    ROUND(100.0 * COUNT(email) / COUNT(*), 1) AS email_pct,
    ROUND(100.0 * COUNT(phone) / COUNT(*), 1) AS phone_pct,
    ROUND(100.0 * COUNT(company) / COUNT(*), 1) AS company_pct
FROM contacts;

A 40% null rate on phone is a data collection problem. A 95% null rate on linkedin_url means that field was added later and mostly not backfilled. A 0% null rate on email either means the field is required at input or it means someone filled it with "N/A" — which leads to the next check.

Garbage value detection

-- Find placeholder/junk values
SELECT email, COUNT(*) AS cnt
FROM contacts
WHERE email IS NOT NULL
GROUP BY email
HAVING COUNT(*) > 10
ORDER BY cnt DESC
LIMIT 20;

If you see [email protected] with 4,000 rows, [email protected] with 800 rows, or unknown with 2,000 rows — those aren’t real records. They’re masked nulls. Add them to your cleaning list.

Date range sanity check

SELECT
    MIN(created_at) AS earliest,
    MAX(created_at) AS latest,
    COUNT(*) AS total,
    COUNT(*) FILTER (WHERE created_at > NOW()) AS future_dates
FROM orders;

Future timestamps, dates from 1970, dates from 2099 — these are red flags. Either a bug in the application, a bad import, or a default value that slipped through.

Cardinality check on categorical fields

SELECT status, COUNT(*) AS cnt
FROM orders
GROUP BY status
ORDER BY cnt DESC;

This reveals a lot. If status has values like "complete", "completed", "COMPLETE", "done", "finished" — you’ve got an enum that was never enforced. That’s a normalization job.

Phase 4: Follow the Foreign Keys

Or find out that they don’t exist, which is equally informative.

-- Find FK relationships in PostgreSQL
SELECT
    tc.table_name AS source_table,
    kcu.column_name AS source_column,
    ccu.table_name AS target_table,
    ccu.column_name AS target_column
FROM information_schema.table_constraints AS tc
JOIN information_schema.key_column_usage AS kcu
    ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage AS ccu
    ON ccu.constraint_name = tc.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';

No foreign keys in a relational database? The relationships probably exist logically but aren’t enforced at the DB level. That means orphaned records are possible — and probably present.

-- Check for orphaned records (example: orders without a matching customer)
SELECT COUNT(*) AS orphaned_orders
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL;

Orphaned records are a common source of subtle bugs in reporting. Find them early.

Phase 5: Write It Down

After working through all of this, you should have:

A list of tables and their rough purpose
Data quality flags per table (null rates, garbage values, orphans)
A shortlist of the 3–5 most important tables for the business logic
A list of questions for the people who built this

Write it up. Even if it’s just a markdown doc in your notes. The audit is only useful if you capture what you found.

A good audit doc has:

Table inventory (name, rows, purpose)
Data quality summary (where are the problems?)
Relationships map (what connects to what)
Open questions (what do you still not understand?)
Risk flags (what should you be careful around?)

The Mindset

The best thing you can bring to an unfamiliar database is structured curiosity. Not assumptions, not judgment — just systematic questions.

Every database reflects the decisions, constraints, and mistakes of the people who built it. Your job isn’t to criticize those decisions. It’s to understand them well enough to work with what’s there.

Some of what you’ll find will be elegant. Most of what you’ll find will be pragmatic. Some of it will make you mutter under your breath. That’s fine. Every database can be understood. Some just take more coffee.

— Matthew