How to Document a Data Pipeline Someone Else Will Actually Maintain

Every data engineer inherits undocumented pipelines. It’s practically a rite of passage. You spend two weeks reverse-engineering what a script does, find three places where the logic contradicts itself, and silently vow that you will never do this to the next person.

Then six months later, under deadline pressure, you ship something with a one-line comment and a README that says “TODO.”

I’ve been on both sides of this. Here’s what actually works.

The Core Problem with Most Pipeline Documentation

Bad documentation comes in two flavors:

Too high-level — “This pipeline syncs data from the CRM to the warehouse.” Thanks. I could see that from the file name. What does it actually do? What are the edge cases? What breaks if I change the schedule?

Too low-level — line-by-line code comments explaining what each function does, which anyone can read by reading the code. This doesn’t tell you why the code does what it does.

The useful documentation lives in the gap between those two: the why, the gotchas, and the how to operate it.

The Three Documents Every Pipeline Needs

I use a three-layer documentation approach. Each layer has a different audience and a different job.

1. The README

Audience: Anyone who needs to understand what this pipeline does and whether it’s relevant to them.

Format: Plain text, short, in the repo root.

What it contains:

# Pipeline Name

One sentence: what does this do?

## What It Does
- Source: [where data comes from]
- Destination: [where it goes]
- Schedule: [when it runs]
- Volume: [roughly how many records, how often]

## Dependencies
- External APIs or services it calls
- Other pipelines it depends on or that depend on it
- Libraries / Python version requirements

## Quick Start
How to run it locally in 3 commands or fewer.

## Owner
Who to ask if something breaks.

That’s it. The README does not need to be a novel. It needs to answer “what is this and who do I call?” in under two minutes.

2. The Runbook

Audience: The person who gets paged at 3 AM when it breaks.

Format: Markdown, lives in a docs/ folder or a team wiki.

What it contains:

The runbook is an operational guide. It covers:

How to check if it’s healthy — what does a successful run look like? Where are the logs?
Common failure modes — what breaks, why, and how to fix each one
How to run a backfill — the exact command to re-run for a specific date range
Known quirks — “the API rate-limits at 100 requests/minute on weekdays, 50 on weekends, this is not documented anywhere”
Escalation path — if you can’t fix it in 30 minutes, who do you call?

This is the document that pays off most at 3 AM. Write it when the code is fresh in your head, not after the incident.

# Runbook: Lead Sync Pipeline

## Health Check
- Cron runs at 06:00 UTC daily
- Logs at: /var/log/pipelines/lead-sync.log
- Success indicator: "Sync complete: X records" in final log line
- Heartbeat table: SELECT * FROM pipeline_heartbeats WHERE name = 'lead_sync';

## Common Failures

### "API rate limit exceeded"
- Cause: too many requests in a short window
- Fix: wait 15 minutes, rerun manually
- Prevention: increase BATCH_DELAY_MS in config

### "Connection refused: postgres"
- Cause: database restart or maintenance window
- Fix: verify DB is up, rerun
- Check: pg_isready -h db-host -p 5432

### "KeyError: 'company_id'"
- Cause: upstream CRM schema changed (this has happened twice)
- Fix: check CRM API changelog, update field mapping in src/transform.py
- Contact: CRM vendor support if schema changed without notice

## Backfill
python run.py --start 2026-01-01 --end 2026-01-31 --dry-run
python run.py --start 2026-01-01 --end 2026-01-31

## Escalation
- First: check logs, try restart
- 30 min no fix: ping @matthew on Slack
- Business hours emergency: call [number]

3. The Inline Comments (But Only for the Non-Obvious Bits)

Audience: The next engineer reading the code.

The rule: Comment the why, not the what.

# Bad: explains what the code does (readable from the code itself)
# Loop through records and filter out inactive ones
active_records = [r for r in records if r["status"] == "active"]

# Good: explains why this choice was made
# We filter here rather than in the SQL query because the CRM API
# doesn't support server-side status filtering — everything comes
# back and we discard ~40% of records client-side. Known limitation.
active_records = [r for r in records if r["status"] == "active"]

The second comment tells the next person something they couldn’t learn by reading the code: that a seemingly-obvious optimization (filter in the query) won’t work here, and why.

Other things worth a comment:

Magic numbers: BATCH_SIZE = 250 # CRM API hard limit per request
Workarounds for external limitations: # Retry 3x — vendor API returns 500 spuriously about 5% of the time
Non-obvious business logic: # Orders placed before 2024-03-01 use legacy pricing model
Deliberate debt: # TODO: this join is slow on large datasets; acceptable for now (<10k rows/day)

Keeping It Alive

Documentation that doesn’t get updated is worse than no documentation. It’s actively misleading.

Two practices that help:

Update the runbook during incidents. When something breaks in a way you haven’t seen before, add it to the runbook before you close the incident. It takes five minutes. Future you will find it at 2 AM and feel genuine gratitude.

Add a “last verified” date. At the top of your README and runbook:

Last verified: 2026-03-12
Verified by: Matthew

When someone reads documentation from 2023 in 2026, they should know how much to trust it. A date creates useful skepticism.

Make documentation part of code review. If a PR changes significant pipeline logic — schedule, dependencies, data transformations — the review should include checking whether the documentation reflects the changes. Not “is there documentation” but “is it still accurate.”

The Test: The New Person Question

When I’m done documenting a pipeline, I ask myself: if someone who has never seen this codebase sat down with just this documentation and the code, could they:

Understand what the pipeline does in five minutes?
Diagnose and fix the three most common failures without calling me?
Run a backfill for last month?

If the answer to any of those is no, the documentation isn’t done yet.

That’s a higher bar than most teams hold themselves to. But it’s the bar that actually matters — because eventually, someone is going to need to maintain this at a moment when you’re not available to explain it.

Write for that person. You might be that person.

— Matthew