Documentation is Infrastructure, Not an Afterthought

There’s a Slack channel at most engineering orgs that costs about $1.25 million a year. Nobody calls it that. It’s usually named something like #help-platform or #eng-questions, and it’s where developers go when the docs are wrong, missing, or hidden somewhere only the original author knows how to find.

The math, if you’ve never run it: GitLab’s research on knowledge silos puts the average developer at roughly 6.5 hours per week spent hunting information or duplicating work that already exists somewhere. For a 50-engineer team at typical fully-loaded comp, that’s north of $1.25M annually walking out the door — not as a line item, just as a slow leak. DX’s research lands in a similar place, pinning the cost of poor documentation at $500K–$2M per year for a mid-sized engineering team.

If a service was burning $1.25M/year in downtime, we’d page someone. We’d write a postmortem. We’d assign a staff engineer and a quarter to fix it. When docs do the same thing, we shrug and say we’ll get to it next sprint.

Why we keep getting this wrong

The honest answer isn’t that engineers don’t care about documentation. It’s that the feedback loops are broken in a way the rest of our infrastructure has long since fixed.

When a service goes down, PagerDuty wakes someone up. When a test fails, the build goes red. When a security scanner finds a CVE, a ticket auto-files itself. There’s an immune system. The whole apparatus of modern engineering reliability is built on the premise that failures should be loud, attributable, and uncomfortable.

Docs don’t have that immune system. When a doc goes stale, the only person who notices is the junior engineer who burns three hours trying to follow it before pinging someone on Slack to ask if it’s still accurate. The cost is real but diffuse — paid in small denominations across many people, none of whom have the leverage to fix it. The original author has moved teams. The runbook says “ask Sarah,” and Sarah left in 2024.

That asymmetry — high cost, no alarm — is the actual root cause. Not laziness, not “engineers can’t write.” We built incident response for one kind of failure and ignored another.

Docs have the same failure modes as infrastructure

The framing that helped me stop arguing about documentation in abstract terms was realizing the failure modes map almost one-to-one to infrastructure failures we already know how to handle.

Drift. When a PR changes the shape of an API and the docs don’t update, that’s the same problem as config drift in your Terraform state. Two sources of truth, diverging. The fix in infra is reconciliation loops; the fix in docs is the same — automated detection that function X changed while its documentation didn’t.

Decay. DX’s research puts a hard timeline on this: documentation older than six months becomes suspect, and after a year it’s often actively misleading, which is worse than having no docs at all. This is the same shape as bit rot or unpatched dependencies. Time passes, the surrounding environment changes, and what was correct becomes wrong by neglect.

Discoverability outages. When a new hire can’t find the runbook for a service they’ve just been put on-call for, that’s a service mesh problem. The information exists; it’s just unreachable. Tribal knowledge — “ask whoever was here in 2022” — is the documentation equivalent of running production from someone’s laptop.

No ownership. Docs without a clear owner are docs that will rot. This is the on-call rotation problem. If everyone is responsible, no one is. The teams that have working docs have explicit owners; the teams that don’t, don’t.

Once you see docs through this lens, the practices that work stop feeling like personal virtue and start feeling like obvious engineering hygiene.

What treating docs like infrastructure actually looks like

Concrete things, not platitudes.

Docs live in the same repo as the code they describe, and they go through the same PR review. This is the docs-as-code idea, and it’s table stakes — but it’s also where most teams stop, which is why most teams still ship stale documentation. Storing markdown in Git is publishing infrastructure. It is not correctness infrastructure.

The next layer is CI checks. Vale for style. Link checkers that fail the build on broken references. Spell-check on technical terms. These are cheap to set up and they catch the mechanical issues that would otherwise occupy human reviewers. A 2025 benchmark study found static sites with automated link checking had 34% fewer broken links in production than those without — which is the kind of number that justifies the half-day it takes to wire up a GitHub Action.

The harder layer — and this is where the industry is still figuring it out — is content drift detection. The “DocOps” tools emerging in the last two years (DeepDocs and similar) try to catch the case where function X changed but the doc for X didn’t. This is genuinely hard, but even crude heuristics (“this code file was modified in this PR; were any docs referencing it touched?”) catch a surprising amount.

CODEOWNERS for documentation. If your runbooks don’t have explicit owners, they will rot. Make ownership a property of the file, not of someone’s memory.

And finally: docs in the definition of done. The same gate as tests. Stripe famously bakes documentation quality into career ladders and performance reviews — not as a side metric but as part of how engineering excellence is defined. When a senior engineer ships a feature, “the docs are done” is part of “the feature is done.”

The cultural piece you can’t skip

Here’s where I’ll be honest about the limits of tooling. You can wire up every linter, every drift detector, every CODEOWNERS entry, and still end up with bad documentation if the org treats docs as overhead.

The companies that have famously good docs — Stripe is the canonical example, GitLab another — didn’t get there through tooling alone. Stripe runs an internal press, publishes books, has an engineering magazine, and one of the most-read engineering blogs in the industry. Writing is part of the culture in a way that makes good documentation feel native, not imposed. GitLab’s handbook-first model is similar: the default is that decisions and processes are written down, not communicated in meetings.

What this looks like in practice for a staff engineer or engineering manager: the signal you send when you write the docs yourself, rather than delegating them to whoever’s most junior on the team, is enormous. It tells everyone watching that documentation is real engineering work. Conversely, when leadership treats docs as the thing you do if you have time after the “real” work, the entire org calibrates to that signal within a quarter.

What to ship this quarter

Every engineering org already has documentation infrastructure. The only question is whether it’s intentional or accidental, instrumented or invisible.

If you want to start somewhere concrete: pick one practice. Wire up a link checker in CI. Add CODEOWNERS to your docs directory. Run a “documentation drift” exercise on your top-five most-touched services and see how stale the runbooks really are.

Ship it like you’d ship any other reliability improvement. Because that’s what it is.

Documentation is Infrastructure, Not an Afterthought

Why we keep getting this wrong

Docs have the same failure modes as infrastructure

What treating docs like infrastructure actually looks like

The cultural piece you can’t skip

What to ship this quarter

Related Posts

Why Ethics Matter in AI-Powered Documentation

Markdown Mastery: Tips for Writing Clear and Concise Docs

Follow along