AI Edit Cost: HTML/docs-playbook vs Markdown Generators¶

How much does it cost an AI agent to make a documentation change? We ran a controlled study comparing our hand-authored-HTML pipeline (docs-playbook) against Markdown-based static site generators, measuring wall-clock time, token usage, and edit quality across three experiments.

Headline

A Markdown KB is roughly 1.5–1.6× faster and ~1.5× cheaper for an agent to edit than the HTML/docs-playbook KB. The gap is driven entirely by authoring ergonomics (page chrome and navigation surface), not by build speed or error-fixing. It holds across two editor models (Opus and Sonnet), across tasks we did not author, and across four Markdown generators — so it is a property of Markdown-based generation, not one tool or one model. Edit completeness was comparable between the stacks.

Metric (aggregate)	HTML / docs-playbook	MkDocs / Markdown	Verdict
Wall-clock time	43.8s	28.2s	1.56× faster
Total billed tokens	~244k / task	~162k / task	1.51× fewer
Cost-weighted tokens	—	—	1.41× cheaper
Edit completeness (0–5)	~4.7	~4.9	Comparable
Build time (warm)	~0.4s	~0.8s	Comparable

Time/token figures are the Opus-editor run (Exp 2); the gap reproduced at 1.55×/1.56× with a Sonnet editor (Exp 4, see Hardening).

Why This Matters¶

docs-playbook generates client knowledge bases as hand-authored static HTML, where each page carries its own chrome (topbar, sidebar, footer, search scripts) and navigation is single-sourced in nav.json and injected at build time. Increasingly, those KBs are edited by AI agents. If a Markdown pipeline is materially cheaper for agents to maintain, that changes the standard. This study quantifies the difference instead of guessing.

Method¶

Model: Claude Opus 4.8 (1M), identical for every agent.
Identical prompts: each agent receives only a repo path and a task, then must discover the repo's conventions itself. The only variable is the stack.
Isolation: every trial runs on its own fresh copy of the repo, so parallel trials never collide.
Primary condition — author-only: make the edit and validate by inspection; do not run the production build. This isolates authoring cost. Build cost is measured separately.
Time: each agent stamps date +%s.%N as its first and last action (real wall-clock).
Tokens: parsed from each agent's API transcript — input + cache-write + cache-read + output per call, streaming records de-duplicated. Cost-weighted tokens apply Anthropic price ratios (output ×5, cache-write ×1.25, cache-read ×0.1).

#	Design	Purpose
Exp 1	Pilot: 2 tasks × 2 real repos, n=1	First signal
Exp 2	Replication: 4 tasks × 2 real repos × 3 trials = 24 agents + blind judge	Confirm with variance + quality
Exp 3	Cross-framework: 5 clean sites × 3 tasks × 2 trials = 30 agents	Generalize + control repo maturity
Exp 4	Hardening: Sonnet editor + 2 Sonnet-authored tasks + Opus (independent) judge	Remove single-model & task-author bias

Core Result (Exp 2)¶

Four task types, three trials each, on the real client KB in both stacks. The HTML stack cost more on every metric in aggregate, and on every individual task except the sitewide rename.

Task	Time (html/md)	Tokens (html/md)	Note
T1 — add a new page + nav	2.51× (55→22s)	1.72×	Biggest gap: page chrome
T2 — add a table to a page	1.42× (27→19s)	1.64×	Replicates Exp 1 (1.37×)
T3 — add a note to 3 pages	1.92× (56→29s)	1.56×	Bulk multi-file edit
T4 — rename a nav label sitewide	0.88× (37→42s)	1.21×	HTML faster — see Nuance
Aggregate	1.56× faster	1.51× fewer	Markdown wins overall

Why: a new HTML page is ~150 lines of reproduced chrome (head, topbar dropdowns, sidebar, footer, search scripts); the Markdown equivalent is ~20 lines plus one nav line. Agents also spend reads understanding the HTML build pipeline where the Markdown nav is one explicit list.

Build & Fix Cost¶

A natural question: is the HTML penalty really authoring, or is it the lint/build/error-fixing loop docs-playbook imposes? We measured the build directly.

Build (warm, full site)	Time
MkDocs `mkdocs build --strict`	~0.8s
docs-playbook `npm run build` (prebuild scripts + Pagefind)	~0.4s

Both build sub-second; the docs-playbook build is actually slightly faster (Pagefind is a quick Rust binary). The author-only experiments never ran a build and still showed the full gap.

Conclusion

The cost is authoring, not build time or error-fixing.

To check that "faster" was not "faster but wrong," an independent agent graded each output against the task spec and the repo's own conventions, hunting for corner-cutting. This surfaced the most important finding.

Task	HTML compl/corr/conv	MD compl/corr/conv	Defect found
T1 new page	4/5/5	5/5/5	HTML updated `nav.json` but never ran `inject-nav` → Glossary link missing on 26 of 27 pages
T2 edit table	5/5/5	5/5/5	Both flawless
T3 bulk note	5/5/5	5/5/3	MD minor style nit (token var vs opacity idiom)
T4 rename	4/5/5	5/5/5	HTML's fast win missed one in-text link
Mean completeness	4.5	5.0	Markdown more complete

Key insight

HTML's apparent advantage on the navigation tasks was partly incompleteness. Because nav lives in baked per-page topbars regenerated at build time, an author-only edit is not actually finished until you run the build — so the build step is not mere overhead, it is required for correctness. The Markdown stack's single explicit nav list was complete and correct immediately. (Caveat: this completeness gap appeared with an Opus editor here but did not replicate under a Sonnet editor — see Hardening. Treat it as a risk, not a reliable advantage.)

Cross-Framework (Exp 3)¶

To test whether the win is "Markdown generally" or "MkDocs specifically," and to control for the real repo's accumulated cruft, we built five clean minimal sites with identical content and ran the same edit tasks.

Stack	Mean time	Cost-weighted tokens	vs MkDocs
Starlight	12.7s	27,239	0.94×
VitePress	13.2s	29,085	0.98×
MkDocs	13.5s	32,315	1.00×
Docusaurus	17.2s	31,817	1.28×
HTML (static)	22.4s	39,160	1.67×

Two signals: the HTML baseline is the clear outlier (~1.7× slower), confirming the penalty generalizes beyond the big real repo; and the four Markdown generators cluster tightly — so the advantage is Markdown-based generation broadly. Docusaurus is the heaviest Markdown stack (MDX front-matter + _category_.json sidebars); Starlight and VitePress edge out MkDocs slightly.

Hardening (Exp 4): Cross-Model, Independent Tasks, Independent Judge¶

The first three experiments share two weaknesses: they used one model (Opus), and we authored the tasks. Exp 4 attacks both. A different model (Sonnet 4.6) is the editor; two of the tasks were generated by Sonnet, not us; and the judge is Opus grading Sonnet's work (so judge and editor are different models).

Task (editor = Sonnet)	Time (html/md)	Tokens (html/md)
T1 new page	1.59×	1.39×
T2 edit table	1.42×	1.41×
T3 bulk note	3.06×	3.73×
T4 rename	1.22×	1.18×
N1 — move a decision row (Sonnet-authored)	1.47×	1.76×
N2 — add a table column (Sonnet-authored)	1.13×	1.22×
Aggregate	1.55×	1.56×

What replicated

The speed and token gap is robust: 1.55× / 1.56× with Sonnet is statistically indistinguishable from 1.56× / 1.51× with Opus. It held on every task — including the two we did not write — and even T4 (the one task HTML won under Opus) flipped to Markdown-favorable under Sonnet, showing that win was a model-specific trick, not a real HTML advantage.

What did NOT replicate — a correction

Exp 2 found Markdown slightly more complete (4.5 vs 5.0). The independent Opus judge of Sonnet's work scored completeness 4.83 vs 4.83 — a tie. The two judge runs disagreed on whether HTML's un-propagated nav counts as a defect. So we retract the completeness claim: edit completeness is comparable. The nav-propagation incompleteness is a real failure mode for author-only HTML edits, but its severity is model- and judge-dependent. The headline speed/cost result stands; this secondary claim does not.

The Honest Nuance¶

The result is not "Markdown always wins everything." The one place the HTML stack is competitive is the sitewide nav rename (Exp 2, T4), where docs-playbook's build-time inject-nav centralizes the label in a single nav.json. That same mechanism is a double-edged sword:

Helps centralized nav edits: change one file, the build propagates it everywhere.
Hurts correctness in author-only edits: until you run the build, the change is incomplete (the new page is unreachable from other pages).

Note also that Exp 3's HTML baseline is plain static HTML with no inject-nav, so its rename task is harsher than real docs-playbook — which is exactly why Exp 2's T4 (real docs-playbook, with the injector) is the fairer read for that one task. The two experiments together tell the complete story.

Recommendation¶

Recommendation

For KBs that are edited by AI agents (most of ours), prefer a Markdown-based generator. It is ~1.5× faster and cheaper per edit, produces more complete changes, and the advantage compounds with every new page. Any of MkDocs Material, VitePress, or Starlight is a strong default; MkDocs Material is the safest for table-heavy client KBs. Reserve hand-authored HTML for pages that genuinely need bespoke layout, and embed those as raw HTML inside a Markdown page rather than running the whole site that way.

Beyond Edit Cost: Choosing a Stack¶

Edit cost is one axis. A standardization decision should weigh several more. We measured the objective ones and researched the governance picture.

Objective metrics (measured)¶

Metric (lower is better)	HTML / docs-playbook	MkDocs	VitePress	Docusaurus	Starlight
Dependency footprint	~0¹	126 MB (pip)	91 MB	318 MB	183 MB
Install time (warm)	~0	6.5s	7.8s	39s	22s
JS shipped per page	0 KB	151 KB	155 KB	648 KB	98 KB
New-page diff (lines)	~61	~10	~10	~13	~12
AI edit cost vs MkDocs	1.67×	1.00×	0.98×	1.28×	0.94×

¹ docs-playbook adds Pagefind (~55 MB) for search. JS/page is the bytes the browser actually downloads on a page load.

Longevity / governance (researched, 2026)¶

The MkDocs ecosystem is fractured — weigh this

As of 2026, MkDocs core split apart. MkDocs 2.0 is an unlicensed, plugin-removing, Material-incompatible rewrite (YAML→TOML, no migration path); a March 2026 PyPI takeover dispute followed; the community forked into ProperDocs (drop-in 1.x) and Zensical / MaterialX (by the Material team). Material for MkDocs 1.x is stable and safe right now but entered maintenance mode in Nov 2025 — its build emits the 2.0 warning. Forward path: Zensical, which reads existing mkdocs.yml and builds ~5× faster. The JS-framework stacks face no such crisis: Docusaurus (Meta), VitePress (Vue team, official VuePress successor), and Starlight (Astro) are all healthily maintained.

Decision matrix¶

If you optimize for…	Pick
Lowest AI + human edit cost	any Markdown SSG (≈tie; HTML loses)
Cleanest diffs / review cost	Markdown
Smallest dependency / supply-chain surface	static HTML, then VitePress / MkDocs
Leanest reader payload (perf)	static HTML, then Starlight
Portability / no lock-in	plain Markdown > MDX/React > bespoke HTML
Richest interactive components	Docusaurus / Starlight (MDX)
Longevity confidence in 2026	VitePress, Starlight, Docusaurus > MkDocs (governance risk)
Table-heavy client KB today	MkDocs Material (eyes open re: Zensical)

Net recommendation¶

For AI-edited client KBs, choose a lean Markdown SSG — the edit-cost, diff, and portability wins are decisive and cluster tightly. Among them:

Starlight — leanest reader payload (98 KB), fast builds, healthy Astro backing. Strong default for new sites.
VitePress — similar profile, Vue ecosystem, smallest node_modules of the JS stacks.
MkDocs Material — best for table-heavy KBs and the existing registry-direct-2; technically excellent and the edit-cost cluster leader, but weigh the 2026 governance risk and keep Zensical in view.
Docusaurus — only if you need rich React components; heaviest footprint (318 MB) and JS payload (648 KB/page).
Bespoke HTML / docs-playbook — reserve for one-off bespoke layouts, embedded as raw HTML inside a Markdown page.

The earlier registry-direct-2 choice (MkDocs Material) remains sound for that table-heavy KB. For the next greenfield KB, Starlight or VitePress edge ahead once reader-perf and longevity are weighted alongside edit cost.

Limitations¶

Author-only primary condition. Builds were measured separately, not inside the timed edit. Including build + fix would widen the gap, so the reported numbers are conservative toward HTML.
Single model — addressed. Exp 1–3 used Opus 4.8; Exp 4 re-ran the benchmark with Sonnet 4.6 as editor and reproduced the gap (1.55× / 1.56×).
Same-model judge — addressed. Exp 4's judge (Opus) is a different model from its editor (Sonnet), giving editor-independent grading.
We authored the tasks — addressed. Exp 4 added two tasks generated by a different model; the gap held on both.
Modest n. n=3 (Exp 2), n=2 (Exp 3 and Exp 4) per cell. Variance was moderate and the direction held on every aggregate.
A human judge would still be stronger. All judges are LLMs. Cross-model agreement (Opus and the editors) is reassuring but not a substitute for human grading on a contested case.
Exp 3 HTML baseline is plain static HTML (no inject-nav), harsher than real docs-playbook on the rename task specifically.

Reproducibility¶

Each agent self-reported time via date; tokens were recovered from API transcripts (sum of input + cache + output per call, streaming-de-duplicated). Cost-weighted tokens use fresh_in + 1.25×cache_write + 0.1×cache_read + 5×output. Tasks were applied to isolated repo copies so trials could run in parallel; outputs were graded by an independent agent against each stack's own conventions. Run date: 2026-06-02.