AI agents documentsagentic workflowdocument corruptionAI agents corrupt documents over long tasksMicrosoft DELEGATE-52 benchmark results

Microsoft Study: AI Agents Corrupt 25% of Documents Over 20 Steps

13 May 20266 min readAlex Drover

// IN THIS ARTICLE

01What Happened 02Technical Anatomy 03Who Gets Burned 04Playbook for AI Development 05Key Takeaways 06Frequently Asked Questions

Every engineering lead who has greenlit an "agentic workflow" pilot in the last twelve months should clear an hour today and read this. Microsoft's own research team has put numbers on something a lot of platform engineers already suspected: hand a frontier model a long-running document task, walk away, and you come back to garbage. Not a little garbage. A quarter of the file gone or wrong.

What Happened

On Monday evening, three Microsoft Research scientists, Philippe Laban, Tobias Schnabel, and Jennifer Neville, dropped a preprint with a title that does not bury the lede: "LLMs Corrupt Your Documents When You Delegate." As The Register reported, the team built a benchmark called DELEGATE-52 that simulates multistep workflows across 52 professional domains, ranging from writing code to crystallography to music notation.

The results are bleak. Frontier models, namely Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, lose on average 25 percent of document content over 20 delegated interactions. Average degradation across all tested models is 50 percent. The researchers set the "ready" threshold at 98 percent or higher integrity after 20 interactions. Out of 52 domains, exactly one cleared the bar: Python programming.

The best performer, Google Gemini 3.1 Pro, was ready for 11 of 52 domains. Catastrophic corruption, defined as a benchmark score of 80 percent or less, occurred in more than 80 percent of model/domain combinations. The accounting domain test is illustrative: the seed document is the ledger of Hack Club, a nonprofit, and the task is to split it into category files and merge them back chronologically. Boring, real, the kind of thing a junior analyst does on a Tuesday. The models botched it.

The team also wired four GPT variants (5.4, 5.2, 5.1, and 4.1) into an agentic harness with file read, write, and code execution. Tools made things worse, not better, adding an average 6 percent degradation by simulation end.

Technical Anatomy

The interesting part is not that models fail. It is how they fail. Errors do not creep in linearly. They detonate. The researchers found that when corruption happens, it tends to wipe out 10 to 30 points of integrity in a single round-trip interaction. The stronger models are not avoiding small errors better than weak ones. They delay the critical failure to a later round, then experience it in one shot.

That distinction matters for anyone designing evaluation pipelines. If your acceptance test runs two iterations and looks at output quality, you will ship a model that looks production-ready and then craters in week three of real use. The paper makes this explicit: performance after two interactions does not predict performance after 20. Short-horizon evals are actively misleading. I have seen this exact pattern in production incidents where a model demoed beautifully in a sandbox and silently rotted state once it ran unattended over a weekend.

There is also a qualitative split in failure mode. Weaker models delete content. Frontier models corrupt it. From a data-integrity standpoint, corruption is worse. Deletion is loud. You notice a missing row. Corruption is silent: a transposed digit in a ledger, a swapped variable name, a chord written in the wrong key. The kind of bug that surfaces during an audit, not during QA.

Agentic harnesses making things worse is the punchline. Giving the model tools (file I/O, code execution) does not improve DELEGATE-52 scores. It degrades them by another 6 percent. This contradicts the entire marketing premise behind products like Claude Cowork, which Anthropic describes as handling tasks autonomously on computers, local files, and applications, and Microsoft 365 Copilot, pitched as able to tackle complex, multistep research across your work data and the web. The vendor pitch and the vendor's own research are now in open disagreement.

Who Gets Burned

According to Deloitte, organizations are spending an average of 36 percent of their digital budgets on AI automation. On a team running a 10 million euro digital budget, that is 3.6 million euro flowing toward systems that, per Microsoft's own scientists, corrupt documents in 80 percent of simulated long-running conditions. That is not a rounding error. That is the entire platform engineering line item at most mid-sized operators.

The teams most exposed are the ones who bought the agent narrative hardest. Fintech back-office automation. iGaming compliance workflows where a regulator expects an immutable audit trail. Ad-tech reconciliation jobs that run nightly and touch finance data. Anything where the LLM is producing an artifact that downstream systems trust without a human pass.

My take: the next 90 days will produce a quiet wave of post-mortems inside companies that pushed agents into accounting, contract review, and reporting. Teams I've worked with on payment reconciliation have a hard rule that any automated mutation of a ledger requires a deterministic reconciliation pass afterward. The shops that skipped that step to "let the agent handle it end-to-end" are the ones who will be paging their CFO this quarter.

The uncomfortable read: vendors are not going to slow the marketing down. OpenAI's GPT family went from 14.7 percent to 71.5 percent on benchmark performance over 16 months, and that curve is what sales decks point at. But DELEGATE-52 is measuring something different: not capability on a single prompt, but integrity across 20 chained ones. Capability is racing ahead. Reliability over time is not following.

Playbook for AI Development

If you are shipping anything agentic in the next quarter, here is what the paper forces onto your roadmap.

First, throw out two-shot evaluation. Anything you put in front of customers needs a long-horizon eval that runs at least 20 chained interactions on representative documents. If you do not have one, build it this sprint. The DELEGATE-52 methodology gives you the template.

Second, scope agents to Python-shaped problems. The one domain that cleared the readiness bar was programming. That is not a coincidence. Code has a compiler. Code has tests. Code has deterministic verification. If your task lacks an oracle that can say "this output is structurally valid," you are flying blind. Build the oracle first, then let the agent operate inside it.

Third, version everything the agent touches. Treat agent outputs like untrusted user input. Snapshot the document before each interaction, diff after, and require human or rule-based approval on any change exceeding a threshold. A 10 to 30 point drop in one round-trip is detectable if you are watching for it.

Fourth, be skeptical of tool-augmented harnesses. The default assumption in the industry is that giving the model code execution and file I/O makes it better. The data says the opposite for long workflows. If you are looking at MCP-based integrations or similar agent frameworks, treat tool access as additional surface area for failure, not a free reliability upgrade.

Fifth, write the kill switch before the launch press release. Verdict: no agentic workflow ships without a one-command rollback and an integrity check that runs on a schedule independent of the agent itself.

Key Takeaways

Frontier models lose 25 percent of document content over 20 delegated interactions; only Python programming met the 98 percent readiness threshold across 52 tested domains.
Failures are catastrophic and bursty, dropping 10 to 30 integrity points in a single round-trip, which makes short-horizon evaluation actively misleading.
Agentic harnesses with file I/O and code execution made things worse by an additional 6 percent, contradicting the core pitch behind Copilot-style products.
With organizations putting 36 percent of digital budgets into AI automation, the gap between vendor marketing and Microsoft's own research is now a procurement-level risk.
Ship long-horizon evals, deterministic oracles, snapshot-and-diff guardrails, and a tested rollback before any agent touches a document that downstream systems trust.

Frequently Asked Questions

Q: What is the DELEGATE-52 benchmark?

DELEGATE-52 is a Microsoft Research benchmark that simulates multistep knowledge work across 52 professional domains, including coding, crystallography, accounting, and music notation. It measures how well an LLM preserves document integrity over 20 chained delegated interactions, rather than scoring a single prompt response.

Q: Why did agentic tool use make models perform worse?

When the four tested GPT variants were given file read, write, and code execution through a basic harness, they incurred an additional 6 percent average degradation by the end of the simulation. The paper suggests tool access expands the surface area for compounding errors rather than helping models self-correct on long-running tasks.

Q: Should teams stop building with AI agents based on these findings?

No, but they should narrow the scope. The one domain that hit readiness was Python programming, where deterministic verification exists. Teams should restrict agents to tasks with strong oracles (compilers, tests, schema validators) and add long-horizon evaluation, snapshot diffs, and rollback paths before letting agents mutate documents unsupervised.

Alex Drover

RiverCore Analyst · Dublin, Ireland

// RELATED ARTICLES

Morgan Stanley's 50bps Crypto Fee Just Lit the Fuse on Coinbase

Morgan Stanley dropped a 50bps crypto fee on E*Trade, undercutting Schwab and Coinbase. Here's what it means for exchange margins, custody revenue, and the next 90 days.

Chile Sets 20% GGR Tax in Fast-Tracked Online Betting Bill

Chile's Senate has 15 days to move a long-stalled online betting bill. The 20% GGR tax is the headline, but the compliance plumbing is where operators will bleed.

The Source That Wasn't: A Note on Citing Bot-Walls as News

The source we were handed for this story contains zero facts: it's a bot-detection page. Here's why that matters more than the missing article itself.