Krafton Cut Incidents 77% Without AI Magic, Just Platform Work
Every on-call engineer who has watched a dashboard go red at 3am knows the truth nobody at a vendor keynote wants to say out loud: the difference between a five-minute blip and a national-news outage is almost never the tooling. Krafton, the Seoul studio behind PUBG: Battlegrounds, just shipped the proof. Incident volume fell from 107 in 2024 to 24 so far in 2026, and mean time to repair collapsed from 53.5 minutes to 10.3. That is the headline, and the AI agents had almost nothing to do with it.
The Numbers
Let's stay with Krafton for a moment, because the numbers are unusual. As TechTarget reported from Datadog's DASH breakout sessions on June 10, Junghun Kim's team also dropped time-to-detect from 8.8 minutes to 1.6 minutes. A 5.5x improvement on detection, a 5.2x improvement on repair, and a 4.5x cut in raw incident count. For a game with millions of concurrent users, every one of those minutes maps to revenue, refunds, and social-media damage.
For context, the SRE teams I've worked with on European iGaming platforms would consider a sub-two-minute MTTD on a complex distributed system close to the practical floor. You can detect faster than that, but only by drowning responders in false positives. Krafton hitting 1.6 minutes while also reducing total incidents is the harder achievement. It means their signal quality went up at the same time as their response speed.
The MTTR number deserves a longer look. Going from 53.5 to 10.3 minutes is the kind of curve you usually only see when a team replaces a brittle deployment process. Nine years of refinement, according to Kim, with the most recent step being consolidation from five observability tools into one with Datadog. Five-into-one tool consolidations look clean on a slide. In production incidents I've seen, they usually take twelve to eighteen months and break a few dashboards along the way. Nine years of compounding context is what made this work.
Then there's Getswish, the operator of Sweden's Swish payment app. A 2021 outage hit national press before the outsourced IT provider even knew about it. By late 2024, a comparable incident was detected and resolved in-house within five hours. Five hours is not a brag in iGaming terms, but for a payment rail running on an outsourced operations contract three years earlier, it's a structural change in who controls the blast radius. Jonas Cronholm-Lundin's team got there by rearchitecting code delivery on GitOps and rebuilding incident practice on Google's SRE book. No agents. No magic.
What's Actually New
Datadog shipped more than 100 updates at DASH. The two that engineering leads should actually read the release notes for are the Runtime Prioritization Engine with auto-tagging for security vulnerabilities, and the Auto-Processing feature inside Observability Pipelines. Both are AI-assisted classifiers running over data you already have. Krafton is also using Datadog's MCP server and the newly released Pup CLI to give coding agents access to incident context across Datadog, Kubernetes, Jira and Slack.
That last detail is the actually-new bit. MCP plus a CLI means agents can read your incident state from four canonical sources without a custom integration layer. For teams that have spent the last two years duct-taping Slack bots to PagerDuty webhooks, this is a meaningful primitive. It is also still strictly assistive. Kim was explicit: cross-source debugging, postmortem drafting, runbook generation, on-call handoff documentation. None of those touch production.
Kim's quote on autonomy is the one to pin to the wall. "Today AI can still make wrong judgments during incidents, and if it takes critical action that's hard to roll back, the risk is too high for production reliability." That is a senior engineer at a games company explaining, in 2026, why his team has not handed the steering wheel to an agent. It maps almost word-for-word to what platform leads at fintechs I've worked with have been saying privately for eighteen months.
Getswish's contribution to the new playbook is structural. Their runbooks are now organized around the OODA loop, the decision framework imported from military aviation. Observe, orient, decide, act. The point of OODA in incident response is that it gives an agent, human or otherwise, a stable decomposition of what step you're on. If you ever do want to hand parts of the loop to automation, having the loop explicit is the prerequisite. Cronholm-Lundin's framing was correct: curated runbooks and post-mortems are training data for whatever agent you adopt next.
What's Priced In for Engineering Teams
The market has already priced in the "AI assistant, not AI operator" framing. Every platform vendor is saying it. What's not priced in is how much foundational work has to happen before any of the assistance pays off.
Look at Krafton again: nine years of process refinement, a tool consolidation, ownership metadata wired into code paths, severity guardrails, and only then MCP plus coding agents on top. The agents are the last 10% of the stack. The reason they generate value is that the bottom 90% is clean. If your incidents still get triaged by whoever shouts loudest in Slack, sticking an agent on top will produce confidently wrong postmortems faster.
The other under-discussed item is data hygiene economics. Bryan Pierson at US Bank told DASH that not every log gets a first-class seat in Datadog's back end. They route the rest to S3 through Observability Pipelines for retention cost reasons. That is the boring, correct answer that every CFO will eventually demand. Observability bills have a way of going from "manageable" to "two engineers worth of budget" between one quarter and the next, and tiered routing to object storage is the standard escape hatch. OpenTelemetry pipelines make this portable if you want to keep the option open across vendors.
My take: the agentic features will be table stakes in 18 months and nobody will remember which vendor shipped them first. What will still matter is whether your platform team built the substrate underneath.
Contrarian View
The contrarian read is that Krafton and Getswish are survivorship bias. Both companies had the budget, the engineering culture, and in Krafton's case nearly a decade of runway to get to this state. Most teams reading the DASH recap do not have nine years. They have a CFO asking why the observability bill is growing 40% a year and a CTO who saw an agent demo and wants to know when incidents will fix themselves.
For those teams, the honest answer might be that an AI assistant on top of a mediocre platform is still better than nothing. Eric Swanson, the SRE at MagicSchool AI in Denver, put the fear plainly: developers offloading log discipline to agents, and engineers blunting the skills they sharpened through critical thinking. He's right to worry. But the counter-position is that not every shop will build a Krafton-grade incident platform, and a competent agent floor might raise the median operator more than it lowers the ceiling on the top operators.
The uncomfortable read: most engineering orgs will skip the platform foundation, bolt on agents, and call it done. The vendors know this. It's why the marketing leads with autonomy and the case studies lead with platform engineering.
Key Takeaways
- The platform comes first. Krafton's 77% incident reduction and 5x MTTR improvement came from nine years of process work, not from agents bolted on at the end.
- MCP plus CLI is the new integration primitive. Datadog's MCP server and Pup CLI let agents read incident context across observability, Kubernetes, Jira and Slack without bespoke glue. Worth a proof-of-concept this quarter.
- Keep agents on the assistive side of the line. Postmortems, runbooks, handoff docs, cross-source debugging. Production-mutating actions stay human-gated until rollback is cheap.
- Structure runbooks for both humans and agents. Getswish's OODA-loop runbooks double as training material. Curate now, harvest later.
- Tier your observability data before the bill tiers you. US Bank routes lower-value logs to S3 via Observability Pipelines. Standard practice now, not an optimization.
Frequently Asked Questions
Q: Did AI agents drive Krafton's incident reduction?
No. Junghun Kim was explicit that the improvements came from the response platform Krafton built over nine years, not from AI. Agents currently assist with cross-source debugging, postmortem drafting, and on-call documentation, but humans still make production decisions.
Q: What did Datadog actually release at DASH 2026?
More than 100 updates, with the notable ones being a new Runtime Prioritization Engine with auto-tagging for security vulnerabilities, an Auto-Processing feature for log management inside Observability Pipelines, and the Pup CLI that pairs with Datadog's MCP server to give coding agents structured access to incident context.
Q: Should engineering teams adopt agentic incident management now?
Only on top of a clean platform foundation. Both Krafton and Getswish emphasized that curated runbooks, ownership metadata, severity guardrails and GitOps-based delivery have to exist first. Agents amplify whatever process is underneath them, including the broken parts.
Visa Rewires Payments for AI Agents and Stablecoins
Visa is rebuilding both ends of the payment stack: AI agents at the front, stablecoins at the back. The engineering implications are bigger than the press release lets on.
Nagarro Bets on Outcome-Linked Cloud Native Engineering
Nagarro's Cloud Native Engineering service ties modernization fees to release cadence and incident rates. The real story is how that reshapes vendor contracts and platform team hiring.
Azure Postgres Adds Pre-Upgrade Checks: What Platform Teams Owe Finance
Microsoft's new Pre-Upgrade Validation Checks for Azure Database for PostgreSQL flexible server land in public preview. The real story is operational risk, not features.




