platform engineeringincident reductionMTTR improvementKrafton PUBG incident reduction platform workreduce on-call incidents without AI

Krafton Cut Incidents 77% Without AI Magic, Just Platform Work

18 Jun 20267 min readAlex Drover

// IN THIS ARTICLE

01The Numbers 02What's Actually New 03What's Priced In for Engineering Teams 04Contrarian View 05Key Takeaways 06Frequently Asked Questions

Every on-call engineer who has watched a dashboard go red at 3am knows the truth nobody at a vendor keynote wants to say out loud: the difference between a five-minute blip and a national-news outage is almost never the tooling. Krafton, the Seoul studio behind PUBG: Battlegrounds, just shipped the proof. Incident volume fell from 107 in 2024 to 24 so far in 2026, and mean time to repair collapsed from 53.5 minutes to 10.3. That is the headline, and the AI agents had almost nothing to do with it.

The Numbers

Let's stay with Krafton for a moment, because the numbers are unusual. As TechTarget reported from Datadog's DASH breakout sessions on June 10, Junghun Kim's team also dropped time-to-detect from 8.8 minutes to 1.6 minutes. A 5.5x improvement on detection, a 5.2x improvement on repair, and a 4.5x cut in raw incident count. For a game with millions of concurrent users, every one of those minutes maps to revenue, refunds, and social-media damage.

For context, the SRE teams I've worked with on European iGaming platforms would consider a sub-two-minute MTTD on a complex distributed system close to the practical floor. You can detect faster than that, but only by drowning responders in false positives. Krafton hitting 1.6 minutes while also reducing total incidents is the harder achievement. It means their signal quality went up at the same time as their response speed.

The MTTR number deserves a longer look. Going from 53.5 to 10.3 minutes is the kind of curve you usually only see when a team replaces a brittle deployment process. Nine years of refinement, according to Kim, with the most recent step being consolidation from five observability tools into one with Datadog. Five-into-one tool consolidations look clean on a slide. In production incidents I've seen, they usually take twelve to eighteen months and break a few dashboards along the way. Nine years of compounding context is what made this work.

Then there's Getswish, the operator of Sweden's Swish payment app. A 2021 outage hit national press before the outsourced IT provider even knew about it. By late 2024, a comparable incident was detected and resolved in-house within five hours. Five hours is not a brag in iGaming terms, but for a payment rail running on an outsourced operations contract three years earlier, it's a structural change in who controls the blast radius. Jonas Cronholm-Lundin's team got there by rearchitecting code delivery on GitOps and rebuilding incident practice on Google's SRE book. No agents. No magic.

What's Actually New

Datadog shipped more than 100 updates at DASH. The two that engineering leads should actually read the release notes for are the Runtime Prioritization Engine with auto-tagging for security vulnerabilities, and the Auto-Processing feature inside Observability Pipelines. Both are AI-assisted classifiers running over data you already have. Krafton is also using Datadog's MCP server and the newly released Pup CLI to give coding agents access to incident context across Datadog, Kubernetes, Jira and Slack.

That last detail is the actually-new bit. MCP plus a CLI means agents can read your incident state from four canonical sources without a custom integration layer. For teams that have spent the last two years duct-taping Slack bots to PagerDuty webhooks, this is a meaningful primitive. It is also still strictly assistive. Kim was explicit: cross-source debugging, postmortem drafting, runbook generation, on-call handoff documentation. None of those touch production.

Kim's quote on autonomy is the one to pin to the wall. "Today AI can still make wrong judgments during incidents, and if it takes critical action that's hard to roll back, the risk is too high for production reliability." That is a senior engineer at a games company explaining, in 2026, why his team has not handed the steering wheel to an agent. It maps almost word-for-word to what platform leads at fintechs I've worked with have been saying privately for eighteen months.

Getswish's contribution to the new playbook is structural. Their runbooks are now organized around the OODA loop, the decision framework imported from military aviation. Observe, orient, decide, act. The point of OODA in incident response is that it gives an agent, human or otherwise, a stable decomposition of what step you're on. If you ever do want to hand parts of the loop to automation, having the loop explicit is the prerequisite. Cronholm-Lundin's framing was correct: curated runbooks and post-mortems are training data for whatever agent you adopt next.

What's Priced In for Engineering Teams

The market has already priced in the "AI assistant, not AI operator" framing. Every platform vendor is saying it. What's not priced in is how much foundational work has to happen before any of the assistance pays off.

Look at Krafton again: nine years of process refinement, a tool consolidation, ownership metadata wired into code paths, severity guardrails, and only then MCP plus coding agents on top. The agents are the last 10% of the stack. The reason they generate value is that the bottom 90% is clean. If your incidents still get triaged by whoever shouts loudest in Slack, sticking an agent on top will produce confidently wrong postmortems faster.

The other under-discussed item is data hygiene economics. Bryan Pierson at US Bank told DASH that not every log gets a first-class seat in Datadog's back end. They route the rest to S3 through Observability Pipelines for retention cost reasons. That is the boring, correct answer that every CFO will eventually demand. Observability bills have a way of going from "manageable" to "two engineers worth of budget" between one quarter and the next, and tiered routing to object storage is the standard escape hatch. OpenTelemetry pipelines make this portable if you want to keep the option open across vendors.

My take: the agentic features will be table stakes in 18 months and nobody will remember which vendor shipped them first. What will still matter is whether your platform team built the substrate underneath.

Contrarian View

The contrarian read is that Krafton and Getswish are survivorship bias. Both companies had the budget, the engineering culture, and in Krafton's case nearly a decade of runway to get to this state. Most teams reading the DASH recap do not have nine years. They have a CFO asking why the observability bill is growing 40% a year and a CTO who saw an agent demo and wants to know when incidents will fix themselves.

For those teams, the honest answer might be that an AI assistant on top of a mediocre platform is still better than nothing. Eric Swanson, the SRE at MagicSchool AI in Denver, put the fear plainly: developers offloading log discipline to agents, and engineers blunting the skills they sharpened through critical thinking. He's right to worry. But the counter-position is that not every shop will build a Krafton-grade incident platform, and a competent agent floor might raise the median operator more than it lowers the ceiling on the top operators.

The uncomfortable read: most engineering orgs will skip the platform foundation, bolt on agents, and call it done. The vendors know this. It's why the marketing leads with autonomy and the case studies lead with platform engineering.

Key Takeaways

The platform comes first. Krafton's 77% incident reduction and 5x MTTR improvement came from nine years of process work, not from agents bolted on at the end.
MCP plus CLI is the new integration primitive. Datadog's MCP server and Pup CLI let agents read incident context across observability, Kubernetes, Jira and Slack without bespoke glue. Worth a proof-of-concept this quarter.
Keep agents on the assistive side of the line. Postmortems, runbooks, handoff docs, cross-source debugging. Production-mutating actions stay human-gated until rollback is cheap.
Structure runbooks for both humans and agents. Getswish's OODA-loop runbooks double as training material. Curate now, harvest later.
Tier your observability data before the bill tiers you. US Bank routes lower-value logs to S3 via Observability Pipelines. Standard practice now, not an optimization.

Frequently Asked Questions

Q: Did AI agents drive Krafton's incident reduction?

No. Junghun Kim was explicit that the improvements came from the response platform Krafton built over nine years, not from AI. Agents currently assist with cross-source debugging, postmortem drafting, and on-call documentation, but humans still make production decisions.

Q: What did Datadog actually release at DASH 2026?

More than 100 updates, with the notable ones being a new Runtime Prioritization Engine with auto-tagging for security vulnerabilities, an Auto-Processing feature for log management inside Observability Pipelines, and the Pup CLI that pairs with Datadog's MCP server to give coding agents structured access to incident context.

Q: Should engineering teams adopt agentic incident management now?

Only on top of a clean platform foundation. Both Krafton and Getswish emphasized that curated runbooks, ownership metadata, severity guardrails and GitOps-based delivery have to exist first. Agents amplify whatever process is underneath them, including the broken parts.

Alex Drover

RiverCore Analyst · Dublin, Ireland

// RELATED ARTICLES

Groundcover's $100M Series C Rewrites the Observability Buy Decision

Groundcover's $100M Series C isn't just a funding headline. It's a signal that platform leads need to reopen their observability contracts before renewal season.

Subaru Cuts AI Container Pull Times 60x With Envoy Gateway

Subaru compressed 30GB AI container pulls from three hours to three minutes using Envoy Gateway, Argo CD and Helmfile. The 60x delta exposes what most ML platforms ignore.

Amazon Cracks the Ad Duopoly: The Triopoly Era Begins

Amazon hit $19.8B in Q2 ad revenue, up 26% YoY. The duopoly is now a three-horse race, and the track has changed shape. Here's what performance marketers need to rethink.