Kubernetes rightsizingcheckpoint-restoreFinOpsautonomous Kubernetes resource optimizationlive workload migration K8s

DevZero Bets Checkpoint-Restore Can Rightsize K8s Without Restarts

13 Jun 20267 min readJames O'Brien

// IN THIS ARTICLE

01What Happened 02Technical Anatomy 03Who Gets Burned 04Playbook for Engineering Teams 05Key Takeaways 06Frequently Asked Questions

Think of Kubernetes capacity planning like a Dublin pub on a Friday night: the landlord always stocks too many kegs because running dry once costs more than the unsold stout. That's how most platform teams provision clusters. DevZero just rolled out a tool that claims it can quietly shift the kegs around while the band is still playing, no last orders called.

What Happened

On Thursday, as IT Brief UK reported, Seattle-based DevZero launched an autonomous infrastructure optimisation platform for Kubernetes workloads that rightsizes resources in real time without restarts. The pitch is straightforward: stop overprovisioning, stop paying for idle, and stop bouncing pods to do it.

The company was founded in 2022 by former Uber engineers Debo Ray and Rob Fletcher, originally around a cloud development platform aimed at improving software engineering productivity. The founders ran that service on Kubernetes themselves, hit the same inefficiencies everyone else does, and built tools to deal with them. Those tools have now become the main product.

This puts DevZero on a collision course with Cast.ai and ScaleOps, the two names most platform leads already have on a vendor shortlist when their CFO starts asking awkward questions about EC2 spend. The differentiator DevZero is leaning on is checkpoint-restore technology, which it says enables live migration of workloads during demand shifts or infrastructure disruptions.

The customer roster they're publicising includes DataBahn, Dentira, Starburst, OpenObserve and Outerbounds, a noticeably AI- and data-platform-heavy list. Backers are Anthos Capital, Foundation Capital and Madrona.

DevZero claims its average client had been overspending on compute by 53% before adopting the platform, and that users typically cut compute bills by 30% to 60%. The company itself flags that those reduction figures haven't been independently verified, which is at least honest. The reason they think the market is ready: a Cloud Native Computing Foundation survey found 66% of organisations hosting generative AI models use Kubernetes to manage some or all of their inference workloads, and Datadog research found 83% of container costs go to idle resources with 54% coming from overprovisioned cluster infrastructure.

Technical Anatomy

The interesting bit is the mechanism, not the dashboard. DevZero operates at the cluster, node and workload levels, profiles resource demand, and adjusts CPU, memory and GPU allocation as usage moves. That part is table stakes. Cast.ai and ScaleOps do versions of it. The boring bit is the same across vendors: collect metrics, model demand, choose instance types, schedule.

The part where it gets spicy is checkpoint-restore. Anyone who has tried to vertically scale a stateful pod on vanilla Kubernetes knows the dance: in-place resource resize is still maturing, the VerticalPodAutoscaler historically wanted to evict the pod, and "evict the pod" is engineer-speak for "restart your workload and pray the connection pool reconnects". Checkpoint-restore (the same family of tech as CRIU, which CNCF projects have been circling for years) snapshots the running process state and resumes it elsewhere. No cold start. No JVM warm-up tax. No lost in-memory cache.

That matters in two specific places. First, AI inference. A large model loaded into GPU memory is expensive to evict. If you can migrate it live to a right-sized GPU node, you avoid both the cold-start latency and the temptation to overprovision GPU capacity "just in case". Second, availability zone failure. DataBahn's Mihir Nair, Head of Architecture, said: "During a recent availability zone outage, DevZero transparently migrated our workloads live without requiring a single restart or operational intervention from our team."

The platform also runs across AWS, Azure, GCP, OCI and OpenShift, and DevZero says it analyses more than 3,000 instance types, 69,000 price points, 23 GPU models and more than 80 regions to decide where things should land. The combinatorial space of cloud SKUs has been a quiet disaster for platform teams for years, and offloading that decision to a solver is, frankly, the only sane move at this point. The harder engineering question is whether the live migration claim holds under truly nasty workloads: things with TCP sessions, GPU memory mapped buffers, kernel module dependencies. The marketing says yes. The 3am pager will be the judge.

Who Gets Burned

Cast.ai and ScaleOps are the obvious incumbents staring at this launch. Both have built strong stories around K8s cost optimisation, and both will now have to articulate why their rightsizing path, which generally involves pod replacement, is acceptable for AI inference and long-lived stateful services. Expect feature pages mentioning checkpoint-restore within a quarter. That's how this market moves.

The bigger group getting burned is the silent majority of platform teams who've spent the last two years writing internal Confluence pages titled "Kubernetes Cost Optimisation Roadmap" with zero shipped progress. Their CFO has now read a Datadog stat that says 83% of container costs go to idle resources. The grace period for "we're looking into Karpenter and HPA tuning" is closing fast.

The AI inference crowd is exposed in a different way. If you're running LLM inference on reserved GPU capacity sized for peak, your unit economics are quietly awful. The 66% Kubernetes-for-inference figure from the CNCF survey tells you the substrate is standardised enough that vendors like DevZero can target it directly. The teams who built bespoke inference stacks on raw EC2 are about to discover their cost story is worse than the K8s-native shops they used to mock.

Finally, the cloud providers themselves take a small nick. Every dollar a rightsizing platform saves a customer is a dollar that came off an AWS, Azure or GCP invoice. Hyperscalers tolerate this because the alternative (customers leaving) is worse, but you'll notice their own native tooling (Compute Optimizer, Azure Advisor) conveniently stops short of aggressive live rightsizing. Funny that.

Playbook for Engineering Teams

If you run a Kubernetes platform, three things to do this week. First, actually measure your idle ratio. Pull container CPU and memory requests against actual usage over a 14 day window. If you land anywhere near that 83% idle figure, you have a board-level cost story whether you like it or not. OpenTelemetry metrics piped to whatever you already use is enough; you do not need a new vendor to find out you have a problem.

Second, separate the workloads that can tolerate eviction from the ones that cannot. Stateless HTTP services: fine, HPA and VPA will do most of the job. Stateful services, long-lived gRPC streams, GPU-loaded inference pods, anything with a meaningful warm-up: these are the workloads where checkpoint-restore actually earns its keep. Knowing your split lets you evaluate DevZero, Cast.ai, ScaleOps, or rolling your own on honest grounds.

Third, pilot on one non-critical cluster before believing any 30 to 60% savings number. DevZero itself flags those weren't independently verified, and your workload mix is not the average. Run a four week bake-off with real production-shaped traffic. Measure not just cost delta but p99 latency, restart counts, and operator interventions. Ray's own framing was "autonomous optimization they can trust at 3 am". Test it at 3am. On a Sunday. With a synthetic AZ failure. If it survives that, you have something. If it doesn't, you have a very expensive dashboard.

Key Takeaways

DevZero's launch puts checkpoint-restore at the centre of the K8s FinOps conversation, forcing Cast.ai and ScaleOps to answer the live-migration question.
The 83% idle container cost figure from Datadog plus 66% of GenAI inference running on K8s means rightsizing is now a CFO-visible line item, not a platform-team hobby.
AI inference and stateful workloads are where eviction-based autoscalers break down, and where live migration changes the economics.
DevZero's 30 to 60% savings claim is company-provided and unverified; treat it as a hypothesis to test, not a number to put in a board deck.
Back to the pub: the landlord who can move kegs mid-service without spilling a pint wins. Everyone else keeps overstocking. That's the bet DevZero is making, and it's a reasonable one.

Frequently Asked Questions

Q: What is checkpoint-restore and why does it matter for Kubernetes rightsizing?

Checkpoint-restore snapshots a running process's state (memory, open files, network sockets where supported) and resumes it elsewhere without a cold start. For Kubernetes, that means you can move a pod to a smaller or differently-located node without restarting it, which is critical for AI inference workloads and long-lived stateful services where restarts cost real money and latency.

Q: How does DevZero compare to Cast.ai and ScaleOps?

All three target Kubernetes cost optimisation through rightsizing, autoscaling and instance selection. DevZero's differentiator is its use of checkpoint-restore for live workload migration during demand shifts or outages, which it argues lets it rightsize without the pod evictions that competing approaches typically require. Whether that holds up across complex production workloads is something each team should verify in a pilot.

Q: Are DevZero's claimed 30 to 60% compute savings credible?

They're plausible given Datadog's research that 83% of container costs go to idle resources and 54% to overprovisioned clusters, but DevZero itself notes the figures were company-provided and not independently verified. Your actual savings depend heavily on workload mix, current provisioning discipline and how aggressively you're willing to let autonomous tooling resize production.

James O'Brien

RiverCore Analyst · Dublin, Ireland

// RELATED ARTICLES

Palo Alto Buys Embrace to Chase Full-Stack Observability

Palo Alto Networks adds Embrace's RUM and in-house Synthetics on top of its $3.35B Chronosphere buy. Engineering teams should read the fine print.

CMSWire's 2026 CMO Reading List Reveals a Traffic Crisis

CMSWire's latest CMO channel lineup reads like a symptom sheet for a patient nobody wants to diagnose: performance marketing's plumbing is leaking, and 2026 is when the floor gives way.

Dentsu and Adobe Bet on GEO as AI Search Eats SEO

Dentsu Digital and Adobe launched a Generative Engine Optimization service on July 27, aiming to track how brands get cited inside ChatGPT and Gemini answers.