SRE toil reductionplatform engineeringintake optimizationSRE toil reduction after acquisitionplatform org intake filter strategy

The Membrane Doctrine: Rethinking SRE Intake After a 83.9% TOIL Collapse

28 May 20266 min readMarina Koval

// IN THIS ARTICLE

01What Happened 02Technical Anatomy 03Who Gets Burned 04Playbook for Engineering Teams 05Key Takeaways 06Frequently Asked Questions

The question every platform lead absorbing an acquisition this year should be putting to their CFO is not whether the integration timeline is realistic. It is whether the SRE team has been funded to re-tune its intake filters before the deal closes, or whether that work will be paid for later in burnout, attrition, and a 200-day cycle time. A new field report out of Trimble, written by the SRE director who lived through the collapse, makes the unit economics of that question uncomfortably visible.

What Happened

Andrea Valenti, Senior Director of SRE at Trimble, runs 38 engineers across multiple geographies. In 2023, by his own account, his organization stopped functioning. Not slowly. All at once, under a cascade of unbuffered change from several acquisitions absorbed in the same year, each importing its own definition of urgency, its own tribal knowledge, and its own undocumented manual processes.

The damage shows up cleanly in one number. As SD Times reported, TOIL, measured under Google's strict 5-point definition, climbed to 83.9%. For an SRE function whose health benchmark sits below 50%, that is not a degraded state. That is the engine seized.

What makes the case interesting is the trajectory before the breach. Each prior merger had been metabolized faster than the one before: two years, then one, then six months. The integration muscle was getting stronger, right up until the moment it tore. Recovery then ran through 2024 and into 2025. TOIL dropped to 59.7% in 2024 and to 44.7% in 2025, back below benchmark. P95 cycle time, which Valenti calls the true pulse of an agile organization, fell from 294 days in 2020 to 57 days in 2025.

The framework he credits is not a tool purchase or a vendor migration. He calls it the Membrane: a semi-permeable filter between engineers and the chaos of the outside world, calibrated through the intake board and triage criteria, drawing on Niklas Luhmann's systems theory and Adriano Olivetti's view of teams as communities rather than throughput resources.

Technical Anatomy

Strip the philosophy and what Valenti is describing is a queueing problem with explicit admission control. Most SRE orgs over-invest in what happens inside the boundary: observability stacks, automated runbooks, blameless postmortems, the reliability patterns Google codified a decade ago. That craft is mature. The boundary itself, what gets admitted into the work queue and in what shape, is treated as soft work, office politics, "people stuff." It rarely has an owner, a spec, or a test suite.

The Membrane reframes the intake board as the system's admission controller. Triage criteria are not policy documents. They are the mechanical settings for permeability: latency thresholds for what counts as urgent, rate limits on interrupt-driven work, dead-letter queues for requests that fail validation, escalation paths that act like circuit breakers. A team whose intake board looks like a parking lot of stalled cards has a filter that is too tight. A team whose board looks like a firehose has no filter at all. Both fail for the same reason: no engineer owns the calibration loop.

The 2023 breach maps cleanly onto this model. Acquisitions injected new request types the existing filter had never seen. Without re-tuning, those requests passed straight through as if they were validated, dragging undocumented manual processes into the on-call rotation. TOIL is the lagging indicator of that failure. Valenti's recovery used the 83.9% figure as input data, not just a wound to lick: a signal to redesign the triage criteria. The drop from 59.7% to 44.7% across 2024 and 2025 is what calibrated admission control looks like in practice. The cycle-time collapse from 294 days to 57 days is the second-order effect: when interrupt load falls, engineers can hold context long enough to actually ship.

Who Gets Burned

The teams most exposed to this failure mode are the ones currently absorbing M&A or scaling through aggressive hiring. In fintech, that means series-B and series-C platforms acquiring smaller compliance or payments specialists. In iGaming, it is the operators rolling up regional licensees ahead of the next regulatory rewrite. In crypto infrastructure, it is exchanges and custody providers bolting on prime-brokerage or RWA tooling teams. Each of those deals imports an SRE liability that almost never shows up in the diligence model.

The unit economics are ugly when you write them out. Thirty-eight engineers at fully-loaded cost is a seven-figure annual line item before you count on-call differential. At 83.9% TOIL, roughly four-fifths of that spend is producing repetitive interrupt work, not durable platform value. The CFO is paying senior staff-engineer rates for ticket churn. Worse, the opportunity cost compounds: a 294-day P95 cycle time means features the business committed to in Q1 ship after the fiscal year closes, which then distorts the next planning cycle and the next acquisition thesis built on top of it.

The hiring market makes it worse. SREs who have lived through a TOIL spike above 80% leave, and they leave first. Replacing them in 2026 means competing with hyperscalers and the better-funded AI infrastructure shops for the exact skill set, boundary engineering and intake design, that the org most needs and least knows how to interview for. The General Counsel should also be paying attention here: in regulated verticals, sustained TOIL above benchmark correlates with missed control attestations, late incident disclosures, and the kind of audit findings that turn into consent orders.

Playbook for Engineering Teams

The actionable move this week is not to adopt a new framework. It is to instrument the boundary you already have. Pull the last 90 days of intake tickets and classify each one against your stated triage criteria. Count how many were admitted that should have been rejected, and how many were rejected that should have been escalated. That ratio is your calibration error, and it is almost certainly the largest single source of unbooked technical debt on the platform.

Second, name an owner. The intake board needs an engineer accountable for its mechanical settings, not a rotating duty roster. Treat the triage criteria as code: versioned, reviewed, and tested against historical incidents. If a Head of Platform cannot point to who tunes the filter, the filter is not tuned.

Third, build the M&A clause into your SRE budget now, before the next deal. Every acquisition should arrive with a funded boundary-recalibration sprint, scoped in engineer-weeks and signed off by the acquiring CTO. The VP of Engineering should be asking, this week, what the dollar cost of re-tuning the membrane would be for a hypothetical mid-size acquisition closing in Q3, and whether that number is sitting in the integration budget or hidden inside the SRE run-rate. If it is hidden, the 2023 Trimble scenario is a coin flip away.

Fourth, treat P95 cycle time as a board-level metric alongside availability. Throughput indices and feature counts do not capture the health of the system. Cycle time does.

Key Takeaways

Trimble's SRE org saw TOIL hit 83.9% in 2023 under Google's 5-point definition, then recovered to 44.7% by 2025 by treating intake calibration as a first-class engineering problem.
P95 cycle time fell from 294 days in 2020 to 57 days in 2025, the clearest signal that boundary engineering, not internal tooling, was the binding constraint.
Acquisitions inject unfamiliar request shapes that defeat uncalibrated intake filters. Integration budgets that ignore SRE boundary work are mispriced.
The intake board is admission control for the engineering org. It needs a named owner, versioned triage criteria, and a calibration loop tied to TOIL and cycle-time metrics.
Teams evaluating their SRE maturity should now be asking themselves not "how good is our observability" but "who owns the filter, and when was it last re-tuned against incident history."

Frequently Asked Questions

Q: What is TOIL under Google's 5-point definition?

Google's SRE practice defines TOIL as work that is manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly with service growth. A request must meet those criteria to count, which is why the 83.9% figure cited at Trimble is significant: it is measured strictly, not as a generic "busy work" tally.

Q: Why is P95 cycle time a better health metric than throughput?

Throughput counts tickets closed, which rewards interrupt-driven work and obscures whether meaningful features are shipping. P95 cycle time measures how long the slowest fifth of work takes end to end, which exposes queueing, context-switching, and boundary failures. Trimble's drop from 294 days to 57 days reflects structural change, not faster typing.

Q: How should an acquiring company budget for SRE integration risk?

Treat boundary recalibration as a line item in every deal, scoped in engineer-weeks and owned by the acquiring platform lead. The 2023 Trimble breach shows that integration muscle built on prior deals does not automatically scale when multiple acquisitions land in the same year. The cost of re-tuning intake filters belongs in the integration budget, not hidden inside the SRE run-rate.

Marina Koval

RiverCore Analyst · Dublin, Ireland

// RELATED ARTICLES

Krumware Ships Epinio MCP Server for Kubernetes AI Agents

Krumware's Epinio MCP server puts guardrails between LLMs and Kubernetes clusters. The bigger story is what it means for platform engineering budgets in 2026.

AWS Publishes Aurora and RDS Version Currency SLAs

AWS published version currency timelines for Aurora and RDS open source engines. Here's what the 7-day, 3-month, and 12-month windows mean for real upgrade planning.

HAProxy Sweeps G2 Summer 2026 With 86 Badges and AI Nod

HAProxy pulled 86 G2 badges, 13 straight quarters at #1 in Load Balancing, and an NVIDIA Run:ai endorsement. The Ingress NGINX archive is reshaping who owns traffic. ===END EXCERPT is not a delimiter, ignore=== ===EXCERPT=== HAProxy pulled 86 G2 badges, 13 straight quarters at #1 in Load Balancing, and an NVIDIA Run:ai endorsement. The Ingress NGINX archive is reshaping who owns traffic.