long-context inferenceSubquadratic AIcontext windowsubquadratic 12 million token context windowcheap long-context AI inference startup

Subquadratic Launches with $29M and a 12M-Token Context Window

6 May 20268 min readJames O'Brien

// IN THIS ARTICLE

01What Happened 02Technical Anatomy 03Who Gets Burned 04Playbook for AI Development 05Key Takeaways 06Frequently Asked Questions

Picture a motorway built in the 1960s, three lanes each way, beautifully engineered for the traffic of its day. Now picture every commuter in the country trying to squeeze onto it at 8am on a Tuesday. That's been the story of the transformer for the last few years: a brilliant piece of road that simply cannot widen fast enough for the cars piling onto it. Subquadratic, out of stealth this week, reckons it has found a way to add ten more lanes without pouring any extra concrete.

What Happened

On May 5, 2026, a startup called Subquadratic came out of stealth with $29 million in seed funding and an LLM named SubQ, as SiliconANGLE reported. The headline number is the one that makes you blink twice: a context window of up to 12 million tokens, roughly 9 million words, or somewhere near 120 books loaded into a single prompt.

For reference, the industry standard sits at 128,000 tokens for most production models, and even the frontier cloud offerings like Claude Sonnet 4.7 and Gemini 3.1 Pro top out around 1 million. Subquadratic is claiming a 12x jump on the ceiling, and doing it cheaper.

The company is led by CEO Justin Dangel and CTO Alexander Whedon. Their architecture is a proprietary transformer variant built around sparse attention rather than the dense attention that has defined the field since 2017. The performance claims are aggressive: more than 50x faster and 50x cheaper than leading frontier models at 1 million tokens, with higher accuracy. At the full 12 million tokens, Subquadratic says SubQ cuts compute by almost 1,000x compared with frontier models.

The benchmark number is the one that will get screenshotted in group chats this week. On RULER 128K, the long-context evaluation, SubQ scored 95% accuracy at a cost of $8. Claude Opus scored 94% at roughly $2,600. That's about a 300x cost reduction for a one-point accuracy gain.

Three products are launching alongside the model: the SubQ API for developers and enterprise teams, SubQ Code (a CLI coding agent that loads entire codebases into a single context), and a search product that will initially be free. The model will not be open-weight or open-source in the near term, though Dangel says it will be trainable for customer-specific use cases. Backers include Javier Villamizar (formerly of SoftBank Vision Fund), Justin Mateen (Tinder co-founder, JAM fund), and early investors in Anthropic, OpenAI, Stripe and Brex.

Technical Anatomy

The whole pitch hinges on one bit of maths that anyone who has ever profiled a long-context inference job at 2am knows in their bones. Dense attention compares every token to every other token. Double the input, and you don't double the work, you quadruple it. That is the quadratic motorway, and it's why your $20 prompt becomes an $80 prompt the moment you paste in a second PDF.

"If you double the input size with quadratic scaling laws, you need four times to compute; with linear scaling laws, you need just twice," Whedon told SiliconANGLE. That single sentence is the entire commercial thesis.

Sparse attention, in Dangel's framing, is "an effort to say, hey, let's try to figure out how to not compare every token to every token to every token." The boring bit, which the company isn't disclosing, is exactly which tokens get compared and which get skipped. That's the secret sauce, and it's the reason this isn't open-weight. Sparse attention isn't a new idea on a whiteboard. Longformer, BigBird, Mamba-style state-space hybrids and a dozen academic papers have all tried it. The hard part has always been keeping accuracy intact when you stop comparing everything to everything.

If the RULER 128K numbers hold up under independent testing, that's the part of the story that matters. A 95% score at $8 versus 94% at $2,600 isn't just cheaper, it changes which products are economically possible. The compute bank, as Subquadratic puts it, stops being the binding constraint.

The other technical wrinkle worth flagging: Whedon's complaint about manual prompt curation. "I used to manually curate prompts and retrieval systems and evals and conditional logic to chain together the workflows," he said, calling it "a waste of human intelligence and also limiting to the product quality." Translation: if your context window is genuinely 12 million tokens and the inference is cheap, you don't need RAG. You don't need an agentic retrieval pipeline. You just shove the whole thing in. That is a very large claim, and it's the part where it could all fall over if the accuracy degrades at length.

Who Gets Burned

The most obvious losers, if SubQ delivers, are the RAG vendors. An entire ecosystem of vector databases, chunking strategies, hybrid retrieval systems and re-rankers exists because dense attention is too expensive at scale. Pinecone, Weaviate, the LangChain retrieval stack, the half-dozen consultancies billing six figures to tune embedding pipelines: all of them are selling solutions to a problem that gets smaller every time someone widens the motorway. They won't disappear in 90 days, but the strategic question on every board deck just got harder.

The frontier labs face a different kind of pressure. Anthropic and Google have built premium pricing tiers around long-context capability. If a seed-stage startup can credibly claim 300x cheaper inference at 128K, the pricing power on million-token tiers gets squeezed from below. I'd argue Anthropic in particular has the most to lose here, given how much of Claude's enterprise pitch leans on long-document analysis.

Coding tools are the other category in the firing line. SubQ Code's pitch is loading entire codebases into a single context. Cursor, Cognition's Devin, the GitHub Copilot Workspace crowd, all of them have spent eighteen months engineering elaborate agentic workflows to compensate for context limits. If SubQ Code works at the latencies implied, the agent-orchestration layer becomes a crutch rather than a feature.

For verticals adjacent to the RiverCore reader: fintech compliance teams running document review at scale, iGaming platforms chewing through transaction logs for fraud detection, ad-tech outfits processing campaign data, all of them have been writing brittle chunking logic for years. The next 90 days, those teams should be running the SubQ API against their hardest internal benchmarks. Not the marketing benchmarks. The ones that broke last quarter.

Playbook for AI Development

Three concrete moves for engineering leads this week.

First, get on the SubQ API waitlist and run your own evals. Vendor benchmarks are vendor benchmarks. RULER 128K is a respectable test, but it isn't your production traffic. Pull last month's hardest queries, the ones where your current RAG pipeline returned garbage, and see what happens when you stop curating and start dumping. Budget two engineer-weeks for honest evaluation.

Second, audit your retrieval stack with an exit in mind. Not because you're ripping it out tomorrow. Because the architectural assumption that retrieval must exist is now contestable. Map which parts of your pipeline are there because of cost, which are there because of latency, and which are there because of genuine information-architecture needs (citations, access control, freshness). The first two categories are now negotiable.

Third, watch the lock-in question. SubQ isn't open-weight and isn't planning to be. If you build product around a 12-million-token context, you're betting on a single vendor's roadmap, pricing, and uptime. That's a familiar trade for anyone using OpenAI's platform, but it's worth being honest about it on the architecture review. The trainable-for-customer-use-cases hint suggests Subquadratic understands enterprise procurement, but understanding it and pricing it accessibly are different things.

For the contrarians: assume the benchmarks are slightly cooked, assume the accuracy degrades at the long tail of the 12-million-token window, and ask whether 1 million tokens at 50x cheaper is still a business-changing outcome. My take: yes, comfortably.

Key Takeaways

Subquadratic launched May 5, 2026 with $29 million in seed funding and an LLM (SubQ) supporting up to 12 million tokens, against an industry standard of 128K and a frontier ceiling around 1 million.
The architecture is a proprietary transformer with sparse attention, moving from quadratic to linear scaling. Doubling input doubles compute rather than quadrupling it.
Headline benchmark: 95% on RULER 128K at $8, versus Claude Opus at 94% for around $2,600. Roughly a 300x cost reduction if it holds up under independent testing.
RAG vendors, agent-orchestration tooling, and frontier-lab long-context pricing tiers are the most exposed if SubQ ships at the claimed quality.
Engineering leads should run their own evals this month, audit which retrieval components exist purely for cost reasons, and weigh single-vendor lock-in against the economic upside.

Back to the motorway. Every few decades someone widens the road and everyone discovers that the traffic was never really the problem, the road was. Dangel put it more grandly: "The fundamental scaling laws imposed by the transformer architecture and dense attention have been broken through." That's a big claim from a company five hours into its public life. But if even half of it survives contact with production workloads, the lanes just got a lot wider, and a lot of carefully-engineered workarounds suddenly look like cones in the middle of an empty road.

Frequently Asked Questions

Q: What makes Subquadratic's SubQ model different from Claude or Gemini?

SubQ uses a proprietary transformer architecture with sparse attention rather than dense attention, which scales linearly rather than quadratically with input size. That allows a context window up to 12 million tokens, compared with around 1 million for Claude Sonnet 4.7 and Gemini 3.1 Pro, while reportedly cutting cost and latency dramatically at long context.

Q: How credible is the 300x cost reduction claim against Claude Opus?

It comes from Subquadratic's own RULER 128K benchmark numbers: 95% accuracy at $8 versus 94% at roughly $2,600 for Claude Opus. RULER is a respected long-context benchmark, but until independent third parties reproduce the result on diverse workloads, treat the figure as a strong signal rather than settled fact.

Q: Does this kill RAG and vector databases?

Not immediately, but it weakens the core economic argument. Retrieval-augmented generation exists largely because dense attention is too expensive at scale. If long-context inference becomes 50x to 300x cheaper, many use cases that needed RAG for cost reasons can simply load full documents or codebases. Use cases that need RAG for citations, access control, or freshness are less affected.

James O'Brien

RiverCore Analyst · Dublin, Ireland

// RELATED ARTICLES

Microsoft Plans to Double AI Capacity by 2028

Microsoft added a full gigawatt of capacity in one quarter and plans to double its AI footprint by 2028. The capex math tells the real story.

Moreh Hits A100 Numbers on Tenstorrent, Skips the HBM Tax

Moreh's TT-Deploy demo split LLM prefill onto Tenstorrent Wormhole and kept decode on GPUs, matching DGX A100-class numbers without the HBM bill.

The Claude Code Story We Can't Actually Verify Yet

The only available source for this Claude Code story is a browser verification page, zero extractable facts. Here's what that absence itself tells AI buyers.