Sakana Fugu Wraps a Multi-Agent Orchestrator Behind a Single API, Claims Frontier Parity With Fable and Mythos

Sakana Fugu Wraps a Multi-Agent Orchestrator Behind a Single API, Claims Frontier Parity With Fable and Mythos

lschvn

Sakana AI launched Sakana Fugu on June 22, 2026, a multi-agent orchestration system delivered as a single OpenAI-compatible model API. The product ships two tiers: Fugu, balanced for latency and everyday use, and Fugu Ultra, tuned for maximum quality on hard multi-step tasks. The headline claim: Fugu Ultra matches the performance of Anthropic's Fable 5 and Mythos Preview across coding, reasoning, and scientific benchmarks, without the export-control risk.

The launch tweet from hardmaru (David Ha), Sakana AI's co-founder, frames the product as a philosophical bet: "Human intelligence is fundamentally a collective intelligence. We solve complex problems by participating in a vast cultural network that builds upon ideas across generations. I believe the strongest AI systems will become a collective intelligence, too." The post has 1,056 likes and 131 retweets at time of writing. The announcement tweet from the official account has over 9,000 likes and 3.7M views.

The framing: "Orchestration Models are the Next Frontier"

Sakana's pitch is not "we built a bigger model." It is "we built a model that commands other models." The release blog calls this an "Orchestration Model," positioning it beyond the brute-force scaling paradigm. Fugu is itself a language model, trained to decide when to delegate, which agents to assemble, how they should communicate, and how to synthesize their outputs into a single answer.

The geopolitical angle is deliberate. The release blog cites Anthropic's recent export controls on Fable and Mythos as the motivating event: "access to top models can disappear overnight." Fugu's pitch: if a provider restricts access, the system routes around the disruption by swapping agents in its pool. hardmaru calls this "the resilient blueprint required for AI sovereignty."

The framing correction: this is sovereignty over orchestration, not sovereignty over the models themselves. Fugu's agent pool consists of closed-source API models that Sakana does not name. If the underlying providers restrict access, Fugu can swap, but it still depends on external closed-source models being available. The sovereignty claim is real at the routing layer but does not extend to model independence.

The architecture: TRINITY and Conductor

Fugu's orchestration rests on two ICLR 2026 papers:

TRINITY (arxiv 2512.04695) introduces a lightweight coordinator (~0.6B parameters plus a ~10K-parameter head) optimized with an evolutionary strategy (CMA-ES, Covariance Matrix Adaptation Evolution Strategy). The coordinator processes queries over multiple turns, assigning one of three roles at each turn: Thinker (reasoning), Worker (execution), or Verifier (checking). The key insight: under high dimensionality and strict budget constraints, CMA-ES outperforms reinforcement learning, imitation learning, and random search by exploiting block-epsilon-separability in the parameter space. The paper reports 86.2% on LiveCodeBench, outperforming individual frontier models.

Conductor (arxiv 2512.04388) is a 7B model trained with reinforcement learning to discover natural-language coordination strategies. Where TRINITY assigns fixed roles, Conductor learns to design agent communication topologies and focused prompts. The model is trained with randomized agent pools, so it generalizes to arbitrary sets of open- and closed-source agents at inference time. Allowing the Conductor to select itself as a worker creates recursive topologies, a form of dynamic test-time scaling through online iterative adaptation.

Together, these two papers provide the research foundation. Fugu the product wraps them into a system where the user calls one endpoint and the orchestration happens internally.

Two models, one API

Fugu ships as two models, both accessible through a single OpenAI-compatible API:

Fugu (the base model) balances performance with latency. It is designed for everyday work: coding assistance, code review, chatbot services. Sakana positions it as a drop-in replacement for single-model endpoints in tools like Codex. Teams with compliance requirements can opt specific agents out of its pool.

Fugu Ultra coordinates a deeper pool of expert agents for maximum answer quality. Early users report deploying it for Kaggle competitions, paper reproduction, cybersecurity analysis, and patent investigations. The key difference: Fugu Ultra can assemble multi-step workflows where different specialized models handle planning, execution, and verification.

The integration is straightforward. No multi-agent framework setup, no agent definitions, no workflow configuration. You send a request to one endpoint and the system handles the rest.

The benchmark table

Here are the numbers Sakana publishes, reproduced verbatim:

BenchmarkFuguFugu UltraOpus 4.8 †Gemini 3.1 Pro †GPT 5.5 †
SWE-bench Pro *59.073.769.254.258.6
TerminalBench 2.180.282.174.670.378.2
LiveCodeBench92.993.287.888.585.3
LiveCodeBench Pro87.890.884.882.988.4
Humanity's Last Exam47.250.049.844.441.4
CharXiv Reasoning85.186.684.283.384.1
GPQA-D95.595.592.094.393.6
SciCode60.158.753.558.956.1
τ³ Banking21.720.620.68.420.6
Long Context Reasoning74.773.367.772.774.3
MRCRv286.693.687.984.994.8

* Uses mini-swe-agent as scaffolding.† Provider-reported scores.

The footnotes matter. All baseline scores come from the model providers themselves, not from independent reproduction. Fugu's scores are Sakana's own measurements. The comparison is asymmetric: Fugu Ultra runs through an orchestration layer that spawns multiple model calls per task, while the baselines are single-model evaluations. Sakana does not report how many tokens Fugu Ultra consumes per benchmark task, what the per-task cost is, or how many agent turns each problem takes.

On SWE-bench Pro, Fugu Ultra's 73.7 beats Opus 4.8's 69.2 by 4.5 points. On TerminalBench 2.1, the gap is 7.5 points (82.1 vs. 74.6). On LiveCodeBench, it is 5.4 points (93.2 vs. 87.8). These are real gaps, but the cost-to-achieve comparison is missing. As ML engineer Elie Bakouch notes: "they are introducing a 'test time scaling' method with 'best of N' over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task."

The qualitative demos

Beyond benchmarks, Sakana showcases six demo scenarios where Fugu Ultra outperforms frontier baselines:

  1. AutoResearch (Karpathy et al. framework): An AI agent autonomously improved a small GPT's training recipe over 123 experiments on a single H100 GPU in ~14 hours. Fugu Ultra achieved a mean BPB of 0.9774 ± 0.0019, ahead of three anonymized frontier baselines ("Model A/B/C"). Sakana does not name which models A, B, and C are.
  2. Kana letter reading order: Classical Japanese handwriting analysis on a 1610 letter. Fugu Ultra scored 0.80 NED (normalized edit distance) vs. Model A at 0.24.
  3. Rubik's Cube solver generation: Code generation for a physical puzzle.
  4. CAD mechanical iris: Mechanical design generation.
  5. Blindfold chess: One-shot chess game generation.
  6. Time-series trading: Financial prediction.

The demos are visually compelling (video walkthroughs are embedded on the product page), but the anonymized baselines make independent verification impossible. The AutoResearch experiment is particularly interesting because it tests sustained multi-step agentic work, which is Fugu Ultra's designed sweet spot.

Pricing and deployment

Sakana offers both subscription and pay-as-you-go pricing:

Subscription tiers (all include both Fugu and Fugu Ultra):

  • Standard: $20/month
  • Pro: $100/month
  • Max: $200/month

Pay-as-you-go (per 1M tokens):

  • Input: $5
  • Output: $30
  • Cached input: $0.50
  • Above 272K context: $10 input, $45 output, $1.00 cached

Sakana reports per-request token usage so users can track spend in real time. The API is OpenAI-compatible, so integration requires changing an endpoint URL and model name.

Not available in the EU/EEA. The product page states compliance with GDPR and EU-specific regulations is in progress. No timeline.

The missing cost data: how many tokens a typical Fugu Ultra task consumes. If Fugu Ultra spawns 5 agent calls per problem (the limit Bakouch identifies), and each call uses a frontier model, the effective cost per task is 5x the per-token rate. For SWE-bench Pro, where Fugu Ultra runs through mini-swe-agent scaffolding, the total token count per problem is unknown.

The critical take

The most detailed public critique comes from Elie Bakouch, an ML engineer who read the technical report. His analysis:

  1. Fugu (non-Ultra) is a router. It selects which model is most likely to answer correctly at each turn. This is a classifier, not an orchestrator. It scores 10 points below Opus on SWE-bench Pro (59.0 vs. 69.2).
  2. Fugu Ultra is "advanced plan mode." It outputs a plan with multiple workflows at t=0, before agents start working. Bakouch argues this is the wrong architecture: "you need to predict what to spawn at t+1 with the information you get at t, not with the info you get at t=0." The system is limited to 5 steps.
  3. Closed source on closed source. "If before you didn't control the models, now you don't even control which ones are used or how much."
  4. No cost transparency. The biggest issue: introducing a test-time scaling method with best-of-N over models while never reporting output token counts or cost.
  5. Anonymized baselines. The AutoResearch comparison uses "Model A, B, and C" without naming them. "This is really crazy to not be transparent about what models you compare against."
  6. Wrong comparison frame. The fair comparison is not Fugu Ultra vs. raw Opus, but Fugu Ultra vs. Opus with ultracode/workflows enabled. Similarly, the comparison should be against Kimi Swarm, not raw Kimi.

The criticism is substantive. The sovereignty narrative is compelling at the geopolitical level but thin at the technical level: Fugu depends on the same closed-source providers it claims to hedge against. The benchmark numbers are real but incomparable without cost data. The architecture is novel (learned orchestration beats hand-designed workflows) but the production system's constraints (closed pool, 5-step limit, no adaptation during execution) narrow its advantage.

What to watch

  1. Independent benchmark reproduction. The most important signal. Can third parties reproduce Fugu Ultra's SWE-bench Pro and TerminalBench scores with the public API? The mini-swe-agent scaffolding is open-source; anyone can run the evaluation.
  2. Token count disclosure. Will Sakana publish per-task token counts for benchmark problems? Without this, cost-efficiency claims are unverifiable.
  3. Agent pool transparency. Which models are in the pool? Sakana says "closed-source API models" but does not name them. If the pool is GPT 5.5 + Opus 4.8 + Gemini 3.1 Pro, the orchestration story is about routing, not about building frontier capability.
  4. EU/EEA availability. GDPR compliance is the blocker. Watch for Sakana's DPA (Data Processing Agreement) and EU data residency commitments.
  5. Open-weights models in the pool. Sakana plans to add open models and its own models. If Fugu's pool includes Llama 4, Qwen 3.7, or Sakana's own trained models, the sovereignty story strengthens materially.
  6. Recursive self-orchestration depth. The Conductor paper shows recursive topologies (the orchestrator calls itself). How deep does this go in production? Recursive orchestration is the most novel technical claim and the hardest to verify from the outside.
  7. Community adoption. 500 beta users is a meaningful signal. Watch for Kaggle competition results, cybersecurity audit reports, and code review quality comparisons using the public API.

The bottom line

Sakana Fugu is the first production system that packages learned multi-agent orchestration as a single API endpoint. The research foundation (TRINITY + Conductor, both ICLR 2026) is solid, and the benchmark numbers are competitive with the current frontier. The "Orchestration Model" framing is the right long-term bet: as models proliferate, the coordination layer becomes the differentiator.

The launch's weakness is transparency. No cost-per-benchmark, no agent pool disclosure, no independent reproduction, anonymized baselines in demos. The sovereignty narrative is emotionally resonant but technically incomplete: Fugu routes around vendor restrictions at the orchestration layer while depending on the same vendors at the model layer. For developers evaluating Fugu, the practical question is not "does it beat Opus?" but "does it beat Opus at the same or lower cost?" That question remains unanswered.

Frequently Asked Questions

Bun Integrates the React Compiler Directly Into Its Bundler, Roughly 20x Faster Than the Babel Plugin

PR #32504, merged into oven-sh/bun on June 20, 2026, turns the upstream React Compiler Rust port into a built-in `bun build` transform behind `--react-compiler` and `Bun.build({ reactCompiler: true })`. Bun ports the upstream `facebook/react` `compiler/crates/` workspace directly into a single `src/react_compiler/` crate (~62k LOC) instead of going through Babel, SWC, or Oxc, and on a large React codebase (around 860 components, 1400 memo slots) the compiler pass runs in 465 ms versus 9.15 s for the Babel plugin. The feature is experimental, off by default, and ships with `reactCompilerOutputMode` (client or ssr) and a `scripts/sync-react-compiler.sh` re-sync helper.

Oxc v0.137 Teaches the Minifier to Treeshake Pure Typed Arrays and Set/Map Literals, Lands an Incremental Scoping Refresh, and Fixes a React Compiler Edge Case

Oxc release crates_v0.137.0, published on 2026-06-18, ships two new minifier passes (treeshake pure typed arrays and Set/Map array literals via #23469, and inline const values for read-only vars via #22593), a long-running incremental scoping refresh that retires the LiveUsageCollector collector entirely (#23197), a friendly parser error for adjacent JSX elements (#23378), a React Compiler bug fix for computed-key imports (#23586), and two breaking changes to the ESTree config API (#23573, #23574). The minifier pass list also gets a Proxy-aware object-introspection fix (#23483) and a new Map/WeakSet/WeakMap preservation rule for string arguments (#23470). v0.137 is the first crates release since v0.135 on 2026-06-08 and the second since Bun's native React Compiler integration landed on 2026-06-20.

Related articles

More coverage with overlapping topics and tags.

Google Cloud's Open Knowledge Format Is a Standard, Not a Product: A Deep Dive Into OKF v0.1
ai

Google Cloud's Open Knowledge Format Is a Standard, Not a Product: A Deep Dive Into OKF v0.1

On June 12, 2026, Google Cloud published the Open Knowledge Format (OKF), an open specification that formalizes the LLM-wiki pattern into a portable, interoperable format: a directory of markdown files with YAML frontmatter, one required field (type), five recommended ones, and zero required tooling. The tweet from Google Cloud Tech on June 16 drove 117,000 views in 24 hours and made the spec the most-discussed knowledge-format launch of the year. This long read walks through the v0.1 spec section by section, the design choices that make it deliberately minimal, what Google is shipping alongside it (an enrichment agent for BigQuery, a static HTML visualizer, three sample bundles, and a native BigQuery Knowledge Catalog integration), and the open question every AI agent builder and data platform team should be tracking over the next six months.
SpaceX Buys Cursor for $60 Billion: A Deep Dive Into the Biggest AI Coding Deal of the Year
ai

SpaceX Buys Cursor for $60 Billion: A Deep Dive Into the Biggest AI Coding Deal of the Year

On June 16, 2026, four trading days after SpaceX's record $85.7bn IPO made Elon Musk the world's first trillionaire, the company confirmed it will acquire Anysphere, the parent of the Cursor AI coding editor, in an all-stock deal valued at $60 billion. The price is roughly 16x Cursor's late-2025 private valuation, twice the round it was about to close, and the deal closes the loop on a curious April arrangement in which SpaceX had the right to buy Cursor for $60bn or pay $10bn for the partnership instead. This long read walks through the deal mechanics, the IPO-as-acquisition-currency story, the technology bet on Composer + Colossus, what xAI's collapse had to do with the timing, and the questions every developer who uses Cursor, Claude Code, Copilot, or any other AI coding tool should be asking this week.
Staan, the First European Search API, Opens Self-Service: A Deep Dive
ai

Staan, the First European Search API, Opens Self-Service: A Deep Dive

On June 15, 2026, at VivaTech, the Staan search API opened self-service to any developer. It is the public face of European Search Perspective, the 50/50 Qwant and Ecosia joint venture that runs the European search index, and the launch lands in the gap left by Microsoft's August 2025 retirement of the Bing Search API and Google's constrained programmatic access. This long read walks through the pipeline, the three product tiers, the pricing, the sovereignty story and its limits, the 'American backup' fallback, and what it means for the AI agents and coding tools that depend on a fresh web index.

Comments

Log in Log in to join the conversation.

No comments yet. Be the first to share your thoughts.