Sakana AI launched Sakana Fugu on June 22, 2026, a multi-agent orchestration system delivered as a single OpenAI-compatible model API. The product ships two tiers: Fugu, balanced for latency and everyday use, and Fugu Ultra, tuned for maximum quality on hard multi-step tasks. The headline claim: Fugu Ultra matches the performance of Anthropic's Fable 5 and Mythos Preview across coding, reasoning, and scientific benchmarks, without the export-control risk.
The launch tweet from hardmaru (David Ha), Sakana AI's co-founder, frames the product as a philosophical bet: "Human intelligence is fundamentally a collective intelligence. We solve complex problems by participating in a vast cultural network that builds upon ideas across generations. I believe the strongest AI systems will become a collective intelligence, too." The post has 1,056 likes and 131 retweets at time of writing. The announcement tweet from the official account has over 9,000 likes and 3.7M views.
The framing: "Orchestration Models are the Next Frontier"
Sakana's pitch is not "we built a bigger model." It is "we built a model that commands other models." The release blog calls this an "Orchestration Model," positioning it beyond the brute-force scaling paradigm. Fugu is itself a language model, trained to decide when to delegate, which agents to assemble, how they should communicate, and how to synthesize their outputs into a single answer.
The geopolitical angle is deliberate. The release blog cites Anthropic's recent export controls on Fable and Mythos as the motivating event: "access to top models can disappear overnight." Fugu's pitch: if a provider restricts access, the system routes around the disruption by swapping agents in its pool. hardmaru calls this "the resilient blueprint required for AI sovereignty."
The framing correction: this is sovereignty over orchestration, not sovereignty over the models themselves. Fugu's agent pool consists of closed-source API models that Sakana does not name. If the underlying providers restrict access, Fugu can swap, but it still depends on external closed-source models being available. The sovereignty claim is real at the routing layer but does not extend to model independence.
The architecture: TRINITY and Conductor
Fugu's orchestration rests on two ICLR 2026 papers:
TRINITY (arxiv 2512.04695) introduces a lightweight coordinator (~0.6B parameters plus a ~10K-parameter head) optimized with an evolutionary strategy (CMA-ES, Covariance Matrix Adaptation Evolution Strategy). The coordinator processes queries over multiple turns, assigning one of three roles at each turn: Thinker (reasoning), Worker (execution), or Verifier (checking). The key insight: under high dimensionality and strict budget constraints, CMA-ES outperforms reinforcement learning, imitation learning, and random search by exploiting block-epsilon-separability in the parameter space. The paper reports 86.2% on LiveCodeBench, outperforming individual frontier models.
Conductor (arxiv 2512.04388) is a 7B model trained with reinforcement learning to discover natural-language coordination strategies. Where TRINITY assigns fixed roles, Conductor learns to design agent communication topologies and focused prompts. The model is trained with randomized agent pools, so it generalizes to arbitrary sets of open- and closed-source agents at inference time. Allowing the Conductor to select itself as a worker creates recursive topologies, a form of dynamic test-time scaling through online iterative adaptation.
Together, these two papers provide the research foundation. Fugu the product wraps them into a system where the user calls one endpoint and the orchestration happens internally.
Two models, one API
Fugu ships as two models, both accessible through a single OpenAI-compatible API:
Fugu (the base model) balances performance with latency. It is designed for everyday work: coding assistance, code review, chatbot services. Sakana positions it as a drop-in replacement for single-model endpoints in tools like Codex. Teams with compliance requirements can opt specific agents out of its pool.
Fugu Ultra coordinates a deeper pool of expert agents for maximum answer quality. Early users report deploying it for Kaggle competitions, paper reproduction, cybersecurity analysis, and patent investigations. The key difference: Fugu Ultra can assemble multi-step workflows where different specialized models handle planning, execution, and verification.
The integration is straightforward. No multi-agent framework setup, no agent definitions, no workflow configuration. You send a request to one endpoint and the system handles the rest.
The benchmark table
Here are the numbers Sakana publishes, reproduced verbatim:
| Benchmark | Fugu | Fugu Ultra | Opus 4.8 † | Gemini 3.1 Pro † | GPT 5.5 † |
|---|---|---|---|---|---|
| SWE-bench Pro * | 59.0 | 73.7 | 69.2 | 54.2 | 58.6 |
| TerminalBench 2.1 | 80.2 | 82.1 | 74.6 | 70.3 | 78.2 |
| LiveCodeBench | 92.9 | 93.2 | 87.8 | 88.5 | 85.3 |
| LiveCodeBench Pro | 87.8 | 90.8 | 84.8 | 82.9 | 88.4 |
| Humanity's Last Exam | 47.2 | 50.0 | 49.8 | 44.4 | 41.4 |
| CharXiv Reasoning | 85.1 | 86.6 | 84.2 | 83.3 | 84.1 |
| GPQA-D | 95.5 | 95.5 | 92.0 | 94.3 | 93.6 |
| SciCode | 60.1 | 58.7 | 53.5 | 58.9 | 56.1 |
| τ³ Banking | 21.7 | 20.6 | 20.6 | 8.4 | 20.6 |
| Long Context Reasoning | 74.7 | 73.3 | 67.7 | 72.7 | 74.3 |
| MRCRv2 | 86.6 | 93.6 | 87.9 | 84.9 | 94.8 |
* Uses mini-swe-agent as scaffolding.† Provider-reported scores.
The footnotes matter. All baseline scores come from the model providers themselves, not from independent reproduction. Fugu's scores are Sakana's own measurements. The comparison is asymmetric: Fugu Ultra runs through an orchestration layer that spawns multiple model calls per task, while the baselines are single-model evaluations. Sakana does not report how many tokens Fugu Ultra consumes per benchmark task, what the per-task cost is, or how many agent turns each problem takes.
On SWE-bench Pro, Fugu Ultra's 73.7 beats Opus 4.8's 69.2 by 4.5 points. On TerminalBench 2.1, the gap is 7.5 points (82.1 vs. 74.6). On LiveCodeBench, it is 5.4 points (93.2 vs. 87.8). These are real gaps, but the cost-to-achieve comparison is missing. As ML engineer Elie Bakouch notes: "they are introducing a 'test time scaling' method with 'best of N' over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task."
The qualitative demos
Beyond benchmarks, Sakana showcases six demo scenarios where Fugu Ultra outperforms frontier baselines:
- AutoResearch (Karpathy et al. framework): An AI agent autonomously improved a small GPT's training recipe over 123 experiments on a single H100 GPU in ~14 hours. Fugu Ultra achieved a mean BPB of 0.9774 ± 0.0019, ahead of three anonymized frontier baselines ("Model A/B/C"). Sakana does not name which models A, B, and C are.
- Kana letter reading order: Classical Japanese handwriting analysis on a 1610 letter. Fugu Ultra scored 0.80 NED (normalized edit distance) vs. Model A at 0.24.
- Rubik's Cube solver generation: Code generation for a physical puzzle.
- CAD mechanical iris: Mechanical design generation.
- Blindfold chess: One-shot chess game generation.
- Time-series trading: Financial prediction.
The demos are visually compelling (video walkthroughs are embedded on the product page), but the anonymized baselines make independent verification impossible. The AutoResearch experiment is particularly interesting because it tests sustained multi-step agentic work, which is Fugu Ultra's designed sweet spot.
Pricing and deployment
Sakana offers both subscription and pay-as-you-go pricing:
Subscription tiers (all include both Fugu and Fugu Ultra):
- Standard: $20/month
- Pro: $100/month
- Max: $200/month
Pay-as-you-go (per 1M tokens):
- Input: $5
- Output: $30
- Cached input: $0.50
- Above 272K context: $10 input, $45 output, $1.00 cached
Sakana reports per-request token usage so users can track spend in real time. The API is OpenAI-compatible, so integration requires changing an endpoint URL and model name.
Not available in the EU/EEA. The product page states compliance with GDPR and EU-specific regulations is in progress. No timeline.
The missing cost data: how many tokens a typical Fugu Ultra task consumes. If Fugu Ultra spawns 5 agent calls per problem (the limit Bakouch identifies), and each call uses a frontier model, the effective cost per task is 5x the per-token rate. For SWE-bench Pro, where Fugu Ultra runs through mini-swe-agent scaffolding, the total token count per problem is unknown.
The critical take
The most detailed public critique comes from Elie Bakouch, an ML engineer who read the technical report. His analysis:
- Fugu (non-Ultra) is a router. It selects which model is most likely to answer correctly at each turn. This is a classifier, not an orchestrator. It scores 10 points below Opus on SWE-bench Pro (59.0 vs. 69.2).
- Fugu Ultra is "advanced plan mode." It outputs a plan with multiple workflows at t=0, before agents start working. Bakouch argues this is the wrong architecture: "you need to predict what to spawn at t+1 with the information you get at t, not with the info you get at t=0." The system is limited to 5 steps.
- Closed source on closed source. "If before you didn't control the models, now you don't even control which ones are used or how much."
- No cost transparency. The biggest issue: introducing a test-time scaling method with best-of-N over models while never reporting output token counts or cost.
- Anonymized baselines. The AutoResearch comparison uses "Model A, B, and C" without naming them. "This is really crazy to not be transparent about what models you compare against."
- Wrong comparison frame. The fair comparison is not Fugu Ultra vs. raw Opus, but Fugu Ultra vs. Opus with ultracode/workflows enabled. Similarly, the comparison should be against Kimi Swarm, not raw Kimi.
The criticism is substantive. The sovereignty narrative is compelling at the geopolitical level but thin at the technical level: Fugu depends on the same closed-source providers it claims to hedge against. The benchmark numbers are real but incomparable without cost data. The architecture is novel (learned orchestration beats hand-designed workflows) but the production system's constraints (closed pool, 5-step limit, no adaptation during execution) narrow its advantage.
What to watch
- Independent benchmark reproduction. The most important signal. Can third parties reproduce Fugu Ultra's SWE-bench Pro and TerminalBench scores with the public API? The mini-swe-agent scaffolding is open-source; anyone can run the evaluation.
- Token count disclosure. Will Sakana publish per-task token counts for benchmark problems? Without this, cost-efficiency claims are unverifiable.
- Agent pool transparency. Which models are in the pool? Sakana says "closed-source API models" but does not name them. If the pool is GPT 5.5 + Opus 4.8 + Gemini 3.1 Pro, the orchestration story is about routing, not about building frontier capability.
- EU/EEA availability. GDPR compliance is the blocker. Watch for Sakana's DPA (Data Processing Agreement) and EU data residency commitments.
- Open-weights models in the pool. Sakana plans to add open models and its own models. If Fugu's pool includes Llama 4, Qwen 3.7, or Sakana's own trained models, the sovereignty story strengthens materially.
- Recursive self-orchestration depth. The Conductor paper shows recursive topologies (the orchestrator calls itself). How deep does this go in production? Recursive orchestration is the most novel technical claim and the hardest to verify from the outside.
- Community adoption. 500 beta users is a meaningful signal. Watch for Kaggle competition results, cybersecurity audit reports, and code review quality comparisons using the public API.
The bottom line
Sakana Fugu is the first production system that packages learned multi-agent orchestration as a single API endpoint. The research foundation (TRINITY + Conductor, both ICLR 2026) is solid, and the benchmark numbers are competitive with the current frontier. The "Orchestration Model" framing is the right long-term bet: as models proliferate, the coordination layer becomes the differentiator.
The launch's weakness is transparency. No cost-per-benchmark, no agent pool disclosure, no independent reproduction, anonymized baselines in demos. The sovereignty narrative is emotionally resonant but technically incomplete: Fugu routes around vendor restrictions at the orchestration layer while depending on the same vendors at the model layer. For developers evaluating Fugu, the practical question is not "does it beat Opus?" but "does it beat Opus at the same or lower cost?" That question remains unanswered.



