
Naomie Halioua
Co-founder & CRO, AI Research

Cleo now runs on Claude Opus 4.8 — and we ran the eval to prove it matters.
MARIA, the engine behind Cleo, has moved to Anthropic’s Claude Opus 4.8. In product compliance the failure mode that hurts is not clumsy prose — it is a confidently cited regulation that does not exist, or one that is real but does not apply to your product. So we did not take the upgrade on faith. We ran a head-to-head eval against the previous model, Opus 4.7, on five real product cases, and graded every cited regulation against the official source. Opus 4.8 won on every axis that matters to a compliance verdict.
Why a model upgrade matters for compliance specifically
A legal-drafting assistant can be judged on tone and structure. A compliance engine is judged on a binary that carries money and liability: can this product be sold in this market, and on the authority of which regulation? A fabricated regulation number, or a real one applied to the wrong product category, is not a stylistic blemish — it is a wrong answer that a brand might act on.
This is the same risk we wrote about in our piece on AI fabricating regulations. Opus 4.8 is positioned by Anthropic as their more honest flagship, and it posts the highest score recorded on their Legal Agent Benchmark — the first model to break 10% on the strict all-pass standard. The question for us was narrower: does that translate into fewer wrong compliance answers, on our kind of task?
The eval: 5 products, two models, every citation verified
We picked five product-and-market cases across the categories Cleo covers — cosmetics, toys, electronics, food supplements, children’s textiles — chosen because the right answer hinges on precise identifiers (a regulation number, an EN standard) that a weaker model tends to invent or misapply. We sent each model the identical prompt, closed-book (no retrieval), asking for a verdict and the applicable regulations. Then we verified every single cited identifier against the official source: Légifrance, EUR-Lex, the CEN standards catalogue.
Metric (n=5, closed-book)
Opus 4.7
Opus 4.8
Correct verdict vs gold
4 / 5
5 / 5
Valid structured output
4 / 5
5 / 5
Citation error rate
9.4%
2.7%
Avg output tokens (max)
626 (1100)
484 (826)
Opus 4.8 got every verdict right, returned a clean structured answer every time, and cited regulations with roughly a third of the error rate — while using fewer tokens. On the hardest case (a mains Bluetooth speaker, the densest regulatory stack), Opus 4.7 ran long and was cut off before it returned a usable verdict; Opus 4.8 returned a complete, correct answer well within budget.
Two errors 4.7 made that 4.8 did not
The interesting failures were not invented numbers — they were real regulations applied to the wrong product. This is exactly the kind of error that survives a quick human glance, because the citation looks authoritative.
EN 71-14
On a plastic pull-along toy, Opus 4.7 cited EN 71-14 — a real standard, but it governs domestic trampolines. It does not apply to a pull toy.
(EU) 2018/1513
On a lipstick, Opus 4.7 cited Regulation (EU) 2018/1513 — a real CMR restriction, but one that targets textiles and footwear, not cosmetics.
What this changes inside Cleo
Fewer misapplied regulations surfaced to a user is the headline, but two operational gains matter just as much. Opus 4.8 is the strongest agentic model Anthropic has tested (84% on Online-Mind2Web), with more efficient tool-calling — which is exactly what MARIA does when it traverses Legal Atlas across 177 jurisdictions: more reliable multi-step retrieval, fewer wasted steps. And note: this eval was closed-book, model alone. Inside Cleo the model is grounded on Legal Atlas, which closes most of the remaining citation gap.
The honest limits of this test
Five cases is illustrative, not a statistically powered benchmark — read it as a directional signal, not a leaderboard. It was a single run per case, closed-book, so it measures the model’s parametric knowledge rather than the full grounded system. And to be fair to 4.7: on the French supplement case it was actually more complete than 4.8 (it included the arrêté du 26 septembre 2016 that 4.8 omitted). The pattern across the set was still clear — 4.8 was more accurate, more precise on citations, and more efficient — which is why it is now the default behind MARIA.
“In compliance, a model that is more honest about what it does not know is worth more than a model that is merely more fluent. That is the upgrade we cared about — and the eval is why we shipped it.”
— Naomie Halioua, Co-founder & CRO, AI Research at Cleo Labs
See the grounded engine in action — scan a product, or explore the legal data underneath.
Explore Legal Atlas →Methodology and Opus 4.8 figures: Anthropic (Introducing Claude Opus 4.8). Citations verified against Légifrance, EUR-Lex and the CEN standards catalogue.
Frequently asked questions
What changed — what model does Cleo use now?
MARIA, the engine behind Cleo, now runs on Anthropic's Claude Opus 4.8, upgraded from Opus 4.7. Opus 4.8 is Anthropic's more honest flagship and posts the highest score on their Legal Agent Benchmark.
How did you measure that 4.8 is better than 4.7?
We ran 5 product×market compliance cases through both models with an identical prompt, closed-book, then verified every cited regulation against the official source (Légifrance, EUR-Lex, CEN). Opus 4.8 scored 5/5 correct verdicts vs 4/5, a 2.7% citation-error rate vs 9.4%, and used fewer tokens. It is a directional eval (n=5), not a powered benchmark.
Does the model fabricate regulations?
The riskiest errors we saw were not invented numbers but real regulations applied to the wrong product (e.g. a textile CMR restriction cited on a lipstick). Opus 4.8 made far fewer of these. Inside Cleo the model is also grounded on Legal Atlas, which closes most of the remaining gap.
Related resources
Try Cleo: free regulatory risk scan
See your regulatory landscape mapped in minutes. No signup, no credit card.