Cleo
Company
Request a Demo
Anaelle GuezNaomie Halioua
Request a Demo
Cleo

AI-powered regulatory intelligence.

contact@cleolabs.co

Solutions

  • Due Diligence
  • Product Compliance

Company

  • About
  • Research
  • Blog
  • Compliance Guides

Jurisdictions

  • 🇪🇺 European Union
  • 🇫🇷 France
  • 🇩🇪 Germany
  • 🇬🇧 United Kingdom
  • 🇺🇸 United States

Legal

  • Privacy
  • Terms
  • Security

Events

  • VivaTech ParisJun 11–14, 2026

© 2026 Cleo Labs. All rights reserved.

GDPREU DataSOC 2 Type IIISO 27001
Blog/AI
AI2026-03-31·5 min read
Naomie Halioua

Naomie Halioua

Co-founder & CRO, AI Research

"Detective Work We Shouldn't Have to Do": Why Data Quality Is the Blind Spot of ML Compliance

"Detective Work We Shouldn't Have to Do"

Every week, I read dozens of research papers on AI and regulatory compliance to select just one — the most useful, the most actionable, the one that truly changes how you think about the subject. This week, I chose a paper that doesn't propose a new framework or tool. It does something more uncomfortable: it interviews the people actually doing compliance work, and documents exactly where things break down.

"Detective Work We Shouldn't Have to Do": Practitioner Challenges in Regulatory-Aligned Data Quality in Machine Learning Systems

Wang, Irion, Groth, Harmouch · arXiv:2602.05944, February 2026

What the paper reveals

The researchers interviewed EU-based data practitioners working on ML systems in regulated industries. Not consultants. Not academics. The people who wake up every morning trying to make their data pipelines GDPR-compliant while shipping features on deadline. What they found is a pattern so consistent it's almost a diagnosis.

5 gaps that keep appearing

01

Legal principles vs. engineering workflows

The GDPR says "data quality." Your data engineer hears "no nulls in production." These are not the same thing — and nobody in the organization is bridging the gap.

02

Pipeline fragmentation

Data quality is checked at ingestion but degrades silently through transformation, training, and inference. By the time a regulator asks, nobody can trace what happened.

03

Tool limitations

Existing tools were built for analytics, not for regulatory evidence. They measure completeness and consistency — not whether your data processing is lawful under Article 5.

04

Responsibility fog

Legal teams own compliance. Engineering owns data. Nobody owns regulatory-aligned data quality. Everyone assumes someone else is handling it.

05

Reactive culture

Most teams only think about data quality when an audit is coming. By then, it's detective work — reverse-engineering what happened months ago.

The detail that changes everything

Practitioners don't need more regulation. They need compliance-aware tooling — tools that understand both data engineering and legal requirements simultaneously, not tools that bolt legal checklists onto existing pipelines.

Why this matters to you

→

You are a DPO or compliance officer

Your biggest data quality risk isn't a technical failure. It's the gap between what you think your data pipeline does and what it actually does. This paper gives you the vocabulary to have that conversation with your engineering team.

→

You are building an AI product

Article 10 of the AI Act (data governance) is consistently ranked as the most challenging requirement to implement. This research tells you exactly where other teams get stuck — so you don't have to learn the hard way.

→

Your sector is regulated (healthcare, finance, energy, HR)

If you can't demonstrate data quality governance across your entire ML pipeline — from collection to inference — you're exposed. Not just to fines, but to decisions that can't be explained or reproduced.

How Cleo handles this

This is exactly why we built Cleo the way we did. Our pipeline doesn't just check compliance at one point in time — it traces regulatory alignment across the entire data lifecycle, producing evidence that both your legal team and your engineering team can understand. If you want to see what that looks like for your stack, reply to this email or take 20 minutes with us.

Reference: Wang, Irion, Groth, Harmouch (2026), "Detective Work We Shouldn't Have to Do": Practitioner Challenges in Regulatory-Aligned Data Quality in ML Systems, arXiv:2602.05944

Frequently asked questions

What is regulatory-aligned data quality?

Regulatory-aligned data quality goes beyond traditional data quality metrics (completeness, consistency, accuracy). It means ensuring that every stage of your ML data pipeline — from collection to inference — meets the legal requirements of regulations like GDPR Article 5 and AI Act Article 10, including lawfulness, purpose limitation, and data minimization.

Why do existing data quality tools fail for compliance?

Existing data quality tools were designed for analytics and business intelligence, not for producing regulatory evidence. They measure technical metrics like null rates and format consistency, but they cannot assess whether data processing is lawful, whether consent was properly obtained, or whether the data minimization principle is respected throughout the ML pipeline.

What is Article 10 of the AI Act?

Article 10 of the EU AI Act sets data governance requirements for high-risk AI systems. It mandates that training, validation, and testing datasets meet specific quality criteria including relevance, representativeness, and freedom from errors. It is consistently ranked by practitioners as one of the most challenging requirements to implement in practice.

Sources & references

  1. Regulation (EU) 2016/679 — General Data Protection Regulation (GDPR)
  2. Regulation (EU) 2024/1689 — Artificial Intelligence Act

Try Cleo: free regulatory risk scan

See your regulatory landscape mapped in minutes. No signup, no credit card.

Scan for free
Book a Call
Anaelle GuezNaomie Halioua
Request a Demo