
Naomie Halioua
Co-founder & CRO, AI Research

"Detective Work We Shouldn't Have to Do"
Every week, I read dozens of research papers on AI and regulatory compliance to select just one — the most useful, the most actionable, the one that truly changes how you think about the subject. This week, I chose a paper that doesn't propose a new framework or tool. It does something more uncomfortable: it interviews the people actually doing compliance work, and documents exactly where things break down.
"Detective Work We Shouldn't Have to Do": Practitioner Challenges in Regulatory-Aligned Data Quality in Machine Learning Systems
Wang, Irion, Groth, Harmouch · arXiv:2602.05944, February 2026
What the paper reveals
The researchers interviewed EU-based data practitioners working on ML systems in regulated industries. Not consultants. Not academics. The people who wake up every morning trying to make their data pipelines GDPR-compliant while shipping features on deadline. What they found is a pattern so consistent it's almost a diagnosis.
5 gaps that keep appearing
Legal principles vs. engineering workflows
The GDPR says "data quality." Your data engineer hears "no nulls in production." These are not the same thing — and nobody in the organization is bridging the gap.
Pipeline fragmentation
Data quality is checked at ingestion but degrades silently through transformation, training, and inference. By the time a regulator asks, nobody can trace what happened.
Tool limitations
Existing tools were built for analytics, not for regulatory evidence. They measure completeness and consistency — not whether your data processing is lawful under Article 5.
Responsibility fog
Legal teams own compliance. Engineering owns data. Nobody owns regulatory-aligned data quality. Everyone assumes someone else is handling it.
Reactive culture
Most teams only think about data quality when an audit is coming. By then, it's detective work — reverse-engineering what happened months ago.
The detail that changes everything
Practitioners don't need more regulation. They need compliance-aware tooling — tools that understand both data engineering and legal requirements simultaneously, not tools that bolt legal checklists onto existing pipelines.
Why this matters to you
You are a DPO or compliance officer
Your biggest data quality risk isn't a technical failure. It's the gap between what you think your data pipeline does and what it actually does. This paper gives you the vocabulary to have that conversation with your engineering team.
You are building an AI product
Article 10 of the AI Act (data governance) is consistently ranked as the most challenging requirement to implement. This research tells you exactly where other teams get stuck — so you don't have to learn the hard way.
Your sector is regulated (healthcare, finance, energy, HR)
If you can't demonstrate data quality governance across your entire ML pipeline — from collection to inference — you're exposed. Not just to fines, but to decisions that can't be explained or reproduced.
How Cleo handles this
This is exactly why we built Cleo the way we did. Our pipeline doesn't just check compliance at one point in time — it traces regulatory alignment across the entire data lifecycle, producing evidence that both your legal team and your engineering team can understand. If you want to see what that looks like for your stack, reply to this email or take 20 minutes with us.
Reference: Wang, Irion, Groth, Harmouch (2026), "Detective Work We Shouldn't Have to Do": Practitioner Challenges in Regulatory-Aligned Data Quality in ML Systems, arXiv:2602.05944
Frequently asked questions
What is regulatory-aligned data quality?
Regulatory-aligned data quality goes beyond traditional data quality metrics (completeness, consistency, accuracy). It means ensuring that every stage of your ML data pipeline — from collection to inference — meets the legal requirements of regulations like GDPR Article 5 and AI Act Article 10, including lawfulness, purpose limitation, and data minimization.
Why do existing data quality tools fail for compliance?
Existing data quality tools were designed for analytics and business intelligence, not for producing regulatory evidence. They measure technical metrics like null rates and format consistency, but they cannot assess whether data processing is lawful, whether consent was properly obtained, or whether the data minimization principle is respected throughout the ML pipeline.
What is Article 10 of the AI Act?
Article 10 of the EU AI Act sets data governance requirements for high-risk AI systems. It mandates that training, validation, and testing datasets meet specific quality criteria including relevance, representativeness, and freedom from errors. It is consistently ranked by practitioners as one of the most challenging requirements to implement in practice.
Sources & references
Try Cleo: free regulatory risk scan
See your regulatory landscape mapped in minutes. No signup, no credit card.