Back to blog

I created a retrieval-ready EU AI Act dataset

Turning the EU AI Act into a structured, multilingual, retrieval-ready corpus

May 12, 2026

I wanted a practical public dataset for building production-grade RAG systems, preferably something useful for the European market. The EU AI Act is a good fit: it is important, public, multilingual, and structured enough to be useful, but still messy enough to teach real retrieval lessons.

So I created a Hugging Face dataset for it: jeroenherczeg/eu-ai-act.

The dataset turns the official EU AI Act into a structured, multilingual, retrieval-ready corpus. It currently contains about 2,600 rows, published as Parquet, with English, Dutch, and French included by default. Each row contains the text plus useful metadata such as article numbers, recital numbers, annex information, cross-references, defined terms, effective dates, and source information. The dataset is published under CC BY 4.0.

The goal is not to create another raw legal text dump. Raw text is easy. Retrieval-ready data is harder.

Most RAG systems do not fail because the vector database was wrong. They fail because the data was poorly prepared. If chunks do not know where they came from, what article they belong to, what language they are in, or when a provision becomes applicable, the retrieval layer has very little to work with.

That is what this dataset tries to solve.

Each chunk keeps structural information from the regulation: articles, paragraphs, recitals, annexes, references between provisions, and effective dates derived from Article 113. Equivalent provisions across languages also share a stable structure_path, which makes the dataset useful as a small multilingual parallel corpus.

I also published the build pipeline here: jeroenherczeg/eu-ai-act-dataset.

The pipeline fetches the official EU AI Act Formex XML, parses it, chunks it, enriches it with metadata, validates the result, and publishes it to the Hugging Face Hub. The repository includes the parser, chunker, enrichment step, export logic, validation checks, and GitHub Actions workflow for continuous publishing.

This makes the dataset reproducible instead of hand-curated. Every row includes snapshot and hash metadata, so future versions can be compared and rebuilt. The build pipeline can also be customized for additional EU languages, since the source material exists in all official EU languages.

I built this mainly as a practical RAG dataset. It should be useful for experiments around:

legal-domain retrieval
multilingual RAG
citation-aware answers
metadata filtering
effective-date filtering
retrieval evaluation
chunking strategies for structured documents

It is not legal advice, and it should not be used to determine compliance without human review. But it is a useful corpus if you want to test how a RAG system behaves on real European regulatory text without starting from a messy PDF scrape.

Dataset: jeroenherczeg/eu-ai-act
Build pipeline: jeroenherczeg/eu-ai-act-dataset