Data Catalogue

Browse our enriched parallel corpora organized by EuroVoc domain classification.


Catalogue Summary

Metric Value
Total domains 21 (EuroVoc classification)
Total segments ~500,000
Languages 24 EU official languages
Source EUR-Lex (Official Journal of the EU)
Enrichment layers 5 (E1-E5)

Enrichment Layers (E1-E5)

All corpora include these enrichment layers:

Layer Name Contents
E1 Linguistic POS tagging, lemmatization, NER
E2 Semantic Entity linking, word sense disambiguation
E3 Domain EuroVoc classification, IATE terminology
E4 Quality Alignment scores, fluency metrics
E5 Metadata CELEX numbers, dates, document types

Available Domains

Domain Name Segments Share
01 Politics ~25,000 5%
02 International Relations ~30,000 6%
03 European Union ~45,000 9%
04 Economics ~95,000 19%
05 Trade ~70,000 14%
06 Finance ~50,000 10%
07 Social Questions ~40,000 8%
08 Education and Communications ~35,000 7%
09 Science ~30,000 6%
10 Business and Competition ~42,000 8%
11 Agriculture, Forestry and Fisheries ~38,000 8%
12 Law ~180,000 36%
13 Politics and Public Safety ~28,000 6%
14 Employment ~35,000 7%
15 Transport ~32,000 6%
16 Environment ~48,000 10%
17 Energy ~25,000 5%
18 Industry ~30,000 6%
19 Regional Policy ~22,000 4%
20 Information Technology ~35,000 7%
21 Agri-foodstuffs ~28,000 6%

Bundle Discounts

Save when purchasing multiple domains:

Bundle Size Discount
3 domains 15% off
5+ domains 20% off
Full corpus (all 21) 30% off

Popular bundles:


How to Order

Research License

  1. Visit Onboarding
  2. Verify ORCID or institutional email
  3. Select domains
  4. Complete payment
  5. Download within 24-48 hours

Commercial License

  1. Contact lds@pauhu.ai
  2. Specify domains needed
  3. Sign LRT agreement
  4. Complete payment (invoice available)
  5. Immediate access

Related Pages