Language Data Space Overview

Pauhu is an authorized participant in the European Language Data Space (LDS).


What is the Language Data Space?

The Language Data Space is a European initiative to create a secure marketplace for language data and technology. It enables:


Pauhu's Role in LDS

What We Sell (vs. What's Free)

Raw EU data is free. Our value is in enrichment, curation, format, and delivery.

We DON'T Sell (Free Sources) We DO Sell (Our Value-Add)
Raw IATE terminology Cross-linked + quality-filtered terminology
Raw EUR-Lex documents 4-star Preferred terms only, segment-aligned
Raw EuroVoc thesaurus AI-ready JSONL with 5 enrichment layers
Helsinki-NLP models (HuggingFace) On-device ONNX inference + live updates

RICH DATA, OPEN MIND = Enrichment + Curation + Format + Delivery

What We Provide

Asset Type Description Availability
Enriched Parallel Corpora EUR-Lex + 5 enrichment layers (E1-E5) 21 EuroVoc domains
Translation Models Helsinki-NLP fine-tuned ONNX (FP32) 24 EU language pairs
Morphology Data UniMorph 4.0 grammar tables 163 languages
Terminology IATE + EuroVoc cross-linked All EU languages

How Data Transfer Works

  1. Browse catalogue - Find datasets you need
  2. Request access - Submit license application
  3. Contract negotiation - Automated via Dataspace Protocol
  4. Payment - Secure payment via Stripe
  5. P2P transfer - Data transferred directly from our EU servers
  6. Audit trail - Full transaction logging for compliance

Key principle: Your data never touches LDS central servers. Transfer is peer-to-peer.


Data Sovereignty Guarantee

Aspect Implementation
Storage location Cloudflare R2 (EU jurisdiction)
Transfer method Direct HTTPS from EU edge nodes
Encryption TLS 1.3 in transit, AES-256 at rest
Audit retention 7 years (Finnish law)
GDPR compliance Full

We never:


Compliance Standards

License Framework

Standard Implementation
CLARIN Categories ACA (Academic) and RES (Commercial)
Condition Modifiers BY, NC, NORED, LRT (laundry tags)
SPDX Identifiers Where applicable

Metadata Schema

Standard Version Purpose
ELRC-SHARE v3.1 Language resource metadata
META-SHARE OWL 3.1 Ontology description
DCAT-AP 3.0 EU data portal interoperability
EuroVoc 4.13 Domain classification

Data Formats

Format Use Case
JSONL Segments with enrichment layers
TMX Translation Memory eXchange
XLIFF 2.1 Localization industry standard
Moses Alignment format
JSON-LD Linked data

Getting Started

For Researchers (Academic License)

  1. Verify eligibility - University/research institution, ORCID ID
  2. Browse catalogue - View catalogue
  3. Select domain - Choose from 21 EuroVoc domains
  4. Apply for license - Automated ORCID verification
  5. Download - Direct access within 24-48 hours

Pricing: €3,000 - €18,000 depending on languages

For Organizations (Commercial License)

  1. Contact us - lds@pauhu.ai
  2. Review offerings - View catalogue
  3. Sign LRT agreement - Language Resource Terms
  4. Payment - Invoice or card payment
  5. Access - Immediate after payment confirmation

Pricing: €9,000 - €54,000 depending on languages


Contact

LDS Inquiries: lds@pauhu.ai
Technical Support: support@pauhu.ai
Partnership: partnership@pauhu.ai


Related Pages