Language Data Space Overview
Pauhu is an authorized participant in the European Language Data Space (LDS).
What is the Language Data Space?
The Language Data Space is a European initiative to create a secure marketplace for language data and technology. It enables:
- Data sovereignty - Data stays on provider infrastructure
- Secure exchange - P2P encrypted transfers
- Fair pricing - Transparent, regulated pricing
- Interoperability - Standardized formats and APIs
Pauhu's Role in LDS
What We Sell (vs. What's Free)
Raw EU data is free. Our value is in enrichment, curation, format, and delivery.
| We DON'T Sell (Free Sources) | We DO Sell (Our Value-Add) |
|---|---|
| Raw IATE terminology | Cross-linked + quality-filtered terminology |
| Raw EUR-Lex documents | 4-star Preferred terms only, segment-aligned |
| Raw EuroVoc thesaurus | AI-ready JSONL with 5 enrichment layers |
| Helsinki-NLP models (HuggingFace) | On-device ONNX inference + live updates |
RICH DATA, OPEN MIND = Enrichment + Curation + Format + Delivery
What We Provide
| Asset Type | Description | Availability |
|---|---|---|
| Enriched Parallel Corpora | EUR-Lex + 5 enrichment layers (E1-E5) | 21 EuroVoc domains |
| Translation Models | Helsinki-NLP fine-tuned ONNX (FP32) | 24 EU language pairs |
| Morphology Data | UniMorph 4.0 grammar tables | 163 languages |
| Terminology | IATE + EuroVoc cross-linked | All EU languages |
How Data Transfer Works
- Browse catalogue - Find datasets you need
- Request access - Submit license application
- Contract negotiation - Automated via Dataspace Protocol
- Payment - Secure payment via Stripe
- P2P transfer - Data transferred directly from our EU servers
- Audit trail - Full transaction logging for compliance
Key principle: Your data never touches LDS central servers. Transfer is peer-to-peer.
Data Sovereignty Guarantee
| Aspect | Implementation |
|---|---|
| Storage location | Cloudflare R2 (EU jurisdiction) |
| Transfer method | Direct HTTPS from EU edge nodes |
| Encryption | TLS 1.3 in transit, AES-256 at rest |
| Audit retention | 7 years (Finnish law) |
| GDPR compliance | Full |
We never:
- Store your data on third-party servers
- Transfer data outside EU/EEA
- Share your usage data with LDS or other participants
- Access your data after transfer completion
Compliance Standards
License Framework
| Standard | Implementation |
|---|---|
| CLARIN Categories | ACA (Academic) and RES (Commercial) |
| Condition Modifiers | BY, NC, NORED, LRT (laundry tags) |
| SPDX Identifiers | Where applicable |
Metadata Schema
| Standard | Version | Purpose |
|---|---|---|
| ELRC-SHARE | v3.1 | Language resource metadata |
| META-SHARE OWL | 3.1 | Ontology description |
| DCAT-AP | 3.0 | EU data portal interoperability |
| EuroVoc | 4.13 | Domain classification |
Data Formats
| Format | Use Case |
|---|---|
| JSONL | Segments with enrichment layers |
| TMX | Translation Memory eXchange |
| XLIFF 2.1 | Localization industry standard |
| Moses | Alignment format |
| JSON-LD | Linked data |
Getting Started
For Researchers (Academic License)
- Verify eligibility - University/research institution, ORCID ID
- Browse catalogue - View catalogue
- Select domain - Choose from 21 EuroVoc domains
- Apply for license - Automated ORCID verification
- Download - Direct access within 24-48 hours
Pricing: €3,000 - €18,000 depending on languages
For Organizations (Commercial License)
- Contact us - lds@pauhu.ai
- Review offerings - View catalogue
- Sign LRT agreement - Language Resource Terms
- Payment - Invoice or card payment
- Access - Immediate after payment confirmation
Pricing: €9,000 - €54,000 depending on languages
Contact
LDS Inquiries: lds@pauhu.ai
Technical Support: support@pauhu.ai
Partnership: partnership@pauhu.ai