About

Lingua Corpora

Language Data Coverage Tracker

A live measurement of how much publicly-available data exists for every human language, scored across modality and task.

How it works

We continuously crawl public dataset repositories — HuggingFace, GitHub, OPUS, Wikipedia, OSCAR, and others — map each dataset to the languages it contains, and score every language across two axes:

7 Modalities: Text, speech, image, video, geospatial, tabular, time-series.
11 Tasks: Pre-training, instruction, conversation, QA, summarisation, translation, reasoning, code generation, classification, information extraction, generation.

The rubric supports two views. Fixed applies the same coverage expectations to every language. Vitality-adjusted reduces the weight of categories like coding and mathematics for endangered and extinct languages, so they aren't penalised for missing data that wouldn't reasonably exist.

Taxonomy sources

The list of ~9,000 languages and their metadata is unified from three authoritative sources:

LinguaMeta — 7,511 languages with 9 metadata categories (Google Research, LREC-COLING 2024). Primary source.
ISO 639-3 — SIL International language code standard (~7,900 codes). Cross-reference + supplementary entries for codes not in LinguaMeta.
Glottolog — 4,853 language families, 27,000+ languoids, coordinates, and additional languages.

All crawlers route language identifiers (BCP-47, ISO 639-1/3, Glottolog codes, free-text names) through a centralised resolver so the same language is never double-counted under different codes.

Scope & limitations

This is not an exhaustive enumeration of every dataset in existence. We scan public, online datasets only.

The tracker deliberately excludes:

Private datasets owned by individuals or companies
Copyrighted material we can't verify the license of
Offline or unpublished corpora
Datasets behind authentication walls we can't inspect

The picture you see is the floor of available coverage, not the ceiling. If you know of a public dataset we've missed, please get in touch.

About Beever AI

The Language Data Coverage Tracker is built and maintained by Beever AI. We work on multilingual language technology and care about every language being represented in the systems people use every day.

Contact

Questions, corrections, or interested in collaborating?

hello@beever.ai