Word databases, language tools and solutions

language data, tools and solutions

based on 1 trillion words in 100+ languages

We provide large high-quality word databases, lexical data, word lists, and lexicons in many languages. Our data are generated from large databases of authentic text called text corpora. The total size of our text data is 1 trillion words with the largest corpus containing over 80,000,000,000 words. The dataset size allows us to generate databases of millions or even hundreds of millions of items while preserving accuracy and reliability. The corpora are also ideal datasets for language modelling (LLMs). Our customers are software developers, dictionary and language teaching material publishers and anyone who needs reliable language data.

The databases we supply can be enriched with related linguistic data such as synonyms, collocations, example sentences and morphological and statistical information.

We also provide solutions in the area of full-text search, terminology extraction, document classification and categorization, data mining and information retrieval.

Data samples

Word frequency lists: English, Spanish, French, Arabic, Russian, Portuguese, Hindi. Bigram databases: English, Spanish, German, Russian.

language data, tools & solutions

language data, tools and solutions

Data samples

our language databases, tools and services are used by

our products

corpus query and management system

online dictionary editor

term extraction

A Course in Lexicography and Lexical Computing