We provide large high-quality word databases, lexical data, word lists, and lexicons in many languages. Our data are generated from large databases of authentic text called text corpora. The total size of our text data is 1 trillion words with the largest corpus containing over 80,000,000,000 words. The dataset size allows us to generate databases of millions or even hundreds of millions of items while preserving accuracy and reliability. The corpora are also ideal datasets for language modelling (LLMs). Our customers are software developers, dictionary and language teaching material publishers and anyone who needs reliable language data.
The databases we supply can be enriched with related linguistic data such as synonyms, collocations, example sentences and morphological and statistical information.
We also provide solutions in the area of full-text search, terminology extraction, document classification and categorization, data mining and information retrieval.
Data samples
Word frequency lists: English, Spanish, French, Arabic, Russian, Portuguese, Hindi. Bigram databases: English, Spanish, German, Russian.