Building large text corpora

Lexical Computing has a long experience in building large high-quality text corpora from the web. We possess both the expertise and the tools need to crawl large amounts of text data and processing them into a clean text corpus.

We specialize in building and processing corpora of billions of words. Our largest corpus sizes reach the 73-billion-word mark.

On-demand corpus building

We are capable of developing language data and word databases for major languages as well as languages whose language resources are scarce or non-existent. In the past, we produced dozens of language corpora according to customer’s specification and processed them into lexical databases, word lists, n-gram lists and other types of data.

The process of web corpus building undergoes several stages:

Web crawling

We developed a special tool for linguistic web crawling that collects text-rich web pages to be included in the corpus for a truly representative sample of language.

text type

Text cleaning

A special boilerplate removal tool is applied on the texts crawled to remove any unwanted portions of the text, namely navigation and menus, advertising, legal text, tabular data and any other types of text unsuitable for linguistic analysis and therefore for inclusion in the corpus.

Deduplication – removing duplicated content

Lots of web content gets copied and published in many places and during web crawling, duplicate instances of the same text or text that was modified to a certain extent, are collected. The whole corpus content undergoes a deduplication procedure where both perfect duplicates as well as near duplicates are removed so that only one instance of each text is preserved. The parameters of the tool can be adjusted according to the customer’s specification to exclude or include different level of similar content.

Tokenization, tagging, lemmatization

The text in the corpus is usually tokenized (divided into words) and can be processed further with linguistic tools to enrich the content with part-of-speech tagging and by assigning the base form to each word form (lemmatization). Such linguistic parameters make it possible to exploit all available features of our software such as term extraction, collocation finding, definition finding, example sentence finding, generation of lemmatized word lists and others.

The full linguistic processing may not be available for all languages but we will develop the support upon a request from the customer.