The process of web corpus building undergoes several stages:
Web crawling
We developed a special tool for linguistic web crawling that collects text-rich web pages to be included in the corpus for a truly representative sample of language.
text type
Text cleaning
A special boilerplate removal tool is applied on the texts crawled to remove any unwanted portions of the text, namely navigation and menus, advertising, legal text, tabular data and any other types of text unsuitable for linguistic analysis and therefore for inclusion in the corpus.
Deduplication – removing duplicated content
Lots of web content gets copied and published in many places and during web crawling, duplicate instances of the same text or text that was modified to a certain extent, are collected. The whole corpus content undergoes a deduplication procedure where both perfect duplicates as well as near duplicates are removed so that only one instance of each text is preserved. The parameters of the tool can be adjusted according to the customer’s specification to exclude or include different level of similar content.
Tokenization, tagging, lemmatization
The text in the corpus is usually tokenized (divided into words) and can be processed further with linguistic tools to enrich the content with part-of-speech tagging and by assigning the base form to each word form (lemmatization). Such linguistic parameters make it possible to exploit all available features of our software such as term extraction, collocation finding, definition finding, example sentence finding, generation of lemmatized word lists and others.
The full linguistic processing may not be available for all languages but we will develop the support upon a request from the customer.