Open-source NLP tools
Lexical Computing has a long-standing cooperation with the Natural Language Processing Centre at the Faculty of Informatics, Masaryk University, Brno, Czech Republic. Together, we developed a number of open-source NLP tools which are available for download. They have also been integrated with the Sketch Engine corpus query and management system and they are automatically applied on data within Sketch Engine so that even users without the necessary technical knowledge can benefit from them.
NLP tools for non-technical users
We integrated all the necessary NLP tools into our flagship product, Sketch Engine. Users can build and analyse large amounts of text without any technical knowledge and without installing and setting up any tools.
JusText
boilerplate removal
JusText is a HTML boilerplate removal tool producing clean text by striping navigation links, headers, footers, etc. from HTML pages and leaving only the main text containing complete sentences.
Chared
encoding detection
Chared is a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.
Spiderling
web spider for linguistics
Spiderling is a web spider designed for linguistics applications which can crawl text-rich parts of the web and collect data that are suitable for inclusion into text corpora. It is a key tool for our corpus building projects.
Onion
text deduplicator
Onion (ONe Instance ONly) is designed for deduplication large text collections (corpora) by measuring the similarity of paragraphs or whole documents. The duplicate texts are removed based on the threshold set by the user.
Unitok
text tokenizer
Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format), while preserving metadata in XML-like tags.
wiki2corpus
wikipedia download
wiki2corpus is a script which downloads Wikipedia articles (for a given language) and outputs them in the form of prevertical which can be further processed by other corpus tools.
NoSketch Engine
corpus query system
NoSketch Engine is an open-source corpus query system based on Sketch Engine. NoSketch Engine does not feature any of the automated corpus building tools integrated in Sketch Engine.