Languages and corpora

Already available

Data, tools and services, in most cases, are based on a large sample of language called a corpus. Word lists, n-grams, lexical databases and any other data we supply are generated from these corpora. We are constantly developing new corpora and increase the coverage of languages. At this moment, these are the languages and corpora we currently have.

Language support development

We have an ample experience in developing support for new languages and building new text corpora. If your language is currently not supported or you need new data, please request the support to be developed.

Languages and corpora already available

NameLanguageAccess policySize in words
ACL Anthology Reference Corpus (ARC) English open 62,196,334
Afrikaans Web 2024 (afTenTen24) Afrikaans trial 142,303,550
Afrikaans Web 2024 (afTenTen24-stanza) Afrikaans trial 141,774,410
Afrikaans Wikipedia 2022 Afrikaans trial 22,227,137
Afrikaans Wikipedia corpus 2018 (afwiki) Afrikaans main 14,466,792
Albanian Web 2020 (sqTenTen20) Albanian trial 528,084,150
Alsatian Drama Corpus German main 276,204
American Spanish Web 2011 (esamTenTen11) Spanish main 7,475,579,365
Amharic Web 2013-17 (amWaC17) Amharic trial 25,975,846
ArabCC – Learner Corpus of English Essays English main 202,364
Arabic Learner Corpus (ALC) Arabic main 362,712
Arabic Trends (2014–today) Arabic trial 6,372,516,295
Arabic Web 2009 Arabic main 150,282,522
Arabic Web 2012 (arTenTen12) Arabic main 7,475,624,779
Arabic Web 2012 sample 115M (arTenTen12, Mada tagger) Arabic main 115,315,274
Arabic Web 2024 (arTenTen24) Arabic trial 6,572,150,262
Araneum Anglicum Africanum Maius [2015] English main 854,484,093
Araneum Anglicum Asiaticum Maius [2015] English main 867,259,037
Araneum Anglicum Maius [2015] English trial 888,466,066
Araneum Finnicum Maius [2014] Finnish main 817,453,523
Araneum Francogallicum Maius [2015] French main 933,688,995
Araneum Germanicum Maius [2013] German main 875,465,845
Araneum Hispanicum Maius [2013] Spanish main 892,299,770
Araneum Hungaricum Maius [2014] Hungarian trial 792,549,686
Araneum Italicum Maius (Italian, 14.12) 1,20 G Italian main 890,568,531
Araneum Nederlandicum Maius [2013] Dutch main 713,417,518
Araneum Polonicum Maius [2013] Polish main 595,768,667
Araneum Portugallicum Maius [2015] Portuguese main 862,134,902
Araneum Russicum Russicum Maius (Russia-only Russian, 15.03) 1,20 G Russian trial 859,319,823
Araneum Slovacum Maius [2013] Slovak trial 816,125,010
Armenian Wikipedia corpus 2020 (hywiki20) Armenian trial 51,349,694
Assamese Wikipedia 2023 (asWiki23) Assamese trial 2,581,684
Australian Legislative Corpus 2023 English ondemand 138,411,932
Bashkir Drama Corpus Bashkir main 18,723
Basque Web (BasqueWaC v2) Basque trial 99,719,584
Belarusian Web 2016 (beTenTen16) Belarusian trial 63,327,264
Belgian parliamentary debates (ParlaMint 2.1) French trial 30,865,918
Belgian parliamentary debates (ParlaMint 2.1, CoNLL format) French trial 30,864,767
Bengali Web (bnWaC) Bengali main 11,519,730
Bengali Web 2017 (bnTenTen17) Bengali main 812,606,941
Bengali Web 2021 (bnTenTen21) Bengali trial 470,732,738
BIBLE Polish-Swahili Polish main 138,216
BIBLE Swahili-Polish Swahili main 139,160
Boot Camp English English trial 85,683,246
Bosnian Web (bsWaC 1.2) Bosnian trial 248,478,730
Brexit corpus (English) English trial 108,452,923
Brexit corpus without retweets (English) English trial 4,789,571
British Academic Spoken English Corpus (BASE) English open 1,477,281
British Academic Written English Corpus (BAWE) English open 6,968,089
British Law Report Corpus English main 8,515,749
British National Corpus (BNC) English trial 96,132,981
British National Corpus (BNC), tagged by CLAWS English trial 96,052,598
British National Corpus 2014 (BNC2014, spoken part) English trial 10,495,185
British parliamentary debates (ParlaMint 2.1, CoNLL format) English trial 100,967,492
British Web 2007 (ukWaC) English main 1,313,058,436
Brown Corpus English open 1,007,299
Brown Family English main 6,963,778
Brown Family (CLAWS + TreeTagger tags) English main 6,975,474
Bulgarian National Corpus (BulgarianNC) Bulgarian main 20,975,703
Bulgarian National Corpus nonweb genres Bulgarian main 22,398,507
Bulgarian National Corpus with web Bulgarian main 419,512,059
Bulgarian parliamentary debates (ParlaMint 2.1) Bulgarian trial 19,099,991
Bulgarian parliamentary debates (ParlaMint 2.1, CoNLL format) Bulgarian trial 19,096,761
Bulgarian Web 2012 (bgTenTen12) Bulgarian main 705,156,683
Bulgarian Web 2021 (bgTenTen21) Bulgarian trial 4,695,125,771
Burmese Web 2021 (myTenTen21) Burmese trial 557,329,406
Cambridge Academic English English main 3,163,648
Cantonese Web (CantoneseWaC) Cantonese trial 30,898,663
Catalan Web 2014 (caTenTen14) Catalan trial 182,608,420
Cebuano Web 2018 (cebTenTen18) Cebuano trial 4,552,105
CELEN: Learner Corpus of Spanish in Japan Spanish open 658,467
CHILDES Afrikaans Corpus Afrikaans main 26,020
CHILDES Catalan Corpus Catalan main 209,525
CHILDES Croatian Corpus Croatian main 300,832
CHILDES Danish Corpus Danish main 285,231
CHILDES English Corpus English main 22,693,506
CHILDES Estonian Corpus Estonian main 313,457
CHILDES Farsi Corpus Persian main 120,527
CHILDES French Corpus French main 2,583,460
CHILDES Gaelic Corpus Irish main 16,848
CHILDES German Corpus German main 5,941,266
CHILDES Hebrew Corpus Hebrew main 807,657
CHILDES Hungarian Corpus Hungarian main 247,881
CHILDES Italian Corpus Italian main 459,881
CHILDES Japanese Corpus Japanese main 1,578,068
CHILDES Korean Corpus Korean main 36,056
CHILDES Norwegian Corpus Norwegian main 56,827
CHILDES Polish Corpus Polish main 1,041,300
CHILDES Portuguese Corpus Portuguese main 216,407
CHILDES Russian Corpus Russian main 48,791
CHILDES Spanish Corpus Spanish main 802,743
CHILDES Swedish Corpus Swedish main 520,478
CHILDES Tamil Corpus Tamil main 15,490
CHILDES Thai Corpus Thai main 243,939
CHILDES Turkish Corpus Turkish main 178,100
Chinese GigaWord 2 Corpus: Mainland, simplified Chinese Simplified main 205,031,379
Chinese GigaWord 2 Corpus: Taiwan, traditional Chinese Traditional main 382,600,557
Chinese Traditional Web (TaiwanWaC, Universal Sketch Grammar) Chinese Traditional main 259,156,002
Chinese Traditional Web 2011 (TaiwanWaC) Chinese Traditional main 259,156,002
Chinese Trends (2023–today) Chinese Simplified trial 24,995,980
Chinese Web 2005 (Internet-ZH, NEUCSP tagger) Chinese Simplified main 198,205,344
Chinese Web 2011 (zhTenTen11, sample 10M) Chinese Simplified main 9,012,125
Chinese Web 2011 (zhTenTen11, Stanford tagger) Chinese Simplified main 1,729,867,455
Chinese Web 2017 (zhTenTen17) Simplified Chinese Simplified trial 13,531,331,169
Chinese Web 2017 (zhTenTen17) Traditional Chinese Traditional trial 2,400,405,372
COLEM Spanish open 1,677,597
COMPAS 2015 English ondemand 114,967,191
COMPAS 2016 English ondemand 260,896,404
CoPEP - The Corpus of Portuguese from Academic Journals (v. 1.4) Portuguese main 40,423,011
Corpus Nko ߒߞߏ ߝߊ߬ߘߌ߬ߞߋ߬ߟߋ߲߬ߡߊ N'Ko open 4,102,593
Corpus of Academic Journal Articles (CAJA) English ondemand 78,970,299
Corpus of English Dialogues 1560–1760 English ondemand 1,151,171
Corpus of Estonian Web sentences 2020 Estonian main 280,961,465
Corpus of Estonian Web sentences 2021 Estonian main 473,455,876
Corpus of the MagyarOK teaching materials for Hungarian, levels A1 to B2 Hungarian open 259,200
COVID-19 Open Research Dataset (CORD-19) English open 1,443,530,655
Crimean Tatar National Monolingual & Parallel Corpora, Crimean Tatar Crimean Tatar open 2,958,868
Crimean Tatar National Monolingual & Parallel Corpora, English English open 92,947
Crimean Tatar National Monolingual & Parallel Corpora, Russian Russian open 538,135
Crimean Tatar National Monolingual & Parallel Corpora, Ukrainian Ukrainian open 344,454
Croatian parliamentary debates (ParlaMint 2.1) Croatian trial 20,337,753
Croatian parliamentary debates (ParlaMint 2.1, CoNLL format) Croatian trial 20,342,230
Croatian Web (hrWaC 2.2, ReLDI) Croatian trial 1,210,021,198
Croatian Web (hrWaC 2.2, RFTagger) Croatian trial 1,211,328,660
csSkELL v1 (whole documents) Czech main 1,717,516,129
csSkELL v2.2 (sentences with GDEX scores) Czech main 1,443,410,941
Cundeelee Wangka Stories (Cundeelee Wangka) Cundeelee Wangka ondemand 1,965
Cundeelee Wangka Stories (English) English ondemand 4,423
Czech Drama Corpus Czech main 135,105
Czech news and web 1995–2002 (czes2.2) Czech main 366,796,757
Czech parliamentary debates (ParlaMint 2.1) Czech trial 22,087,036
Czech parliamentary debates (ParlaMint 2.1, CoNLL format) Czech trial 22,104,199
Czech Trends (2014–today) Czech trial 2,035,300,469
Czech Web (csTenTen 12+17+19) Czech trial 11,722,066,502
Czech Web 2012 (csTenTen12 v9a) Czech main 4,175,089,441
Czech Web 2019 (csTenTen19) Czech main 6,280,217,621
Czech Web 2023 (csTenTen23) Czech trial 4,456,427,977
CzechParl 2012 (v2 with lempos) Czech main 37,184,025
Danish Gigaword (DAGW) Danish main 964,617,784
Danish parliamentary debates (ParlaMint 2.1) Danish trial 29,225,255
Danish parliamentary debates (ParlaMint 2.1, CoNLL format) Danish trial 29,205,018
Danish Trends Danish trial 99,376,976
Danish Web 2010 (DanishWaC) Danish main 288,272,967
Danish Web 2014 (daTenTen14) Danish main 2,040,976,501
Danish Web 2017 (daTenTen17) Danish main 1,956,590,663
Danish Web 2020 (daTenTen20) Danish trial 3,480,275,804
DGT-Translation Memory parallel – Bulgarian Bulgarian main 25,912,721
DGT-Translation Memory parallel – Croatian Croatian main 3,968,608
DGT-Translation Memory parallel – Czech Czech main 43,621,933
DGT-Translation Memory parallel – Danish Danish main 44,962,280
DGT-Translation Memory parallel – Dutch Dutch main 50,523,892
DGT-Translation Memory parallel – English English main 59,106,576
DGT-Translation Memory parallel – Estonian Estonian main 34,155,488
DGT-Translation Memory parallel – Finnish Finnish main 35,129,923
DGT-Translation Memory parallel – French French main 58,224,781
DGT-Translation Memory parallel – German German main 45,380,666
DGT-Translation Memory parallel – Greek Greek main 51,865,988
DGT-Translation Memory parallel – Hungarian Hungarian main 2,306,272
DGT-Translation Memory parallel – Irish Irish main 1,065,421
DGT-Translation Memory parallel – Italian Italian main 53,260,912
DGT-Translation Memory parallel – Latvian Latvian main 38,898,134
DGT-Translation Memory parallel – Lithuanian Lithuanian main 38,675,242
DGT-Translation Memory parallel – Maltese Maltese main 22,388,562
DGT-Translation Memory parallel – Polish Polish main 44,149,107
DGT-Translation Memory parallel – Portuguese Portuguese main 53,950,705
DGT-Translation Memory parallel – Romanian Romanian main 26,644,734
DGT-Translation Memory parallel – Slovak Slovak main 43,276,048
DGT-Translation Memory parallel – Slovenian Slovenian main 42,897,385
DGT-Translation Memory parallel – Spanish Spanish main 57,311,149
DGT-Translation Memory parallel – Swedish Swedish main 44,378,725
Directory of Open Access Journals (DOAJ) – English English trial 2,662,763,697
Duch parliamentary debates (ParlaMint 2.1) Dutch trial 51,175,668
Dutch parliamentary debates (ParlaMint 2.1, CoNLL format) Dutch trial 51,156,406
Dutch Trends Dutch trial 275,705,181
Dutch Web 2014 (nlTenTen14) Dutch main 2,253,777,579
Dutch Web 2020 (nlTenTen20) Dutch trial 5,890,009,964
e-flux (International art English) English main 5,036,119
EcoLexicon English Corpus (EEC) English open 23,169,446
ELEXIS Bulgarian Web 2021 Bulgarian main 1,014,316,771
ELEXIS Bulgarian Web 2021 (bgTenTen21) WSD sample Bulgarian main 1,992,046
ELEXIS Croatian Web 2020 Croatian main 1,006,040,496
ELEXIS Croatian Web 2020 (hrTenTen20) WSD sample Croatian main 1,964,238
ELEXIS Czech Web 2019 Czech main 949,730,627
ELEXIS Czech Web 2019 (csTenTen19) WSD sample Czech main 1,970,054
ELEXIS Danish Web 2020 Danish main 989,769,308
ELEXIS Danish Web 2020 (daTenTen20) WSD sample Danish main 1,982,549
ELEXIS Dutch Web 2020 Dutch main 1,024,660,354
ELEXIS Dutch Web 2020 (nlTenTen20) WSD sample Dutch main 1,982,397
ELEXIS English Web 2020 English main 1,000,329,442
ELEXIS English Web 2020 (enTenTen20, no genres and topics) WSD sample English main 1,999,789
ELEXIS Estonian Web 2021 Estonian main 1,006,940,696
ELEXIS Estonian Web 2021 (etTenTen21) WSD sample Estonian main 1,995,380
ELEXIS Finnish Web 2019 Finnish main 1,011,352,644
ELEXIS Finnish Web 2019 (fiTenTen19) WSD sample Finnish main 1,993,821
ELEXIS French Web 2020 French main 1,069,392,783
ELEXIS French Web 2020 (frTenTen20) WSD sample French main 2,099,651
ELEXIS German Web 2020 German main 1,023,830,342
ELEXIS German Web 2020 (deTenTen20) WSD sample German main 1,998,166
ELEXIS Greek Web 2019 Greek main 1,003,265,093
ELEXIS Greek Web 2019 (elTenTen19) WSD sample Greek main 1,961,351
ELEXIS Hebrew Web 2021 Hebrew main 1,043,504,840
ELEXIS Hebrew Web 2021 (heTenTen21) WSD sample Hebrew main 2,017,821
ELEXIS Hungarian Web 2020 Hungarian main 994,806,145
ELEXIS Hungarian Web 2020 (huTenTen20) WSD sample Hungarian main 1,989,855
ELEXIS Irish Web 2021 Irish main 58,130,702
ELEXIS Irish Web 2021 (gaTenTen21) WSD sample Irish main 1,980,914
ELEXIS Italian Web 2020 Italian main 1,020,349,212
ELEXIS Italian Web 2020 (itTenTen20) WSD sample Italian main 1,996,623
ELEXIS Latvian Web 2021 Latvian main 1,029,262,793
ELEXIS Latvian Web 2021 (lvTenTen21) WSD sample Latvian main 2,006,576
ELEXIS Lithuanian Web 2021 Lithuanian main 846,563,251
ELEXIS Lithuanian Web 2021 (ltTenTen21) WSD sample Lithuanian main 2,004,075
ELEXIS Polish Web 2019 Polish main 987,945,132
ELEXIS Polish Web 2019 (plTenTen19) WSD sample Polish main 1,971,906
ELEXIS Portuguese Web 2020 Portuguese main 1,021,937,614
ELEXIS Portuguese Web 2020 (ptTenTen20) WSD sample Portuguese main 1,997,515
ELEXIS Romanian Web 2021 Romanian main 995,033,835
ELEXIS Romanian Web 2021 (roTenTen21) WSD sample Romanian main 1,968,801
ELEXIS Slovak Web 2021 Slovak main 1,008,238,227
ELEXIS Slovak Web 2021 (skTenTen21) WSD sample Slovak main 1,975,380
ELEXIS Slovene Web 2020 (slTenTen20) WSD sample Slovenian main 1,964,284
ELEXIS Slovenian Web 2020 Slovenian main 1,007,206,400
ELEXIS Spanish Web 2020 Spanish main 1,012,502,656
ELEXIS Spanish Web 2020 (esTenTen20) WSD sample Spanish main 1,988,999
ELEXIS Swedish Web 2020 Swedish main 1,006,477,461
ELEXIS Swedish Web 2020 (svTenTen20) WSD sample Swedish main 1,980,144
Elsevier OA CC-BY Corpus English main 187,615,459
English Broadsheet Newspapers 1993–2021 (SiBol) English main 858,566,374
English Corpus for SKELL 3.10 English main 1,038,200,313
English Corpus for SkELL 3.8 English main 1,041,772,774
English Corpus for SkELL 3.9 English main 1,041,138,575
English Drama Corpus English main 18,846,687
English Historical Book Collection (EEBO, ECCO, Evans) English main 826,296,048
English parliamentary debates (ParlaMint 2.1) English trial 100,616,051
English Preposition Corpus English trial 2,136,325
English Trends (2014–today) English trial 83,038,019,370
English Web 2008 (ententen08_tt31) English trial 3,083,193,293
English Web 2012 (enTenTen12) English main 11,191,860,036
English Web 2013 (enTenTen13) English main 19,685,733,337
English Web 2015 (enTenTen15) English main 13,190,556,334
English Web 2018 (enTenTen18) English main 21,926,740,748
English Web 2021 (enTenTen21) English trial 52,268,286,493
English Wikipedia English main 1,356,523,079
English Wikipedia sample with Error annotations English trial 951,824
Environment corpus English main 61,197,742
Estonian Corpus for Learners 2020 (etSkELL) Estonian main 280,572,215
Estonian coursebook corpus 2018 Estonian main 121,114
Estonian National Corpus 2021 (Estonian NC 2021) Estonian main 2,410,296,919
Estonian National Corpus 2021 (Estonian NC 2021, CoNLL format) Estonian main 2,410,296,919
Estonian National Corpus 2023 (Estonian NC 2023) Estonian main 3,080,721,728
Estonian Trends Estonian trial 209,213,109
Estonian Web 2017 (etTenTen17) Estonian main 658,558,136
Estonian Web 2019 (etTenTen19) Estonian main 508,447,009
Estonian Web 2021 (etTenTen21) Estonian trial 725,832,092
EUR-Lex 2/2016 parallel – Bulgarian Bulgarian trial 329,071,554
EUR-Lex 2/2016 parallel – Croatian Croatian trial 109,138,184
EUR-Lex 2/2016 parallel – Czech Czech trial 350,230,088
EUR-Lex 2/2016 parallel – Danish Danish trial 519,765,085
EUR-Lex 2/2016 parallel – Dutch Dutch trial 583,263,688
EUR-Lex 2/2016 parallel – English English trial 629,722,593
EUR-Lex 2/2016 parallel – Estonian Estonian trial 291,077,511
EUR-Lex 2/2016 parallel – Finnish Finnish trial 384,119,975
EUR-Lex 2/2016 parallel – French French trial 677,063,993
EUR-Lex 2/2016 parallel – German German trial 528,617,843
EUR-Lex 2/2016 parallel – Greek Greek trial 579,344,223
EUR-Lex 2/2016 parallel – Hungarian Hungarian trial 340,618,970
EUR-Lex 2/2016 parallel – Irish Irish trial 31,439,542
EUR-Lex 2/2016 parallel – Italian Italian trial 606,070,097
EUR-Lex 2/2016 parallel – Latvian Latvian trial 324,734,544
EUR-Lex 2/2016 parallel – Lithuanian Lithuanian trial 323,151,426
EUR-Lex 2/2016 parallel – Maltese Maltese trial 314,396,006
EUR-Lex 2/2016 parallel – Polish Polish trial 360,862,149
EUR-Lex 2/2016 parallel – Portuguese Portuguese trial 595,066,701
EUR-Lex 2/2016 parallel – Romanian Romanian trial 336,928,068
EUR-Lex 2/2016 parallel – Slovak Slovak trial 255,531,673
EUR-Lex 2/2016 parallel – Slovenian Slovenian trial 351,899,258
EUR-Lex 2/2016 parallel – Spanish Spanish trial 635,187,126
EUR-Lex 2/2016 parallel – Swedish Swedish trial 478,485,126
EUR-Lex judgments 12/2016 parallel – Bulgarian Bulgarian trial 17,071,495
EUR-Lex judgments 12/2016 parallel – Croatian Croatian trial 5,613,468
EUR-Lex judgments 12/2016 parallel – Czech Czech trial 18,226,505
EUR-Lex judgments 12/2016 parallel – Danish Danish trial 34,934,021
EUR-Lex judgments 12/2016 parallel – Dutch Dutch trial 40,534,071
EUR-Lex judgments 12/2016 parallel – English English trial 42,339,337
EUR-Lex judgments 12/2016 parallel – Estonian Estonian trial 15,029,608
EUR-Lex judgments 12/2016 parallel – Finnish Finnish trial 23,601,422
EUR-Lex judgments 12/2016 parallel – French French trial 48,023,524
EUR-Lex judgments 12/2016 parallel – German German trial 35,297,517
EUR-Lex judgments 12/2016 parallel – Greek Greek trial 35,815,108
EUR-Lex judgments 12/2016 parallel – Hungarian Hungarian trial 17,940,879
EUR-Lex judgments 12/2016 parallel – Italian Italian trial 42,053,315
EUR-Lex judgments 12/2016 parallel – Latvian Latvian trial 16,908,831
EUR-Lex judgments 12/2016 parallel – Lithuanian Lithuanian trial 16,252,111
EUR-Lex judgments 12/2016 parallel – Maltese Maltese trial 19,146,797
EUR-Lex judgments 12/2016 parallel – Polish Polish trial 18,799,551
EUR-Lex judgments 12/2016 parallel – Portuguese Portuguese trial 35,412,936
EUR-Lex judgments 12/2016 parallel – Romanian Romanian trial 17,592,388
EUR-Lex judgments 12/2016 parallel – Slovak Slovak trial 18,265,664
EUR-Lex judgments 12/2016 parallel – Slovenian Slovenian trial 18,439,766
EUR-Lex judgments 12/2016 parallel – Spanish Spanish trial 39,431,836
EUR-Lex judgments 12/2016 parallel – Swedish Swedish trial 30,666,764
Europarl spoken parallel – Bulgarian Bulgarian trial 9,215,233
Europarl spoken parallel – Czech Czech trial 13,013,774
Europarl spoken parallel – Danish Danish trial 48,343,860
Europarl spoken parallel – Dutch Dutch trial 54,007,722
Europarl spoken parallel – English English trial 53,837,625
Europarl spoken parallel – English English open 15,099,625
Europarl spoken parallel – Estonian Estonian trial 11,171,727
Europarl spoken parallel – Finnish Finnish trial 34,182,031
Europarl spoken parallel – French French trial 59,145,988
Europarl spoken parallel – French French open 16,815,290
Europarl spoken parallel – German German trial 47,805,055
Europarl spoken parallel – Greek Greek trial 38,868,863
Europarl spoken parallel – Hungarian Hungarian trial 12,421,715
Europarl spoken parallel – Italian Italian trial 52,871,060
Europarl spoken parallel – Latvian Latvian trial 11,920,085
Europarl spoken parallel – Lithuanian Lithuanian trial 11,424,032
Europarl spoken parallel – Polish Polish trial 13,034,164
Europarl spoken parallel – Polish Polish open 13,034,164
Europarl spoken parallel – Portuguese Portuguese trial 53,778,766
Europarl spoken parallel – Romanian Romanian trial 9,554,864
Europarl spoken parallel – Slovak Slovak trial 12,942,651
Europarl spoken parallel – Slovenian Slovenian trial 12,496,942
Europarl spoken parallel – Spanish Spanish trial 54,302,284
Europarl spoken parallel – Spanish Spanish open 15,513,307
Europarl spoken parallel – Swedish Swedish trial 46,303,799
European Spanish Web 2011 (eseuTenTen11) Spanish main 2,021,633,644
Film Corpus English main 21,661,806
Finnish Web 2014 (fiTenTen14) Finnish trial 1,404,083,812
Finnish Web 2014 (fiTenTen14, TreeTagger v2) Finnish main 1,404,100,049
Frantext (French literature of the 18th-20th century) French main 15,573,070
Frantext (French literature of the 18th-20th century), without trends French main 15,573,070
French corpus of 88,000 SMS (88milSMS) French trial 1,206,663
French Drama Corpus French main 12,822,260
French parliamentary debates (ParlaMint 2.1) French trial 32,214,147
French parliamentary debates (ParlaMint 2.1, CoNLL format) French trial 32,176,380
French Trends French trial 793,899,522
French Web 2008 (v2 with lempos) French main 104,705,211
French Web 2010 (frWaC) French main 1,330,564,200
French Web 2012 (frTenTen12) French main 9,889,689,889
French Web 2017 (frTenTen17) French main 5,752,261,039
French Web 2020 (frTenTen20) French main 15,115,914,647
French Web 2023 (frTenTen23) French trial 23,874,070,858
Georgian Web 2013 (kaWaC) Georgian trial 50,713,604
German Corpus for SkELL 1.0 German main 769,810,745
German Drama Corpus German main 9,374,314
German Political Speeches Corpus German trial 11,144,258
German Trends German trial 1,458,055,059
German Web 2010 German main 2,338,036,362
German Web 2010 (deWaC) German main 1,348,188,416
German Web 2013 (deTenTen13) German main 16,526,335,416
German Web 2018 (deTenTen18) German main 5,346,041,196
German Web 2020 (deTenTen20) German trial 17,512,733,172
GerManC (German Newspapers 1650-1800) German main 667,310
Gigafida v2.0 (referenčni) Slovenian main 1,109,441,592
Greek Drama Corpus Greek main 269,334
Greek Web (GkWaC with lempos) Greek main 124,285,612
Greek Web 2014 (elTenTen14) Greek main 1,671,692,845
Greek Web 2019 (elTenTen19) Greek trial 2,342,091,029
Guangwai - Lancaster Chinese Learner Corpus Chinese Simplified open 1,289,060
Gujarati Web (guWaC) Gujarati main 17,960,095
Gujarati Web 2021 (guTenTen21) Gujarati trial 88,574,710
Gutenberg Afrikaans 2020 Afrikaans main 315,010
Gutenberg Bulgarian 2020 Bulgarian main 33,352
Gutenberg Catalan 2020 Catalan main 1,320,242
Gutenberg Chinese Traditional 2020 Chinese Traditional main 27,136,782
Gutenberg Czech 2020 Czech main 364,683
Gutenberg Danish 2020 Danish main 3,959,344
Gutenberg Dutch 2020 Dutch main 87,390,658
Gutenberg English 2020 English main 2,903,177,585
Gutenberg Esperanto 2020 Esperanto trial 2,024,013
Gutenberg Finnish 2020 Finnish main 68,174,366
Gutenberg French 2020 French main 197,560,500
Gutenberg German 2020 German main 74,709,930
Gutenberg Greek 2020 Greek main 7,837,742
Gutenberg Hebrew 2020 Hebrew main 158,212
Gutenberg Hungarian 2020 Hungarian main 9,140,833
Gutenberg Icelandic 2020 Icelandic main 82,211
Gutenberg Italian 2020 Italian main 93,049,296
Gutenberg Japanese 2020 Japanese main 963,368
Gutenberg Latin 2020 Latin main 3,871,335
Gutenberg Norwegian Bokmål 2020 Norwegian Bokmål main 762,295
Gutenberg Polish 2020 Polish main 421,318
Gutenberg Portuguese 2020 Portuguese main 14,309,476
Gutenberg Russian 2020 Russian main 13,643
Gutenberg Serbian 2020 Serbian main 70,724
Gutenberg Spanish 2020 Spanish main 37,202,233
Gutenberg Swedish 2020 Swedish main 7,919,783
Gutenberg Tagalog 2020 Tagalog main 2,468,064
Gutenberg Telugu 2020 Telugu main 157,077
Gutenberg Welsh 2020 Welsh main 221,733
Hausa Web 2015 (hausaWaC15) Hausa (Boko) trial 5,304,300
Hebrew Drama Corpus Hebrew main 954,359
Hebrew General Corpus (web crawled, mostly newspapers) Hebrew main 157,947,728
Hebrew Translation Corpus Hebrew trial 1,180,003
Hebrew Web (HebWaC) Hebrew main 47,832,254
Hebrew Web 2014 (heTenTen14, Meni/Alon tagged + lempos) Hebrew ondemand 895,876,116
Hebrew Web 2014 (heTenTen14, no POS tagging) Hebrew main 890,282,843
Hebrew Web 2021 (heTenTen21) Hebrew trial 2,775,686,699
Hindi Web 2012 (HindiWaC v. 4) Hindi trial 107,960,109
Hindi Web 2013 (hiTenTen13) Hindi main 351,289,441
Hindi Web 2017 (hiTenTen17) Hindi main 1,228,379,747
Hindi Web 2021 (hiTenTen21) Hindi trial 792,395,313
Hungarian Drama Corpus Hungarian main 533,088
Hungarian parliamentary debates (ParlaMint 2.1) Hungarian trial 858,844
Hungarian parliamentary debates (ParlaMint 2.1, CoNLL format) Hungarian trial 856,543
Hungarian Web 2012 (huTenTen12) Hungarian main 2,572,620,694
Hungarian Web 2020 (huTenTen20) Hungarian main 5,164,717,029
Hungarian Web 2023 (huTenTen23) Hungarian trial 3,494,350,960
Icelandic Gigaword Corpus 2017 Icelandic main 532,028,866
Icelandic parliamentary debates (ParlaMint 2.1) Icelandic trial 23,468,157
Icelandic parliamentary debates (ParlaMint 2.1, CoNLL format) Icelandic trial 23,461,109
Icelandic texts [sample] Icelandic trial 5,436,035
Icelandic Web 2020 (isTenTen20) Icelandic trial 518,620,759
Igbo Web 2015 (IgboWaC15) Igbo main 331,042
Igbo Web 2017 (igTenTen17) Igbo trial 629,294
Indonesian Web (IndonesianWaC) Indonesian trial 90,120,046
Indonesian Web 2020 (idTenTen20) Indonesian main 3,687,192,045
Indonesian Web 2024 (idTenTen24) Indonesian trial 7,108,841,939
Irish Syllabic Poetry, circa 1200-1650 (BARDIC@TCD) Irish open 478,445
Irish Trends Irish trial 1,802,383
Irish Web 2022 (gaTenTen22) Irish trial 125,040,541
Italian Corpus for SkELL 1.0 Italian main 328,270,600
Italian Drama Corpus Italian main 1,669,717
Italian parliamentary debates (ParlaMint 2.1) Italian trial 26,549,927
Italian parliamentary debates (ParlaMint 2.1, CoNLL format) Italian trial 26,571,966
Italian Trends (2014–today) Italian trial 9,039,772,149
Italian Web 2006 (itWaC) Italian main 1,597,295,469
Italian Web 2010 (itTenTen) Italian main 2,588,873,046
Italian Web 2016 (itTenTen16) Italian main 4,989,729,171
Italian Web 2020 (itTenTen20) Italian trial 12,451,734,885
itWAC (reduced) Italian main 751,542,948
Japanese Web 2006 (jpWaC) Japanese main 336,867,039
Japanese Web 2011 (jaTenTen11) Japanese trial 8,432,294,787
Japanese Web 2011 (jaTenTen11, sample) Japanese main 301,407,652
Japanese Web 2011 sample (jaTenTen11, LUW) Japanese trial 163,837,764
Kannada Web 2012 (knWaC12) Kannada trial 11,056,526
KAS-Dipl (diplome) Slovenian main 568,188,810
KAS-Dr (doktorati) Slovenian main 30,244,519
KAS-Mag (magisteriji) Slovenian main 157,168,378
Khmer Web 2018 (kmTenTen18) Khmer main 16,500,379
Khmer Web 2021 (kmTenTen21) Khmer trial 103,066,083
Korean Web 2012 (koTenTen12) Korean main 461,196,240
Korean Web 2018 (koTenTen18) Korean trial 1,668,851,720
KSUCCA (Classical Arabic) Arabic trial 46,705,577
Lao Web 2018 (loTenTen18) Lao main 15,862,991
Lao Web 2019 (loTenTen19) Lao trial 105,018,584
LatinISE corpus Latin trial 11,202,216
Latvian parliamentary debates (ParlaMint 2.1) Latvian trial 6,318,701
Latvian parliamentary debates (ParlaMint 2.1, CoNLL format) Latvian trial 6,342,984
Latvian Web (LatvianWaC) Latvian main 57,666,024
Latvian Web 2014 (lvTenTen14) Latvian trial 530,367,474
Lektor (Learner corpus of proofreading and translations) Slovenian main 953,038
LEXMCI English main 1,448,180,339
Lithuanian parliamentary debates (ParlaMint 2.1) Lithuanian trial 14,573,624
Lithuanian parliamentary debates (ParlaMint 2.1, CoNLL format) Lithuanian trial 14,428,682
Lithuanian Web (LithuanianWaC v2) Lithuanian main 48,650,918
Lithuanian Web 2014 (ltTenTen14) Lithuanian trial 778,151,979
London English Corpus English main 2,391,040
MaCoCu Albanian Web v1 (2022) Albanian main 617,643,884
MaCoCu Bosnian Web v1 (2021-2022) Bosnian trial 715,708,157
MaCoCu Croatian Web v2 (2021–2022) Croatian main 2,299,750,788
MaCoCu Macedonian Web v2 (2021) Macedonian trial 512,171,886
MaCoCu Maltese Web v2 (2021) Maltese main 331,665,362
MaCoCu Montenegrin Web v1 (2021-2022) Montenegrin main 157,680,373
MaCoCu Serbian Web v1 (2021-2022) Serbian main 2,435,143,021
MaCoCu Slovene Web v2 (2021-2022) Slovenian main 1,863,942,989
MaCoCu Turkish Web v2 (2021) Turkish main 4,261,087,826
MaCoCu Ukrainian Web v1 (2021-2022) Ukrainian main 5,912,040,719
Magpie corpus English main 4,597,782
Malay Web 2020 (msTenTen20) Malay trial 296,419,465
Malayalam Web (malayalamWaC) Malayalam trial 15,950,663
Malaysian Web (MalaysianWaC) Malay trial 182,578,743
Maldivian Wikipedia corpus 2019 (dvwiki) Maldivian trial 548,211
Maltese MLRS Corpus Maltese trial 110,714,844
Maltese Trends Maltese trial 6,839,047
Maori Web 2013 and 2020 (miTenTen20) Maori trial 11,814,825
Medical Web Corpus English main 33,961,786
Merlin Written Learner Czech Czech main 75,526
Merlin Written Learner German German main 150,256
Merlin Written Learner Italian Italian main 107,797
METCLIL: Metaphor in EMI seminars English open 110,493
Mongolian Web Texts 2016 (mnWaC16) Mongolian trial 6,104,565
Mueller Report English trial 167,103
Nepalbhasa Online Media Corpus Newari open 7,750,050
Nepali National Corpus Nepali trial 13,440,835
Nepali Web (NepaliWaC) Nepali main 1,290,388
New corpus for English (NCI English) English main 216,618,095
New Model Corpus English main 95,276,958
Newspapers in Portuguese (CetemPúblico, CetenFolha) Portuguese main 56,768,822
Norwegian dictionary corpus (Nynorskkorpuset) Norwegian main 74,496,664
Norwegian Web 2012 Norwegian main 669,511,569
Norwegian Web 2017 (noTenTen17, Bokmål and Nynorsk) Norwegian trial 2,630,849,803
Norwegian Web 2017 (noTenTen17, Bokmål) Norwegian Bokmål trial 2,461,704,417
Norwegian Web 2017 (noTenTen17, Nynorsk) Norwegian Nynorsk trial 169,145,386
OEC English ondemand 2,073,319,589
Old French and Middle French (BFM 2022) French main 6,002,552
Open American National Corpus (spoken) English main 3,202,026
Open American National Corpus (written) English main 11,048,137
Open Cambridge Learner Corpus (Uncoded) English ondemand 2,975,701
Open Parallel Corpus (OPUS) – Afrikaans Afrikaans main 586,334
Open Parallel Corpus (OPUS) – Albanian Albanian main 46,304,346
Open Parallel Corpus (OPUS) – Arabic Arabic main 300,000,057
Open Parallel Corpus (OPUS) – Bosnian Bosnian main 43,582,516
Open Parallel Corpus (OPUS) – Bulgarian Bulgarian main 183,115,244
Open Parallel Corpus (OPUS) – Croatian Croatian main 121,369,625
Open Parallel Corpus (OPUS) – Czech Czech main 203,845,619
Open Parallel Corpus (OPUS) – Danish Danish main 120,107,271
Open Parallel Corpus (OPUS) – Dutch Dutch main 356,363,571
Open Parallel Corpus (OPUS) – English English main 1,139,515,048
Open Parallel Corpus (OPUS) – Estonian Estonian main 64,879,741
Open Parallel Corpus (OPUS) – Finnish Finnish main 131,985,872
Open Parallel Corpus (OPUS) – French French main 766,833,908
Open Parallel Corpus (OPUS) – German German main 125,229,773
Open Parallel Corpus (OPUS) – Greek Greek main 239,360,926
Open Parallel Corpus (OPUS) – Hebrew Hebrew main 130,972,343
Open Parallel Corpus (OPUS) – Hindi Hindi main 854,741
Open Parallel Corpus (OPUS) – Hungarian Hungarian main 157,495,018
Open Parallel Corpus (OPUS) – Italian Italian main 180,532,849
Open Parallel Corpus (OPUS) – Japanese Japanese main 5,455,106
Open Parallel Corpus (OPUS) – Korean Korean main 374,850
Open Parallel Corpus (OPUS) – Latvian Latvian main 24,499,516
Open Parallel Corpus (OPUS) – Lithuanian Lithuanian main 29,621,940
Open Parallel Corpus (OPUS) – Macedonian Macedonian main 40,348,792
Open Parallel Corpus (OPUS) – Persian Persian main 4,425,133
Open Parallel Corpus (OPUS) – Polish Polish main 208,008,636
Open Parallel Corpus (OPUS) – Portuguese Portuguese main 297,700,205
Open Parallel Corpus (OPUS) – Portuguese Portuguese main 272,300,927
Open Parallel Corpus (OPUS) – Romanian Romanian main 282,408,295
Open Parallel Corpus (OPUS) – Russian Russian main 307,709,872
Open Parallel Corpus (OPUS) – Serbian Serbian main 153,237,786
Open Parallel Corpus (OPUS) – Slovak Slovak main 62,451,407
Open Parallel Corpus (OPUS) – Slovenian Slovenian main 121,228,966
Open Parallel Corpus (OPUS) – Spanish Spanish main 701,944,027
Open Parallel Corpus (OPUS) – Swedish Swedish main 102,298,686
Open Parallel Corpus (OPUS) – Turkish Turkish main 151,342,424
Open Parallel Corpus (OPUS) – Ukrainian Ukrainian main 2,577,481
Open Parallel Corpus OPUS – Chinese Simplified Chinese Simplified main 243,427,123
Open Parallel Corpus OPUS – Chinese Traditional Chinese Traditional main 380,245
Open Parallel Corpus OPUS – Norwegian Bokmål Norwegian main 20,237,510
OpenSubtitles 2018 parallel – Afrikaans Afrikaans main 341,349
OpenSubtitles 2018 parallel – Albanian Albanian main 15,662,170
OpenSubtitles 2018 parallel – Arabic Arabic main 333,329,378
OpenSubtitles 2018 parallel – Armenian Armenian main 24,216
OpenSubtitles 2018 parallel – Basque Basque main 3,919,829
OpenSubtitles 2018 parallel – Bengali Bengali main 2,270,841
OpenSubtitles 2018 parallel – Bosnian Bosnian main 125,323,299
OpenSubtitles 2018 parallel – Brazilian Portuguese Portuguese main 545,598,189
OpenSubtitles 2018 parallel – Breton Breton trial 85,503
OpenSubtitles 2018 parallel – Bulgarian Bulgarian main 371,685,493
OpenSubtitles 2018 parallel – Catalan Catalan main 3,273,561
OpenSubtitles 2018 parallel – Chinese Simplified Chinese Simplified main 119,998,854
OpenSubtitles 2018 parallel – Chinese Traditional Chinese Traditional main 41,876,166
OpenSubtitles 2018 parallel – Croatian Croatian main 370,177,938
OpenSubtitles 2018 parallel – Czech Czech main 453,218,524
OpenSubtitles 2018 parallel – Danish Danish main 135,228,416
OpenSubtitles 2018 parallel – Dutch Dutch main 444,413,064
OpenSubtitles 2018 parallel – English English main 1,211,666,401
OpenSubtitles 2018 parallel – Esperanto Esperanto main 396,790
OpenSubtitles 2018 parallel – Estonian Estonian main 107,391,459
OpenSubtitles 2018 parallel – European Portuguese Portuguese main 466,021,603
OpenSubtitles 2018 parallel – Finnish Finnish main 175,247,181
OpenSubtitles 2018 parallel – French French main 462,749,061
OpenSubtitles 2018 parallel – Galician Galician trial 1,572,312
OpenSubtitles 2018 parallel – Georgian Georgian main 1,157,136
OpenSubtitles 2018 parallel – German German main 185,133,927
OpenSubtitles 2018 parallel – Greek Greek main 457,347,003
OpenSubtitles 2018 parallel – Hebrew Hebrew main 371,473,205
OpenSubtitles 2018 parallel – Hindi Hindi main 675,322
OpenSubtitles 2018 parallel – Hungarian Hungarian main 378,525,740
OpenSubtitles 2018 parallel – Icelandic Icelandic main 9,194,074
OpenSubtitles 2018 parallel – Indonesian Indonesian main 77,273,767
OpenSubtitles 2018 parallel – Italian Italian main 431,415,848
OpenSubtitles 2018 parallel – Japanese Japanese main 15,224,480
OpenSubtitles 2018 parallel – Kazakh Kazakh main 14,172
OpenSubtitles 2018 parallel – Korean Korean main 7,432,927
OpenSubtitles 2018 parallel – Latvian Latvian main 2,494,901
OpenSubtitles 2018 parallel – Lithuanian Lithuanian main 6,806,857
OpenSubtitles 2018 parallel – Macedonian Macedonian main 28,859,153
OpenSubtitles 2018 parallel – Malay Malay main 13,465,077
OpenSubtitles 2018 parallel – Malayalam Malayalam main 1,671,708
OpenSubtitles 2018 parallel – Norwegian (Mixed) Norwegian main 61,215,172
OpenSubtitles 2018 parallel – Persian Persian main 53,444,595
OpenSubtitles 2018 parallel – Polish Polish main 496,167,686
OpenSubtitles 2018 parallel – Romanian Romanian main 658,289,867
OpenSubtitles 2018 parallel – Russian Russian main 180,032,832
OpenSubtitles 2018 parallel – Serbian Serbian main 480,367,760
OpenSubtitles 2018 parallel – Sinhalese Sinhalese trial 3,430,727
OpenSubtitles 2018 parallel – Slovak Slovak main 66,455,056
OpenSubtitles 2018 parallel – Slovenian Slovenian main 198,366,873
OpenSubtitles 2018 parallel – Spanish Spanish main 753,235,853
OpenSubtitles 2018 parallel – Swedish Swedish main 153,717,474
OpenSubtitles 2018 parallel – Tagalog Tagalog main 96,291
OpenSubtitles 2018 parallel – Tamil Tamil main 132,055
OpenSubtitles 2018 parallel – Telugu Telugu main 109,730
OpenSubtitles 2018 parallel – Thai Thai main 33,223,171
OpenSubtitles 2018 parallel – Turkish Turkish main 461,809,489
OpenSubtitles 2018 parallel – Ukrainian Ukrainian main 5,054,963
OpenSubtitles 2018 parallel – Urdu Urdu main 229,947
OpenSubtitles 2018 parallel – Vietnamese Vietnamese main 31,848,385
OPUS MontenegrinSubs parallel – English English trial 468,337
OPUS MontenegrinSubs parallel – Montenegrin Montenegrin trial 365,698
Oromo Web 2016 (orWaC16) Oromo trial 4,249,953
Oxford Children's Corpus 2015 (PTag) English ondemand 210,322,185
Oxford Children's Corpus 2015 -- Education (PTag) English ondemand 1,323,174
Oxford Children's Corpus 2015 -- Reading (PTag) English ondemand 34,284,687
Oxford Children's Corpus 2015 -- Writing (PTag) English ondemand 174,714,324
Oxford Children's Corpus 2016 (PTag) English ondemand 284,360,063
Oxford Children's Corpus 2016 -- Reading (PTag) English ondemand 53,858,955
Oxford Children's Corpus 2016 -- Writing (PTag) English ondemand 229,177,934
Oxford Corpus of Academic English (OCAE, April 2012) English ondemand 71,371,739
Paisa Italian main 221,989,288
ParlaTalk Austria parliamentary debates (lower house) German trial 7,675,413
ParlaTalk Austria parliamentary debates (upper house) German trial 3,101,421
ParlaTalk Belgium parliamentary debates (lower house) French trial 58,073,338
ParlaTalk Bulgaria parliamentary debates Bulgarian trial 15,221,455
ParlaTalk Czech Republic parliamentary debates (lower house) Czech trial 22,091,557
ParlaTalk Czech Republic parliamentary debates (upper house) Czech trial 11,737,338
ParlaTalk Czechia - parliamentary debates Czech main 23,090,562
ParlaTalk Denmark parliamentary debates Danish trial 80,017,714
ParlaTalk Estonia parliamentary debates Estonian trial 11,665,859
ParlaTalk Finland parliamentary debates Finnish trial 22,660,060
ParlaTalk France - parliamentary debates French main 72,411,399
ParlaTalk France parliamentary debates (lower house) French trial 61,116,819
ParlaTalk France parliamentary debates (upper house) French trial 181,508,579
ParlaTalk German parliamentary debates (lower house) German trial 130,988,058
ParlaTalk Greek parliamentary debates Greek trial 23,540,099
ParlaTalk Hungary parliamentary debates Hungarian trial 3,077,151
ParlaTalk Ireland parliamentary debates English trial 121,302,091
ParlaTalk Italy parliamentary debates (lower house) Italian trial 7,656,348
ParlaTalk Italy parliamentary debates (upper house) Italian trial 13,308,453
ParlaTalk Netherlands parliamentary debates (lower house) Dutch trial 82,035,039
ParlaTalk Netherlands parliamentary debates (upper house) Dutch trial 12,073,192
ParlaTalk Poland parliamentary debates (upper house) Polish trial 20,409,110
ParlaTalk Portugal - parliamentary debates Portuguese main 145,415,953
ParlaTalk Portugal parliamentary debates Portuguese trial 141,098,975
ParlaTalk Romania parliamentary debates (lower house) Romanian trial 15,772,145
ParlaTalk Romania parliamentary debates (upper house) Romanian trial 27,543,309
ParlaTalk Slovakia parliamentary debates Slovak trial 9,790,175
ParlaTalk Slovenia parliamentary debates (lower house) Slovenian trial 26,002,443
ParlaTalk Spain Republic parliamentary debates (lower house) Spanish trial 1,882,700
ParlaTalk Sweden parliamentary debates Swedish trial 131,739,759
Parsed German Web (sDeWaC) German main 755,165,551
Penn Corpora of Historical English English ondemand 3,800,639
Persian Trends Persian trial 317,172,244
PICAE 2010 English ondemand 31,025,920
Polish Drama Corpus Polish main 117,230
Polish language of the 1960s Polish main 546,042
Polish Parliamentary Corpus (PPC) Polish main 553,858,723
Polish parliamentary debates (ParlaMint 2.1) Polish trial 26,619,472
Polish parliamentary debates (ParlaMint 2.1, CoNLL format) Polish trial 26,882,964
Polish Trends Polish trial 673,382,464
Polish Web (PolishWac, Morfeusz and TaKIPI tagger) Polish main 103,028,410
Polish Web 2012 (plTenTen12, RFTagger) Polish main 7,715,835,214
Polish Web 2012 sample (plTenTen12) Polish main 45,208,497
Polish Web 2019 (plTenTen19) Polish trial 3,994,024,317
Polish Web 2019 term reference (plTenTen19_01) Polish trial 181,036,098
Portuguese Trends Portuguese trial 738,742,299
Portuguese Web 2011 (ptTenTen11) Portuguese main 3,896,392,719
Portuguese Web 2011 (ptTenTen11, Palavras parsed) Portuguese main 2,757,635,105
Portuguese Web 2018 (ptTenTen18) Portuguese trial 7,407,393,731
Portuguese Web 2023 (ptTenTen23) Portuguese trial 16,976,742,883
Project Gutenberg English English main 443,471,071
pukWaC (ukWaC parsed with MaltParser) English main 39,496,785
Quran annotated corpus [unvowelled Arabic] Arabic main 128,243
Quran annotated corpus [unvowelled Latin] Arabic main 99,268
Quran annotated corpus [vowelled Arabic] Arabic main 128,241
Quran annotated corpus [vowelled Latin] Arabic main 97,970
RapCor1360 - Francophone rap songs French trial 735,513
Riznica v0.1 Croatian main 85,273,724
Roman Drama Corpus Latin main 278,890
Romanian Web 2016 (roTenTen16) Romanian main 2,640,496,763
Romanian Web 2021 (roTenTen21) Romanian trial 2,763,173,824
ruSkELL 1.6 Russian main 975,584,449
Russian Drama Corpus Russian main 2,011,699
Russian Sites in Estonian Web 2017–2023 Russian main 312,244,562
Russian Trends Russian trial 1,559,839,691
Russian Web 2006 (v2 with lempos) Russian main 147,930,261
Russian Web 2011 (ruTenTen11) Russian trial 14,553,856,113
Russian Web 2017 (ruTenTen17) Russian trial 9,034,837,939
Samoan Web (SamoanWac1) Samoan trial 3,115,385
Santa Barbara Corpus of Spoken American English English main 249,655
ScienceBlogs English main 103,175,233
Scottish Gaelic Wiki 2015 (gdWiki) Scottish Gaelic trial 980,026
Semcor v3.0 (sense-tagged corpus) English main 664,038
Serbian Web (srWaC 1.2 processed by Hunpos) Serbian trial 477,724,164
Serbian Web (srWaC 1.2 processed by RFTagger v1) Serbian (Latin) trial 441,888,202
Serbian Web (srWaC 1.2) Serbian (Latin) trial 476,888,297
Setswana/Tswana Web (SetswanaWaC v2) Setswana trial 11,496,687
Shakespeare English Drama Corpus English main 810,929
Shakespeare German Drama Corpus German main 796,439
Slovak Trends Slovak trial 199,626,628
Slovak Web 2011 (skTenTen11) Slovak main 540,112,634
Slovak Web 2011 (skTenTen11, ambiguity tag, lempos) Slovak main 715,707,053
Slovak Web 2023 (skTenTen23) Slovak trial 898,031,101
Slovene Trends Slovenian trial 118,154,887
Slovenian parliamentary debates (ParlaMint 2.1) Slovenian trial 19,933,512
Slovenian parliamentary debates (ParlaMint 2.1, CoNLL format) Slovenian trial 19,933,836
Slovenian reference corpus (FidaPLUS v2) Slovenian trial 600,309,637
Slovenian Web (slWaC 2.1) Slovenian trial 754,255,589
Slovenian Web (slWaC 2.1, processed with TreeTagger version 2) Slovenian trial 755,255,547
Slovenian Web 2015 (slTenTen15, TreeTagger v2) Slovenian trial 829,544,337
Somali Web 2016 (soWaC16) Somali trial 71,871,585
SoNaR Dutch ondemand 425,978,755
Sorani Kurdish Wikipedia corpus 2020 (ckbwiki20) Kurdish (Sorani) trial 5,042,449
Spanish Calderon Drama Corpus Spanish main 2,112,643
Spanish Drama Corpus Spanish main 371,624
Spanish parliamentary debates (ParlaMint 2.1) Spanish trial 12,875,498
Spanish parliamentary debates (ParlaMint 2.1, CoNLL format) Spanish trial 12,930,870
Spanish Trends Spanish trial 1,431,421,603
Spanish Web 2005 (SpanishWaC) Spanish main 97,773,185
Spanish Web 2011 (esTenTen11, Eu + Am) Spanish main 9,497,213,009
Spanish Web 2018 (esTenTen18) Spanish main 16,953,735,742
Spanish Web 2023 (esTenTen23) Spanish trial 28,652,392,686
Susanne English trial 128,998
Swahili Web 2014 (swWaC) Swahili trial 17,882,483
Swedish Drama Corpus Swedish main 581,524
Swedish Parole Swedish main 21,735,113
Swedish Web 2014 (svTenTen14) Swedish trial 3,401,035,817
Tagalog (Filipino) Web 2019 (tlTenTen19) Tagalog trial 198,303,250
Tajik Web (TajikWaC) Tajik trial 93,151,897
TalkBank Persian (blog posts) Persian trial 269,753,238
Tamil Web 2015 (TamilWaC) Tamil main 26,750,515
Tamil Web 2021 (taTenTen21) Tamil trial 823,837,031
Tatar Drama Corpus Turkish main 10,595
Tatar Mixed Corpus Tatar trial 102,779,803
Tatar News (2000–2014) Tatar main 24,927,439
Tatar Web 2015 sample Tatar trial 195,901
Telugu Web 2017 (teTenTen) Telugu trial 126,807,158
Terms of Service (English) English open 168,199
Thai Web (ThaiWaC) Thai trial 82,787,119
Thai Web 2018 (thTenTen18) Thai trial 640,530,227
The Annotated Corpus of Classical Tibetan (ACTib 2.0) Tibetan trial 170,202,078
The Digital Corpus of Sanskrit (2010 – 2019) Sanskrit (romanised) trial 3,361,394
The Digital Parisian Stage Corpus French main 172,202
The New Corpus for Ireland Irish main 29,886,201
Tigrinya Web 2016 (tiWaC16) Tigrinya trial 2,087,613
Timestamped JSI web corpus 2014-2016 Catalan Catalan trial 99,395,494
Timestamped JSI web corpus 2014-2016 Finnish Finnish trial 119,109,490
Timestamped JSI web corpus 2014-2016 French French trial 1,870,341,756
Timestamped JSI web corpus 2014-2016 German German trial 1,987,759,563
Timestamped JSI web corpus 2014-2016 Hebrew Hebrew trial 111,339,363
Timestamped JSI web corpus 2014-2016 Hungarian Hungarian trial 180,843,359
Timestamped JSI web corpus 2014-2016 Korean Korean trial 438,816,127
Timestamped JSI web corpus 2014-2016 Polish Polish trial 157,930,228
Timestamped JSI web corpus 2014-2016 Portuguese Portuguese trial 1,109,771,393
Timestamped JSI web corpus 2014-2016 Russian Russian trial 1,120,731,416
Timestamped JSI web corpus 2014-2016 Serbian Serbian trial 86,380,673
Timestamped JSI web corpus 2014-2016 Spanish Spanish trial 4,055,944,612
Timestamped JSI web corpus 2014-2016 Swedish Swedish trial 335,782,681
Timestamped JSI web corpus 2014-2021 Catalan Catalan main 449,634,119
Timestamped JSI web corpus 2014-2021 Finnish Finnish main 421,879,841
Timestamped JSI web corpus 2014-2021 French French main 6,998,186,326
Timestamped JSI web corpus 2014-2021 German German main 7,055,641,455
Timestamped JSI web corpus 2014-2021 Hebrew Hebrew main 466,851,576
Timestamped JSI web corpus 2014-2021 Hungarian Hungarian main 903,862,798
Timestamped JSI web corpus 2014-2021 Korean Korean main 1,576,995,357
Timestamped JSI web corpus 2014-2021 Polish Polish main 973,863,152
Timestamped JSI web corpus 2014-2021 Portuguese Portuguese main 4,685,199,909
Timestamped JSI web corpus 2014-2021 Russian Russian main 5,788,590,952
Timestamped JSI web corpus 2014-2021 Serbian Serbian main 565,311,513
Timestamped JSI web corpus 2014-2021 Spanish Spanish main 16,358,148,966
Timestamped JSI web corpus 2014-2021 Swedish Swedish main 1,162,692,802
Timestamped JSI web corpus 2014-2022 Estonian Estonian main 270,502,859
Timestamped JSI web corpus 2021-03 Catalan Catalan main 12,107,597
Timestamped JSI web corpus 2021-03 Czech Czech main 20,431,801
Timestamped JSI web corpus 2021-03 Finnish Finnish main 6,154,402
Timestamped JSI web corpus 2021-03 French French main 145,384,862
Timestamped JSI web corpus 2021-03 German German main 126,775,824
Timestamped JSI web corpus 2021-03 Hebrew Hebrew main 8,450,710
Timestamped JSI web corpus 2021-03 Hungarian Hungarian main 30,439,114
Timestamped JSI web corpus 2021-03 Italian Italian main 365,307,999
Timestamped JSI web corpus 2021-03 Korean Korean main 19,324,576
Timestamped JSI web corpus 2021-03 Polish Polish main 38,911,481
Timestamped JSI web corpus 2021-03 Portuguese Portuguese main 108,540,406
Timestamped JSI web corpus 2021-03 Russian Russian main 150,971,438
Timestamped JSI web corpus 2021-03 Serbian Serbian main 15,122,285
Timestamped JSI web corpus 2021-03 Spanish Spanish main 373,185,400
Timestamped JSI web corpus 2021-03 Swedish Swedish main 22,715,935
Timestamped JSI web corpus 2021-04 Catalan Catalan main 8,926,986
Timestamped JSI web corpus 2021-04 Czech Czech main 15,095,366
Timestamped JSI web corpus 2021-04 Finnish Finnish main 5,624,514
Timestamped JSI web corpus 2021-04 French French main 113,581,013
Timestamped JSI web corpus 2021-04 German German main 89,579,085
Timestamped JSI web corpus 2021-04 Hebrew Hebrew main 6,544,178
Timestamped JSI web corpus 2021-04 Hungarian Hungarian main 23,392,828
Timestamped JSI web corpus 2021-04 Italian Italian main 261,813,779
Timestamped JSI web corpus 2021-04 Korean Korean main 15,506,235
Timestamped JSI web corpus 2021-04 Polish Polish main 28,676,001
Timestamped JSI web corpus 2021-04 Portuguese Portuguese main 85,486,841
Timestamped JSI web corpus 2021-04 Russian Russian main 117,645,204
Timestamped JSI web corpus 2021-04 Serbian Serbian main 12,237,307
Timestamped JSI web corpus 2021-04 Spanish Spanish main 289,923,417
Timestamped JSI web corpus 2021-04 Swedish Swedish main 16,876,787
Timestamped JSI web corpus 2021-2022 Ukrainian Ukrainian main 199,135,032
Timestamped JSI web corpus 2021-22 Spanish Spanish main 5,869,620,451
Toxicity Corpus English main 102,132,547
Transhistorical Corpus of Written English (TCWE) English open 501,633
Turkic web – Azerbaijani Azerbaijani trial 94,267,206
Turkic web – Kazakh Kazakh trial 139,417,763
Turkic web – Kyrgyz Kyrgyz trial 19,369,507
Turkic web – Turkmen Turkmen trial 2,105,359
Turkic web – Uzbek Uzbek trial 18,720,334
Turkish parliamentary debates (ParlaMint 2.1) Turkish trial 40,873,301
Turkish parliamentary debates (ParlaMint 2.1, CoNLL format) Turkish trial 42,913,306
Turkish Web (trWaC) Turkish main 32,791,491
Turkish Web 2012 (trTenTen12) Turkish main 3,388,418,900
Turkish Web 2020 (trTenTen20) Turkish trial 4,980,168,485
Ukrainian Drama Corpus Ukrainian main 322,441
Ukrainian Trends Ukrainian trial 713,557,915
Ukrainian Web 2014 (ukTenTen14) Ukrainian main 2,194,447,594
Ukrainian Web 2020 and 2014 (ukTenTen20) Ukrainian main 2,592,516,436
Ukrainian Web 2022 (ukTenTen22) Ukrainian trial 7,594,784,148
UKWaC super sensed English main 315,402,632
United Nations Parallel Corpus (UNPC) – Arabic Arabic trial 545,594,235
United Nations Parallel Corpus (UNPC) – Chinese Chinese Simplified trial 372,004,482
United Nations Parallel Corpus (UNPC) – English English trial 664,924,245
United Nations Parallel Corpus (UNPC) – French French trial 800,980,141
United Nations Parallel Corpus (UNPC) – Russian Russian trial 529,667,487
United Nations Parallel Corpus (UNPC) – Spanish Spanish trial 692,809,915
Urdu Web (UrduWaC) Urdu main 53,269,273
Urdu Web 2018 (urTenTen18) Urdu trial 245,656,128
Vietnamese Web (viWaC) Vietnamese trial 106,664,817
Vietnamese Web 2017 (viTenTen17) Vietnamese trial 6,056,899,600
Welsh Web 2013 (WelshWaC) Welsh trial 12,458,397
Welsh web corpus Welsh main 50,392,441
Western Frisian Web 2013 (FrisianWaC) Frisian trial 3,116,119
Western Punjabi Web 2017 in Shahmukhi script (pnbTenTen17) Punjabi (Gurmukhi) trial 2,806,904
Yiddish Drama Corpus Yiddish main 51,351
Yiddish Wikipedia corpus 2018 (yiwiki) Yiddish trial 2,106,912
Yoruba Web 2015 (YorubaWaC15) Yoruba trial 2,816,965