The NOW Corpus (News on the Web) is composed of billion words of data. There are billion words of data from the past year and million words from the last month.
As an example of how the corpus is growing, there are 45,341,832 new words of text from the
last week (22-11-04 through 22-11-10).
See listing of 0 articles (textID, date, country, source, URL, title): 2010-2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023. See totals by: country / month See totals by: country / source See totals by: country / year / source The sources for the 10,000 or so articles each day come from two sources. In addition to Bing News (Google News before July 2019), we also search 1,000+ websites to find articles that have appeared in the previous 24 hours. We then download the texts, clean them up with JusText (to remove boilerplate material); tag and lemmatize them; and then integrate them into our existing relational database architecture. The texts are usually available by about 7 PM (Utah time) for a given date. |
# WEBSITES | # TEXTS | # WORDS | TOTAL = 0 WORDS | |
United States | 12,932 | 1,484,115 | 1,122,633,101 | (through 22-11-10) |
Canada | 1,795 | 1,442,420 | 957,769,816 | |
Great Britain | 4,435 | 1,391,262 | 895,485,206 | |
India | 1,319 | 1,568,927 | 845,407,591 | |
Australia | 1,370 | 910,848 | 552,904,051 | |
Ireland | 471 | 989,943 | 512,267,807 | |
South Africa | 481 | 856,707 | 453,320,925 | |
Nigeria | 259 | 738,007 | 381,276,797 | |
New Zealand | 383 | 649,209 | 338,374,070 | |
Singapore | 371 | 657,256 | 314,598,872 | |
Malaysia | 227 | 618,929 | 278,029,285 | |
Philippines | 422 | 484,335 | 235,523,508 | |
Pakistan | 322 | 475,853 | 223,288,713 | |
Kenya | 155 | 277,437 | 119,303,289 | |
Ghana | 113 | 275,781 | 119,209,492 | |
Sri Lanka | 138 | 78,641 | 44,599,312 | |
Jamaica | 21 | 73,575 | 39,172,072 | |
Bangladesh | 62 | 71,647 | 34,018,356 | |
Hong Kong | 183 | 63,192 | 30,591,608 | |
Tanzania | 26 | 20,022 | 9,566,126 |