The NOW Corpus (News on the Web) is composed of billion words of data. There are billion words of data from the past year and million words from the last month.

As an example of how the corpus is growing, there are 45,341,832 new words of text from the last week (22-11-04 through 22-11-10).

See listing of 0 articles (textID, date, country, source, URL, title): 2010-2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023.

See totals by: country / month

See totals by: country / source

See totals by: country / year / source

The sources for the 10,000 or so articles each day come from two sources. In addition to Bing News (Google News before July 2019), we also search 1,000+ websites to find articles that have appeared in the previous 24 hours. We then download the texts, clean them up with JusText (to remove boilerplate material); tag and lemmatize them; and then integrate them into our existing relational database architecture. The texts are usually available by about 7 PM (Utah time) for a given date.

# WEBSITES # TEXTS # WORDS TOTAL = 0 WORDS
United States 12,932 1,484,115 1,122,633,101  (through 22-11-10)
Canada 1,795 1,442,420 957,769,816
Great Britain 4,435 1,391,262 895,485,206
India 1,319 1,568,927 845,407,591
Australia 1,370 910,848 552,904,051
Ireland 471 989,943 512,267,807
South Africa 481 856,707 453,320,925
Nigeria 259 738,007 381,276,797
New Zealand 383 649,209 338,374,070
Singapore 371 657,256 314,598,872
Malaysia 227 618,929 278,029,285
Philippines 422 484,335 235,523,508
Pakistan 322 475,853 223,288,713
Kenya 155 277,437 119,303,289
Ghana 113 275,781 119,209,492
Sri Lanka 138 78,641 44,599,312
Jamaica 21 73,575 39,172,072
Bangladesh 62 71,647 34,018,356
Hong Kong 183 63,192 30,591,608
Tanzania 26 20,022 9,566,126