We use HC corpora data collected across English Twitter posts, blogs and news. To get insight into Natural Language Processing and eventually construct a prediction algorithm for the next word during typing (e.g. auto-suggestions on mobile devices).

We will use the following files:

File Lines Words
en_US.blogs.txt 899288 37334114
en_US.news.txt 1010242 34365936
en_US.twitter.txt 2360148 30359804

The first step is to acquire and load the data. We obtain English Twitter files from Coursera Project webpage and in view of computational restrictions we choose a representative random subset of 10 thousand lines (using Linux sort -R command).

First we load the text using the following R code.

#Representative subset of 10000 lines saved in test_file
fileName <- "test_file"
#Create file connection
con <- file(fileName, "r")
#Read all words
d <- scan(con, character(0))
# Close connection
close(con)

Next, we want to exclude profanity from the analysis. We created a text file with common English profanity words and load it into our R session.

con <- file("bad_words", "r")
bad <- scan(con, character(0))
close(con)

For analysis we tokenize the text into words (excluding “bad” words), numbers and punctuation. We store those elements in 3 different vectors.

d <- lapply(d, toupper)
bad <- lapply(bad, toupper)
words <- c(); numbers <- c(); puncts <- c()

for (i in (1:300000)){
  element <- d[[i]]
  word <- gsub("[^A-Z]+", "", element)
  good <-1
  if (word %in% bad) {good <-0}
  if ((word > 0)&good){words <- c(words, word)}
  number <- gsub("[^0-9]+", "", element)
  if (number > 0){numbers <- c(numbers, number)}
  pun <- gsub("[^[:punct:]]", "", element)
  if (nchar(pun) > 0){puncts <- c(puncts, pun)}
}



cloud

An interesting question is: what are the ten most common words in the dataset?

word times
THE 14751
AND 8623
TO 8462
A 7202
OF 6953
I 6180
IN 4672
THAT 3763
IS 3439
IT 3194


These are common English particles. Let us look at the most used “long” words (more than 6 letters):

word times
BECAUSE 467
THROUGH 282
SOMETHING 280
ANOTHER 247
THOUGHT 207
WITHOUT 155
DIFFERENT 153
FRIENDS 152
GETTING 148
STARTED 148


To understand how many words constitute the major part of the language, we plot the histogram of the frequencies of words.

We observe that most of the words have low (<5%) frequency.

The following graph shows how many distinct words are necessary to cover a part of the text in percentage.

An interresting observation is that we need only 200 words to cover about 50% of any text and 1000 words to cover 80%. We conclude that for English language an average text contains about 20% of rare words.

Many words tend to be used together, meaning that it is possible to give a prediction with high probability of the following word. For instance, let us look at the most common occurring word pairs.

Words pair Times
OF THE 1409
IN THE 1219
TO THE 679
ON THE 600
TO BE 532
AND THE 462
FOR THE 462
AND I 448
AT THE 403
IT WAS 402
I HAVE 396
IT IS 373
I WAS 373
IN A 371
IS A 352
I AM 325
WITH A 321
THAT I 319
WITH THE 312
OF A 281
FROM THE 271
IF YOU 257
THAT THE 241
THIS IS 239
ONE OF 235
WILL BE 230
IS THE 227
AS A 226
FOR A 226
YOU CAN 223
BY THE 220
OF MY 218
I HAD 216
BUT I 214
WAS A 203
ALL THE 202
HAVE A 201
WHEN I 197
A FEW 196
AND A 196
I THINK 193
HAVE BEEN 191
THE SAME 190
TO DO 190
GOING TO 188
HAVE TO 188
WANT TO 187
OUT OF 184
TO GET 183
TO MAKE 179


We can also explore collocations of longer words (>3 characters).

Words pair Times
HAVE BEEN 199
MORE THAN 114
WOULD HAVE 101
THEY WERE 100
THEY HAVE 86
THAT THEY 84
THAT HAVE 74
THAT WOULD 71
KNOW THAT 68
THAT THIS 60
THAT WILL 59
THIS WEEK 53
THIS YEAR 53
THERE WERE 52
WHEN THEY 52
DONT KNOW 46
KNOW WHAT 42
THAT THERE 42
LIKE THAT 41
LIKE THIS 41
THINK THAT 41
DONT HAVE 40
EVEN THOUGH 40
WITH THIS 40
THIS TIME 39
WHAT THEY 39
ALONG WITH 38
FACT THAT 38
ABOUT THIS 37
BECAUSE THEY 36
COULD HAVE 36
WOULD LIKE 36
RATHER THAN 35
THEY WILL 35
WILL HAVE 35
ABOUT WHAT 34
THAT COULD 34
WITH THAT 34
EACH OTHER 33
LAST NIGHT 33
LAST WEEK 33
WITH SOME 33
WITH YOUR 33
FEEL LIKE 32
MUCH MORE 32
THEY WOULD 32
AWAY FROM 31
TALKING ABOUT 30
THAT JUST 30
THINGS THAT 30



In the following publication we provide prediction algorithm for the next word during typesetting.