Data Science Milestone

We use HC corpora data collected across English Twitter posts, blogs and news. To get insight into Natural Language Processing and eventually construct a prediction algorithm for the next word during typing (e.g. auto-suggestions on mobile devices).

We will use the following files:

File	Lines	Words
en_US.blogs.txt	899288	37334114
en_US.news.txt	1010242	34365936
en_US.twitter.txt	2360148	30359804

The first step is to acquire and load the data. We obtain English Twitter files from Coursera Project webpage and in view of computational restrictions we choose a representative random subset of 10 thousand lines (using Linux sort -R command).

First we load the text using the following R code.

#Representative subset of 10000 lines saved in test_file
fileName <- "test_file"
#Create file connection
con <- file(fileName, "r")
#Read all words
d <- scan(con, character(0))
# Close connection
close(con)

Next, we want to exclude profanity from the analysis. We created a text file with common English profanity words and load it into our R session.

con <- file("bad_words", "r")
bad <- scan(con, character(0))
close(con)

For analysis we tokenize the text into words (excluding “bad” words), numbers and punctuation. We store those elements in 3 different vectors.

d <- lapply(d, toupper)
bad <- lapply(bad, toupper)
words <- c(); numbers <- c(); puncts <- c()

for (i in (1:300000)){
  element <- d[[i]]
  word <- gsub("[^A-Z]+", "", element)
  good <-1
  if (word %in% bad) {good <-0}
  if ((word > 0)&good){words <- c(words, word)}
  number <- gsub("[^0-9]+", "", element)
  if (number > 0){numbers <- c(numbers, number)}
  pun <- gsub("[^[:punct:]]", "", element)
  if (nchar(pun) > 0){puncts <- c(puncts, pun)}
}

cloud

An interesting question is: what are the ten most common words in the dataset?

word	times
THE	14751
AND	8623
TO	8462
A	7202
OF	6953
I	6180
IN	4672
THAT	3763
IS	3439
IT	3194

These are common English particles. Let us look at the most used “long” words (more than 6 letters):

word	times
BECAUSE	467
THROUGH	282
SOMETHING	280
ANOTHER	247
THOUGHT	207
WITHOUT	155
DIFFERENT	153
FRIENDS	152
GETTING	148
STARTED	148

To understand how many words constitute the major part of the language, we plot the histogram of the frequencies of words.

We observe that most of the words have low (<5%) frequency.

The following graph shows how many distinct words are necessary to cover a part of the text in percentage.

An interresting observation is that we need only 200 words to cover about 50% of any text and 1000 words to cover 80%. We conclude that for English language an average text contains about 20% of rare words.

Many words tend to be used together, meaning that it is possible to give a prediction with high probability of the following word. For instance, let us look at the most common occurring word pairs.

Words pair	Times
OF THE	1409
IN THE	1219
TO THE	679
ON THE	600
TO BE	532
AND THE	462
FOR THE	462
AND I	448
AT THE	403
IT WAS	402
I HAVE	396
IT IS	373
I WAS	373
IN A	371
IS A	352
I AM	325
WITH A	321
THAT I	319
WITH THE	312
OF A	281
FROM THE	271
IF YOU	257
THAT THE	241
THIS IS	239
ONE OF	235
WILL BE	230
IS THE	227
AS A	226
FOR A	226
YOU CAN	223
BY THE	220
OF MY	218
I HAD	216
BUT I	214
WAS A	203
ALL THE	202
HAVE A	201
WHEN I	197
A FEW	196
AND A	196
I THINK	193
HAVE BEEN	191
THE SAME	190
TO DO	190
GOING TO	188
HAVE TO	188
WANT TO	187
OUT OF	184
TO GET	183
TO MAKE	179

We can also explore collocations of longer words (>3 characters).

Words pair	Times
HAVE BEEN	199
MORE THAN	114
WOULD HAVE	101
THEY WERE	100
THEY HAVE	86
THAT THEY	84
THAT HAVE	74
THAT WOULD	71
KNOW THAT	68
THAT THIS	60
THAT WILL	59
THIS WEEK	53
THIS YEAR	53
THERE WERE	52
WHEN THEY	52
DONT KNOW	46
KNOW WHAT	42
THAT THERE	42
LIKE THAT	41
LIKE THIS	41
THINK THAT	41
DONT HAVE	40
EVEN THOUGH	40
WITH THIS	40
THIS TIME	39
WHAT THEY	39
ALONG WITH	38
FACT THAT	38
ABOUT THIS	37
BECAUSE THEY	36
COULD HAVE	36
WOULD LIKE	36
RATHER THAN	35
THEY WILL	35
WILL HAVE	35
ABOUT WHAT	34
THAT COULD	34
WITH THAT	34
EACH OTHER	33
LAST NIGHT	33
LAST WEEK	33
WITH SOME	33
WITH YOUR	33
FEEL LIKE	32
MUCH MORE	32
THEY WOULD	32
AWAY FROM	31
TALKING ABOUT	30
THAT JUST	30
THINGS THAT	30

In the following publication we provide prediction algorithm for the next word during typesetting.

Data Science Milestone

NLP exploratory analysis