1. Introduction

This report presents a brief and concise exploratory analysis of the data provided for the Coursera Data Science Specialisation Capstone project.

The report is divided into three sections:

  1. Descriptive statistics for each of the three input corpora: news, blogs and twitter;
  2. An analysis of single word frequencies from a sample of the combined copora; and
  3. An analysis of bigram, trigram and quadgram frequencies in the same sample

The aim of this exploratory analysis is to guide the development of a simple algorithm to predict the next word in a given sentence.

Note: the report is not aimed at the specialist data scientist. Because of this, hidden most of the code.

2. Basic descriptions of the three files

We can now get basic line and word counts for each file.

##      file   lines    words     chars
## 1   blogs  899288 37345400 206824505
## 2    news 1010242 34376329 203223159
## 3 twitter 2360148 30374792 162096031
##      file words per line chars per line
## 1   blogs       41.52774      229.98695
## 2    news       34.02782      201.16285
## 3 twitter       12.86987       68.68045

From this we can see that, not surprisingly, the twitter documents have fewer characters and words. Blogs and news are broadly comparable. This arises from the limit of 140 characters per tweet, which we can quickly confirm.

max(nchar(twitter))
## [1] 140

Analysis of word frequencies

Now I take a random sample of lines from each file and combine them into one corpora for analysis. Then I can calculate the number of times each word is seen (its count).

##    word count
## 1   the 16517
## 2    to 13057
## 3   and 12041
## 4     a 11763
## 5    of 10526
## 6    in  9393
## 7   for  6840
## 8     i  6395
## 9    is  6277
## 10 that  6277

From this we can see the ten most common words. There are no real surprises here. It’s worth considering what proportion of words are only seen once in this sample. We can also plot the distribution of word counts. Note, in the panel below, I make two plots, one of all words, and the other of the top 100 words. This is because there are so many words that have almost zero frequency that you can’t see what’s going on (note there aren’t any axes on the first plot - that’s the line!).

You can see that the frequency of words drops off pretty quickly and after about the 25th most common word the word counts have almost reached zero. However there are a lot of words that are uncommon:

sum(words$count == 1)/length(words$count)
## [1] 0.6524673

So we can see that over 65% of words are only seen once. This will become important when predicting the next word from an ngram - you don’t want to rely heavily on a long ngram you’ve only seen once. You might prefer to predict from a shorter ngram you’ve seen more often. This is called a ‘backoff’ approach.

Another way to look at this is to ask how many words you have to have to cover, say, 90% of all seen words. We can calculate this by using cumulative word counts, starting with the most common words and stopping when the cumulative count is over 90% of the total count:

cutoff <- sum(!(words$cumulative/sum(words$count) >= 0.90))
cutoff / length(words$word)
## [1] 0.2892761

So you only need just over 29% of the words to cover 90% of instances of words seen! We can show this on a plot (the red line shows the cut-off point where 90% of word instances are covered:

It’s also worth looking at the appearance of numeric characters in the corpora. It’s unlikely that we’d want to predict based on these - combinations of numbers and letters are very rare.

sum(grepl("[0-9]+", words$word))
## [1] 5935
sum(grepl("[A-Za-z]+[0-9]+", words$word))
## [1] 331

From this we can see that there are 5935 ‘words’ that are made up of combinations of letters and numbers and 331 ‘words’ that are really numbers. I may want to remove these for the purposes of prediction.

Analysis of bigram, trigram and quadgram frequencies

Finally, I will do a simple analysis of the frequencies of bigrams, trigrams and quadgrams in the data. As I mentioned before, this is important if you want to predict the next work from ngram frequencies. For the purposes of this report I will just report the number of distinct 2, 3 and 4-grams in the data, the most common ngrams, and the proportion of each that are only seen once in the data set.

##    bigram count
## 1  of the  3286
## 2  in the  3216
## 3  to the  1822
## 4  on the  1599
## 5 for the  1497
##       trigram count
## 1  one of the   335
## 2    a lot of   265
## 3  as well as   152
## 4 going to be   147
## 5     to be a   141
##           quadgram count
## 1  the rest of the    67
## 2   the end of the    64
## 3    at the end of    49
## 4 when it comes to    46
## 5  one of the most    45

From this, we can see a couple of things. First, the most common ngrams are mostly made of the most common words. That’s not surprising. However it’s also important to note that the counts for even the most common words decline quickly as the length of the ngram increases. The most common quadgram is seen 67 times. Compare this with the most common word, which is seen 16,000 times. This shouldn’t be a surprise - there are more possible combinations of four words than three and so on. This is reflected in the number of ngrams of different lengths:

##   ngram_length number seen_once
## 1            1  96178 0.6524673
## 2            2 464110 0.8437332
## 3            3 724542 0.9449086
## 4            4 778924 0.9871708

It’s another thing to bear in mind when buiding an ngram-based prediction model: the longer ngrams will capture more subtlety and long-distance relationships, but you need a lot of data, and even then you can expect to have to back-off to shorter ngrams a lot.

Summary

In this short report I have briefly summarised each of the three files that have been provided for this project. I have explored the frequencies of words and have found their distribution to be very skewed, with some words seen much more than others and many words only seen once. Finally, I have explored bigram, trigram and quadgram frequencies, and found that the counts for even the most common ngrams decreases a lot as the size of the ngram increases.