This report presents a brief and concise exploratory analysis of the data provided for the Coursera Data Science Specialisation Capstone project.
The report is divided into three sections:
The aim of this exploratory analysis is to guide the development of a simple algorithm to predict the next word in a given sentence.
Note: the report is not aimed at the specialist data scientist. Because of this, hidden most of the code.
We can now get basic line and word counts for each file.
## file lines words chars
## 1 blogs 899288 37345400 206824505
## 2 news 1010242 34376329 203223159
## 3 twitter 2360148 30374792 162096031
## file words per line chars per line
## 1 blogs 41.52774 229.98695
## 2 news 34.02782 201.16285
## 3 twitter 12.86987 68.68045
From this we can see that, not surprisingly, the twitter documents have fewer characters and words. Blogs and news are broadly comparable. This arises from the limit of 140 characters per tweet, which we can quickly confirm.
max(nchar(twitter))
## [1] 140
Now I take a random sample of lines from each file and combine them into one corpora for analysis. Then I can calculate the number of times each word is seen (its count).
## word count
## 1 the 16517
## 2 to 13057
## 3 and 12041
## 4 a 11763
## 5 of 10526
## 6 in 9393
## 7 for 6840
## 8 i 6395
## 9 is 6277
## 10 that 6277
From this we can see the ten most common words. There are no real surprises here. It’s worth considering what proportion of words are only seen once in this sample. We can also plot the distribution of word counts. Note, in the panel below, I make two plots, one of all words, and the other of the top 100 words. This is because there are so many words that have almost zero frequency that you can’t see what’s going on (note there aren’t any axes on the first plot - that’s the line!).
You can see that the frequency of words drops off pretty quickly and after about the 25th most common word the word counts have almost reached zero. However there are a lot of words that are uncommon:
sum(words$count == 1)/length(words$count)
## [1] 0.6524673
So we can see that over 65% of words are only seen once. This will become important when predicting the next word from an ngram - you don’t want to rely heavily on a long ngram you’ve only seen once. You might prefer to predict from a shorter ngram you’ve seen more often. This is called a ‘backoff’ approach.
Another way to look at this is to ask how many words you have to have to cover, say, 90% of all seen words. We can calculate this by using cumulative word counts, starting with the most common words and stopping when the cumulative count is over 90% of the total count:
cutoff <- sum(!(words$cumulative/sum(words$count) >= 0.90))
cutoff / length(words$word)
## [1] 0.2892761
So you only need just over 29% of the words to cover 90% of instances of words seen! We can show this on a plot (the red line shows the cut-off point where 90% of word instances are covered:
It’s also worth looking at the appearance of numeric characters in the corpora. It’s unlikely that we’d want to predict based on these - combinations of numbers and letters are very rare.
sum(grepl("[0-9]+", words$word))
## [1] 5935
sum(grepl("[A-Za-z]+[0-9]+", words$word))
## [1] 331
From this we can see that there are 5935 ‘words’ that are made up of combinations of letters and numbers and 331 ‘words’ that are really numbers. I may want to remove these for the purposes of prediction.
Finally, I will do a simple analysis of the frequencies of bigrams, trigrams and quadgrams in the data. As I mentioned before, this is important if you want to predict the next work from ngram frequencies. For the purposes of this report I will just report the number of distinct 2, 3 and 4-grams in the data, the most common ngrams, and the proportion of each that are only seen once in the data set.
## bigram count
## 1 of the 3286
## 2 in the 3216
## 3 to the 1822
## 4 on the 1599
## 5 for the 1497
## trigram count
## 1 one of the 335
## 2 a lot of 265
## 3 as well as 152
## 4 going to be 147
## 5 to be a 141
## quadgram count
## 1 the rest of the 67
## 2 the end of the 64
## 3 at the end of 49
## 4 when it comes to 46
## 5 one of the most 45
From this, we can see a couple of things. First, the most common ngrams are mostly made of the most common words. That’s not surprising. However it’s also important to note that the counts for even the most common words decline quickly as the length of the ngram increases. The most common quadgram is seen 67 times. Compare this with the most common word, which is seen 16,000 times. This shouldn’t be a surprise - there are more possible combinations of four words than three and so on. This is reflected in the number of ngrams of different lengths:
## ngram_length number seen_once
## 1 1 96178 0.6524673
## 2 2 464110 0.8437332
## 3 3 724542 0.9449086
## 4 4 778924 0.9871708
It’s another thing to bear in mind when buiding an ngram-based prediction model: the longer ngrams will capture more subtlety and long-distance relationships, but you need a lot of data, and even then you can expect to have to back-off to shorter ngrams a lot.
In this short report I have briefly summarised each of the three files that have been provided for this project. I have explored the frequencies of words and have found their distribution to be very skewed, with some words seen much more than others and many words only seen once. Finally, I have explored bigram, trigram and quadgram frequencies, and found that the counts for even the most common ngrams decreases a lot as the size of the ngram increases.