The Data Science capstone project offered by Coursera and Johns Hopkins University challenges the student to create a prediction algorithm similar to an “autocorrect” function. This document describes the Exploratory Data Analysis for the given corpus and lays the groundwork for the model to be built. This report is aimed at a non-technical audience and thus the amount of code is kept to a minimum
The corpus to be used is downloaded from HC Corpora (http://www.corpora.heliohost.org/aboutcorpus.html). The English corpora for blogs, news items and Twitter feeds are used.
We can first have a look at some statistics for the data sets in order to try understand their size and characteristics.
## FileName NumOfLines NumOfWords NumOfWordsUnique
## 1 en_US.blogs.txt 899288 38154238 445884
## 2 en_US.news.txt 1010242 35010783 354097
## 3 en_US.twitter.txt 2360148 30218125 501657
As can be seen from the table here, there are fairly comparable numbers of words and unique words in each corpus. The large size of the corpus at this point means that we need to sample it in order to make the current computations more viable. We are going to use a 5% sampling rate to get the new corpora, as shown here:
## FileName NumOfLines NumOfWords NumOfWordsUnique
## 1 en_US.blogs.txt 44964 1895441 88482
## 2 en_US.news.txt 50512 1744389 82650
## 3 en_US.twitter.txt 118007 1511185 86703
As with any selection of words, there will be undesirable ones listed in our corpus here. Profanity filtering is easily done with the help of the profanity list hosted at http://www.bannedwordlist.com. The method used in this analysis is fairly crude as the profanity is simply removed, which can have an effect on sentence structure.
The removal of junk characters is important as these are characters usually introduced by changing formats of the way the data is stored. For example, a space in a URL is represented as “%20”, which can lead to the sentence “Welcome%20to%my%20blog” if not read in the correct format.
Similarly to above, we also want to remove punctuation and numbers from the corpus as these should not form part of our final data set.
In Natural Language Processing (NLP), the idea of tokens and ngrams are very important. Tokens are simply a “chopped up” piece of data that we can use for anaylsis. For example, a paragraph in a blog can be tokenised into multiple sentences and then tokenised again into individual words. Ngrams are a way of representing the tokens by analysing what surrounds them. For example, a sentence can be broken into ngrams, showing a word and the word that comes before and after it. For obvious reasons, this is very important to us in this project.
We are going to use the Quanteda R Package here to help with the task of getting ngrams of various lengths here. This will help us in understanding the most popular words and phrases in the data, which in turn will be used in the next phase of this project:
A very easy to understand method of plotting the ngrams here is to simply plot a histogram showing the top 20 ngrams by length against the number of times they appear in the corpus.
The 1-word ngram is unique in that it simply shows us the most common words in the dataset at this point. It is not known whether this will be useful for the final model, but it is interesting to look at.
The 2-word ngram gives us our first look at basic phrases that appear in the corpus. This starts to let us see how words are used.
The 3-word ngram gives us an even better look at phrases in the corpus. Using this, we can start looking at a 2-word ngram and predicting the possibility that the 3rd word is a word in the 3-word ngrams.
Finally, we come to the 4-word ngram. This is potentially the most useful of all in that we can now examine a phrase in terms of both its 2-word and 3-word ngram to find the most probable 4th word in the sequence.
The large size of the corpus involved here means that we need to consider the word coverage of the model in order to keep it small enough to run quickly and efficiently while doing predictions. We want to eliminate ngrams that appear the least in the corpus. The best solution is to simply eliminate a certain percentage of the least most common ngrams. This is especially important as our ngrams get longer.
These graphs clearly show that as we work with the 3- and 4-word ngrams, we are going to need to use at least half of the total ngrams in order to get even a basic 50% coverage. This obviously has serious implications for memory usage in the model, but this will have to be fully addressed later in the project.
Now that we have our ngrams, let’s take a look at the most basic model, a prediction of the 4th word in the most common 3-word ngram. From our current work, we can see that the most common 3-word ngram is “one of the”. We can now find the top 10 most common 4-word ngrams that start with this phrase:
## ngram count
## 1: one_of_the_most 251
## 2: one_of_the_best 161
## 3: one_of_the_first 48
## 4: one_of_the_biggest 37
## 5: one_of_the_things 37
## 6: one_of_the_few 36
## 7: one_of_the_reasons 32
## 8: one_of_the_top 27
## 9: one_of_the_many 25
## 10: one_of_the_worst 25
We can now start seeing how we can use ngrams to build up phrases that might be used, which is exactly what we need to do.
There are multiple steps that still need to be taken to clean this corpus and make it more suitable for the prediction model we want to build: