The Data Science capstone project offered by Coursera and Johns Hopkins University challenges the student to create a prediction algorithm similar to an “autocorrect” function. This document describes the Exploratory Data Analysis for the given corpus and lays the groundwork for the model to be built. This report is aimed at a non-technical audience and thus the amount of code is kept to a minimum

Acquiring The Data

The corpus to be used is downloaded from HC Corpora (http://www.corpora.heliohost.org/aboutcorpus.html). The English corpora for blogs, news items and Twitter feeds are used.

Dataset Characteristics

We can first have a look at some statistics for the data sets in order to try understand their size and characteristics.

##            FileName NumOfLines NumOfWords NumOfWordsUnique
## 1   en_US.blogs.txt     899288   38154238           445884
## 2    en_US.news.txt    1010242   35010783           354097
## 3 en_US.twitter.txt    2360148   30218125           501657

As can be seen from the table here, there are fairly comparable numbers of words and unique words in each corpus. The large size of the corpus at this point means that we need to sample it in order to make the current computations more viable. We are going to use a 5% sampling rate to get the new corpora, as shown here:

##            FileName NumOfLines NumOfWords NumOfWordsUnique
## 1   en_US.blogs.txt      44964    1895441            88482
## 2    en_US.news.txt      50512    1744389            82650
## 3 en_US.twitter.txt     118007    1511185            86703

Data Cleaning

Profanity Filtering

As with any selection of words, there will be undesirable ones listed in our corpus here. Profanity filtering is easily done with the help of the profanity list hosted at http://www.bannedwordlist.com. The method used in this analysis is fairly crude as the profanity is simply removed, which can have an effect on sentence structure.

Removal of Junk Characters

The removal of junk characters is important as these are characters usually introduced by changing formats of the way the data is stored. For example, a space in a URL is represented as “%20”, which can lead to the sentence “Welcome%20to%my%20blog” if not read in the correct format.

Removal of Punctuation and Numbers

Similarly to above, we also want to remove punctuation and numbers from the corpus as these should not form part of our final data set.

Tokenization and Creation of ngrams

In Natural Language Processing (NLP), the idea of tokens and ngrams are very important. Tokens are simply a “chopped up” piece of data that we can use for anaylsis. For example, a paragraph in a blog can be tokenised into multiple sentences and then tokenised again into individual words. Ngrams are a way of representing the tokens by analysing what surrounds them. For example, a sentence can be broken into ngrams, showing a word and the word that comes before and after it. For obvious reasons, this is very important to us in this project.

We are going to use the Quanteda R Package here to help with the task of getting ngrams of various lengths here. This will help us in understanding the most popular words and phrases in the data, which in turn will be used in the next phase of this project:

  1. Combine the three corpora into a single one.
  2. Using Quanteda, remove profanity, punctuation and other undesirable data.
  3. Using Quanteda, break the data set into individual sentences and then words to find 1-ngrams, 2-ngrams, 3-ngrams and 4-ngrams which can be used.
  4. Perform Exploratory Data Analysis to highlight any interesting data that may be found.

Exploratory Data Analysis

A very easy to understand method of plotting the ngrams here is to simply plot a histogram showing the top 20 ngrams by length against the number of times they appear in the corpus.

The 1-word Ngram

The 1-word ngram is unique in that it simply shows us the most common words in the dataset at this point. It is not known whether this will be useful for the final model, but it is interesting to look at.

The 2-word Ngram

The 2-word ngram gives us our first look at basic phrases that appear in the corpus. This starts to let us see how words are used.

The 3-word Ngram

The 3-word ngram gives us an even better look at phrases in the corpus. Using this, we can start looking at a 2-word ngram and predicting the possibility that the 3rd word is a word in the 3-word ngrams.

The 4-word Ngram

Finally, we come to the 4-word ngram. This is potentially the most useful of all in that we can now examine a phrase in terms of both its 2-word and 3-word ngram to find the most probable 4th word in the sequence.

Word Coverage

The large size of the corpus involved here means that we need to consider the word coverage of the model in order to keep it small enough to run quickly and efficiently while doing predictions. We want to eliminate ngrams that appear the least in the corpus. The best solution is to simply eliminate a certain percentage of the least most common ngrams. This is especially important as our ngrams get longer.

These graphs clearly show that as we work with the 3- and 4-word ngrams, we are going to need to use at least half of the total ngrams in order to get even a basic 50% coverage. This obviously has serious implications for memory usage in the model, but this will have to be fully addressed later in the project.

Basic Modelling

Now that we have our ngrams, let’s take a look at the most basic model, a prediction of the 4th word in the most common 3-word ngram. From our current work, we can see that the most common 3-word ngram is “one of the”. We can now find the top 10 most common 4-word ngrams that start with this phrase:

##                  ngram count
##  1:    one_of_the_most   251
##  2:    one_of_the_best   161
##  3:   one_of_the_first    48
##  4: one_of_the_biggest    37
##  5:  one_of_the_things    37
##  6:     one_of_the_few    36
##  7: one_of_the_reasons    32
##  8:     one_of_the_top    27
##  9:    one_of_the_many    25
## 10:   one_of_the_worst    25

We can now start seeing how we can use ngrams to build up phrases that might be used, which is exactly what we need to do.

Next Steps

There are multiple steps that still need to be taken to clean this corpus and make it more suitable for the prediction model we want to build:

  1. Remove words and phrases that are not common enough to be predicted. This will save both memory space and processing power in our final model.
  2. Correct for poor spelling. Some of the exploratory data analysis done here showed that poor spelling, especially (and predictably) in the Twitter dataset, can lead to invalid or unhelpful ngrams.
  3. Develop the predictive model. I am still undecided on the size of the training data versus the testing data. In theory, we should be able to use a very small subset of the data in testing, especially if we use something like the blogs data set in testing due to the fact that blogs will typically use full, correct sentences.
  4. Construction and deployment of the Shiny app that will allow the model to be used.