Milestone Report for SwiftKey Capstone Project

Project Introduction

The capstone project for the Data Science Specialization from Coursera is to produce a text prediction algorithm using Natural Language Processing (NLP). The algorithm should be able to determine the likely next word after a user has entered a couple of words.

The text used to build the model comes from three sources - blogs, Twitter and news articles. The data has been cut down for fair use considerations. Initial analysis will be done with the tidytext package, which is well described in the book “Text Mining in R”.

Project Goal

The given goal for this project is to take a string of words and predict the next word that would be desired. While the core of this project would work for any number of words, the reality is that an enormous corpus would be required to have enough n-grams to make a prediction for n > 3. A collection of trigrams and bigrams will be generated from the three data sources. These n-grams will be represented as a root (the leading bigram or word, respectively) and a terminal word, for which a frequency score is given. The Shiny application would then work as follows:

A user types two words into a text bar.
The second word of the input is compared to the roots of the bigram pool.
- If the second word isn’t found, a spell checking script will be kicked off that looks at similar words from the bigram roots.
- If the second word is found, or the spell check returns a likely correction, the top X terminal words and their frequency scores will be collected.
The input bigram is then compared to the roots of the trigram pool.
- If the bigram isn’t found, then the spell checking script will be run for the first word in the input to see if there was a typo.
- If the bigram is found, or the spell checker returns a corrected first word then the top X terminal words and their frequency scores will be collected.
Using a model-based weighting the bigram and trigram terminal word sets will be combined and ranked, with the top 3-5 suggested words being returned.
(Optional) If no results are found for either the trigram or the bigram pools, then the most common words will be returned.

Loading the data

The three source files were loaded using the read_lines() function from the tidyverse package.

## [1] "Blog string length statistics"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833

## [1] "News string length statistics"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11384.0

## [1] "Twitter string length statistics"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

## [1] "The total lines in each source:"

## $Blogs
## [1] 899288
## 
## $News
## [1] 1010242
## 
## $Twitter
## [1] 2360148

As is to be expected, the Twitter data is capped at 140 characters, with a fairly evenly distributed string length. The blog and news sources, on the other hand, have some large outliers, but are generally in the sub-400 character range. Taking the log10() of the line lengths these can be compared with a violin plot.

Processing

Tokenizing and Word Frequencies

Using the unnest_token function from tidytext the three sources were broken down into individual words. From there it was possible to count

The top words in each source were:

The tops words are standard common words. One interesting thing is that pronouns (I, you, me, my, he) are top for blogs and Twitter, but not news. It would be interesting to see what words are frequent beyond the common words, so the next step is removing “stop words” using the stop_words dataset.

Removing the stop words better illustrates the differences between these different data sources. Aside from “time”, “day”, “people” and “two” there is much less overlap between the top 15 most common words from each source.

Bigram Analysis

The same analysis as was performed on the words can be done on n-grams by tokenizing using unnest_tokens(bigram, text, token = "ngrams", n = X) where X is the length of the n-gram. Setting n = 2 we get the collection of bigrams:

From what was found from the original word frequencies, the most frequent bigrams are generated from stop words. These were removed by splitting the bigram and anti-joining each word with the stop words list.

Similar to the stop word-removed word list, there is less overlap between the sources. It is also worth noting that the frequencies with the bigrams are significantly less. The most common (non-stop word) words are 4.6071883 times more common than the most common bigrams.

Next Steps

While the exploratory data analysis has included removal of stop words, those will actually be needed in the final model, as those words are common because they are going to be the most frequently encountered. When tokenized these datasets become large enough to become too slow to process on a laptop, so one of the next steps is to find a way to work around the size of the dataset to pull all three sources together. Once together the bigrams and trigrams need to be counted, split into a root and terminator, and then grouped by root. Depending on the desired final settings (of how many choices are offered for the next word), each root can be trimmed to have at most X terminators. An overall threshold should also be set to remove rare trigrams and bigrams (in order to minimize the size of the resulting data object).

The structure should then consist of a trigram object that lists the root bigram, top X terminators and frequency of each terminator. A similar object will be made from the bigrams for each single word root and a similar set of Y terminators and their frequencies. The model algorithm will take the input bigram and the X terminators from the trigram object and the Y terminators from the bigram object and create an ordered list of both sets of terminators with their frequencies. Because of the difference in frequency between bigrams and trigrams there will need to be a scaling factor for the Y terminators. This factor, k, will be determined by running a training set of trigrams and checking how often the terminator was correctly guessed. The out-of-sample error can be estimated using a test or validation set of trigrams from the sources, after k has been calculated.