Introduction

Created for the Coursera Data Science Specialization, this milestone report explains the latest exploratory analysis for the SwiftKey project - where the final deliverable will be an interactive text prediction app.

The code has been suppressed to maintain readability of this document.

File Basic Summaries

Note that there are extremely long articles in the blog data set. Whereas the Twitter data set consists of a large quantity of short entries.

Sample text from each data set

Blog file

  • In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.

  • We love you Mr. Brown.

  • Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his \($\) from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.

News file

  • He wasn’t home alone, apparently.

  • The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.

  • WSU’s plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.

Twitter file

  • How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.

  • When you meet someone special… you’ll know. Your heart will beat more rapidly and you’ll smile for no reason.

  • they’ve decided its more fun if I don’t.

Boxplot of character frequency in each data set

Data Cleaning

We can see from the initial exploratory analysis that there are some outliers with extremely long articles. When examining those outliers, we have observations such as a log of events from the Fukushima nuclear reactor incident and a log of stock prices. Likewise, outliers on the short side are also unlikely occurences. Neither of these extremes relate well to the type of text that we want to predict. For this reason we will limit the data sets to the 1st and 3rd quartile.

We will merge the data sets into one file. Initially, the intention was to split the data set to 60% training and 40% test. However, the data set is way too large to process on a single computer. A full corpus on the blog data set alone will take 3.4 GB of RAM and this is before performing any transformations which take an exponential amount of resources. Given that we are looking to run this as a speed efficient app over the web, reducing the data set will increase efficiency. For this report, we will limit the data set to 30k observations.

We will clean the data by removing punctuation and removing profanity. Ideally, spelling correction and removing non-English words would be helpful, but out-of-scope for this phase of the milestone report. For the most part, spelling errors and non-English words appear to be edge cases when considering the top ngrams only.

The stemwords were left alone because they are possible candidates for next word prediction.

Sample text of clean output

## sadly im awake the joys of having a cat that constantly want attention

Plotting the frequency of the top 10 ngrams

Word cloud visualization

Word clouds show the most frequent words in the data set

Interesting findings

Plans for the prediction algorithm and Shiny app

  1. The next step in the research would be to investigate the speed and accuracy of predicting with trigrams, bigrams, and unigrams. I suspect trigrams would yield the highest accuracy but what is the trade-off cost?

  2. The algorithm that I anticipate using would be to match the text for 2 ngrams. If they match, then we would check 3 ngrams for the prediction. If that doesn’t exist, we will fall back on the first word to predict using 2 ngrams. Highest frequency will determine the top prediction.

  1. The Shiny app would have a simple interface where predicive results would reactively appear as the user enters the text. Speed would be critical. The usage scenario would be fairly similar to how an user types text in a cell phone to deliver short messages.