Introduction

Created for the Coursera Data Science Specialization, this milestone report explains the latest exploratory analysis for the SwiftKey project - where the final deliverable will be an interactive text prediction app.

The code has been suppressed to maintain readability of this document.

File Basic Summaries

Note that there are extremely long articles in the blog data set. Whereas the Twitter data set consists of a large quantity of short entries.

Sample text from each data set

Blog file

In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.

News file

He wasn’t home alone, apparently.
The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
WSU’s plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.

Twitter file

How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
When you meet someone special… you’ll know. Your heart will beat more rapidly and you’ll smile for no reason.
they’ve decided its more fun if I don’t.

Boxplot of character frequency in each data set

Data Cleaning

We can see from the initial exploratory analysis that there are some outliers with extremely long articles. When examining those outliers, we have observations such as a log of events from the Fukushima nuclear reactor incident and a log of stock prices. Likewise, outliers on the short side are also unlikely occurences. Neither of these extremes relate well to the type of text that we want to predict. For this reason we will limit the data sets to the 1st and 3rd quartile.

We will merge the data sets into one file. Initially, the intention was to split the data set to 60% training and 40% test. However, the data set is way too large to process on a single computer. A full corpus on the blog data set alone will take 3.4 GB of RAM and this is before performing any transformations which take an exponential amount of resources. Given that we are looking to run this as a speed efficient app over the web, reducing the data set will increase efficiency. For this report, we will limit the data set to 30k observations.

We will clean the data by removing punctuation and removing profanity. Ideally, spelling correction and removing non-English words would be helpful, but out-of-scope for this phase of the milestone report. For the most part, spelling errors and non-English words appear to be edge cases when considering the top ngrams only.

The stemwords were left alone because they are possible candidates for next word prediction.

Sample text of clean output

## sadly im awake the joys of having a cat that constantly want attention

Plotting the frequency of the top 10 ngrams

Word cloud visualization

Word clouds show the most frequent words in the data set

Interesting findings

The goal of the final app will be prediction for short texts, perhaps to be used on a mobile phone. Looking at the outliers, the data is not relevant. For example, the longest articles in the blog data set is actually a log of the Fukushima power plant incidents. Likewise the news data set has a long entry on stock prices. These outliers will apply leverage on prediction because of their size.
The Twitter data set has a length that is relevant since they are typically under 250 characters. However there are a significant amount of abbreviations when compared against the news and blog sets. The casual nature of the Twitter data appears to yield more spelling errors as well.
The preprocessing can taking a lot of computing time. By generating a clean and tidy data set, this would avoid long load times for the app that we hope to build.

Plans for the prediction algorithm and Shiny app

The next step in the research would be to investigate the speed and accuracy of predicting with trigrams, bigrams, and unigrams. I suspect trigrams would yield the highest accuracy but what is the trade-off cost?
The algorithm that I anticipate using would be to match the text for 2 ngrams. If they match, then we would check 3 ngrams for the prediction. If that doesn’t exist, we will fall back on the first word to predict using 2 ngrams. Highest frequency will determine the top prediction.

If there is enough time to experiment, I suspect we can build a much faster prediction system that has a data structure that maps out each word+previous word’s possibilities in an ordered list - but this is probably out of scope for the course.

The Shiny app would have a simple interface where predicive results would reactively appear as the user enters the text. Speed would be critical. The usage scenario would be fairly similar to how an user types text in a cell phone to deliver short messages.

Data Science Capstone Milestone Report

Dr. Ali Sajid Imami

October 27, 2018