Data Science Capstone: Word Prediction App

Jenina Halitsky
April 23, 2015

Understanding the Problem:

The final project for the Coursera Data Science Specialization is the SwiftKey Capstone Project. SwiftKey has partnered with Cousera by providing a corpus called HC Corpora. These corpora have been collected from numerous webpages with the aim to get a varied and comprehensive corpus. This project uses a natural language processing (NLP) prediction model to predict the next word in a phrase.

The training data was downloaded from the Coursera site (http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) on March 22, 2015. The files named LOCALE.blogs.txt where LOCALE indicates each of the 4 locales en_US, de_DE, ru_RU and fi_FI. Additionally, each LOCALE has 3 types of sources: blogs, news and Twitter updates. For this project, en_US data will only be used.

Data Sample and Cleaning:

en_US.blogs.txt file size:   248.5 Mb  number of lines: 899288   number of words: 37334131 

en_US.news.txt file size:    249.6 Mb  number of lines: 1010242  number of words: 34372530 

en_US.twitter.txt file size: 301.4 Mb  number of lines: 2360148  number of words: 30373543

As you can see above, the file size on each of the training files are extremely large. By tidying the data we will see a dramatic decrease in the speed of running the code. Part of tidying the data, is to clean the dataset by removing all special characters, trim extra whitespace, removing punctuation, numbers, stopwords, profanity and changing all the data to lowercase. This will help with analyzing the words to see how many are repetitive.

Modeling:

Once the data was cleaned, the next step was to learn all of the N-gram words to obtain their frequencies. Each 4-gram was broken into a 3-gram to create the first 3 words and then a final word. This produced the most common final word after 3-gram words. This process was repeated for the original set of 3-grams, producting a set of 2-grams. As well as the original set of 2-grams, producing a set of single grams.

The application requests the user to type in a phrase. The prediction algorithm will examine the phrase entered. If the phrase was present in the training data, it will give the next common word. If not, it will continue to search using 3 words, then 2 words and then 1 word until it is able to predict the next common word.

Shiny App Instructions:

  1. Type your phrase in the text field
  2. The phrase will begin to process while you are typing.
  3. Obtain the word prediction.
  4. The Prediction Word Cloud graphic presents all (up to a max of 50) of the next word predictions for entered phrase. The probability of the prediction is represented by the font size.
  5. The app takes about 30 seconds to fully load. Moving forward, I would like to find a way to reduce this time.

Please feel free to try the SwiftKey Data Science Capstone - Word Prediction app on ShinyApps:

(http://jmhalitsky.shinyapps.io/DataScienceCapstoneProject/)