Jenina Halitsky
April 23, 2015
The final project for the Coursera Data Science Specialization is the SwiftKey Capstone Project. SwiftKey has partnered with Cousera by providing a corpus called HC Corpora. These corpora have been collected from numerous webpages with the aim to get a varied and comprehensive corpus. This project uses a natural language processing (NLP) prediction model to predict the next word in a phrase.
The training data was downloaded from the Coursera site (http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) on March 22, 2015. The files named LOCALE.blogs.txt where LOCALE indicates each of the 4 locales en_US, de_DE, ru_RU and fi_FI. Additionally, each LOCALE has 3 types of sources: blogs, news and Twitter updates. For this project, en_US data will only be used.
en_US.blogs.txt file size: 248.5 Mb number of lines: 899288 number of words: 37334131
en_US.news.txt file size: 249.6 Mb number of lines: 1010242 number of words: 34372530
en_US.twitter.txt file size: 301.4 Mb number of lines: 2360148 number of words: 30373543
As you can see above, the file size on each of the training files are extremely large. By tidying the data we will see a dramatic decrease in the speed of running the code. Part of tidying the data, is to clean the dataset by removing all special characters, trim extra whitespace, removing punctuation, numbers, stopwords, profanity and changing all the data to lowercase. This will help with analyzing the words to see how many are repetitive.
Once the data was cleaned, the next step was to learn all of the N-gram words to obtain their frequencies. Each 4-gram was broken into a 3-gram to create the first 3 words and then a final word. This produced the most common final word after 3-gram words. This process was repeated for the original set of 3-grams, producting a set of 2-grams. As well as the original set of 2-grams, producing a set of single grams.
The application requests the user to type in a phrase. The prediction algorithm will examine the phrase entered. If the phrase was present in the training data, it will give the next common word. If not, it will continue to search using 3 words, then 2 words and then 1 word until it is able to predict the next common word.
Please feel free to try the SwiftKey Data Science Capstone - Word Prediction app on ShinyApps:
(http://jmhalitsky.shinyapps.io/DataScienceCapstoneProject/)