Data Science Capstone Project: Milestone Report

Background

The Swiftkey http://swiftkey.com/en/ keyboard has been installed in over 250 million handheld (Andriod, IOS) devices. It provides its user base a text input experience they love and it also saves them time. This milestone report provides preliminary approach, analysis and progress on “The Data Science Capstone NLP Project”. The objective of this project is to emulate the Swiftkey’s NLP ( Natural Language Processing), text mining and text predictive models with your own.

The tasks or milestones for the projects are Data acquisition and cleaning, Exploratory analysis, Modeling, Prediction, Exploration, Data product & slide deck

Data acquisition and cleaning

The data for this capstone project is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The downloaded zip file contained four directories ( German, Finish, Russian, US English). The three (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) US English files in final/en_US/ directory were used as training datasets.

The following table summarizes the raw dataset; using unix command “wc -lcwL en_US*.txt" and “ls -hl en_US*.txt"

File name	Number of lines	Number of words	Number of characters	File size	Longest line
en_US.blogs.txt	899,288	37,334,114	210,160,014	201M	40,833
en_US.news.txt	1,010,242	34,365,936	205,811,889	197M	11,384
en_US.twitter.txt	2,360,148	30,359,804	167,105,338	160M	173
——————	—————–	—————–	———————-	———–
Total	4,269,678	102,059,854	583,077,241	558M

The files contained many binary characters, non printable character, emoticons , foreign words, etc and was cleaned

by reading in binary mode (UTF-8) and converted to latin1 using function iconv
by regex functions like “[[:cntrl:]]”

Removed all numbers, some punctuation’s and white-space. Converted all text to lowercase.

Given the large size of the data files and the limitation the computing power PC ( Linux, 16 Ram, 4 Core) only a random sample of 5,000 lines of text from each the file was used for the following analysis.

Exploratory analysis

UniGram analysis

There is high frequency of stop-words and smart-words in any English Corpus.

Examples of English stop-words and smart-words are listed below. Complete lists could be found in http://svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt?revision=431&view=markup and http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

sample(stopwords("english"), 20)

##  [1] "own"       "to"        "as"        "any"       "doesn't"  
##  [6] "there's"   "they're"   "did"       "yourself"  "i"        
## [11] "what's"    "over"      "its"       "by"        "further"  
## [16] "shouldn't" "once"      "it's"      "from"      "above"

sample(stopwords("SMART"), 20)

##  [1] "did"        "sub"        "two"        "haven't"    "themselves"
##  [6] "better"     "entirely"   "twice"      "serious"    "were"      
## [11] "greetings"  "same"       "keep"       "specifying" "re"        
## [16] "be"         "using"      "ex"         "example"    "namely"

For the above reason the stop-words and smart-words were removed from uni-gram analysis of the data.

The following is a World Cloud visualization of the top 50 high frequency words from the three different data sources (Corpus). It shows some noticeable difference in the words, the frequencies are represented by the size of the print and color.

Bi, Tri, Quad Gram analysis

Total of nine Term Document Matrix was created; Three corpus tokenized with three different (2, 3 & 4) n-grams.

High sparse rate (100%) was observed on the TDMs, different sparse values (0.95 – 0.99999) were used by removeSparseTerms function to reduce the size.

Frequency matrix were created for each of the TDM’s, then the matching n-grams with the frequency were combined to produce the following ggplot histograms for Data visualization.

The following histogram shows a sample of 25 Bi-Gram words that were present in all three corpuses.

The following histogram shows a sample of 25 Tri-Gram words that were present in all three corpuses.

The following histogram shows a sample of 25 Quad-Gram words that were present in all three corpuses.

Modeling, Prediction, Exploration

This task is a work in progress

Merge the above build three corpus into one corpus.
Explore stemming the corpus to reduce the overall size of the data in the final TDM by reducing variations of the English words.
Re-produce the TDM and the bi, tri and quad n-grams with their frequencies.
Build and test a predictive algorithm / statistical smoothing using (Back-off, Kneser Ney, word2vec, good-turning ) to implement the next word prediction in R.

Data product & slide deck

Following is an ideation for the Shiny Data Product.

Create a ui.R function to; display information on usage, text string input box, submit button, three text output for prediction choices & a text output for messages and information.
Create a R function to clean the user input string. Remove punctuation’s, numbers, blank space, convert to lowercase.
Create server.R function; Load the previously derived n-gram.RData data file into memory, call the prediction function, shinyServer function reactively respond to the input with the output.
Create prediction R function to implement the chosen prediction algorithm, input a string and return a data frame with predictions.

Slide Deck task has not started yet.

Conclusion & next steps

Complete all the above mention steps and construct a working Shiny app. Time permitting do the following in an agile manner.

Filter and mask profanity in the final dataset.
Rework the R code for better performance and efficiency.
Investigate ways to process a larger datasets.
Consider using cloud computing using http://www.dominodatalab.com/ if my PC cannot handle the computational power needs.
Figure out the max size of data that could be used by Shiny App server.
Test and improve the accuracies and effiencies of the prediction model.