Background

The Swiftkey http://swiftkey.com/en/ keyboard has been installed in over 250 million handheld (Andriod, IOS) devices. It provides its user base a text input experience they love and it also saves them time. This milestone report provides preliminary approach, analysis and progress on “The Data Science Capstone NLP Project”. The objective of this project is to emulate the Swiftkey’s NLP ( Natural Language Processing), text mining and text predictive models with your own.

The tasks or milestones for the projects are Data acquisition and cleaning, Exploratory analysis, Modeling, Prediction, Exploration, Data product & slide deck

Data acquisition and cleaning

The data for this capstone project is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The downloaded zip file contained four directories ( German, Finish, Russian, US English). The three (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) US English files in final/en_US/ directory were used as training datasets.

The following table summarizes the raw dataset; using unix command “wc -lcwL en_US*.txt" and “ls -hl en_US*.txt"

File name Number of lines Number of words Number of characters File size Longest line
en_US.blogs.txt 899,288 37,334,114 210,160,014 201M 40,833
en_US.news.txt 1,010,242 34,365,936 205,811,889 197M 11,384
en_US.twitter.txt 2,360,148 30,359,804 167,105,338 160M 173
—————— —————– —————– ———————- ———–
Total 4,269,678 102,059,854 583,077,241 558M

The files contained many binary characters, non printable character, emoticons , foreign words, etc and was cleaned

Removed all numbers, some punctuation’s and white-space. Converted all text to lowercase.

Given the large size of the data files and the limitation the computing power PC ( Linux, 16 Ram, 4 Core) only a random sample of 5,000 lines of text from each the file was used for the following analysis.

Exploratory analysis

UniGram analysis

There is high frequency of stop-words and smart-words in any English Corpus.

Examples of English stop-words and smart-words are listed below. Complete lists could be found in http://svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt?revision=431&view=markup and http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

sample(stopwords("english"), 20)
##  [1] "own"       "to"        "as"        "any"       "doesn't"  
##  [6] "there's"   "they're"   "did"       "yourself"  "i"        
## [11] "what's"    "over"      "its"       "by"        "further"  
## [16] "shouldn't" "once"      "it's"      "from"      "above"
sample(stopwords("SMART"), 20)
##  [1] "did"        "sub"        "two"        "haven't"    "themselves"
##  [6] "better"     "entirely"   "twice"      "serious"    "were"      
## [11] "greetings"  "same"       "keep"       "specifying" "re"        
## [16] "be"         "using"      "ex"         "example"    "namely"

For the above reason the stop-words and smart-words were removed from uni-gram analysis of the data.

The following is a World Cloud visualization of the top 50 high frequency words from the three different data sources (Corpus). It shows some noticeable difference in the words, the frequencies are represented by the size of the print and color.

Bi, Tri, Quad Gram analysis

Total of nine Term Document Matrix was created; Three corpus tokenized with three different (2, 3 & 4) n-grams.

High sparse rate (100%) was observed on the TDMs, different sparse values (0.95 – 0.99999) were used by removeSparseTerms function to reduce the size.

Frequency matrix were created for each of the TDM’s, then the matching n-grams with the frequency were combined to produce the following ggplot histograms for Data visualization.

The following histogram shows a sample of 25 Bi-Gram words that were present in all three corpuses.

The following histogram shows a sample of 25 Tri-Gram words that were present in all three corpuses.

The following histogram shows a sample of 25 Quad-Gram words that were present in all three corpuses.

Modeling, Prediction, Exploration

This task is a work in progress

Data product & slide deck

Following is an ideation for the Shiny Data Product.

Slide Deck task has not started yet.

Conclusion & next steps

Complete all the above mention steps and construct a working Shiny app. Time permitting do the following in an agile manner.