Introduction

This is the milestone report for Johns Hopkins Data Science Specialization Capstone Project. The purpose of this milestone report is to show an understanding about the project and current progress. It is an attempt to start further discussion and obtain constructive feedback from peers. Since the intended audience for this project is a non-data scientist manager, the code will be seperated from the report and kept to a minimum, while explaining how the data has been transformed.

Executive Summary

The objective of this Capstone Project is to produce a predictive text algorithm written in R that based on a user’s text input will suggest the next 8 most likely words to be entered.

As the user inputs characters the set of characters will be compared to text against a word list. The predicted word will be the word that has the highest probability following the previous word or multi-word phrase.

In the current project stage the dataset has been downloaded from Coursera and Swiftkey. Some initial exploratory data analysis has been performed along with some data preparation in order to proceed with the predictive modeling and construction of the end user application.

The next objective is to find the optimal sample size from the dataset required to build a corpus on which to train the prediction algorithm.

The Problem

A user interface will need to be build to process the users input and predict the highest probability following the previous word or multi-word phrase. Problems that could arise are such as how to handle undesirable features within the dataset such as non-English words, twitter handles, email addresses, abbreviations and contractions, numbers and whitespace.

One of the largest challenges present is how to achieve total coverage of all possible combinations. This algorithm will need to process large amounts of data quickly in order to keep the users attention. To achieve this a minimal amount of data will be needed to maximize coverage.

Deliverables for this milestone

The main deliverables are:

  • Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  • Create a basic report of summary statistics about the data sets.
  • Report any interesting findings that you amassed so far.

Download Dataset The training dataset is downloadable from here

setwd("L:\\Cousera\\Capstone\\week 2\\")

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destFile <- "data/cs.zip"
output <- "L:\\Cousera\\Capstone\\week 2\\data"

if (!file.exists(destFile)) {
  download.file(url, destFile)
  unzip(destfile,exdir=output) 
}

Data Summary

The dataset includes 3 sources of textual data namely blogs, news and twitter. Each type has 4 languages including Russian, Finnish, German and English. The scope of this proect will only include the english version of all three types.

Files used are as follows:

  • en_US.blogs.txt
  • en_US.twitter.txt
  • en_US.news.txt

The following table summarizes the characteristics of the three datasets, including file size and number of lines.

Preprocessing

In the interest of efficiency and time, a sample of 1% of the blog, news and twitter dataset has been combined to explore for this milestone report. After sampling, the data goes through a couple of transformations, including:

  • changing letters to lower case
  • remove emoticons
  • removing whitespace
  • remove punctuations
  • remove numbers
  • remove stop words
  • remove word stems

After processing, the summary of the datasets are in the following table. These transformations are done by turning the vectors of data into class corpus and using the ‘tm_map’ function under ‘tm’ library.

FileName Lines Chars Words Min.WPL Mean.WPL Max.WPL
blogs en_US.blogs 899288 208361438 37865888 0 42.43 6726
news en_US.news 77259 15683765 2665742 1 34.87 1123
twitter en_US.twitter 2360148 162384825 30578891 1 12.80 60
build_corpus <- function (x = sampleData) {
  c <- VCorpus(VectorSource(x)) # Create corpus dataset
  c <- tm_map(c, tolower) # all characters to lowercase
  c <- tm_map(c, removePunctuation) # Remove punctuation
  c <- tm_map(c, removeNumbers) # Remove numbers
  c <- tm_map(c, stripWhitespace) # Remove whitespace
  c <- tm_map(c, removeWords, stopwords("english")) # remove english stop words
  c <- tm_map(c, stemDocument) # Stem the document
  c <- tm_map(c, PlainTextDocument) # Create plain text format
}
corpusData <- build_corpus(combine.Data)

rm(combine.Data)

#Run garbage collection to reclaim memory
gc()
##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 1656129 88.5    2572361 137.4  2637877 140.9
## Vcells 3498738 26.7    6851928  52.3  6007196  45.9

Exploratory Analysis

The frequencies of the words from blogs, news and twitter datasets are summarised in the wordCloud and three bar charts, plotted below. Only top 10 most frequent words are shown for each of the dataset in the dataset.

It is noted that stop words have been removed after punctuation was removed, this has cause words like dont and cant to still be in the dataset. An adjustment will be needed to remove stopwords than punctuation in moving forward.

The Plan

The data has been cleaned and stacked for further work that will include exploring the frequencies of multiple-word phrases using n-gram analysis. A predictive model with multiple n-gram will develop a text predictive app for the final submission. The prediction will also list out a couple of most likely next word which are ranked by probabilities. The predictive app will be hosted as a Shiny app to predict the next word with up to 4-5 words input.

Final note

For readability purpose, this report hides most the R code that generated above data and plots.

Appendix

Corpus Inspection

The text mining (TM) package will be used to help with the collection and analysis of the data. A Corpus, represents a collection of text documents. A corpus is an abstract concept, and there can exist several implementations in parallel. A VCorpus or Volatile Corpus are held fully in memory. This is denoted as volatile since once the R object is destroyed, the whole corpus is gone.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 82
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 45
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 78

Stop Words

Stop words removed from corpus data set.

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"