The goal of the Data Science Capstone project is to create a Shiny application which uses a natural language processing (NLP) prediction algorithm to predict the next word in a phrase that the user inputs. This report discusses my exploration of the data given to us to develop the algorithm, and my plans for developing this algorithm.

Exploratory Data Analysis

The first step I took was to write code that downloaded the data set from the web (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and unzipped the files. While there are directories for several languages, I am using the three text files from the US English directory: blogs, news, and twitter.

All three files are quite large (see Table 1), so to do my exploratory data analysis, I randomly sampled 5% of the the lines from the 3 texts, which resulted in over 200,000 lines of text.

##     file number.of.lines number.of.characters
## 1   news         1010242            203791405
## 2  blogs          899288            208361438
## 3 tweets         2360148            162384825

Table 1. File names with associated numbers of lines and characters.

Looking through the files, I noticed a number of non-English characters, but while there were a fairly large number of occurences of these characters, the number of unique characters was small, and oftentimes these characters were used in place of apostrophes, e.g. “won’t”" instead of “won’t”, or “I’m” instead of “I’m”. Since there was a small number, I wrote a function that would remove them from the corpus.

Text can sometimes contain anomalies such as a backslash, which can cause problems when analyzed in R. To remove such anomalies and extraneous whitespace, I used the function scrubber in the pacage qdap. To further clean the data, I made all characters lowercase, removed numbers, removed punctuation, and removed stopwords. Punctuation is potentially useful in predicting words, such as when it is a question or when it implies excitement, but I decided to remove them for exploratory analysis since, at this point, I am more concerned with the words and their frequencies. Stopwords are common words such as “my”, “I”, and “very”. When I create my model I plan to test it with and without the stopwords, since it seems that having these words in the corpus may be important when predicting the next word in a phrase. Finally, I stemmed the words, i.e. truncated them. This is also a process that I will test when I get to the modeling stage.

With cleaned data, I created a document-term-matrix (DTM), which determines the number of times each word appears in each of the three documents, and creates a matrix with the documents making up the rows, and the words comprising the columns. I summed the columns and then oredered the words according to frequency. This allowed me to create lists and histograms of the most freqent words. For example, Table 2 shows a list of the words that occur at least 3000 times in the corpus, and Figure 1 shows a histogram of the words that occured at least 6000 times.

##   [1] "anoth"  "around" "ask"    "back"   "best"   "better" "big"   
##   [8] "book"   "call"   "can"    "cant"   "citi"   "come"   "day"   
##  [15] "didnt"  "dont"   "end"    "even"   "everi"  "famili" "feel"  
##  [22] "find"   "first"  "follow" "friend" "game"   "get"    "give"  
##  [29] "good"   "got"    "great"  "happi"  "help"   "home"   "hope"  
##  [36] "hous"   "ive"    "just"   "keep"   "know"   "last"   "let"   
##  [43] "life"   "like"   "littl"  "live"   "lol"    "long"   "look"  
##  [50] "lot"    "love"   "made"   "make"   "man"    "mani"   "may"   
##  [57] "month"  "much"   "need"   "never"  "new"    "next"   "night" 
##  [64] "now"    "one"    "peopl"  "place"  "play"   "point"  "put"   
##  [71] "realli" "right"  "run"    "said"   "say"    "school" "see"   
##  [78] "show"   "sinc"   "someth" "start"  "state"  "still"  "take"  
##  [85] "talk"   "team"   "thank"  "that"   "thing"  "think"  "time"  
##  [92] "today"  "tri"    "two"    "use"    "want"   "watch"  "way"   
##  [99] "week"   "well"   "will"   "work"   "world"  "year"   "your"

Tabel 2. Words occuring at least 3000 times

Figure 1. Words occuring at least 6000 times.

Figure 2 is a correlation plot of the 14 words that occured over 5000 times and were highly correlated. Lines are drawn between the words that have a correlation of at least 0.5.

Figure 2. Correlation map of 14 words occuring at least 5000 times with a correlaton of at least 0.5.

Model Development

To build the predictive model, I took the cleaned data set, and created a list of bigrams and trigrams using the NGramTokenizer function from the RWeka package. N-grams are a sequence of words that go together; thus bigrams are pairs of words and trigrams are a series of three words. The next step will be to take these lists and create document-term-matrices which are analogous to the DTM created for individual words.

With DTM’s created, I will need to shrink their size by removing sparse terms, and then use a method for creating frequencies for terms that are not on the list. I plan to use Good-Turing Smoothing to calculate the probability of unseen words. I will use this final DTM to predict the “next word” in my Shiny application.