Milestone Report - Swiftkey Preliminary Data Analysis

Executive Summary

The purpose of this document is to demonstrate progress on the Swiftkey-based word prediction software development exercise for the Coursera Data Science Capstone course.

The tone of this report is not intended for a general audience, but rather for data scientist and software engineer managers.

The report shows:

How the data were read and preliminary exploratory analysis
Some high-level statistics about the nature of the data
How a first cut at creation of data structures might work
A preliminary design approach of our algorithm
Considerations to address in the construction of the application

Preliminary Exploratory Analysis

We first establish our working directory. Use the readlines function from the readr library and take a high-level look at the data.

For the initial high-level analysis, we will also get word counts. For that exercise, we will simply assert that any set of characters separated by space(s) is a word. That definition will be refined when we study the actual word composition. We also include performance metrics throughout to evaluate the effectiveness of various libraries.

Read and Analyze Data

library(readr)
library(stringi)

#twitter first
connection = file("raw/en_US.twitter.txt", encoding = 'UTF-8')
start = Sys.time()
twitter = readLines(connection,skipNul=TRUE)
close(connection)
Sys.time() - start

## Time difference of 1.424291 mins

length(twitter)

## [1] 2360148

head(twitter,3)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."

#count per line
start = Sys.time()
countst = sapply(gregexpr("\\S+", twitter), length)
Sys.time() - start

## Time difference of 2.654318 mins

wordst = round(sum(countst)/10e5,1)
# just for comparison, try the latex feature
wordst1 = round(stri_stats_latex(twitter)["Words"]/10e5,1)
linest = round(length(twitter)/1000,1)
summary(countst)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.87   18.00   47.00

#blogs
connection = file("raw/en_US.blogs.txt", encoding = 'UTF-8')
blogs = readLines (connection,skipNul = TRUE)
close(connection)
linesb = round(length(blogs)/1000,1)
#count per line
countsb = sapply(gregexpr("\\S+", blogs), length)
wordsb= round(sum(countsb)/10e5,1)
length(blogs)

## [1] 899288

head(blogs,3)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

summary(countsb)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   28.00   41.52   59.00 6630.00

Now generate the summary table.

## [1] 1010242

## [1] "He wasn't home alone, apparently."                                                                                                                                                
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   31.00   34.02   45.00 1792.00

Preliminary summary analysis shows that the Twitter file has far fewer words per line than the other two files, due to the text limitation of twitter messages. However, the overall word count is similar for all three files.

Let’s take a deeper look at the file sizes, word counts, and word densities.

Note that the file sizes are roughly the same, but the word counts and densities are quite different in the Twitter file vs. the other two files.

Preliminary Exploratory Analysis

We explore the use of n-grams to capture the word groups for the next phase of the analysis, using the tm library.

Due to sheer data volume, we reduce our initial datasets to more manageable size via random sampling of 5%.

Next. generate a sample set of data, and remove extraneous characters including and converting:

non-ASCII characters
upper to lower case
special characters
numbers
extra blanks, including leading and trailing

The function does this is in Appendix I.

## Warning: package 'ngram' was built under R version 3.2.5

It is important to look at the performance for each of two ngram libraries. The ngram library is very simple and does not offer a lot of data cleaning capabilities. However, we have already developed a simple data-cleaning function using grep functionality, so that is not a major advantage. Instead, we will evaluate based on performance.

We develop a simple constructor function to create an ngram corpora and compare it to tm/weka libraries.

For performance testing, we include only the first 10k records.

## Warning: package 'tm' was built under R version 3.2.5

## Loading required package: NLP

## Warning: package 'RWeka' was built under R version 3.2.5

## An ngram object with 71449 2-grams

## <<TermDocumentMatrix (terms: 2, documents: 10000)>>
## Non-/sparse entries: 20000/0
## Sparsity           : 0%
## Maximal term length: 4
## Weighting          : term frequency (tf)

The results of the comparison showed that the ngram library is almost two order of magnitudes faster:

ngram performance - ~ 1.3 seconds
tm performance - ~ 51 seconds

We proceed with further analysis using only the ngram library. Code is hidden for this processing.

Next plot histograms for a generated ngrams with a generalized ggplot function.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(gridExtra)

generateHistPlot = function (df,n, title) {

    tmp = df[1:n,]
    tmp$Phrases = reorder(tmp$ngrams, tmp$freq)
    plt = ggplot(tmp,aes(Phrases,y=freq)) +
        ylab("Frequency")
    plt= plt + geom_bar(stat = "identity")+
        coord_flip() + labs(title = title)
    plt
}

howMany = 25
gtw1 = generateHistPlot(ngramst1,howMany,"Twitter Words")
gtw2 = generateHistPlot(ngramst2,howMany,"Twitter Bigrams")
gtw3 = generateHistPlot(ngramst3,howMany,"Twitter Trigrams")
gtw4 = generateHistPlot(ngramst4,howMany,"Twitter Quadgrams")


grid.arrange(gtw1,gtw2,ncol=2)

grid.arrange(gtw3,gtw4,ncol=2)

gn1 = generateHistPlot(ngramsn1,howMany,"News Words")
gn2 = generateHistPlot(ngramsn2,howMany,"News Bigrams")
gn3 = generateHistPlot(ngramsn3,howMany,"News Trigrams")
gn4 = generateHistPlot(ngramsn4,howMany,"News Quadgrams")


grid.arrange(gn1,gn2,ncol=2)

grid.arrange(gn3,gn4,ncol=2)

gb1 = generateHistPlot(ngramsb1,howMany,"Blog Words")
gb2 = generateHistPlot(ngramsb2,howMany,"Blog Bigrams")
gb3 = generateHistPlot(ngramsb3,howMany,"Blog Trigrams")
gb4 = generateHistPlot(ngramsb4,howMany,"Blog Quadgrams")


grid.arrange(gb1,gb2,ncol=2)

grid.arrange(gb3,gb4,ncol=2)

# 
#      
gall1 = generateHistPlot(ngramsAll1,howMany,"All Words")
gall2 = generateHistPlot(ngramsAll2,howMany,"All Bigrams")
gall3 = generateHistPlot(ngramsAll3,howMany,"All Trigrams")
gall4 = generateHistPlot(ngramsAll4,howMany,"All Quadgrams")

grid.arrange(gall1,gall2,ncol=2)

grid.arrange(gall3,gall4,ncol=2)

The histograms seem to illustrate a reasonable sampling of ngrams from the combined data sources.

There was one major issue of an outlier record in the blogs file. The record contains, in part, the following:

“…Do yourself a favour. Start banging on about vested interests all week. Speak of nothing else. For god’s sake, give us an enemy that the public can identify with. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. Vested interests. …”

This skews our ngram generation dramatically. Generation of these data structures eliminated this record.

Predictive Algorithm Design

The design of the prediction algorithm does not make any assumptions at this point about memory restrictions or processing speed. Whether it will be memory based or disk based will be decided in later analysis.

Our overall algorithm design will be:

Use three separate datasets, bigram, trigrams, and quadgrams. These will be used to emulate Markov chain processing.
Evaluate and parse input
If one word entered, use a starting with lookup in the bigram set
If two words entered, use a starting with lookup in the trigram set
If three or more words entered, use a starting with lookup in the quadgram set using the last three words
If records are found then, from the result, select the most frequent record using the frequency in the corresponding table
- Return the last word in the ngram as the predicted word
- Else if no record found then drop back to the next lower ngram, e.g. if no trigram, then use the bigram and repeat the lookup

The algorithm will need an exception process to handle the case where no record is returned at all. That is TBD.

Further analysis should be done as to how to represent these data structures. Possible implementations might be either indexed disk-based (SQL or file lookup) or in-memory tables.

Appendix I - Supporting Functions

# removes extraneous stuff from text files
cleanText = function (txt) {
    # remove non-ascii characters
    txt = iconv(txt, 'UTF-8', 'ASCII', "byte")
    # convert to lower
    txt = tolower(txt)
    # remove special characters
    txt = gsub("[^[:^punct:]]", "", txt,perl=TRUE)
    # remove numbers
    txt = gsub('[[:digit:]]+', '', txt)
    # remove contiguous whitespace
    txt = gsub("\\s+",' ',txt)
    # trim whitespace
    txt = trimws(txt)
    txt
}
# sample call
twitter = cleanText(twitter)

# function to generalize the ngram function with wordcounts

makeNgrams = function (str,wCnt, size) {
    siz = size 
    if (size==1) {
         siz = 2
     }
     length(str) 
     s = str[which(wCnt>=siz)]
     length(s)
     ng = ngram(s,n=size)
     ng
}

#Wrapper function for tm TermDocumentMatrix

tdm.generate = function(string, ng){
    
    corpus = Corpus(VectorSource(string)) 
    # create corpus for TM processing
    
    BigramTokenizer = function(x) NGramTokenizer(
        x, 
        x = Weka_control(min = ng, max = ng)) 
    # create n-grams
    tdm = TermDocumentMatrix(
        corpus, 
        control = 
            list(tokenize = BigramTokenizer)) 
    # create tdm from n-grams
    tdm
}

# function to generalize the ngram function with wordcounts

makeNgrams = function (str,wCnt, size) {
    siz = size 
    if (size==1) {
         siz = 2
     }
     length(str) 
     s = str[which(wCnt>=siz)]
     length(s)
     ng = ngram(s,n=size)
     ng
}

References

ngram library - https://cran.r-project.org/web/packages/ngram/ngram.pdf

Discussion of Markov Chains - https://golang.org/doc/codewalk/markov/

Overview of NLP - https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

An ngram example project - http://amunategui.github.io/speak-like-a-doctor/

Milestone Report - Swiftkey Preliminary Data Analysis

Bill Holst

November 1, 2016

Executive Summary

Preliminary Exploratory Analysis

Read and Analyze Data

Preliminary Exploratory Analysis

Predictive Algorithm Design

Appendix I - Supporting Functions

References