Data Science Capstone Project: Exploratory data analysis of SwiftKey data

Max G.
November 15, 2014

Synopsis

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, founded in London in 2008, builds technology that makes it easy for everyone communicate and work using their mobile devices. The company is best known for smart SwiftKey keyboard app that learns from users as they type and makes it easier for people to type faster on their cell phones and tables. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.

Dataset

As part of this project, we will be applying data science in the area of natural language processing. As a first step, we will familiarize ourselves with Natural Language Processing, Text Mining, and the associated tools in R. We will obtain the training dataset and explore it to determine the best approach for cleaning up and pre-processing.

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text.

This is the training data to get us started that will be the basis for most of the capstone. The data will be downloaded from the link below:

SwiftKey Dataset

datafolder <- "data"
url  <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fname <- "Coursera-SwiftKey.zip"
fpath <- paste(datafolder, fname, sep="/")
if (!file.exists(fpath)){
  download.file(url, destfile=fpath, method="curl")
}
unzip(zipfile=fpath, exdir=datafolder)

Summary

Since we're interested in the English text files, we will filter down to those that contain 'en_' prefix. Below is the output of our preliminary analysis where we calculate size of the files, number of lines, longest lines in each file and the number of words in each file:

flist <- list.files(path=datafolder, recursive=T, pattern=".*en_.*.txt")
l <- lapply(paste(datafolder, flist, sep="/"), function(f) {
   fsize <- file.info(f)[1]/1024/1024
   con <- file(f, open="r")
   lines <- readLines(con)
   nchars <- lapply(lines, nchar)
   maxchars <- which.max(nchars)
   nwords <- sum(sapply(strsplit(lines, "\\s+"), length))
   close(con)
   return(c(f, format(round(fsize, 2), nsmall=2), length(lines), maxchars, nwords))
})

## Warning in readLines(con): line 167155 appears to contain an embedded nul

## Warning in readLines(con): line 268547 appears to contain an embedded nul

## Warning in readLines(con): line 1274086 appears to contain an embedded nul

## Warning in readLines(con): line 1759032 appears to contain an embedded nul

df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T))
colnames(df) <- c("file", "size(MB)", "num.of.lines", "longest.line", "num.of.words")
df

##                                 file size(MB) num.of.lines longest.line
## 1   data/final/en_US/en_US.blogs.txt   200.42       899288       483415
## 2    data/final/en_US/en_US.news.txt   196.28      1010242       123628
## 3 data/final/en_US/en_US.twitter.txt   159.36      2360148           26
##   num.of.words
## 1     37334131
## 2     34372530
## 3     30373543

Pre-processing and Cleanup

To work with large amount of data efficiently, we will take a random sample of 10% of each data file and perform clean up and preprocessing:

set.seed(4321)
blog_data <- file(paste(datafolder, flist, sep="/")[1], open="r")
blog_lines <- readLines(blog_data)
num_blog_lines <- length(blog_lines)
blog_sample <- blog_lines[sample(1:num_blog_lines, num_blog_lines * 0.1, replace=FALSE)]
close(blog_data)

news_data <- file(paste(datafolder, flist, sep="/")[2], open="r")
news_lines <- readLines(news_data)
num_news_lines <- length(news_lines)
news_sample <- news_lines[sample(1:num_news_lines, num_news_lines * 0.1, replace=FALSE)]
close(news_data)

twit_data <- file(paste(datafolder, flist, sep="/")[3], open="r")
twit_lines <- readLines(twit_data)

## Warning in readLines(twit_data): line 167155 appears to contain an
## embedded nul

## Warning in readLines(twit_data): line 268547 appears to contain an
## embedded nul

## Warning in readLines(twit_data): line 1274086 appears to contain an
## embedded nul

## Warning in readLines(twit_data): line 1759032 appears to contain an
## embedded nul

num_twit_lines <- length(twit_lines)
twit_sample <- twit_lines[sample(1:num_twit_lines, num_twit_lines * 0.1, replace=FALSE)]
close(twit_data)

Below we will plot the values for number of lines for blogs, news, and twitter:

library(ggplot2)
numlines <- c(length(blog_lines),length(news_lines),length(twit_lines))
numlines <- data.frame(numlines)
numlines$names <- c("blogs","news","twitter")
ggplot(numlines,aes(x=names,y=numlines)) + geom_bar(stat='identity',color='blue') + xlab('File source') + ylab('Total No. of Lines') + ggtitle('Total Line Count per File Source')

plot of chunk unnamed-chunk-4

The plot shows that blogs have the least number of lines while the twitter data have the most.

print(paste("Blog data sample no. of lines:", length(blog_sample), sep=" "))

## [1] "Blog data sample no. of lines: 89928"

print(paste("News data sample no. of lines:", length(news_sample), sep=" "))

## [1] "News data sample no. of lines: 101024"

print(paste("Twitter data sample no. of lines:", length(twit_sample), sep=" "))

## [1] "Twitter data sample no. of lines: 236014"

For our analysis, we are interested in the most commonly occuring phrases. To aid in this analysis, we have to strip non alpha-numeric (special characters) from the corpus data as well as profane words.

After doing further research on how to work with token combinations (n-grams) in R, we found that many recommendations are pointing to the 'tm' package in R, which can assist with finding n-grams and count their frequency. This is a good starting point in building our predictive model.

The example below shows how we can analyze trigrams (phrases that consist of three tokens). The same can be applied to 1-gram, and 2-gram phrases:

library("tm")

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

library("RWeka")

sc <- Corpus(VectorSource(list(blog_sample)))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
termdocmatrix <- TermDocumentMatrix(sc, control = list(tokenize = TrigramTokenizer))

tdf <- data.frame(inspect(termdocmatrix))

names(tdf) <- c("blog_data")
tdf$Freq <- rowSums(tdf)
tdf <- tdf[order(-tdf$Freq),]
head(tdf)

##             blog_data Freq
## one of the       1443 1443
## i don t          1247 1247
## a lot of         1211 1211
## some of the       699  699
## the end of        677  677
## to be a           676  676

Next Steps: Building Predictive Model

For our predictive model, we are planning to use text mining package to build a model based on the frequency of 1-gram to 3-gram phrases. The application will allow use to enter text. As the input is being provided word by word, it will be evaluated against our predictive model which will determine what word or a set of words (we will probably use up to 5) are the most likely be the next token the user will input based on the frequency of n-grams. The application will be able to respond to the user input to display word suggestions.

The following are considerations for our model:

Our training data (based on blogs, news, and twitter data) has to be cleaned up and common words (such as a, the, an) should not be included in our n-gram maps
We have to remove apostrophes, and replace commas with spaces
We have to use sample data for faster performance
We have to remove profane words to avoid offensive content
We have to combine the data and use the sample in the model