Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, founded in London in 2008, builds technology that makes it easy for everyone communicate and work using their mobile devices. The company is best known for smart SwiftKey keyboard app that learns from users as they type and makes it easier for people to type faster on their cell phones and tables. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.
As part of this project, we will be applying data science in the area of natural language processing. As a first step, we will familiarize ourselves with Natural Language Processing, Text Mining, and the associated tools in R. We will obtain the training dataset and explore it to determine the best approach for cleaning up and pre-processing.
The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text.
This is the training data to get us started that will be the basis for most of the capstone. The data will be downloaded from the link below:
datafolder <- "data"
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fname <- "Coursera-SwiftKey.zip"
fpath <- paste(datafolder, fname, sep="/")
if (!file.exists(fpath)){
download.file(url, destfile=fpath, method="curl")
}
unzip(zipfile=fpath, exdir=datafolder)
Since we're interested in the English text files, we will filter down to those that contain 'en_' prefix. Below is the output of our preliminary analysis where we calculate size of the files, number of lines, longest lines in each file and the number of words in each file:
flist <- list.files(path=datafolder, recursive=T, pattern=".*en_.*.txt")
l <- lapply(paste(datafolder, flist, sep="/"), function(f) {
fsize <- file.info(f)[1]/1024/1024
con <- file(f, open="r")
lines <- readLines(con)
nchars <- lapply(lines, nchar)
maxchars <- which.max(nchars)
nwords <- sum(sapply(strsplit(lines, "\\s+"), length))
close(con)
return(c(f, format(round(fsize, 2), nsmall=2), length(lines), maxchars, nwords))
})
## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T))
colnames(df) <- c("file", "size(MB)", "num.of.lines", "longest.line", "num.of.words")
df
## file size(MB) num.of.lines longest.line
## 1 data/final/en_US/en_US.blogs.txt 200.42 899288 483415
## 2 data/final/en_US/en_US.news.txt 196.28 1010242 123628
## 3 data/final/en_US/en_US.twitter.txt 159.36 2360148 26
## num.of.words
## 1 37334131
## 2 34372530
## 3 30373543
To work with large amount of data efficiently, we will take a random sample of 10% of each data file and perform clean up and preprocessing:
set.seed(4321)
blog_data <- file(paste(datafolder, flist, sep="/")[1], open="r")
blog_lines <- readLines(blog_data)
num_blog_lines <- length(blog_lines)
blog_sample <- blog_lines[sample(1:num_blog_lines, num_blog_lines * 0.1, replace=FALSE)]
close(blog_data)
news_data <- file(paste(datafolder, flist, sep="/")[2], open="r")
news_lines <- readLines(news_data)
num_news_lines <- length(news_lines)
news_sample <- news_lines[sample(1:num_news_lines, num_news_lines * 0.1, replace=FALSE)]
close(news_data)
twit_data <- file(paste(datafolder, flist, sep="/")[3], open="r")
twit_lines <- readLines(twit_data)
## Warning in readLines(twit_data): line 167155 appears to contain an
## embedded nul
## Warning in readLines(twit_data): line 268547 appears to contain an
## embedded nul
## Warning in readLines(twit_data): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(twit_data): line 1759032 appears to contain an
## embedded nul
num_twit_lines <- length(twit_lines)
twit_sample <- twit_lines[sample(1:num_twit_lines, num_twit_lines * 0.1, replace=FALSE)]
close(twit_data)
Below we will plot the values for number of lines for blogs, news, and twitter:
library(ggplot2)
numlines <- c(length(blog_lines),length(news_lines),length(twit_lines))
numlines <- data.frame(numlines)
numlines$names <- c("blogs","news","twitter")
ggplot(numlines,aes(x=names,y=numlines)) + geom_bar(stat='identity',color='blue') + xlab('File source') + ylab('Total No. of Lines') + ggtitle('Total Line Count per File Source')
The plot shows that blogs have the least number of lines while the twitter data have the most.
print(paste("Blog data sample no. of lines:", length(blog_sample), sep=" "))
## [1] "Blog data sample no. of lines: 89928"
print(paste("News data sample no. of lines:", length(news_sample), sep=" "))
## [1] "News data sample no. of lines: 101024"
print(paste("Twitter data sample no. of lines:", length(twit_sample), sep=" "))
## [1] "Twitter data sample no. of lines: 236014"
For our analysis, we are interested in the most commonly occuring phrases. To aid in this analysis, we have to strip non alpha-numeric (special characters) from the corpus data as well as profane words.
After doing further research on how to work with token combinations (n-grams) in R, we found that many recommendations are pointing to the 'tm' package in R, which can assist with finding n-grams and count their frequency. This is a good starting point in building our predictive model.
The example below shows how we can analyze trigrams (phrases that consist of three tokens). The same can be applied to 1-gram, and 2-gram phrases:
library("tm")
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library("RWeka")
sc <- Corpus(VectorSource(list(blog_sample)))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
termdocmatrix <- TermDocumentMatrix(sc, control = list(tokenize = TrigramTokenizer))
tdf <- data.frame(inspect(termdocmatrix))
names(tdf) <- c("blog_data")
tdf$Freq <- rowSums(tdf)
tdf <- tdf[order(-tdf$Freq),]
head(tdf)
## blog_data Freq
## one of the 1443 1443
## i don t 1247 1247
## a lot of 1211 1211
## some of the 699 699
## the end of 677 677
## to be a 676 676
For our predictive model, we are planning to use text mining package to build a model based on the frequency of 1-gram to 3-gram phrases. The application will allow use to enter text. As the input is being provided word by word, it will be evaluated against our predictive model which will determine what word or a set of words (we will probably use up to 5) are the most likely be the next token the user will input based on the frequency of n-grams. The application will be able to respond to the user input to display word suggestions.
The following are considerations for our model: