Milestone Report - Data Science Capstone

Introduction

Typing quickly on devices without traditional keyboards can be slow and frustrating. By using machine learning to predict the next word that a person will type, some of the speed issues with typing on a touchscreen can be minimized. The work presented here takes the first steps toward predicting what the next word that a user will type will be.

First, I load in the tm, NLP, and ggplot2 packages.

library(tm)

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.2.3

library(NLP)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Preliminary Analysis

Below I complete a quick summary of the files used in this project. The output table presents the filename, the file size, the number of lines in the file, the character length of the line with the most characters from each file, the word length of the line with the most words, and a word count for the entire document.

files<-c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
fileData<-sapply(paste(files), function(x) {
  fileSize<-file.size(x)/1024^2
  con<-file(x, "r")
  text<-readLines(con)
  maxChar<-max(nchar(text))
  maxWord<-max(sapply(strsplit(text, "\\s+"), length))
  wordCount<-length(unlist(strsplit(text, "\\s+")))
  close(con)
  return(c(x, fileSize, length(text), maxChar, maxWord, wordCount))
})
summaryData<-as.data.frame(t(fileData)[,-1], stringsAsFactors=FALSE)
colnames(summaryData)<-c("fileSizeMB","nLines", "mostChar",
                         "mostWords","wordCount")
summaryData

##                         fileSizeMB  nLines mostChar mostWords wordCount
## en_US.blogs.txt   200.424207687378  899288    40833      6630  37334149
## en_US.news.txt    196.277512550354 1010242    11384      1792  34372814
## en_US.twitter.txt 159.364068984985 2360148      140        47  30373565

A quick browse shows that the blog file is the biggest, the Twitter file has the most lines, the blog file has the lines with the most characters and words, and the blog file has the most words by 3 million.

Dealing with the Data

First, I create functions to make Bigrams, Trigrams, 4gram, and 5grams. I have adapted a formula from here: http://tm.r-forge.r-project.org/faq.html#Bigrams For the sake of brevity, I will limit myself to only discussing the bigrams.

twoGram <- function(lines) {
    unlist(lapply(ngrams(words(lines), 2), paste, collapse = " "), use.names = FALSE)}
#threeGram <- function(lines) {
#    unlist(lapply(ngrams(words(lines), 3), paste, collapse = " "), use.names = FALSE)}
#fourGram <- function(lines) {
#    unlist(lapply(ngrams(words(lines), 4), paste, collapse = " "), use.names = FALSE)}
#fiveGram <- function(lines) {
#    unlist(lapply(ngrams(words(lines), 5), paste, collapse = " "), use.names = FALSE)}

First, to make this analysis reproducible, a seed is set. For the Blog and News data, the files are read in by line, sampled to 1%, and converted into a corpus. From there, numbers, punctuation, and white space are removed, and all of the data is converted to lowercase.

The TermDocumentMatrix function then collects the bigrams, trigrams, and 4grams. Because the TermDocumentMatrix function defaults to words that are between 3 and infinity characters, the function call was adjusted so that words like I, a, an, and, the, etc. are included. The length of words included was also restricted to 15 characters. Words larger than this will not only be difficult to predict, but are highly variable.

Because of the size of the Twitter data, certain adjustments were made. Only 0.5% of the Twitter data is used here. Also, numbers were not removed. Attempting to remove the numbers caused long processing times.

set.seed(111)

conBlog <- file("en_US.blogs.txt", "r") 
blog<-readLines(conBlog)
blogSample<-sample(blog,8992)
close(conBlog)
blogCorpus <- Corpus(VectorSource(blogSample), 
                     readerControl=list(reader=readPlain, language="en_US", load=TRUE))
rm(blog,blogSample)
blogCorpus <- tm_map(blogCorpus, content_transformer(removeNumbers))
blogCorpus <- tm_map(blogCorpus, content_transformer(removePunctuation))
blogCorpus <- tm_map(blogCorpus, content_transformer(tolower))
blogCorpus <- tm_map(blogCorpus, content_transformer(stripWhitespace))
blogTwoTDM <- TermDocumentMatrix(blogCorpus, control=list(tokenize=twoGram, wordLengths=c(1, 15)))
#blogThreeTDM <- TermDocumentMatrix(blogCorpus, control=list(tokenize=threeGram, wordLengths=c(1, 15)))
#blogFourTDM <- TermDocumentMatrix(blogCorpus, control=list(tokenize=fourGram, wordLengths=c(1, 15)))

conNews <- file("en_US.news.txt", "r") 
news<-readLines(conNews)
newsSample<-sample(news,10102)
close(conNews)
newsCorpus <- Corpus(VectorSource(newsSample), 
                     readerControl=list(reader=readPlain, language="en_US", load=TRUE))
rm(news,newsSample)
newsCorpus <- tm_map(newsCorpus, content_transformer(removeNumbers))
newsCorpus <- tm_map(newsCorpus, content_transformer(removePunctuation))
newsCorpus <- tm_map(newsCorpus, content_transformer(tolower))
newsCorpus <- tm_map(newsCorpus, content_transformer(stripWhitespace))
newsTwoTDM <- TermDocumentMatrix(newsCorpus, control=list(tokenize=twoGram, wordLengths=c(1, 15)))
#newsThreeTDM <- TermDocumentMatrix(newsCorpus, control=list(tokenize=threeGram, wordLengths=c(1, 15)))
#newsFourTDM <- TermDocumentMatrix(newsCorpus, control=list(tokenize=fourGram, wordLengths=c(1, 15)))

conTwitter <- file("en_US.twitter.txt", "r") 
twitter<-readLines(conTwitter)
twitterSample<-sample(twitter,11800)
close(conTwitter)
twitterCorpus <- Corpus(VectorSource(twitterSample), 
                     readerControl=list(reader=readPlain, language="en_US", load=TRUE))
rm(twitter,twitterSample)
#twitterCorpus <- tm_map(twitterCorpus, content_transformer(removeNumbers))
twitterCorpus <- tm_map(twitterCorpus, content_transformer(removePunctuation))
twitterCorpus <- tm_map(twitterCorpus, content_transformer(tolower))
twitterCorpus <- tm_map(twitterCorpus, content_transformer(stripWhitespace))
twitterTwoTDM <- TermDocumentMatrix(twitterCorpus, control=list(tokenize=twoGram, wordLengths=c(1, 15)))
#twitterThreeTDM <- TermDocumentMatrix(twitterCorpus, control=list(tokenize=threeGram, wordLengths=c(1, 15)))
#twitterFourTDM <- TermDocumentMatrix(twitterCorpus, control=list(tokenize=fourGram, wordLengths=c(1, 15)))

Graphing

The graphs reveal that phrases like “for the”, “in the”, “of the”, “at the” (in other words, the beginning of prepositional phrases), were the most common bigrams.

blog2gram<-data.frame(Names=row.names(as.matrix(blogTwoTDM)),
                     Count=rowSums(as.matrix(blogTwoTDM)))
b2g<-head(blog2gram[order(blog2gram$Count, decreasing = TRUE),],10)
ggplot(b2g, aes(x=Names, y=Count)) +
  geom_bar(stat="identity") +
  ggtitle("Frequent Bigrams in Blogs") +
  xlab("Bigram") +
  ylab("Frequency")

news2gram<-data.frame(Names=row.names(as.matrix(newsTwoTDM)),
                     Count=rowSums(as.matrix(newsTwoTDM)))
n2g<-head(news2gram[order(news2gram$Count, decreasing = TRUE),],10)
ggplot(n2g, aes(x=Names, y=Count)) +
  geom_bar(stat="identity") +
  ggtitle("Frequent Bigrams in News") +
  xlab("Bigram") +
  ylab("Frequency")

twit2gram<-data.frame(Names=row.names(as.matrix(twitterTwoTDM)),
                     Count=rowSums(as.matrix(twitterTwoTDM)))
t2g<-head(twit2gram[order(twit2gram$Count, decreasing = TRUE),],10)
ggplot(t2g, aes(x=Names, y=Count)) +
  geom_bar(stat="identity") +
  ggtitle("Frequent Bigrams in Twitter") +
  xlab("Bigram") +
  ylab("Frequency")

Plans from here

Remove Profanity. Combine the Corpora. Rank tokens by frequency. Use words entered as a dictionary to predict the next word by picking the first entry in the dictionary. Start with 4grams, and work backwards to go to 3grams, then finally 2grams. Develop a way to deal with unseen words. Develop shiny app.