Milestone report for text prediction algo

In this report we will briefly go over some preliminary findings and major features of the data as well as discuss go-forward plans to develop the algo and associated Shiny app.

Downloading and loading the required data

download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
library(NLP)
library(tm)
library(ngram)

conblogs <- file("en_US/en_US.blogs.txt", "r")
connews <- file("en_US/en_US.news.txt", "r")
contwitter <- file("en_US/en_US.twitter.txt", "r")

conblogs<-readLines(conblogs,warn=FALSE,encoding="UTF-8")
connews<-readLines(connews,warn=FALSE,encoding="UTF-8")
contwitter<-readLines(contwitter,warn=FALSE,encoding="UTF-8")

head(conblogs,1)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."
head(connews,1)
## [1] "He wasn't home alone, apparently."
head(contwitter,1)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

we can see that the Blogs data features 899288 lines, the news data features 77259 lines, and the twitter data features 2360148 lines. Now lets look at some summary statistics for each dataset.

Basic statistics by dataset

We’ll using the NLP package’s ngrams function. First we use strsplit and flatten to a vector. then we can take the length of the vector produced by the ngrams function.

conblogs_ngrams<-unlist(strsplit(conblogs, " ", fixed = TRUE))
connews_ngrams<-unlist(strsplit(connews, " ", fixed = TRUE))
contwitter_ngrams<-unlist(strsplit(contwitter, " ", fixed = TRUE))

df<-data.frame("File Name" = c("Blogs", "News", "Twitter"),
           "File Size" = c(file.size("en_US/en_US.blogs.txt"), file.size("en_US/en_US.news.txt"), file.size("en_US/en_US.twitter.txt")),
           "Total Words" = c(length(ngrams(conblogs_ngrams, 1L)), length(ngrams(connews_ngrams, 1L)), length(ngrams(contwitter_ngrams, 1L))),
           "Avg chars/Line"=c(mean(nchar(conblogs)),mean(nchar(connews)),mean(nchar(contwitter))))
df
##   File.Name File.Size Total.Words Avg.chars.Line
## 1     Blogs 210160014    37334131      229.98695
## 2      News 205811889     2643969      202.42830
## 3   Twitter 167105338    30373543       68.68045

cleanup and exploration

Given the size of the datasets, we’ll sample and clean each to form our corpus. We’re looking to sample 1% of the lines for each, which gives us 8992.88 for the blogs data, 772.59 for the news data, and 2.36014810^{4} for the twitter data.

set.seed(400)
textsample <- c(sample(conblogs, length(conblogs) * 0.01),
                 sample(connews, length(connews) * 0.01),
                 sample(contwitter, length(contwitter) * 0.01))

corpus <- VCorpus(VectorSource(textsample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, toSpace, "[^[:graph:]]")
corpus <- tm_map(corpus, tolower)

con_corpus<-concatenate(corpus)
con_corpus<-gsub("\\n","",con_corpus)
con_corpus<-gsub("\"","",con_corpus)
con_corpus<-gsub("\ ,","",con_corpus)
con_corpus<-gsub("[^0-9A-Za-z///' ]", "", con_corpus)

Now that the corpus size is workable we can explore the data. Here we will show the most frequently found word pairs (2-gram), as this may be of interest in developpig the predictive text algo.

unigram<-ngram(con_corpus,n=1)
bigram<-ngram(con_corpus,n=2)
trigram<-ngram(con_corpus,n=3)

phrase<-get.phrasetable(bigram)
phrase2<-get.phrasetable(trigram)
phrase3<-get.phrasetable(unigram)

phrase[1:30,]
##         ngrams freq         prop
## 1     i think   514 0.0012186852
## 2      i love   449 0.0010645713
## 3      i know   415 0.0009839579
## 4       i can   376 0.0008914896
## 5      i just   336 0.0007966503
## 6         i m   301 0.0007136659
## 7       don t   278 0.0006591333
## 8      i will   270 0.0006401654
## 9      i want   240 0.0005690359
## 10  right now   237 0.0005619230
## 11     i like   202 0.0004789386
## 12      i get   187 0.0004433738
## 13      i got   185 0.0004386318
## 14     i need   170 0.0004030671
## 15   i really   165 0.0003912122
## 16 last night   165 0.0003912122
## 17       it s   164 0.0003888412
## 18      now i   163 0.0003864702
## 19     know i   159 0.0003769863
## 20     i feel   158 0.0003746153
## 21     i hope   157 0.0003722443
## 22     time i   150 0.0003556474
## 23  i thought   148 0.0003509055
## 24       i ve   143 0.0003390506
## 25     didn t   143 0.0003390506
## 26     i wish   133 0.0003153407
## 27    think i   117 0.0002774050
## 28     like i   117 0.0002774050
## 29      i see   115 0.0002726630
## 30     i hate   114 0.0002702921
phrase2[1:30,]
##                        ngrams freq         prop
## 1                    i don t    92 2.181309e-04
## 2                  i think i    80 1.896791e-04
## 3                   i know i    70 1.659692e-04
## 4                   i wish i    57 1.351463e-04
## 5                i feel like    48 1.138075e-04
## 6                   i didn t    48 1.138075e-04
## 7          happy mothers day    42 9.958152e-05
## 8                    i can t    39 9.246855e-05
## 9                i thought i    38 9.009757e-05
## 10                don t know    32 7.587163e-05
## 11                 i m going    30 7.112966e-05
## 12                don t want    24 5.690373e-05
## 13                  i m sure    23 5.453274e-05
## 14              update as of    23 5.453274e-05
## 15                 i can see    22 5.216175e-05
## 16            happy new year    22 5.216175e-05
## 17                i think im    21 4.979076e-05
## 18               feel like i    21 4.979076e-05
## 19               years ago i    21 4.979076e-05
## 20              every time i    20 4.741977e-05
## 21                 i haven t    20 4.741977e-05
## 22               let us know    19 4.504878e-05
## 23                 i guess i    19 4.504878e-05
## 24              last night i    19 4.504878e-05
## 25               right now i    18 4.267779e-05
## 26 fukushima daiichi nuclear    18 4.267779e-05
## 27       little italy boston    17 4.030681e-05
## 28             i really want    17 4.030681e-05
## 29     magianos little italy    17 4.030681e-05
## 30       nuclear power plant    16 3.793582e-05

Now let’s look at the most common words in the corpus.

phrase3 <- phrase3[with(phrase3, order(-freq, ngrams)), ]
phrase3 <- phrase3[1:30,]
barplot(phrase3$freq, names.arg=phrase3$ngrams, main="Unigram frequency", xlab="Words")

Next steps to produce algo and associated app

Much of the analysis shown here will form the basis of our predictive text model. We will use the nram frequencies, following further cleanup and analysis, to predict the most likely ngram based on the input of a given word. A major challenge in searching an ngram table or tables based on text input will be speed. We can’t have users waiting for the recommended predicted word.