In this report we will briefly go over some preliminary findings and major features of the data as well as discuss go-forward plans to develop the algo and associated Shiny app.
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
library(NLP)
library(tm)
library(ngram)
conblogs <- file("en_US/en_US.blogs.txt", "r")
connews <- file("en_US/en_US.news.txt", "r")
contwitter <- file("en_US/en_US.twitter.txt", "r")
conblogs<-readLines(conblogs,warn=FALSE,encoding="UTF-8")
connews<-readLines(connews,warn=FALSE,encoding="UTF-8")
contwitter<-readLines(contwitter,warn=FALSE,encoding="UTF-8")
head(conblogs,1)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."
head(connews,1)
## [1] "He wasn't home alone, apparently."
head(contwitter,1)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
we can see that the Blogs data features 899288 lines, the news data features 77259 lines, and the twitter data features 2360148 lines. Now lets look at some summary statistics for each dataset.
We’ll using the NLP package’s ngrams function. First we use strsplit and flatten to a vector. then we can take the length of the vector produced by the ngrams function.
conblogs_ngrams<-unlist(strsplit(conblogs, " ", fixed = TRUE))
connews_ngrams<-unlist(strsplit(connews, " ", fixed = TRUE))
contwitter_ngrams<-unlist(strsplit(contwitter, " ", fixed = TRUE))
df<-data.frame("File Name" = c("Blogs", "News", "Twitter"),
"File Size" = c(file.size("en_US/en_US.blogs.txt"), file.size("en_US/en_US.news.txt"), file.size("en_US/en_US.twitter.txt")),
"Total Words" = c(length(ngrams(conblogs_ngrams, 1L)), length(ngrams(connews_ngrams, 1L)), length(ngrams(contwitter_ngrams, 1L))),
"Avg chars/Line"=c(mean(nchar(conblogs)),mean(nchar(connews)),mean(nchar(contwitter))))
df
## File.Name File.Size Total.Words Avg.chars.Line
## 1 Blogs 210160014 37334131 229.98695
## 2 News 205811889 2643969 202.42830
## 3 Twitter 167105338 30373543 68.68045
Given the size of the datasets, we’ll sample and clean each to form our corpus. We’re looking to sample 1% of the lines for each, which gives us 8992.88 for the blogs data, 772.59 for the news data, and 2.36014810^{4} for the twitter data.
set.seed(400)
textsample <- c(sample(conblogs, length(conblogs) * 0.01),
sample(connews, length(connews) * 0.01),
sample(contwitter, length(contwitter) * 0.01))
corpus <- VCorpus(VectorSource(textsample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, toSpace, "[^[:graph:]]")
corpus <- tm_map(corpus, tolower)
con_corpus<-concatenate(corpus)
con_corpus<-gsub("\\n","",con_corpus)
con_corpus<-gsub("\"","",con_corpus)
con_corpus<-gsub("\ ,","",con_corpus)
con_corpus<-gsub("[^0-9A-Za-z///' ]", "", con_corpus)
Now that the corpus size is workable we can explore the data. Here we will show the most frequently found word pairs (2-gram), as this may be of interest in developpig the predictive text algo.
unigram<-ngram(con_corpus,n=1)
bigram<-ngram(con_corpus,n=2)
trigram<-ngram(con_corpus,n=3)
phrase<-get.phrasetable(bigram)
phrase2<-get.phrasetable(trigram)
phrase3<-get.phrasetable(unigram)
phrase[1:30,]
## ngrams freq prop
## 1 i think 514 0.0012186852
## 2 i love 449 0.0010645713
## 3 i know 415 0.0009839579
## 4 i can 376 0.0008914896
## 5 i just 336 0.0007966503
## 6 i m 301 0.0007136659
## 7 don t 278 0.0006591333
## 8 i will 270 0.0006401654
## 9 i want 240 0.0005690359
## 10 right now 237 0.0005619230
## 11 i like 202 0.0004789386
## 12 i get 187 0.0004433738
## 13 i got 185 0.0004386318
## 14 i need 170 0.0004030671
## 15 i really 165 0.0003912122
## 16 last night 165 0.0003912122
## 17 it s 164 0.0003888412
## 18 now i 163 0.0003864702
## 19 know i 159 0.0003769863
## 20 i feel 158 0.0003746153
## 21 i hope 157 0.0003722443
## 22 time i 150 0.0003556474
## 23 i thought 148 0.0003509055
## 24 i ve 143 0.0003390506
## 25 didn t 143 0.0003390506
## 26 i wish 133 0.0003153407
## 27 think i 117 0.0002774050
## 28 like i 117 0.0002774050
## 29 i see 115 0.0002726630
## 30 i hate 114 0.0002702921
phrase2[1:30,]
## ngrams freq prop
## 1 i don t 92 2.181309e-04
## 2 i think i 80 1.896791e-04
## 3 i know i 70 1.659692e-04
## 4 i wish i 57 1.351463e-04
## 5 i feel like 48 1.138075e-04
## 6 i didn t 48 1.138075e-04
## 7 happy mothers day 42 9.958152e-05
## 8 i can t 39 9.246855e-05
## 9 i thought i 38 9.009757e-05
## 10 don t know 32 7.587163e-05
## 11 i m going 30 7.112966e-05
## 12 don t want 24 5.690373e-05
## 13 i m sure 23 5.453274e-05
## 14 update as of 23 5.453274e-05
## 15 i can see 22 5.216175e-05
## 16 happy new year 22 5.216175e-05
## 17 i think im 21 4.979076e-05
## 18 feel like i 21 4.979076e-05
## 19 years ago i 21 4.979076e-05
## 20 every time i 20 4.741977e-05
## 21 i haven t 20 4.741977e-05
## 22 let us know 19 4.504878e-05
## 23 i guess i 19 4.504878e-05
## 24 last night i 19 4.504878e-05
## 25 right now i 18 4.267779e-05
## 26 fukushima daiichi nuclear 18 4.267779e-05
## 27 little italy boston 17 4.030681e-05
## 28 i really want 17 4.030681e-05
## 29 magianos little italy 17 4.030681e-05
## 30 nuclear power plant 16 3.793582e-05
Now let’s look at the most common words in the corpus.
phrase3 <- phrase3[with(phrase3, order(-freq, ngrams)), ]
phrase3 <- phrase3[1:30,]
barplot(phrase3$freq, names.arg=phrase3$ngrams, main="Unigram frequency", xlab="Words")
Much of the analysis shown here will form the basis of our predictive text model. We will use the nram frequencies, following further cleanup and analysis, to predict the most likely ngram based on the input of a given word. A major challenge in searching an ngram table or tables based on text input will be speed. We can’t have users waiting for the recommended predicted word.