The purpose of this report is to (1) present a concise exploratory analysis of the data, and (2) provide an overview of the major steps in the strategy to build an algorithm that predicts the next word based on a given user input.
The data consists of the data sets en_US.blogs.txt, en_US.blogs.txt, and en_US.blogs.txt. The data was read in and a basic summary of the three files was performed: line count and word count, the latter with the Unix/Linux command wc.
# read in data
setwd("~/MyStuff/Programming/Rwork/CapstoneProject/final/en_US")
library(tm)
library(wordcloud)
library(ngram)
library(ggplot2)
require(RWeka)
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")
df <- data.frame(blogs.lines = length(blogs), news.lines = length(news), twitter.lines = length(twitter))
print(df)
## blogs.lines news.lines twitter.lines
## 1 899288 1010242 2360148
system("wc -w *.txt >> word_count")
wcnt <- read.delim("word_count", header = F)
names(wcnt) <- "word.count"
print(wcnt)
## word.count
## 1 37334114 en_US.blogs.txt
## 2 34365936 en_US.news.txt
## 3 30359804 en_US.twitter.txt
## 4 102059854 total
Furthermore, a 0.5% sample of the data was selected; 0.5% was chosen due to memory limitations during processing, when matrices are needed for word frequency count. Some data cleaning of the text corpus was performed, such as conversion to lowercase, removal of extra whitespace, punctuation and stopwords. Some user defined stopwords were used as well.
sample.length <- 0.001
blogs.sample <- sample(blogs, length(blogs) * sample.length)
news.sample <- sample(news, length(news) * sample.length)
twitter.sample <- sample(twitter, length(twitter) * sample.length)
myData.sample <- c(blogs.sample, news.sample, twitter.sample)
myData.corpus <- Corpus(VectorSource(myData.sample))
# Data cleaning
myData.corpus <- tm_map(myData.corpus, content_transformer(tolower))
myData.corpus <- tm_map(myData.corpus, removePunctuation)
myData.corpus <- tm_map(myData.corpus, removeNumbers)
myData.corpus <- tm_map(myData.corpus, stripWhitespace)
myData.corpus <- tm_map(myData.corpus, removeWords, stopwords("english"))
mystopwords <- c("cant", "dont", "isnt", "wont", "youre", "havent", "didnt",
"doesnt", "ive", "youve", "hasnt", "hadnt", "couldnt", "wouldnt")
myData.corpus <- tm_map(myData.corpus, removeWords, mystopwords)
myData.corpus <- tm_map(myData.corpus, PlainTextDocument)
dtm <- DocumentTermMatrix(myData.corpus)
In order get some insight into the data, an analysis of the most frequent words in the data set was performed, as shown below. The twenty most frequent words are tabulated here, and a word cloud is constructed for the fifty most frequent words. A barplot for words with frequencies higher than 200 follows.
frequency <- colSums(as.matrix(dtm))
frequency <- sort(frequency, decreasing=TRUE)
head(frequency, 20)
## just one will said like can get time new now
## 326 323 317 304 302 258 216 208 201 192
## good love know back day people last think first going
## 186 181 165 161 160 144 140 138 137 135
words <- names(frequency)
wordcloud(words[1:50], frequency[1:50])
wf <- data.frame(words = words, word_frequency = frequency)
p <- ggplot(subset(wf, word_frequency>200), aes(words, word_frequency))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p
Some associations for the two most frequent words, which typically belong to the set said, one, just, and will, were sought, but non-zero results were returned for weak correlation only. The first ten associations for each of the words is shown.
require(tm)
assocs <- findAssocs(dtm, words[1:2], 0.1)
print(assocs[[1]][1:10])
## breathe commentyou probs sos verification
## 0.20 0.20 0.20 0.20 0.20
## waging worn blogs like adopting
## 0.20 0.20 0.17 0.16 0.15
print(assocs[[2]][1:10])
## analyze brothers celeste cohort cohorts kaiser
## 0.20 0.20 0.20 0.20 0.20 0.20
## fascinated wire banter buggerdixon
## 0.17 0.17 0.15 0.15
Finally, the number of bigrams and trigrams in the data set was determined, and their elements are displayed below. Ngrams are sequences of n words found in a text, and their knowledge is useful in predicting what the next word may be based on a given word.
ng <- ngram(tolower(myData.sample), n = 2)
ng
## [1] "An ngram object with 161 2-grams"
get.ngrams(ng)
## [1] "not the" "humorous flashes," "rico, we"
## [4] "universe pageant." "reasons i" "herself, no"
## [7] "joke. ever" "advised to" "the contestants"
## [10] "i enjoy" "the deal" "how she"
## [13] "contestants. when" "weber is" "working with"
## [16] "witnessed. ellen's" "no matter" "blew up"
## [19] "natural wit" "of the" "have ever"
## [22] "was on" "you can" "with loses"
## [25] "doctor who" "the door." "gifts of"
## [28] "the reasons" "she tripped" "wit daily."
## [31] "engages others" "she networks." "call for"
## [34] "business deals" "not feeling" "banter work"
## [37] "to see" "-- but" "stayed in"
## [40] "when presenting" "and fell" "pageant. one"
## [43] "called to" "makes business" "for contestants."
## [46] "deals or" "deal and" "in puerto"
## [49] "with a" "once when" "was called"
## [52] "from natural" "to laugh" "she experienced"
## [55] "a hotel" "is she" "but not"
## [58] "ever witnessed." "was not" "of all"
## [61] "was advised" "one of" "if a"
## [64] "action with" "walked to" "the doctor"
## [67] "others as" "feeling well" "her as"
## [70] "laughs i" "flashes, check" "who was"
## [73] "networks. if" "there. she" "or just"
## [76] "floodlight wire" "just engages" "puerto rico,"
## [79] "see how" "presenting in" "of wit"
## [82] "a well-timed" "she makes" "she has"
## [85] "on call" "to the" "fell as"
## [88] "and you" "wire and" "i have"
## [91] "contestants waiting" "at herself," "group we're"
## [94] "so was" "named was" "hotel hosting"
## [97] "ever tried" "collaborating with" "a group"
## [100] "physician, she" "in front" "miss universe"
## [103] "she draws" "daily. to" "and banter"
## [106] "ellen's gifts" "in a" "can quickly"
## [109] "work well" "belly laughs" "laugh at"
## [112] "the physician," "enjoy collaborating" "up --"
## [115] "her humorous" "what. once" "steam, ellen"
## [118] "ellen was" "she walked" "see the"
## [121] "out dinner" "hosting the" "the action"
## [124] "on a" "all the" "ellen weber"
## [127] "when ellen's" "check out" "front of"
## [130] "a floodlight" "as she" "experienced one"
## [133] "jump-starts the" "we stayed" "waiting there."
## [136] "for her" "wit and" "tried it?"
## [139] "loses steam," "ellen's named" "draws from"
## [142] "well for" "morning ellen" "well-timed joke."
## [145] "dinner blew" "most gregarious" "the most"
## [148] "tripped on" "matter what." "has ability"
## [151] "ability to" "ellen jump-starts" "gregarious belly"
## [154] "quickly see" "door. in" "with ellen"
## [157] "the miss" "well so" "we're working"
## [160] "one morning" "see her"
ng <- ngram(myData.sample, n = 3)
ng
## [1] "An ngram object with 166 3-grams"
get.ngrams(ng)
## [1] "who was on" "witnessed. Ellen's gifts"
## [3] "Ellen's named was" "the doctor who"
## [5] "just engages others" "there. She experienced"
## [7] "matter what. Once" "Ellen jump-starts the"
## [9] "her as she" "hosting the Miss"
## [11] "most gregarious belly" "what. Once when"
## [13] "Ever tried it?" "all the contestants"
## [15] "she tripped on" "experienced one of"
## [17] "her humorous flashes," "Ellen's gifts of"
## [19] "she has ability" "was not feeling"
## [21] "loses steam, Ellen" "as she networks."
## [23] "In front of" "Dinner Blew Up"
## [25] "the contestants waiting" "One morning Ellen"
## [27] "work well for" "the reasons I"
## [29] "with a well-timed" "not feeling well"
## [31] "she makes business" "physician, she tripped"
## [33] "called to see" "a floodlight wire"
## [35] "of wit and" "belly laughs I"
## [37] "or just engages" "natural wit daily."
## [39] "banter work well" "Puerto Rico, we"
## [41] "see her humorous" "the door. In"
## [43] "to the door." "others as she"
## [45] "humorous flashes, check" "collaborating with Ellen"
## [47] "as she walked" "ability to laugh"
## [49] "One of the" "for contestants. When"
## [51] "laughs I have" "at herself, no"
## [53] "daily. To see" "Universe Pageant. One"
## [55] "to see the" "contestants waiting there."
## [57] "was advised to" "advised to see"
## [59] "how she has" "well for her"
## [61] "with Ellen Weber" "to laugh at"
## [63] "ever witnessed. Ellen's" "when presenting in"
## [65] "you can quickly" "To see her"
## [67] "call for contestants." "named was called"
## [69] "Pageant. One morning" "with loses steam,"
## [71] "was on call" "draws from natural"
## [73] "group we're working" "Once when presenting"
## [75] "Up -- But" "no matter what."
## [77] "waiting there. She" "gifts of wit"
## [79] "business deals or" "and banter work"
## [81] "we stayed in" "for her as"
## [83] "the action with" "working with loses"
## [85] "makes business deals" "on a floodlight"
## [87] "in Puerto Rico," "a group we're"
## [89] "from natural wit" "floodlight wire and"
## [91] "Not the Deal" "action with a"
## [93] "stayed in a" "Blew Up --"
## [95] "a hotel hosting" "have ever witnessed."
## [97] "steam, Ellen jump-starts" "engages others as"
## [99] "doctor who was" "see the doctor"
## [101] "But Not the" "networks. If a"
## [103] "Rico, we stayed" "deals or just"
## [105] "gregarious belly laughs" "the physician, she"
## [107] "she draws from" "see the physician,"
## [109] "of all the" "front of all"
## [111] "She experienced one" "wit and banter"
## [113] "and fell as" "If a group"
## [115] "Ellen was not" "the Deal and"
## [117] "and you can" "one of the"
## [119] "When Ellen's named" "as she makes"
## [121] "hotel hosting the" "laugh at herself,"
## [123] "I have ever" "is she draws"
## [125] "well so was" "the Miss Universe"
## [127] "door. In front" "Miss Universe Pageant."
## [129] "wit daily. To" "we're working with"
## [131] "was called to" "-- But Not"
## [133] "Ellen Weber is" "well-timed joke. Ever"
## [135] "she walked to" "morning Ellen was"
## [137] "Weber is she" "tripped on a"
## [139] "herself, no matter" "in a hotel"
## [141] "contestants. When Ellen's" "Deal and you"
## [143] "feeling well so" "enjoy collaborating with"
## [145] "quickly see how" "of the most"
## [147] "joke. Ever tried" "so was advised"
## [149] "fell as she" "can quickly see"
## [151] "see how she" "she networks. If"
## [153] "on call for" "walked to the"
## [155] "out Dinner Blew" "presenting in Puerto"
## [157] "jump-starts the action" "I enjoy collaborating"
## [159] "of the reasons" "a well-timed joke."
## [161] "wire and fell" "has ability to"
## [163] "the most gregarious" "reasons I enjoy"
## [165] "flashes, check out" "check out Dinner"
A representative sample of the data will be selected, provided that it will provide a sufficient pool of words for the prediction to be accurate. N-grams will be constructed, for n = 2, 3, etc, and the minimum value of n will be determined such that the computational process is efficient. Based on the constructed n-grams a prediction will be made and suggested to a potential user based on an input word.