The objective of this project is to create an data product that can predict the next word given a word in context. The idea is to build a predictive model based on a library of text usage. The data which will be used to train the model will be the HC Corpora en_US blogs, twitter and news sets.
Each text data set is 200MB large. To run processes on 600MB of data is not feasible given the computing resources available. Instead a random sample of 10% of records from each set will be used. We can infer the characteristics of the population from the sample. Sampling will be in 10 line chunks. Using a function named random Sample to do this work.
##Random sample a tenth of the total set
randomSample("~/coursera/data scientist/Capstone/final/en_US/en_US.blogs.txt",
"~/coursera/data scientist/Capstone/final/en_US/en_US.blogs.sample.txt",0.1,10);
randomSample("~/coursera/data scientist/Capstone/final/en_US/en_US.news.upd.txt",
"~/coursera/data scientist/Capstone/final/en_US/en_US.news.sample.txt",0.1,10);
randomSample("~/coursera/data scientist/Capstone/final/en_US/en_US.twitter.txt",
"~/coursera/data scientist/Capstone/final/en_US/en_US.twitter.sample.txt",0.1,10);
To allow for meaningful analysis of the text bodies we need to do some cleaning. This includes removing additional whitespace, converting all text to lower case, remove stop words such as “and”,“it”,“so” (as listed below - which add no information).
stopwords("english");
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very" "will"
In addition, to enable meaningful aggregate analysis it is necessary to “tokenise” the text by converting the words of the texts into word stems.
#Load the three cleaned tokenised English text body samples
##Remove whitespace, convert to lower case, remove stop words and perform stemming
en_US_blogs<-swiftToken("~/coursera/data scientist/Capstone/final/en_US/en_US.blogs.sample.txt")
en_US_news<-swiftToken("~/coursera/data scientist/Capstone/final/en_US/en_US.news.sample.txt")
en_US_twitter<-swiftToken("~/coursera/data scientist/Capstone/final/en_US/en_US.twitter.sample.txt")
#combine into a single corpus
corpus_cmb<-c(en_US_blogs,en_US_news,en_US_twitter)
For preliminary analysis, examine the number of lines in each of the texts.
#Check number of lines in each set
NROW(en_US_blogs[[1]][1]$content)
## [1] 89080
NROW(en_US_twitter[[1]][1]$content)
## [1] 235300
NROW(en_US_news[[1]][1]$content)
## [1] 100520
Next examine the number of words in the combined corpus. Specifically focussing on the words which occur in all 3 documents which are part of the corpus, and word stems which are between 3 and 20 characters long. Look at the 30 most frequently occuring stems in each document.
##Create the combined term document matrix only including word stems of length 3 to 20
## which occur in all 3 of documents in the corpus.
tdm <-TermDocumentMatrix(corpus_cmb, control=list(wordLengths=c(3, 20),
bounds = list(global = 3)))
##Examine the 30 most frequently used terms
findMostFreqTerms(tdm,30);
## $en_US.blogs.sample.txt
## one like time can just get make day know year use love
## 13184 10837 10289 9932 9837 9296 7937 7023 6823 6759 6428 6393
## work peopl thing want think now see even also look dont new
## 6202 6160 6116 6083 5920 5837 5597 5596 5584 5565 5553 5413
## way well back first good take
## 5255 5169 5071 5040 4961 4879
##
## $en_US.news.sample.txt
## said year one new time state say can also like
## 24775 11101 8526 6841 6647 6565 6268 6046 6002 5992
## get two first last just make peopl work game school
## 5908 5654 5334 5313 5245 5176 4969 4966 4965 4575
## citi play day includ want use take team back now
## 4497 4355 4310 3905 3773 3766 3748 3744 3612 3569
##
## $en_US.twitter.sample.txt
## just get thank like love day good dont can one
## 15064 14578 12977 12933 12396 10874 10103 9060 8904 8675
## know time now follow great see today make new lol
## 8586 8581 8121 7867 7695 7562 7445 7205 6973 6911
## look think come work need want back got cant peopl
## 6556 6425 6388 6335 6293 6175 5706 5679 5371 5301
It’s interesting to note the variation in the ranking of word stems across the 3 corpora. Although words seem to be common across the 3 sets and rankings are similar, they are not exactly the same. This could point to there being a difference in the type of language used when communicating in a blog as opposed to a twitter post or a news item.
This may also indicate the need to include context into the predictive model. I also suspect that language may differ between locations.
Looking a bit deeper at the words in the term document matrix as an aggregate - the top 100 most frequently occurring stems.
##convert to matrix
tdmaggr<-as.matrix(tdm)
##look at top 100 most occuring stems
v<-sort(rowSums(tdmaggr), decreasing=TRUE)
head(v, 100)
## one said just get like time can day year make
## 30385 30286 30146 29782 29762 25517 24882 22207 22055 20318
## love new know good dont now work peopl want say
## 20230 19227 18301 18201 17561 17527 17503 16430 16031 16007
## see think thank look come back need first use also
## 15775 15331 15186 15055 14437 14389 13892 13427 13382 13210
## thing last well take great way even much today two
## 13030 12817 12715 12616 12530 12398 12004 11730 11535 11302
## right realli follow got week start still play game call
## 11241 11150 11040 10979 10638 10592 10203 10093 10064 9634
## show tri state feel that life school home mani cant
## 9402 9401 9273 9200 8976 8933 8867 8809 8638 8553
## live help night littl made hope never let may best
## 8443 8356 8239 8214 8188 8135 8134 8105 7831 7766
## next friend give lol someth book lot world citi happi
## 7643 7535 7406 7167 7127 7055 6935 6907 6906 6884
## end find man didnt place keep better watch alway anoth
## 6864 6857 6758 6749 6731 6719 6705 6685 6667 6607
## run ive around everi team your put talk big read
## 6585 6580 6540 6470 6465 6456 6262 6251 6209 6085
It will take 986 word stems from the vocabulary to cover 60% of the text.
dv<-data.frame(v,names(v))
colnames(dv)<-c("count","wordstem")
dvp<-dv %>%
mutate(
perc_cover = cumsum(count)/sum(count)
)
tail(dvp[dvp$perc_cover<=0.6,])
## count wordstem perc_cover
## 978 1111 common 0.5988054
## 979 1110 deep 0.5990116
## 980 1109 fast 0.5992176
## 981 1108 appar 0.5994233
## 982 1107 address 0.5996289
## 983 1106 absolut 0.5998343
If will take 7,889 of the 167,786 word stems to cover 90% of the text. So with a very small portion of the vocabulary can cover a large proportion of the text
tail(dvp[dvp$perc_cover<=0.9,])
## count wordstem perc_cover
## 7874 51 reboot 0.8999452
## 7875 51 rejoic 0.8999547
## 7876 51 rhp 0.8999642
## 7877 51 roth 0.8999736
## 7878 51 salesman 0.8999831
## 7879 51 scenic 0.8999926
Plotting this relationship - number of word stems included versus percentage coverage.
g<-ggplot(dvp,mapping=aes(perc_cover,as.numeric(rownames(dvp)))) + geom_line(color="red",size=1) + scale_y_continuous(labels=comma) + scale_x_continuous(labels = percent) + labs(y="Number of Word Stems", x="Percentage of text covered") + ggtitle("Word Stems against coverage of text")
g
There is an exponential relationship between the number of word stems required to cover the text. In other words to achieve very high levels of cover requires an increasingly higher proportion of the total word stem vocabulary.
Exploratory analysis has shown that the body of text is extremely large. We have learned that it requires an increasingly higher proportion of the vocabulary to achieve higher levels of coverage of the text. In addition, the character of the language seems to be dependent on the context. This may influence the way in which predictive models are constructed.