Data has been manually downloaded from the provided link ( https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and unzipped into E:/Coursera_JHU_Capstone/Corpus/.
The data in question contains files titled blogs.txt, news.txt, and twitter.txt for four languages: English (en_us), German (de_DE), Finnish (fi_FI) and Russian (ru_RU). Technically this procedure can be done for any language, but for the purposes of this project, I will stick to English.
These files are huge for text corpora, totalling 556 MB for the English folder. Sample of the Twitter txt:
How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.
they've decided its more fun if I don't.
So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)
Words from a complete stranger! Made my birthday even better :)
First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!
i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing
Looks like the text is arranged in lines. readr’s read_lines() function is faster than the base package. So:
## Filename Wordcount Linecount
## 1 en_US.blogs.txt 37334131 899288
## 2 en_US.news.txt 34372530 1010242
## 3 en_US.twitter.txt 30373543 2360148
These files are obviously huge, so we’re going to work with a sample. R’s sample() takes a random sample of the specified size from the elements of x.
## [1] 42394
## [1] 33986
## [1] 12691
Now that we have samples of a manageable size, a quick look at head(twitter2) shows that we need to deal with punctuation, different cases, errant whitespace, urls, emojis, and other artefacts in text data that will skew prediction functions.
The code below takes care of that. We also convert symbols and numbers to word equivalents and remove stopwords - ubiquitious words that appear so frequently in text corpora that they have little information value. The tm() package ships with a handy default list of stopwords for this. The rest are a combination of base r and functions from the qdap() package.
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 6
## [1] "Top 20 unigrams"
## word freq
## the the 487
## one one 481
## two two 360
## said said 303
## will will 293
## thousand thousand 278
## hundred hundred 274
## like like 229
## can can 222
## just just 214
## and and 192
## new new 182
## three three 174
## five five 161
## get get 156
## good good 155
## time time 140
## first first 138
## now now 137
## number number 133
## [1] "Top 20 bigrams"
## word freq
## i am i am 124
## two thousand two thousand 113
## one thousand one thousand 82
## it is it is 71
## nine hundred nine hundred 63
## thousand nine thousand nine 63
## one hundred one hundred 54
## i think i think 52
## i can i can 50
## i will i will 46
## i know i know 41
## don t don t 38
## i just i just 37
## two hundred two hundred 37
## i have i have 35
## hundred ninety hundred ninety 33
## i m i m 32
## three hundred three hundred 31
## that is that is 26
## i love i love 25
## [1] "Top 20 trigrams"
## word freq
## thousand nine hundred thousand nine hundred 58
## one thousand nine one thousand nine 52
## two thousand ten two thousand ten 23
## nine hundred ninety nine hundred ninety 18
## i don t i don t 15
## two thousand twelve two thousand twelve 14
## thousand eight hundred thousand eight hundred 13
## i think i i think i 12
## dollar one hundred dollar one hundred 10
## i am going i am going 10
## i am sure i am sure 10
## thousand one hundred thousand one hundred 10
## thousand three hundred thousand three hundred 10
## two thousand eight two thousand eight 10
## two thousand eleven two thousand eleven 10
## i know i i know i 9
## one thousand eight one thousand eight 9
## dollar two hundred dollar two hundred 8
## thousand five hundred thousand five hundred 8
## i am looking i am looking 7
‘queení ½akes’ in ’utf8towcs became a massive headache that threw the tolower() function buried inside the as.data.frame((as.matrix(TermDocumentMatrix(corpus)))) function. Clearly the number there was throwing something off, so I had to add a more robust gsub. Other data cleaning issues could exist.
This is a sample of 1000 from each corpus. For more robustness, it should be significantly larger - there’s enough data here to draw 10,000 from each corpus, and given the runtime it should be feasible.