Loading and testing data

Data has been manually downloaded from the provided link ( https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and unzipped into E:/Coursera_JHU_Capstone/Corpus/.

The data in question contains files titled blogs.txt, news.txt, and twitter.txt for four languages: English (en_us), German (de_DE), Finnish (fi_FI) and Russian (ru_RU). Technically this procedure can be done for any language, but for the purposes of this project, I will stick to English.

These files are huge for text corpora, totalling 556 MB for the English folder. Sample of the Twitter txt:

    How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been         way, way too long.
    When you meet someone special... you'll know. Your heart will beat more rapidly and you'll         smile for no reason.
    they've decided its more fun if I don't.
    So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)
    Words from a complete stranger! Made my birthday even better :)
    First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!
    i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me              up...damn thing

Looks like the text is arranged in lines. readr’s read_lines() function is faster than the base package. So:

##            Filename Wordcount Linecount
## 1   en_US.blogs.txt  37334131    899288
## 2    en_US.news.txt  34372530   1010242
## 3 en_US.twitter.txt  30373543   2360148

Sampling

These files are obviously huge, so we’re going to work with a sample. R’s sample() takes a random sample of the specified size from the elements of x.

## [1] 42394
## [1] 33986
## [1] 12691

Data cleaning

Now that we have samples of a manageable size, a quick look at head(twitter2) shows that we need to deal with punctuation, different cases, errant whitespace, urls, emojis, and other artefacts in text data that will skew prediction functions.

The code below takes care of that. We also convert symbols and numbers to word equivalents and remove stopwords - ubiquitious words that appear so frequently in text corpora that they have little information value. The tm() package ships with a handy default list of stopwords for this. The rest are a combination of base r and functions from the qdap() package.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 6

Exploratory statistics

## [1] "Top 20 unigrams"
##              word freq
## the           the  487
## one           one  481
## two           two  360
## said         said  303
## will         will  293
## thousand thousand  278
## hundred   hundred  274
## like         like  229
## can           can  222
## just         just  214
## and           and  192
## new           new  182
## three       three  174
## five         five  161
## get           get  156
## good         good  155
## time         time  140
## first       first  138
## now           now  137
## number     number  133

## [1] "Top 20 bigrams"
##                          word freq
## i am                     i am  124
## two thousand     two thousand  113
## one thousand     one thousand   82
## it is                   it is   71
## nine hundred     nine hundred   63
## thousand nine   thousand nine   63
## one hundred       one hundred   54
## i think               i think   52
## i can                   i can   50
## i will                 i will   46
## i know                 i know   41
## don t                   don t   38
## i just                 i just   37
## two hundred       two hundred   37
## i have                 i have   35
## hundred ninety hundred ninety   33
## i m                       i m   32
## three hundred   three hundred   31
## that is               that is   26
## i love                 i love   25

## [1] "Top 20 trigrams"
##                                          word freq
## thousand nine hundred   thousand nine hundred   58
## one thousand nine           one thousand nine   52
## two thousand ten             two thousand ten   23
## nine hundred ninety       nine hundred ninety   18
## i don t                               i don t   15
## two thousand twelve       two thousand twelve   14
## thousand eight hundred thousand eight hundred   13
## i think i                           i think i   12
## dollar one hundred         dollar one hundred   10
## i am going                         i am going   10
## i am sure                           i am sure   10
## thousand one hundred     thousand one hundred   10
## thousand three hundred thousand three hundred   10
## two thousand eight         two thousand eight   10
## two thousand eleven       two thousand eleven   10
## i know i                             i know i    9
## one thousand eight         one thousand eight    9
## dollar two hundred         dollar two hundred    8
## thousand five hundred   thousand five hundred    8
## i am looking                     i am looking    7

Notes:

  1. ‘queení ½akes’ in ’utf8towcs became a massive headache that threw the tolower() function buried inside the as.data.frame((as.matrix(TermDocumentMatrix(corpus)))) function. Clearly the number there was throwing something off, so I had to add a more robust gsub. Other data cleaning issues could exist.

  2. This is a sample of 1000 from each corpus. For more robustness, it should be significantly larger - there’s enough data here to draw 10,000 from each corpus, and given the runtime it should be feasible.