Obtaining the data and verifying data integrity

Below you will find a code snippet that will download the training data and extract it to an appropriate location. We make the assumption that all relevant data will be stored in a subdirectory of the project working directory called data.

Firstly, we wish to ensure that we have downloaded a true copy of the project data set. Per this post (thanks Oscar Fernando de León Osorio for sharing the file hashes), valid file hashes are:

# download the file only if necessary (file is > 500 MB!!!)
data.url <- 'http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip'
valid.md5 <- 'e0629c64b1747103a7e751b0a30d3858'
if(!file.exists('../data')) dir.create('../data')
fn <- file.path('../data', basename(data.url))
if(file.exists(fn)) {
        # do file integrity check
        chksum <- md5sum(fn)
        message(chksum)
        # if bad checksum re-download the data file
        if(chksum != valid.md5) {
                message('MD5 sum mismatch, re-downloading data file...')
                download.file(url = data.url, destfile = fn, quiet = FALSE, 
                              mode = 'wb')
        }
} else download.file(url = data.url, destfile = fn, quiet = FALSE, mode = 'wb')
## e0629c64b1747103a7e751b0a30d3858
# verify archive integrity
message('Checking data archive file integrity...')
## Checking data archive file integrity...
chksum <- md5sum(fn)
if(chksum != valid.md5) stop('ABORTING: Corrupt data file!') else {
        message('Data file checksums are correct, proceeding...')
}
## Data file checksums are correct, proceeding...
# extract archive
if(!file.exists('../data/final')) {
        unzip('../data/Coursera-SwiftKey.zip', exdir = '../data')
}

Loading the Data

These files are very large. They will be cumbersome to work with in a memory-constrained environment. For example, look at the size of individual data files US English corpus:

f <- list.files('../data/final/en_US', full.names = TRUE, recursive = TRUE)
paste0(f, ': ', paste(as.numeric(sapply(f, file.info)[ 1, ]) * 1e-6, 'MB'))
## [1] "../data/final/en_US/en_US.blogs.txt: 210.160014 MB"  
## [2] "../data/final/en_US/en_US.news.txt: 205.811889 MB"   
## [3] "../data/final/en_US/en_US.twitter.txt: 167.105338 MB"

For our initial analysis we will load the entire US English corpus into memroy. The action below completes in roughly 50 seconds on a laptop with the following specifications:

Depending on your systems’ specifications, loading the entire dataset into memory may not be possible or may result in severe system slow-down!

# load all data
print(system.time(txt <- lapply(f, readLines))) # takes about 45 - 65 secs on my laptop
##    user  system elapsed 
##  45.747   0.470  47.626
# we will convert all words to uppercase for convenience
print(system.time(txt <- lapply(txt, stri_trans_toupper)))
##    user  system elapsed 
##  30.366   0.263  30.768
names(txt) <- basename(f)

Exploratory Analysis

Line counts

There are the following number of lines in the files which comprise the en_US corpus:

  • en_US.blogs.txt: 899288
  • en_US.twitter.txt: 2360148
  • en_US.news.txt: 1010242

Tokenization

For the purposes of our initial exploratory analysis, we will use a very crude tokenizer that splits on any kind of whitespace. You can expect this operation to complete in roughly 30 - 60 seconds if your hardware is similar to the specs mentioned above.

# very basic whitespace tokenizer
print(system.time(tokens <- lapply(txt, stri_split_regex, '[:blank:]')))
##    user  system elapsed 
##  17.146   0.056  17.289

Word counts

Using our simple whitespace tokenizer we find a grand total of 102080206 tokens in the en_US corpus. Here are the total word counts for the individual data sets in the US English corpus:

  • en_US.blogs.txt: 37334131
  • en_US.twitter.txt: 30373545
  • en_US.news.txt: 34372530

Unique words

Before we can determine the counts of word usages, our data needs a bit of cleaning, otherwise there are a lot of extraneous characters which make it difficult to do so. Let’s to do a bit of data cleaning before we proceed to deeper analysis of our data…

Data Cleaning

Here is a sample of data from each set…

lapply(txt, sample, 1)
## $en_US.blogs.txt
## [1] "#70...BABS"
## 
## $en_US.news.txt
## [1] "PROCEEDS WILL BENEFIT VOICES OUTREACH PROGRAMS, SUCH AS THE ANNUAL CHILDREN’S COMPOSITION CONTEST AND WORKSHOPS, VARIOUS SCHOLARSHIPS, AND CONCERTS AT ASSISTED LIVING AND RETIREMENT COMMUNITIES IN NEW JERSEY AND BUCKS COUNTY, PA."
## 
## $en_US.twitter.txt
## [1] "THEY DON'T REALLY UNDERSTAND"

Let’s jam all of our data together and see what kind of weird stuff is lurking in there, we will be looking for non-english characters as a proxy for things that will not be included in our eventual language model (non-english words, emojii, numbers, etc)

rm(tokens)
all.txt <- unlist(txt)
# find anything which is not alphanumeric or whitespace
non.alphanum.or.foreign <- stri_extract_all_regex(all.txt, '[^A-Za-z0-9 ]')
non.alphanum.or.foreign <- unlist(non.alphanum.or.foreign)
non.alphanum.or.foreign.tbl <- sort(table(non.alphanum.or.foreign), 
                                    decreasing = TRUE)
non.alphanum.or.foreign.proptbl <- sort(prop.table(non.alphanum.or.foreign.tbl), 
                                        decreasing = TRUE)
non.alphanum.or.foreign.chars <- sort(unique(non.alphanum.or.foreign))
rm(all.txt)

We can see that in the entire corpus, we find roughly 3400 different characters that are not part of the standard US english alphanumeric character set. These include standard english punctuation, various symbols and characters from many other world languages (there seems to be quite a bit of Simplified Chinese, Japanese and French characters in there). Here is a small sampling:

sample(non.alphanum.or.foreign.chars, 10)
##  [1] "ய"          "疲"         "于"         "¡"          "ቴ"         
##  [6] "其"         "\U0001f620" "吟"         "È"          "조"

Here are the top 100 most commonly occuring “non-standard” characters by frequencies:

names(head(non.alphanum.or.foreign.tbl, 100))
##   [1] "."          ","          "'"          "!"          "-"         
##   [6] "\""         ":"          "?"          "’"          ")"         
##  [11] "("          "#"          "“"          "”"          "/"         
##  [16] "$"          ";"          "&"          "—"          "–"         
##  [21] "<"          "*"          "…"          "_"          "%"         
##  [26] "‘"          "="          ">"          "@"          "~"         
##  [31] "+"          "É"          "Ø"          "\u0093"     "\u0094"    
##  [36] "^"          "\u0092"     "♥"          "]"          "["         
##  [41] "|"          "\u0096"     "£"          "❤"          "\u0097"    
##  [46] "`"          "�"          "•"          "½"          "\\"        
##  [51] "\u0095"     "\U0001f60a" "′"          "Ñ"          "Á"         
##  [56] "\U0001f44d" "\U0001f602" "»"          "´"          "°"         
##  [61] "€"          "☺"          "\U0001f601" "È"          "Í"         
##  [66] "\U0001f612" "Ö"          "}"          "Ó"          "«"         
##  [71] "″"          "\U0001f60d" "\U0001f618" "\U0001f614" "{"         
##  [76] "­"          "\U0001f609" "\U0001f633" "Ā"          "®"         
##  [81] "Ü"          "·"          "\U0001f49c" "Ä"          "\U0001f603"
##  [86] "\U0001f60f" "Â"          "™"          "☀"          "\U0001f61c"
##  [91] "\U0001f497" "\U0001f4a4" "✌"          "Ç"          "\U0001f62d"
##  [96] "\U0001f621" "\U0001f604" "А"          "Е"          "\U0001f499"

Here are the top 20 as a proportion of total “non-standard” characters found:

round(head(non.alphanum.or.foreign.proptbl, 20), 4)
## non.alphanum.or.foreign
##      .      ,      '      !      -      "      :      ?      ’      ) 
## 0.3176 0.2096 0.0865 0.0680 0.0592 0.0502 0.0311 0.0279 0.0230 0.0226 
##      (      #      “      ”      /      $      ;      &      —      – 
## 0.0173 0.0126 0.0084 0.0083 0.0077 0.0069 0.0067 0.0062 0.0032 0.0028

You can see that english punctuation marks by far dominate the set of non alphanumeric characters found in the corpus. We also note that there are quite a few entries that may be entirely in foreign languages which we will need to filter out when we do our data pre-processing to ensure that our model is not “confused” by what amounts to gobbledy-gook.

Let’s remove all non alphabetic characters from our data set (we will willingly sacrifice the contextual information provided by punctuation for the time being for the sake of simplicity). We are interested only in english words so we get rid of any numbers we come across and will try keep all apostrophes and dashes intact! We will then pivot back to the problem of finding word counts and the distribution of those counts…

# let's first clean all text
clean.txt <- lapply(txt, stri_replace_all_regex, "[^A-Za-z\\'\\-]", ' ')
clean.txt <- lapply(clean.txt, stri_replace_all_regex, '[:blank:]{2,}', ' ')
clean.txt <- lapply(clean.txt, stri_trim_both)
names(clean.txt) <- basename(f)
lapply(clean.txt, sample, 2)
## $en_US.blogs.txt
## [1] "IT WEIGHS A LOT MORE THAN A POUND TOO"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [2] "I DO REALIZE THAT IF WE RE GOING TO KEEP OUR FRESH FIRE AND ENERGY AND VISION WE MUST HAVE THE PROTECTION OF PRAYER A SPIRITUAL BATTLE RAGES BETWEEN THE FORCES OF HEAVEN AND HELL AND WE CAN T AFFORD TO ENTER THE FIGHT WITHOUT COMMITTED PRAYER WARRIORS ENCIRCLING US EVERYONE WHO WANTS TO MAKE A LASTING DIFFERENCE FOR CHRIST AND HIS KINGDOM HAS TO FIND A WAY TO BUILDAND MAINTAIN AN EFFECTIVE PRAYER COVERING FAILURE IN MINISTRY CAN OFTEN BE TRACED TO FAILURE TO CREATE AN ACTIVE UNIFIED PRAYER TEAM WHEN WE DISREGARD OR NEGLECT THE CRUCIAL PLACE OF GROUP PRAYER WE ALLOW OUR BLIND SPOTS TO CONTINUE TO PLAGUE AND INJURE US WITH A COMMITTED PRAYER TEAM LABORING FOR AND WITH US HOWEVER WE TAP INTO THE INFINITE POWER OF GOD WE BEGIN TO SEE HIS MIND AND HIS WILL WITH INCREASING CLARITY WE FEEL HIS HEARTBEAT WITH GROWING CERTAINTY"
## 
## $en_US.news.txt
## [1] "THERE'S NOT A LOT OF TIME AS YOU KNOW HE SAID IT'S ONE OF THE CONCERNS THAT PEOPLE HAVE THERE ARE ROUGHLY DAYS TOTAL AS FAR AS TRANSPARENCY MY SENSE IS THAT EVERYONE WHO'S PART OF THIS COMMITTEE WANTS TO MAKE SURE THAT THERE'S TRANSPARENCY THAT THE PUBLIC HAS AN OPPORTUNITY TO HAVE ITS INPUT MADE THAT OTHER MEMBERS OF CONGRESS DO THAT INCLUDES POSSIBLY GETTING THE ADVICE OF EXISTING CONGRESSIONAL COMMITTEES ON BUDGETS FINANCE AND TAXES"
## [2] "COLUMBUS OHIO -- AT JIM TRESSEL WAS IN HIS FIFTH SEASON AS THE HEAD COACH AT DIVISION I-AA YOUNGSTOWN STATE AND HADN'T YET WON HIS FIRST NATIONAL TITLE JOE PATERNO WAS STILL A PENN STATE ASSISTANT BRADY HOKE WAS THE DEFENSIVE ENDS COACH AT MICHIGAN AND URBAN MEYER WAS GETTING READY FOR HIS FIRST YEAR AS THE BOSS AT BOWLING GREEN"                                                                                                             
## 
## $en_US.twitter.txt
## [1] "CAN'T WAIT TILL GODSMACK PLAYS SO I CAN SIT DOWN"                
## [2] "AH I NEED THT BOOK NOW I WANT PEETA AND KATNISS TOGETHER SOO BAD"

Now let’s tokenize our cleaned text and then get word counts…

clean.tokens <- lapply(txt, stri_split_regex, '[:blank:]')
names(clean.tokens) <- basename(f)

unique.words.blogs <- sort(unique(unlist(clean.tokens$en_US.blogs.txt)))
unique.words.twitter <- sort(unique(unlist(clean.tokens$en_US.twitter.txt)))
unique.words.news <- sort(unique(unlist(clean.tokens$en_US.news.txt)))

We see the following counts of unique words used in the following data sets:

  • en_US.blogs.txt: 964297
  • en_US.twitter.txt: 1080317
  • en_US.news.txt: 790251

I was kind of surprised to see that Twitter appears to have the most diverse “vocabulary”" of the three data sets!

Word counts and their distributions

blogs.word.counts <- sort(table(unlist(clean.tokens$en_US.blogs.txt)), decreasing = TRUE)
twitter.word.counts <- sort(table(unlist(clean.tokens$en_US.twitter.txt)), decreasing = TRUE)
news.word.counts <- sort(table(unlist(clean.tokens$en_US.news.txt)), decreasing = TRUE)

par(mfrow = c(1, 3))
hist(blogs.word.counts, col = 'gray', probability = TRUE)
hist(twitter.word.counts, col = 'gray', probability = TRUE)
hist(news.word.counts, col = 'gray', probability = TRUE)

par(mfrow = c(1, 1))

From the histograms we can see that the distributions of word counts for all data sets are very highly skewed towards many infrequently occuring words with a handful extremely common of words.

Most common words by data set

Here are the top ten words by proportion of occurence per data set:

blogs.word.counts.prop <- sort(prop.table(blogs.word.counts), decreasing = TRUE)
twitter.word.counts.prop <- sort(prop.table(twitter.word.counts), decreasing = TRUE)
news.word.counts.prop <- sort(prop.table(news.word.counts), decreasing = TRUE)

head(blogs.word.counts.prop, 10)
## 
##         THE         AND          TO           A          OF           I 
## 0.049277510 0.028651825 0.028317065 0.023820134 0.023328868 0.020119847 
##          IN        THAT          IS         FOR 
## 0.015577623 0.011710330 0.011199832 0.009550189
head(twitter.word.counts.prop, 10)
## 
##        THE         TO          I          A        YOU        AND 
## 0.03047438 0.02565802 0.02311271 0.01983134 0.01566044 0.01412571 
##        FOR         IN         OF         IS 
## 0.01252090 0.01209530 0.01175204 0.01146577
head(news.word.counts.prop, 10)
## 
##         THE          TO         AND           A          OF          IN 
## 0.056421654 0.026069102 0.025404909 0.025091490 0.022379179 0.019342423 
##         FOR        THAT          IS          ON 
## 0.010097220 0.009583670 0.008064696 0.007589956

We can see that common words such as: “THE”, “AND”, “TO”, “A”, “OF”, “I”, “IN”, “THAT”, “IS” are the most commonly occuring words in each data set. This is consistent with what we would intuitively expect.

Profanity filtering

In a commercial product, ideally we would like to avoid generating predictions for words considered to be profane in the language we are working in. I was able to locate the following list of potentially offensive/profane english words: list There are many, many other freely available word lists on the internet which can be used for the purpose of identifying profanity.

Prevalence of profanity

If we divide the number of words which might be considered to be profane by the total number of words in our entire corpus, we see that roughly

profanity <- toupper((read.table('http://www.cs.cmu.edu/~biglou/resources/bad-words.txt')[ , 1 ]))
# all.clean.tokens <- unlist(clean.tokens)
# (length(which(all.clean.tokens %in% profanity)) / length(all.clean.tokens))

percent of words used in the entire corpus are potentially profane. This is not an enormous proportion, but still significant and we have to be conscious of the great distress profanity could cause users of the application in a commercial setting, make sure to filter naughty words out!!!

Predictive algorithm plans

The problem of statistical language modelling is not an old one by any stretch of the imagination! In fact there is a very significant existing body of research and open source code related to solving this specific problem. One of the things that many students attempting this project will discover is that R is actually a somewhat difficult environment (unless very specific optimal coding practices are used) to do the kind of processing needed to perform analysis of large corpora of text in an efficient fashion. In many cases, commonly available command line tools from the Linux world such as grep, sed, head, tail and others are many orders of magnitude faster at crunching and munging text than the standard R libraries (and even packages from the so-called “Hadleyverse” such as stringr). The stringi package (see here) is one notable exception. This package provides highly efficient string processing capabilities with a code interface that will be somewhat familiar to users of the stringr package.

Basically the entire aim of this project is to efficiently clean and tokenize a large corpus of text and thent to generate bigrams and trigrams which will be used to create a statistical language model for the purpose of doingWhen faced with this reality I realized I had basically two choices:

I think that in order to be able to focus more time on building high-quality models rather than on re-inventing the wheel, I will likely opt to use an existing library such as the CMU Statistical Language Modeling Toolkit v2 to do the grunt work of tokenization, tallying, calculating perplexity, etc and then focus on building the smallest, most efficient and consistently accurate language models I can. I think that this is consistent with the course objectives and an appropriate course of action (provided I can demonstrate sufficient evidence of having understood the underlying techniques and algorithms).

Game plan summary

To summarize, we will approach the app development in the following way:

  1. Create an R wrapper for the CMU Statistical Language Modeling Toolkit v2. This will include an entire text processing and modelling pipeline which will perform the following steps:
    1. Download the source data and verify the integrity of the data
    2. Pre-process the data (clean, normalize, etc) and create a test/training set split
    3. Fit a statistical language model to the training data using CMU toolkit (create ARPA format probability model)
    4. Evaluate the fitted model against the testing data set created in step b.
    5. Transform the CMU-generated probability model into data.table
  2. Optimize the language probability model for size so we get acceptable accuracy and consistency with runtime and space requirements suitable for use on a modern mobile device
  3. Use the model in a Shiny App as required by the final project rubric.