Executive Summary

This report summarises the exploratory data analysis conducted on the dataset provided for the JHU Data Science Capstone Project. The raw data has been downloaded and zipped locally from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
 
The data comprises 3 texts files (news articles, tweets and blogs) in 4 languages (English, German, Finnish and Russian). The objective of the project is to produce a predictive text algorithm based on the data set provided. The analysis for this report focusses on the English text files. The resulting algorithm will be embedded in a web app using Shiny.
 
The bulk of the text mining operations used the quanteda library which, after much experimentation, is faster than the tm library functions. The process for transforming the text into a usable dictionary in quanteda is as follows :
text -> corpus -> tokens -> ngrams -> dfm -> dictionary
 
The process and analysis is described below. Full code for running this reproducible analysis can be found at the end of the document.
 

Setting up the Environment

The following code loads the required libraries, sets up the working directories and initial arrays containing file information.

    library(tm); library(readr); library(stringi); library(ggplot2)
    library(SnowballC); library(quanteda); library(RColorBrewer)
    
    setwd("~/DataScience/CapstoneJHU")
    dirData <- "~/DataScience/CapstoneJHU/data/"
    
    lstLanguage <- c("de_DE","en_US","fi_FI","ru_RU")
    lstDoctype <- c(".blogs",".news",".twitter")
    lstFileDE <- c("de_DE.blogs","de_DE.news","de_DE.twitter")
    lstFileEN <- c("en_US.blogs","en_US.news","en_US.twitter")
    lstFileFI <- c("fi_FI.blogs","fi_FI.news","fi_FI.twitter")
    lstFileRU <- c("ru_RU.blogs","ru_RU.news","ru_RU.twitter")

 

Reading the Text Files

For the English language analysis, there is over 3 million lines of text to be analysed. The size and contents of the first line of each file is shown below.

## [1] "~/DataScience/CapstoneJHU/data/en_US/en_US.blogs.txt"
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   208361438   171926076
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."
## [1] "~/DataScience/CapstoneJHU/data/en_US/en_US.news.txt"
##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15683765    13117038
## [1] "He wasn't home alone, apparently."
## [1] "~/DataScience/CapstoneJHU/data/en_US/en_US.twitter.txt"
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162385035   134371036
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

 

Converting the text file to a corpus

A corpus is designed to be a library of documents that have been converted to standardised text. It is intended to be a static container of original reference texts.   In this step we also load the bad word list which has been saved locally from https://www.cs.cmu.edu/~biglou/resources/. The processing will remove the profanity from the original text.


To enable faster processing, a random sample of 20% of the text is read. The data is preprocessed slightly prior to being converted to a corpus. The initial text preprocessing speeds up the later tokenisation process. It also performs operations which are not available within the quanteda library. This includes
* removing html style tags
* removing urls
* removing twitter hashtags (not done well in quanteda)
* removing non-alphanumeric characters

    preprocessText <- function(tData) {
        tSample <- 0.20 
        tData <- sample(tData,length(tData)*tSample,replace=FALSE)   
        tData <- gsub("<.*?>", "", tData)            ## remove html style tags
        a <- iconv(tData,to="ASCII") ## from="UTF-8",
        a <- gsub("(https?)?://[^[:blank:]]*", " ",a) ## remove urls
        a <- gsub("[[:blank:]]#[^[:blank:]]*", " ",a) ## remove hashtags
        a <- gsub("[[:digit:]]","",a)
        a <- gsub("[[:punct:]]","",a)
        a <- gsub("^[:alphanum:][:space:]"," ",a)
        corpus(a) }
    
    a1 <- preprocessText(a1)
    a2 <- preprocessText(a2)
    a3 <- preprocessText(a2)
    x <- a1 + a2 + a3
    saveRDS(x,getFilename(tDir,tLanguage,tDocName,".train.rds"))  

 

Converting corpus to tokens

The corpus is converted into a token dataset using quanteda and saved to an .rds file. This process of tokenising cleans the text and extracts the unique features (words or terms) of the documents.   The second round of preprocessing
* removes punctuation
* removes numbers
* removes symbols
* removes stopwords (eg “to”, “the”, “and”, etc)
* twitter hashtags
* removes hyphens
* removes any profanity

    x <-tokens(x,remove_punct=TRUE, remove_numbers=TRUE, 
        remove_symbols=TRUE, remove_twitter=TRUE, remove_hyphens=TRUE,
        include_docvars=FALSE, what = "word")
    x <-tokens_remove(x,stopwords(substr(tLanguage,1,2)))
    x <-tokens_remove(x,lstProfanity)
    saveRDS(x,getFilename(tDir,tLanguage,tDocName,".token.rds"))

 

Converting tokens to n-grams

In Natural Language Processing (NLP), an n-gram is a sequence of n items from a given text. The n-grams are constructed by first tokenising the text.

    readTokenFile <- function(tDir,tLanguage) {
        tDocName <- tLanguage
        fileRDS <-getFilename(tDir,tLanguage,tDocName,".token.rds")
        a <- readRDS(fileRDS)
        a } 
    getNgram <- function(x,n,nskip,tDir,tLanguage,tExtension) {
        xgram <- tokens_ngrams(x,n,nskip,concatenator=" ")
        saveRDS(xgram,getFilename(tDir,tLanguage,"token",tExtension)) 
        rm(xgram) }  
    
    x <- readTokenFile(tDir,tLanguage)
    getNgram(x,n=1L,nskip=0L,tDir,tLanguage,".x1gram.rds")
    getNgram(x,n=2L,nskip=0L,tDir,tLanguage,".x2gram.rds")
    getNgram(x,n=3L,nskip=0L,tDir,tLanguage,".x3gram.rds")
    getNgram(x,n=2L,nskip=1L,tDir,tLanguage,".x1skip.rds")

 

Converting the n-grams to a document frequency matrix

The n-grams are then converted to a document frequency matrix (dfm). A document frequency matrix is a table which contains the unique tokens (features) and their total counts within the corpus.

    readNgramFile <- function(tDir,tLanguage,tExtension) {
        tDocName <- tLanguage
        fileRDS <-getFilename(tDir,tLanguage,"token",tExtension)
        a <- readRDS(fileRDS)
        a }    
    getdfm <- function(tDir,tLanguage,tExtension) {
        x <- readNgramFile(tDir,tLanguage,tExtension)
        xdfm <- dfm(x)
        xdfm <- dfm_sort(xdfm,decreasing=TRUE,margin="features")
        saveRDS(xdfm,getFilename(tDir,tLanguage,"dfm",tExtension))
        xdfm } 
    
    x.1skip <- getdfm(tDir,tLanguage,".x1skip.rds")
    x.1gram <- getdfm(tDir,tLanguage,".x1gram.rds")
    x.2gram <- getdfm(tDir,tLanguage,".x2gram.rds")
    x.3gram <- getdfm(tDir,tLanguage,".x3gram.rds")

 

Unigram Analysis

The most common unigrams and their absolute & relative frequencies are shown below. When this analysis is conducted using stopwords, the top 15 unigrams are stopwords. The analysis below excludes stopwords.

##    one   just    can   like   time    get    now     im    new   also 
##  15235  12251  11865  11674  11093   8915   7450   7449   7298   7233 
##    day   know  first people   make   said   back     us   well little 
##   6832   6809   6690   6571   6535   6508   6464   6453   6343   6231 
## really   much    see   good   love 
##   6117   6062   6023   5997   5892

 
The coverage or cumulative percentage of words which covers the text is shown in the table below. The chart and table below shows that less than 15% of the unigrams account for 90% of the words in the corpus.

##    coverage cumfreq   cumfreqpct
## 1       0.1      33 0.0002794905
## 2       0.2     109 0.0009231655
## 3       0.3     259 0.0021935768
## 4       0.4     515 0.0043617454
## 5       0.5     940 0.0079612440
## 6       0.6    1671 0.0141523816
## 7       0.7    3042 0.0257639406
## 8       0.8    6011 0.0509096145
## 9       0.9   15124 0.1280913341
## 10      1.0  118072 1.0000000000

 

Bigram Analysis

The most common bigrams and their relative frequencies are shown below.

##  [1] "dont know"     "right now"     "years ago"     "last year"    
##  [5] "new york"      "even though"   "last week"     "can see"      
##  [9] "make sure"     "first time"    "feel like"     "im sure"      
## [13] "im going"      "last night"    "can get"       "every day"    
## [17] "high school"   "dont want"     "one day"       "little bit"   
## [21] "dont think"    "long time"     "united states" "year old"     
## [25] "let know"

 
There are many bigrams (about 85%) which only occur once in the text. These low frequency bigrams will be pruned out when creating the dictionary for the algorithm.

##    freqcount featcount  featfreq
## 1          1   1381707 0.8628814
## 2          2    124964 0.9409219
## 3          3     38558 0.9650016
## 4          4     18195 0.9763644
## 5          5     10031 0.9826288
## 6          6      6243 0.9865276
## 7          7      4269 0.9891936
## 8          8      2993 0.9910627
## 9          9      2232 0.9924566
## 10        10      1747 0.9935476

 

Trigram Analysis

The most common trigrams are shown below.

##  [1] "vested interests vested"    "interests vested interests"
##  [3] "new york city"              "im pretty sure"            
##  [5] "new york ny"                "couple weeks ago"          
##  [7] "preheat oven degrees"       "cant wait see"             
##  [9] "new york times"             "two years ago"             
## [11] "lets just say"              "love love love"            
## [13] "world war ii"               "dont get wrong"            
## [15] "im looking forward"         "please let know"           
## [17] "please feel free"           "many years ago"            
## [19] "three years ago"            "two weeks ago"

 
As with bigrams, there is a high incidence of trigrams which occur less than 5 times within the corpus analysed. Additionally, the number of features observed multiplies significantly as the number n in n-grams increases, resulting in larger data sets.

##    freqcount featcount  featfreq
## 1          1   2040614 0.9841906
## 2          2     24686 0.9960967
## 3          3      4280 0.9981610
## 4          4      1704 0.9989828
## 5          5       766 0.9993523
## 6          6       420 0.9995548
## 7          7       247 0.9996740
## 8          8       149 0.9997458
## 9          9       127 0.9998071
## 10        10        98 0.9998543

 

Skipgram Analysis

The most common 1-skipgrams and their relative frequencies are shown below. A skipgram is a sequence of words where x number of words in between are removed. For this analysis, the middle word in a trigram has been removed to create a 1-skipgram.

##  [1] "vested interests vested"    "interests vested interests"
##  [3] "new york city"              "im pretty sure"            
##  [5] "new york ny"                "couple weeks ago"          
##  [7] "preheat oven degrees"       "cant wait see"             
##  [9] "new york times"             "two years ago"             
## [11] "lets just say"              "love love love"            
## [13] "world war ii"               "dont get wrong"            
## [15] "im looking forward"         "please let know"           
## [17] "please feel free"           "many years ago"            
## [19] "three years ago"            "two weeks ago"             
## [21] "new years eve"              "level mp cost"             
## [23] "happy new year"             "dont even know"            
## [25] "couple years ago"

 

Observations and Further Analysis

The exploratory data analysis (EDA) provides a framework for understanding the data. One of the approaches used in the EDA was to save down the results at key points of transforming the data due to the high memory usage. In all, the EDA performed above took around 10 minutes to run.
 
Considerations for building the dictionary for the next stage of the predictive text model include how to -
* trim the n-grams and determine the appropriate dictionary size which balances the tradeoff between prediction speed and accuracy based on the coverage and frequency of the n-grams in the text
* determine if 4+-grams add additional accuracy to the prediction
* determine the weight of contribution of news articles vs blogs vs tweets to the accuracy of the overall model, to match the informal text typing style expected for predictive text
* consider model accuracy of including stop words and using stemming in creating the n-gram dictionary

  After the dictionaries have been created, the likely approach to assessing the next word will based on the following set of rules -
* if there are more than 2 words, prediction will be based on 3, 4-gram windows
* if there are 2 words, prediction will be based on 2, 3 and 4-grams windows
* if there is 1 word, prediction will be based on 2, 3-gram windows
* during the typing of any word, the prediction and text correction will use unigrams

  Issues to consider when building the algorithm include how to -
* calculate the next word probabilities for each n-gram in the dictionary and investigate the Katz Backoff model, Naive Bayes and other NLP algorithms
* build training and test datasets to assess and validate the accuracy of next word prediction
* investigate likelihood of typing errors and how it affects next word prediction
* investigate using tfidf of each n-gram when the modelling probabilities
 
I will also need to observe existing predictive text models (eg on mobile phones and Google) in order to design the front end of the Shiny app. For example, predictive text within the iPhone message app returns the top 3 choices.
 

References

  1. Creating dfm step by step with quanteda https://stackoverflow.com/questions/38931507/create-dfm-step-by-step-with-quanteda
  2. Quanteda Package https://cran.r-project.org/web/packages/quanteda/quanteda.pdf
  3. Introduction to quantitative text analysis using quanteda https://tutorials.quanteda.io/introduction/
  4. Working with Textual Data http://kenbenoit.net/assets/courses/dmslTCD2015/Days5-6_Exercise.html
  5. Natural Language Corpus Data : Beautiful Data http://norvig.com/ngrams/
  6. R Programming/Text Processing https://en.wikibooks.org/wiki/R_Programming/Text_Processing#Comparing_two_strings
  7. A closer look at skip-gram modelling http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
  8. Quanteda Quick Start Guide https://cran.r-project.org/web/packages/quanteda/vignettes/quickstart.html
     

Code

The full reproducible code for running the above processes is as follows :

    library(tm); library(readr); library(stringi); library(ggplot2)
    library(SnowballC); library(quanteda); library(RColorBrewer)
    
    setwd("~/DataScience/CapstoneJHU")
    dirData <- "~/DataScience/CapstoneJHU/data/"
    
    lstLanguage <- c("de_DE","en_US","fi_FI","ru_RU")
    lstDoctype <- c(".blogs",".news",".twitter")
    lstFileDE <- c("de_DE.blogs","de_DE.news","de_DE.twitter")
    lstFileEN <- c("en_US.blogs","en_US.news","en_US.twitter")
    lstFileFI <- c("fi_FI.blogs","fi_FI.news","fi_FI.twitter")
    lstFileRU <- c("ru_RU.blogs","ru_RU.news","ru_RU.twitter")
    

## -------------------------------------------------------------------------
## FUNCTIONS TO READ FILES
## -------------------------------------------------------------------------

    getFilename <- function(tDir,tLanguage,tDocName,tExtension) {
        tDir <- paste(tDir,tLanguage,"/",sep="")
        tFilename <-paste(tDir,tDocName,tExtension,sep="")
        tFilename }
    
    ## read text File, print no of lines, return data
    readTextFile <- function(tDir,tLanguage,tDocName) {
        fileData <-getFilename(tDir,tLanguage,tDocName,".txt")
        tData <- scan(file = fileData,sep = '\n', what = '', skipNul = TRUE)
        print(fileData)
        print(stri_stats_general(tData)) ## lines of data
        tData }

    ## read Token file
    readTokenFile <- function(tDir,tLanguage) {
        tDocName <- tLanguage
        fileRDS <-getFilename(tDir,tLanguage,tDocName,".token.rds")
        a <- readRDS(fileRDS)
        a }     

    ## read ngram file
    readNgramFile <- function(tDir,tLanguage,tExtension) {
        tDocName <- tLanguage
        fileRDS <-getFilename(tDir,tLanguage,"token",tExtension)
        a <- readRDS(fileRDS)
        a }   
    
    
## --------------------------------------------------------------------------
## PROCESSING FUNCTIONS FOR TEXT TO CORPUS
## --------------------------------------------------------------------------  
    ## convert text to corpus in quanteda and sample x% of data
    preprocessText <- function(tData) {
        tSample <- 0.20  ## sample 20% of the data
        tData <- sample(tData,length(tData)*tSample,replace=FALSE)   
        tData <- gsub("<.*?>", "", tData) ## remove html style tags
        a <- iconv(tData,to="ASCII") ## from="UTF-8",
        a <- gsub("(https?)?://[^[:blank:]]*", " ",a) ## remove urls
        a <- gsub("[[:blank:]]#[^[:blank:]]*", " ",a) ## remove hashtags
        a <- gsub("[[:digit:]]","",a)
        a <- gsub("[[:punct:]]","",a)
        a <- gsub("^[:alphanum:][:space:]"," ",a)
        corpus(a) }
    

## --------------------------------------------------------------------------
## PROCESSING FUNCTIONS FOR CORPUS TO TOKENS
## --------------------------------------------------------------------------  
    ## reads corpus, preprocessing and tokenising
    tokenText <- function(tDir,tLanguage,lstFile) {
        ## read text file and preprocess
        tDocName <- tLanguage
        a1 <- readTextFile(tDir,tLanguage,lstFile[1])
        a2 <- readTextFile(tDir,tLanguage,lstFile[2])
        a3 <- readTextFile(tDir,tLanguage,lstFile[3])
        a1 <- preprocessText(a1)
        a2 <- preprocessText(a2)
        a3 <- preprocessText(a3)
        x <- a1 + a2 + a3
        saveRDS(x,getFilename(tDir,tLanguage,tDocName,".corpus.rds"))
        ## make tokens
        x <-tokens(x,remove_punct=TRUE, remove_numbers=TRUE, 
            remove_symbols=TRUE, remove_twitter=TRUE, remove_hyphens=TRUE,
            include_docvars=FALSE, what = "word")
        ## x <-tokens_remove(x,stopwords(substr(tLanguage,1,2)))
        x <-tokens(x,what="fasterword",remove_url=TRUE)
        x <-tokens_remove(x,lstProfanity)
        saveRDS(x,getFilename(tDir,tLanguage,tDocName,".token.rds"))
        x }

    
## --------------------------------------------------------------------------
## PROCESSING FUNCTIONS FOR TOKENS TO NGRAMS
## -------------------------------------------------------------------------- 
    ## calculate n-grams from tokens
    getNgram <- function(x,n,nskip,tDir,tLanguage,tExtension) {
        xgram <- tokens_ngrams(x,n,nskip,concatenator=" ")
        saveRDS(xgram,getFilename(tDir,tLanguage,"token",tExtension)) 
        rm(xgram) }        
    
    ## convert tokens to ngrams, creates x.ngram file
    ngramToken <- function(tDir, tLanguage) {
        x <- readTokenFile(tDir,tLanguage)
        getNgram(x,n=1L,nskip=0L,tDir,tLanguage,".x1gram.rds")
        getNgram(x,n=2L,nskip=0L,tDir,tLanguage,".x2gram.rds")
        getNgram(x,n=3L,nskip=0L,tDir,tLanguage,".x3gram.rds")
        getNgram(x,n=2L,nskip=1L,tDir,tLanguage,".x1skip.rds")
        rm(x)
    }

            
## --------------------------------------------------------------------------
## PROCESSING FUNCTIONS FOR NGRAMS TO DFM
## -------------------------------------------------------------------------- 
    ## coverage by number of features - only works on dfm objects
    ngramCoverage <- function(xgram,n) {
        word cloud and top words
        textplot_wordcloud(xgram,max_words=60/n,color=brewer.pal(6,"Dark2"))
        print(topfeatures(xgram,n=25))
        ## word frequency table
        tblgram <- textstat_frequency(xgram)
        tblgram <- data.frame(tblgram, 
            cumfreq = cumsum(tblgram$frequency)/sum(tblgram$frequency),
            length=nchar(tblgram$feature))
        ## print(mean(tblgram$length))
        ## for x% coverage of text, cumfreq of features as abs and % of corpus
        tblgram2 <- data.frame(coverage = 1:10)
        tblgram2$coverage <- tblgram2$coverage/10
        cumfreq <- sapply(tblgram2$coverage, function(x)
            length(which(tblgram$cumfreq<=x)))
        cumfreqpct <- cumfreq/NROW(tblgram$freq)
        tblgram2 <- cbind(tblgram2,cumfreq,cumfreqpct)
        print(tblgram2) 
        ## ggplot(tblgram2,aes(cumfreqpct,coverage))+geom_line()
        ## count of features which appear x times in corpus
        tblgram3 <- data.frame(freqcount = 1:10)
        featcount <- sapply(tblgram3$freqcount, function(x)
            length(which(tblgram$freq==x)))
        featfreq <- cumsum(featcount)/NROW(tblgram$freq)
        tblgram3 <- cbind(tblgram3,featcount,featfreq)
        print(tblgram3)
        rm(tblgram, tblgram2,tblgram3) }
    
    ## calculate dfm from ngrams
    getdfm <- function(tDir,tLanguage,tExtension) {
        x <- readNgramFile(tDir,tLanguage,tExtension)
        xdfm <- dfm(x)
        xdfm <- dfm_sort(xdfm,decreasing=TRUE,margin="features")
        saveRDS(xdfm,getFilename(tDir,tLanguage,"dfm",tExtension))
        xdfm }     

    ## convert ngrams to dfm, trim and create dfm file
    dfmNgram <- function(tDir, tLanguage) {
        getdfm(tDir,tLanguage,".x1skip.rds")
        getdfm(tDir,tLanguage,".x1gram.rds")
        getdfm(tDir,tLanguage,".x2gram.rds")
        getdfm(tDir,tLanguage,".x3gram.rds")

    
## --------------------------------------------------------------------------
## REAL CODE STARTS HERE
## --------------------------------------------------------------------------
## bad word list https://www.cs.cmu.edu/~biglou/resources/
    fileProfanity <- getFilename("data","","bad-words",".txt")
    lstProfanity <- scan(file=fileProfanity,sep ='\n',what='',skipNul=TRUE) 
    
    ## en_US
    ## tokenText : convert text to tokens; creates .corpus and .token file
    system.time(tokenEN <- tokenText(dirData,"en_US",lstFileEN))
    
    ## ngramToken : convert tokens to ngrams, creates x.ngram file
    system.time(ngramEN <- ngramToken(dirData,"en_US"))
    
    ## dfmNgram : convert ngrams to dfm, create dfm file
    system.time(dfmEN <- dfmNgram(dirData,"en_US"))