Data Science Specialization Capstone - Week 2 Summary

Executive Summary

Natural Language Processing is the attempt to use computers and their data processing capabilities to understand, make use of, or predict text based on words that are combined as human beings normally write, say, or read them. While human beings are often intuitively capable of these sorts of tasks without expending much effort it is the case that the lack of hard and fast rules means that computers often have difficulty with the same thing. This document contains an exploratory analysis of some input data (texts) and a basic prediction function to predict the next word in a phrase given the previous 1, 2, or 3 words.

This Document

The documents to be considered here comprise a set of tweets (from Twitter), a set of blog entries, and a set of news articles. There is no metadata about these documents, so it is unknown who the writers are, when the data was collected, what the specifics of the storage and capture media were, and so forth.
The code here aggregates the data, samples it, processes it, and creates a simple prediction function.

Analysis

Loading and pre-processing

The data files must be downoaded, unzipped, loaded into R, and processed into a format that facilitates analysis. The original data is located at ‘https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip’. Additional data includes a list of profane words to be removed from the data. While there are many options for which set of words to use I ultimately decided on the list at ‘https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en’.

After the files are downloaded they must be loaded into R in order for me to work with them. For each of the four files (profanity words/phrases, tweets, blog entries, and news articles) I read the file into R.

Once all the files are loaded we see that there are, in total, 377 profanities, 2360148 tweets, 1010242 news articles, and 899288 blog posts. In real life cases we might try to train our text prediction model using only documents like the ones involved in the prediction request (tweets, text messages, etc.). However, in this case we will simply use all of the documents available.

Aggregating the documents results in a set of documents with 4269678 lines (or, equivalently, a corpus with that many documents). This is far too much data for the capabilities of my computer or for any mobile device, so I took a sample of 1% of the data.

This results in a document with 42494 rows (lines/documents) for generating the prediction model. The R library quanteda will preprocess and process the data.

Quanteda can divide up the documents into single word “tokens”. From there we can determine tokens of size 2, 3, and 4 also.

As an example consider 3grams of the tweet “I am a really cool blue dog”. These would be “I_am_a”, “am_a_really”, “a_really_cool”, “really_cool_blue”, and “cool_blue_dog”. (For processing the data file uses underscores in place of spaces, but this isn’t something we need to worry about.)

At this point we generate document-feature-matrices. These are simply matrices that list a document (tweet, blog post, news article, etc.) as a row and the features (the 1grams, 2grams, 3grams, etc.) individually as columns. This allows us an easy way to aggregate and understand the distribution of the n-grams of various lengths. What we find is that for the 1grams this matrix has 57841 features/columns. For the 2grams there are 439328, the 3grams have 763523, and the 4grams matrix comprises 839493 features/columns. These numbers get larger as the size of the ngram increases, and that’s what we’d expect given the number of different ways english words can be combined.

Exploratory analysis

We can sum across the rows of this document-feature-matrix. This gives, roughly, a proxy for the lengths of the documents. It would make sense that due to the prevalence of many short documents (probably tweets) it is likely that the histograms generated will show a large number of documents with a small number of ngrams.

And that’s exactly what we see. The histogram bar at length 1 gets quite large (tall) as the size of the ngram increases. That indiates a lot of very short documents that are completely comprised of a single ngram.

Next, we can sum down the columns of this document-feature-matrix to get an idea of how common particular ngrams are. Here I’ve adjusted the tables to show a percentage (or density) rather than the count of items. This will show us how common it is overall to have ngrams of a certain length. As the graphs show it is the case that having only one of a particular ngram becomes increasingly common as the size of the ngram increases. Basically, most of the 2grams, 3grams, and 4grams are quite uncommon individually.

Next, we can look at a numerical approximation of how common the most common ngrams are for each length. In the numerical displays below I have included both the gross count and the average number of times that a particular ngram appears in any one document. We’ll see that with the 1grams it is the case that the word “the” appears, on average, more than once per document. Given that it’s quite a common english word it is the case that such a fact is not surprising. Note that as the size of the ngram increases we see that the fraction of documents that contain the most common ngrams drops sharply. While the most common 1grams can show up almost or more than half the time it is the case that the most common 2 grams are in only 10% of the documents, the most common 3 grams are in less than 1% of the documents, and even the most common 4grams are in less than .2% of the documents

##   the    to   and     a    of 
## 47031 27381 23836 23413 20064

##       the        to       and         a        of 
## 110.67680  64.43498  56.09262  55.09719  47.21608

##  of_the  in_the  to_the for_the  on_the 
##    4211    4105    2126    1966    1906

##   of_the   in_the   to_the  for_the   on_the 
## 9.909634 9.660187 5.003059 4.626536 4.485339

##     one_of_the       a_lot_of thanks_for_the    going_to_be      i_want_to 
##            309            300            241            178            164

##     one_of_the       a_lot_of thanks_for_the    going_to_be      i_want_to 
##      0.7271615      0.7059820      0.5671389      0.4188827      0.3859368

##     the_end_of_the for_the_first_time      at_the_end_of 
##                 75                 68                 67

##     the_end_of_the for_the_first_time      at_the_end_of 
##          0.1764955          0.1600226          0.1576693

Stopwords are words that are very common in the language and as such are not very useful for understanding the content of a document. Like for profanity it is the case that the list of stopwords is often up for debate. For this project I used R’s default list of stopwords at \(stopwords('en')\). Exploratory analysis indicates that in the document used for this project there are very few stopwords that come up as features (tokens, 1grams) by themselves. It turns out that only 173 out of 57841 of the 1 word features are stopwords. Clearly none of the 2grams, 3grams, or 4grams can be stopwords because stopwords are single words.

As a curiosity it’s interesting to see how many of the ngrams at each length have a certain frequency. Considering an arbitrary threshold (75 occurrences) we see that 1411 1grams occur more than 75 times. For 2grams this number is 801, for 3grams it’s 51, and for 4grams it is all the way down to 1. These results corroborate what we’ve seen earlier - that larger ngrams are more and more uncommon.

Lexical diversity refers to the length and variety of words and sentences in a particular document. In general it’s preferred to compare lexical diversity across documents of the same kind. Fornexample, comparing tweets to news articles might not make a whole lot of sense. Unfortunately in this project there’s really only that option.The following plots indicate lexical diversity of the for different ngram lengths in our training set. Note the rather clear divisions in each plot. This is a likely indication of each kind of document (tweets, blog posts, and news articles).

The results aren’t surprising. Tweets are usually very short, there are a lot of them, and they make up a large portion of this set of documents. Also, the overall number of documents in this sample is rather small, so it’s more likely that the word choice, sentence length, and phrasing are going to vary quite a lot overall compared to the size of the set of documents.

Document similarity is also something that could be investigated. In this project it’s not terribly useful because of the three different kinds of documents being considered.

As a final note I created a basic prediction algorithm. Step 1 is to take an input. That input has a length in words, so we’ll consider the table of ngrams of size one greater than the input. By matching up the input to the begging of those larger ngrams we can find the most frequent nth word after n-1 words.

As an example: Assume the input is “I want to”. That has 3 words in it, so we will consider the document-feature-matrix for the 4grams. If the most frequent 4grams that start with “I_want_to” are “I_want_to_go”, “I_want_to_eat”, and “I_want_to_sleep” in order of decreasing frequency then this algorithm predicts that the next word after “I want to” is “go”.

An example of the output of the basic prediction function consider the input “You’re using”. The output is “rpgmaker”" as the predicted next word.

Limitations

The algorithm is not yet sophisticated enough to consider what to do with inputs that don’t match to anything in the available document-feature-matrices. Implementing this functionality will likely require some sophisticated adjustments to the tables and probability distributions.
The algorithm may or may not handle input punctuation properly. It’s not yet been converted to an RShiny app, so whether or not that needs to be adjusted or fixed remains to be seen.
The algorithm is using a combination of tweets, blog posts, and news articles to generate its predictions. In that way it is probably less accurate than a prediction algorithm for one of those categories based only on information from that kind of source.
The algorithm is not yet set up to run on small devices with limited memory and computing power. It’s been suggested that a Markov chain setup might alleviate this problem.

Conclusion

A combination of tweets, blog posts, and news articles can be used to create a next-word prediction algorithm. It accuracy and performance can be hindered by several things. However, the quanteda package provides useful tools for preprocessing text data and analyzing it in order to create such a prediction algorithm.

Appendix

Here are the different code chunks that went into this analysis.

Reading the data into R.

conEnProfanity <- file('./profanity.txt', 'r')
enProfanityData <- readLines(con = conEnProfanity, skipNul = TRUE, warn = FALSE, encoding = 'UTF-8')
close(conEnProfanity)
length(enProfanityData) #377

conEnTwitter <- file('./CapstoneData/final/en_US/en_US.twitter.txt', 'r')
enTwitterData <- readLines(con = conEnTwitter, skipNul = TRUE, warn = FALSE, encoding = 'UTF-8')
close(conEnTwitter)
length(enTwitterData) #2360148

conEnBlogs <- file('./CapstoneData/final/en_US/en_US.blogs.txt', 'r')
enBlogsData <- readLines(con = conEnBlogs, skipNul = TRUE, warn = FALSE, encoding = 'UTF-8')
close(conEnBlogs)
length(enBlogsData) #899288

conEnNews <- file('./CapstoneData/final/en_US/en_US.news.txt', 'rb')
enNewsData <- readLines(con = conEnNews, skipNul = TRUE, warn = FALSE, encoding = 'UTF-8')
close(conEnNews)
length(enNewsData) #1010242

Combine the documents.

allDocs <- c(enTwitterData, enBlogsData, enNewsData)
length(allDocs) #4,269,678

Sample 1% of the total documents.

sampleDoc <-
        function(doc, perc) {
                set.seed(20190805)
                a <- sample(
                        c(TRUE, FALSE),
                        size = length(doc),
                        replace = TRUE,
                        prob = c(perc, 1-perc)
                )
                doc[a]
        }
allDocsSample <- sampleDoc(allDocs, perc = .01)

Use the Quanteda package.

library(quanteda)

Tokenize the documents.

#Generate unigrams
docTokens <- function(doc) {
        a <- tokens(
                doc,
                remove_punct = TRUE,
                remove_numbers = TRUE,
                remove_symbols = TRUE
        )
        a <- tokens_remove(a, pattern = enProfanityData)
        a
}
allDocTokens <- docTokens(allDocsSample)
length(allDocTokens)

Generate all of the ngrams.

#generate ngrams
docNGram <- function(toks, num) {
        a <- tokens_ngrams(toks, n = num)
        a
}

all1Grams <- docNGram(allDocTokens, num = 1)
all2Grams <- docNGram(allDocTokens, num = 2)
all3Grams <- docNGram(allDocTokens, num = 3)
all4Grams <- docNGram(allDocTokens, num = 4)

Generate the document-feature-matrix

#generate the document-feature-matrices

dfmat1 <- dfm(all1Grams)
ndoc(dfmat1) #42494
nfeat(dfmat1) #57841

dfmat2 <- dfm(all2Grams)
ndoc(dfmat2) #42494
nfeat(dfmat2) #439,328

dfmat3 <- dfm(all3Grams)
ndoc(dfmat3) #42494
nfeat(dfmat3) #763,523

dfmat4 <- dfm(all4Grams)
ndoc(dfmat4) #42494
nfeat(dfmat4) #839,493

Histograms of the row sums of the document-feature-matrix. Proxy for length of document.

set.seed(Sys.time())
yran <- c(0,5000)
hist(
        rowSums(dfmat1),
        ylim = yran,
        xlim = c(0, 200),
        xlab = 'Number of 1grams',
        main = 'How many 1Grams in the document'
        , breaks = c(0:800)
        , col = 'forestgreen'
)
hist(
        rowSums(dfmat2),
        ylim = yran,
        xlim = c(0, 200),
        xlab = 'Number of 2grams',
        main = 'How many 2Grams in the document'
        , col = 'steelblue'
        , breaks = c(0:800)
)
hist(
        rowSums(dfmat3),
        ylim = yran,
        xlim = c(0, 200),
        xlab = 'Number of 3grams',
        main = 'How many 3Grams in the document'
        , col = 'yellow'
        , breaks = c(0:800)
)
hist(
        rowSums(dfmat4),
        ylim = yran,
        xlim = c(0, 200),
        xlab = 'Number of 4grams',
        main = 'How many 4Grams in the document'
        , col = 'purple'
        , breaks = c(0:800)
)

Histograms of the column sums of the document-feature-matrix. How often does each particular ngram appear in the entire set of documents?

yran2 <- c(0,1)
hist(
        colSums(dfmat1),
        xlim = c(0, 30),
        ylim = yran2,
        xlab = '1grams',
        main = 'Number of 1Grams',
        breaks = range(colSums(dfmat1))[2] - range(colSums(dfmat1))[1]
        ,freq = FALSE
        , col = 'forestgreen'
        
)
hist(
        colSums(dfmat2),
        xlim = c(0, 30),
        ylim = yran2,
        xlab = '2grams',
        main = 'Number of 2Grams',
        breaks = range(colSums(dfmat2))[2] - range(colSums(dfmat2))[1]
        ,freq = FALSE
        , col = 'steelblue'
        
)
hist(
        colSums(dfmat3),
        xlim = c(0, 30),
        ylim = yran2,
        xlab = '3grams',
        main = 'Number of 3Grams',
        breaks = range(colSums(dfmat3))[2] - range(colSums(dfmat3))[1]
        ,freq = FALSE
        , col = 'yellow'
)
hist(
        colSums(dfmat4),
        xlim = c(0, 30),
        ylim = yran2,
        xlab = '4grams',
        main = 'Number of 4Grams',
        breaks = range(colSums(dfmat4))[2] - range(colSums(dfmat4))[1]
        ,freq = FALSE
        , col = 'purple'
)
hist(
        colSums(dfmat4),
        xlim = c(0, 30),
        ylim = c(0,.003),
        xlab = '4grams',
        main = 'Number of 4Grams - Detail',
        breaks = range(colSums(dfmat4))[2] - range(colSums(dfmat4))[1]
        ,freq = FALSE
        , col = 'purple'
)

How often do the most frequent features at each ngram size occur?

siz <- ndoc(dfmat1)
topfeatures(dfmat1, 5);topfeatures(dfmat1, 5)/siz*100
        # the           to              and             a               of 
        # 47031         27381           23836           23413           20064
        # 1.1067680     0.6443498       0.5609262       0.5509719       0.4721608
        #The word 'the' appears more than once in a substantial number of the
        #documents
topfeatures(dfmat2, 5);topfeatures(dfmat2, 5)/siz*100
        # of_the        in_the          to_the          for_the         on_the 
        # 4211          4105            2126            1966            1906 
        # 0.09909634    0.09660187      0.05003059      0.04626536      0.04485339 
        #As noted in the histograms there is a strong correlation between the 1grams and the 2grams
topfeatures(dfmat3, 5);topfeatures(dfmat3, 5)/siz*100
        # one_of_the    a_lot_of        thanks_for_the  going_to_be     i_want_to 
        # 309           300             241             178             164 
        # 0.007271615   0.007059820     0.005671389     0.004188827     0.003859368
topfeatures(dfmat4, 3);topfeatures(dfmat4, 3)/siz*100
        # the_end_of_the        for_the_first_time      at_the_end_of   thanks_for_the_follow   at_the_same_time 
        # 75                    68                      67              60                      57 
        # 0.001764955           0.001600226             0.001576693     0.001411964             0.001341366

How many features are removed if stopwords are not included in the 1grams to be considered?

#To remove stopwords - If do nfeat first then 
        # can subtract to get the number of tokens removed as stopwords
nfeat(dfmat1) - nfeat(dfm_remove(dfmat1, pattern = stopwords('en'))) #173
nfeat(dfmat1) #57841
        #173 of the tokens in dfm1 are stopwords
        #The other matrices will have no stopwords in them because the tokens
        #comprise multiple words

How many frequent features exist for each size ngram?

#How many features have a frequency at or above the minimum you set?
nfeat(dfm_trim(dfmat1, min_termfreq = 75)) #only 1411 bigger than freq = 75
nfeat(dfm_trim(dfmat2, min_termfreq = 75)) #only 801 bigger than freq = 75
nfeat(dfm_trim(dfmat3, min_termfreq = 75)) #only 51 bigger than freq = 75
nfeat(dfm_trim(dfmat4, min_termfreq = 75)) #only 1 bigger than freq = 75
#This corroborates those histograms that show the distribution having
        #extraordinarily low percentage at the high end

What does the lexical diversity of the documents look like for each size of ngram?

#Lexical diversity really has to be calculated over documents of the same type.
        #It doesn't make sense to compare tweets to news articles.
        #In general the lexical diversity for these documents, as presented, is high.
        #The plot indicates that there are a lot of different words used in the twitter sample
plot(textstat_lexdiv(dfmat1)$TTR, xlab = NULL, ylab = "TTR", main = 'Unigram Lexical Diversity')
plot(textstat_lexdiv(dfmat2)$TTR, xlab = NULL, ylab = "TTR", main = 'Bigram Lexical Diversity')
plot(textstat_lexdiv(dfmat3)$TTR, xlab = NULL, ylab = "TTR", main = 'Trigram Lexical Diversity')
plot(textstat_lexdiv(dfmat4)$TTR, xlab = NULL, ylab = "TTR", main = 'Quadgram Lexical Diversity')
#Overall this shows that the included document contain a lot of different words 
        #for their lengths

How is the final model constructed?

en2GramFreqs <- sort(colSums(dfmat2), decreasing = TRUE)
en3GramFreqs <- sort(colSums(dfmat3), decreasing = TRUE)
en4GramFreqs <- sort(colSums(dfmat4), decreasing = TRUE)

nextWord <- function(input) {
        #input string
        input <- tolower(input)
        #how many words in input string = n-1?
        noInput <- length(unlist(strsplit(input, split = ' ')))
        #Get the list of ngrams that's already stored
        library(stringr)
        inputReplaced <-
                str_replace_all(string = input,
                                pattern = ' ',
                                replacement = '_')
        if(!(grepl(pattern = '_$', x = inputReplaced))) {inputReplaced <- paste0(inputReplaced, '_')}
        if (noInput == 1) {
                inputGrams <- en2GramFreqs
        } else if (noInput == 2) {
                inputGrams <- en3GramFreqs
        } else if (noInput == 3) {
                inputGrams <- en4GramFreqs
        }
        
        #use regexp to match the n-1 input words to the beginning of the ngram
        num <-
                grep(pattern = paste0('^', inputReplaced),
                     x = names(inputGrams))[1]
        #inputGrams[num]
        #choose the next word by fnding the end word in that ngram
        maxIndexDelim <-
                max(gregexpr(pattern = '_', text = names(inputGrams)[num])[[1]])
        bestWord <-
                substr(names(inputGrams[num]), maxIndexDelim + 1, nchar(names(inputGrams[num])))
        bestWord
}

What’s an example of the prediction function working?

nextWord('You\'re using')

## [1] "rpgmaker"

Data Science Specialization Capstone - Week 2 Summary - Thomas Hill III

Thomas Hill III

August 18, 2019