Project Description

This capstone project gives us the opportunity to apply the skills we have learned in the first nine modules of the Data Science Specialization to the area of natural language processing (NLP).

Data for this project is provided by Swiftkey, the industry partner for the Capstone Project. Our ultimate goal is to build a predictive algorithm which can guess the next word as a user enters text on a keyboard. For example, “I love…” might be followed with the algorithm guessing “you.”

Our main deliverable will be a Shiny app that demonstrates the algorithm.

Packages

We will make use of the following packages.

library(quanteda)
library(ggplot2)
library(tm)

Data

Data sets are available on the course webpage (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip).

The corpora are collected from publicly available sources by a web crawler. Although multiple files are provided in a variety of languages, we are instructed to work only with the files in English. These files are each text files, and contain a collection of blogs, news, or tweets respectively.

Size and Structure of English Data Set

The data sets are fairly large (roughly 167-210 MB).
It’s informative to look at the structure of the data (lines, tokens, sentences).

We can get the total lines in each file using readLines or make use of the quanteda packages for more general summary information.

# set working directory
setwd("~/Desktop/texts")
# read files to get line count
con <- file("~/Desktop/CapstoneFiles/en_US.blogs.txt")
length(readLines(con, skipNul = TRUE))
close(con)

con <- file("~/Desktop/CapstoneFiles/en_US.news.txt")
length(readLines(con, skipNul = TRUE))
close(con)

con <- file("~/Desktop/CapstoneFiles/en_US.twitter.txt")
length(readLines(con, skipNul = TRUE))
close(con)

# build corpus of 3 text files to get summary information
cname <- file.path("~", "Desktop", "CapstoneFiles")
MyCorpus <- Corpus(DirSource(cname))
summary(MyCorpus)
File Size in MB Line Count Tokens Sentences
en_US.blogs.txt 210.2 899288 42840097 2077533
en_US.news.txt 205.8 1010242 39918312 1868674
en_US.twitter.txt 167.1 2360148 36719689 2598128

Sampling

Using a smaller subset of the data was recommended by the course instructors, and we will select our smaller subset through random sampling, and use that representative sample to infer facts about the population.

The instructors suggested using the rbinom() function in R to “flip a biased coin” and determine if we will sample a line of text or not. They also suggested creating a separate sub-sample data set by reading a subset of the original data and writing it in a separate file. This way, we can avoid recreating the sample each time. I wrote a function to randomly sample a percentage lines from a given file and to save the sample as a new file in a subdirectory (I suppressed the code as instructed by the course mentor on the discussion formum). I sampled 1% of each file, and saved that sample in a separate text file to work with. I will likely change the percentage used in the future, but took a sample small enough to let everything run quickly for this exploratory milestone report.

# take 1% of each file
samplingFile("blogs", 0.01)
samplingFile("news", 0.01)
samplingFile("twitter", 0.01)

Cleaning

We are asked to remove profanity and other words we do not want to predict from the data set. I found several lists available online, such as google’s banned words (https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/). I compared a few lists and used the list provided by Carnegie Mellon Univeristy School of Computer Science because is was more thorough (1384 words). I used this list to filter profanity and other (potentially) offensive words. In the future, it will be necessary to shorten this list, as some words on it are only offensive in a certain context (i.e., “crack”).

I also perform a number of transformations such as

library(tm)
# identify subsample files, they are saved locally in Desktop folder "textfolder.""
cname <- file.path("~", "Desktop", "CapstoneFiles", "SampleFiles")   

# create corpus from subsample files
docs <- VCorpus(DirSource(cname))

# remove profanity
profanityWords<-readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
docs <- tm_map(docs,removeWords, profanityWords)

# convert to lowercase (not a canonical transformation under getTransformation()), so wrap in content_transformer.
docs <- tm_map(docs, content_transformer(tolower))

# remove non-ASCII characters
docs <- tm_map(docs, function(x) iconv(x, "latin1", "ASCII", sub=""))

# Canonical transformations in tm package
# remove numbers
docs <- tm_map(docs, removeNumbers)

# remove stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))

# remove punctuation
docs <- tm_map(docs, removePunctuation, preserve_intra_word_dashes=TRUE)

# strip whitespace
docs <- tm_map(docs, stripWhitespace)

# convert corpus to plain text document
docs <- tm_map(docs, PlainTextDocument) 

# save the cleaned corpus as a file so we don't have to recreate it
saveRDS(docs, file = "./CleanSampleCorpus.RData")

# read the corpus as a data frame in preparation for tokenization below
DF <- readRDS("~/Desktop/texts/CleanSampleCorpus.RData")
finalCorpus <-data.frame(text=unlist(sapply(DF,`[`, "content")),stringsAsFactors = FALSE)

Exploratory Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens (loosely, terms or words which are linguistically significant and methodologically useful), and phrases in the text. We want to understand the variation of both the frequencies of words and word pairs (or triplets, etc.) in the data. The pairing is important as our ultimate goal is to predict the next word when one or two are given.

We build the document term matrix (the matrix that lists all the occurences of words in the corpus by document). Documents are represented by rows and words are represented by columns. These matrices are generally huge (hence the subsampling done above), and are generally sparse (a large number of its entries are zeros, because a large majority of words don’t appear). This converts a body of text into a mathematical object (a matrix) which is amenable to mathematical analysis. For example, the frequency of each word occuring in the corpus is just the column sum. We will use this to generate some simple exploratory graphics.

I chose to use the function dfm() in the Quanteda package rather than DocumentTermMatrix() in the tm package due to efficiency considerations. (I changed course partway through this report.) This constructs a sparse document-feature matrix. It tokenizes and tabulates the extracted features into a matrix of documents by features. It is also possible to do a fair amount of cleaning (removing stop words, stemming, etc.) within this function call. I already cleaned the samples with functions in the tm package, but to improve efficiency when working with the larger data sets, this is worth noting.

Quanteda also has other useful functions that let us explore the data set, such as looking at the longest token, looking at associations with a chosen word (I chose “love” below), and generally summarizing the corpus. The output is lengthy, so I have suppressed it in this brief report.

# build dfm
mydfm <- dfm(QCorpus)

# access list of 20 most frequently occuring words
topfeatures(mydfm, 20)
##   said   will   just    one   like    can    get   time    new    now   good    day   know   love people     us   back     go    see  going 
##   3073   2987   2982   2720   2707   2458   2213   2128   1882   1791   1760   1692   1584   1566   1544   1496   1423   1383   1339   1271
# plot wordcloud with up to 300 words
library(wordcloud)
set.seed(123)
textplot_wordcloud(mydfm, min.freq = 6, max.words = 300, random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

N-gram Model

An N-gram is a sequence of N words, and we will use an N-gram model to estimate the probability of the last word of an N-gram given the previous words, and also assign probabilities to entire sequences.

An N-gram model makes a Markov assumption (the assumption that the probability of a word depends only on the previous word or words, up to N-1).

We can compute the probability of an entire sequence of words by using the chain rule of probability: \[P(X_1,...X_n) = P(X_1)P(X_2|X1)P(X_3|X_1^2)...P(X_n|X_1^{n-1})\] This rules shows the link between computing the joint probability of a sequence and computing the conditional probability of a word given previous words. By using the Markov assumption, we assume that \[P(X_n|X_1^{n-1}) \approx P(X_n|X_{n-N+1}^{n-1})\].

We estimate the N-gram probabilities with Maximum likelihood estimation (MLE). Namely, we get counts from the corpus and then normalize the counts by the total counts of all n-grams that share the same first word so that the probabilities fall between 0 and 1, which is required.

The packages have built in functions for these calculations.

Bigrams and Trigrams and Quadgrams, Oh My!

As explained above, we are not interested in only the probability of a word (unigram) occuring. We are interested in the probability of a word conditioned on a previous word occuring or previous words occuring. We therefore conduct exploratory analysis of bigrams, trigrams and quadgrams in our sample.

We can again make use of the quanteda package and create document feature matrices for bigrams and trigrams. We plot the 20 most frequent unigrams, bigrams and trigrams respectively. I suppressed the code as instructed by the course mentor on the discussion forum. I choose not to plot the quadgram frequencies in this report for brevity.

Future Work

We will develop a predictive model based on the steps outlined above, and will implement smoothing and backoff (explained below). We will create a Shiny application as a demo.

Smoothing: Smoothing addresses the problem of zero probabilities for N-grams (i.e., sparse data sets).

To prevent a language model from assigning zero probability to events not seen in the test set, we must adjust the Maximum Likelihood Estimate of probabilities. Smoothing (also referred to as discounting) is the modification of shaving off a small amount of probability mass from some frequent events and giving it to events we have not seen. This keeps unseen phrases from having a zero probability. This is desirous because Maximum Likelihood Estimates produce poor estimates when counts are small, so we need better estimators for rare events.

Backoff: Backoff Exploits the Hierarchical Nature of N-grams.

If we have no examples of a given trigram, we can estimate its probability by using the associated bigram probability. If we have no examples of that, we can use the associated unigram probability. More generally, we only back-off to a lower-order N-gram if we have zero evidence for a higher order N-gram. There are a number of smoothing and back off techniques, but we will focus on Katz Backoff which performs better than Jelinek-Mercer for large data sets (see Chen 1999) and was recommended in the course discussion forum.

The Katz Back off Model (see Katz 1987) redistributes some of the probability mass of high order N-grams to lower-order N-grams and normalizes. If the N-gram has been seen more than k times in training, the conditional probability is proportional to the MLE of that n-gram. Otherwise, the conditional probability is equal to the back-off conditional probability of the N-1 gram.

Large counts are taken to be reliable so are not discounted. “Large” is determined by a threshold level k (Katz suggests taking k = 5). The discounts for the lower counts come from the Good-Turning estimate.

Stemming

I will also consider stemming in the next iteration of this project. For example, “offer,”" “offered,”" “offering”" will all be stemmed to “offer”. This is advantageous because entries will consolidate in the document term matrix. I did not stem at this stage because some of the simple stemming algorithms (such as the function contained in the tm package) are considered crude, and I want more time to research more precise methods. (For example, I want to read about the SnowballC package). I also would like to research lemmatization, which incorporates the grammatical context.

I also hestitate to stem because I want my algorithm to predict the correct ending. For example, if a user types “Let’s go” I would want my algorithm to predict the grammatically correct “running” and not “run.”

Stop Words

Similarly, I am considering putting stop words back into the data set. When looking at the Top 20 Trigrams, “looking forward seeing” comes up. It’s clear that the word “to” should be predicted after “forward”“, but the algorithm will not do that well without stop words for training.

Efficiency versus Accuracy Tradeoff

Our ultimate goal is to build an application that runs on a web server, and realistically, on a mobile device (Swiftkey’s main application is texting). Therefore, it is important to reduce the data set as much as possible, so removing n-grams that occur with very low frequencies might help with this.

From reading the discussion forums, I have gathered a list of other suggestions to optimize efficiency. I intend to learn and implement data.table because it was recommended on the forum, and because it can dramatically reduce the time and resources required in the data processing workflow. I also rewrote some of my code above to make use of the quanteda package (I orginally made use of only the tm package and the RWeka package). For the final project, I will likely rely on only the quanteda package.

References