Coursera NLP capstone - Milestone Report

Summary

The goal of the capstone project is to develop a model to predict text based on user’s input. This milestone report will demonstrate the progress so far, including:
1. downloading and loading the training data, including blogs, newsfeeds and twitter feeds
2. some summary statistics
3. exploratory analysis showing any interesting findings
4. initial plans on developing a prediction algorithm, based on the analysis done so far.

The two major packages used for this analysis are quanteda, for storing the text, and tidytext, for exploratory analysis of the text.

Load data

Each dataset (blog, news and twitter) is loaded and stored. The summary statistics, including the number of tokens (total words), types (unique words), and documents, for the three datasets, are found below.

library(readtext)
library(spacyr)
library(quanteda)
library(knitr)

## read entire datasets 
blog <- readLines(file("en_US/en_US.blogs.txt"))
news <- readLines(file("en_US/en_US.news.txt"))
twitter <- readLines(file("en_US/en_US.twitter.txt"))

# take out accents
blog <- iconv(blog,from="UTF-8",to="ASCII//TRANSLIT")
news <- iconv(news,from="UTF-8",to="ASCII//TRANSLIT")
twitter <- iconv(twitter,from="UTF-8",to="ASCII//TRANSLIT")

#  store in corpora
blogCorpus <- corpus(blog)
newsCorpus <- corpus(news)
twitterCorpus <- corpus(twitter)
fullCorpus <- blogCorpus + newsCorpus + twitterCorpus # store in quanteda corpus object for modelling

# summary stats
tokens <- c(sum(ntoken(blogCorpus)),sum(ntoken(newsCorpus)), sum(ntoken(twitterCorpus)))
types <- c(sum(ntype(blogCorpus)),sum(ntype(newsCorpus)),sum(ntype(twitterCorpus)))
documents <- c(ndoc(blogCorpus),ndoc(newsCorpus),ndoc(twitterCorpus))
summaryDF <- data.frame(tokens,types,documents)
rownames(summaryDF) <- c("blogs","news","twitter")
kable(format(summaryDF, big.mark=","))

	tokens	types	documents
blogs	42,803,583	31,239,074	899,288
news	39,383,861	31,622,924	1,010,242
twitter	36,306,802	32,301,039	2,360,148

Clean data

Some data cleaning steps were performed prior to exploratory data analysis, including:
1. profanity filtering
2. conversion of text to lowercase
3. remove whitespace
4. remove punctuation
5. split (tokenize) the text data to one, two and three word units (1-grams, 2-grams and 3-grams). These units form the basis of the prediction model.

10% of each dataset was sampled to maintain reasonable runtimes. Stopwords (common words such as the, off or and) were removed in the exploratory data analysis to better assess the different characters of each dataset. Stopwords will be kept in for the modelling phase, as they will be required for accurate text predictions.

library(tidytext)
library(dplyr)

# sample 10%  and store in quanteda corpora
set.seed(2017)
blogSample <- blog[sample(1:length(blog), size = 0.1*length(blog))]
blogCorpus <- corpus(blogSample)
docvars(blogCorpus, "source") <- "blog"
newsSample <- news[sample(1:length(news), size = 0.1*length(news))]
newsCorpus <- corpus(newsSample)
docvars(newsCorpus, "source") <- "news"
twitterSample <- twitter[sample(1:length(twitter), size = 0.1*length(twitter))]
twitterCorpus <- corpus(twitterSample)
docvars(twitterCorpus, "source") <- "twitter"
sampleCorpus <- blogCorpus + newsCorpus + twitterCorpus

# retrieve profanity and stopword list
stopWords <- readtext("stop-word-list.txt") %>% unnest_tokens(term, text)
stopWords <- stopWords[,2]
profanity <- readtext("bad-words.txt") %>% unnest_tokens(term, text)
profanity <- profanity[,2]

#convert corpora to document feature matrices (dfms)
blogTokens <- tokens(blogCorpus, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE, remove_twitter = TRUE)
blogTokens <- tokens_remove(blogTokens, profanity)
blogTokens <- tokens_remove(blogTokens, stopWords)
blogDFMn1 <- dfm(blogTokens, ngrams = 1)
blogDFMn2 <- dfm(blogTokens, ngrams = 2)
blogDFMn3 <- dfm(blogTokens, ngrams = 3)

newsTokens <- tokens(newsCorpus, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE, remove_twitter = TRUE)
newsTokens <- tokens_remove(newsTokens, profanity)
newsTokens <- tokens_remove(newsTokens, stopWords)
newsDFMn1 <- dfm(newsTokens, ngrams = 1)
newsDFMn2 <- dfm(newsTokens, ngrams = 2)
newsDFMn3 <- dfm(newsTokens, ngrams = 3)

twitterTokens <- tokens(twitterCorpus, remove_numbers = TRUE, remove_punct = TRUE, remove_url = TRUE, remove_twitter = TRUE)
twitterTokens <- tokens_remove(twitterTokens, profanity)
twitterTokens <- tokens_remove(twitterTokens, stopWords)
twitterDFMn1 <- dfm(twitterTokens, ngrams = 1)
twitterDFMn2 <- dfm(twitterTokens, ngrams = 2)
twitterDFMn3 <- dfm(twitterTokens, ngrams = 3)

Exploratory data analysis

The document feature matrices (dfm) were converted into a tidy format using tidytext for exploratory analysis. The intent of this analysis is see if there are aspects of the data that will affect the production of a predictive model, as well as to understand the character of the datasets.

# convert dfms to tidy format
blogTidy1 <- tidy(blogDFMn1)
blogTidy1[4] <- "blog"
colnames(blogTidy1)[4] <- c("source")
blogTidy2 <- tidy(blogDFMn2)
blogTidy2[4] <- "blog"
colnames(blogTidy2)[4] <- c("source")
blogTidy3 <- tidy(blogDFMn3)
blogTidy3[4] <- "blog"
colnames(blogTidy3)[4] <- c("source")

newsTidy1 <- tidy(newsDFMn1)
newsTidy1[4] <- "news"
colnames(newsTidy1)[4] <- c("source")
newsTidy2 <- tidy(newsDFMn2)
newsTidy2[4] <- "news"
colnames(newsTidy2)[4] <- c("source")
newsTidy3 <- tidy(newsDFMn3)
newsTidy3[4] <- "news"
colnames(newsTidy3)[4] <- c("source")

twitterTidy1 <- tidy(twitterDFMn1)
twitterTidy1[4] <- "twitter"
colnames(twitterTidy1)[4] <- c("source")
twitterTidy2 <- tidy(twitterDFMn2)
twitterTidy2[4] <- "twitter"
colnames(twitterTidy2)[4] <- c("source")
twitterTidy3 <- tidy(twitterDFMn3)
twitterTidy3[4] <- "twitter"
colnames(twitterTidy3)[4] <- c("source")

tidySample1 <- bind_rows(blogTidy1,newsTidy1,twitterTidy1) # combine 1-gram datasets together
tidySample2 <- bind_rows(blogTidy2,newsTidy2,twitterTidy2) # combine 2-gram datasets together
tidySample3 <- bind_rows(blogTidy3,newsTidy3,twitterTidy3) # combine 3-gram datasets together

# number of 1-gram observations
kable(format(nrow(tidySample1), big.mark=","))

4,666,831

# number of 2-gram observations
kable(format(nrow(tidySample2), big.mark=","))

4,507,289

# number of 3-gram observations
kable(format(nrow(tidySample3), big.mark=","))

4,128,342

Distribution histograms

library(ggplot2)

# plot total 1-gram frequency by source
tidySample1 %>%  group_by(source) %>%
                count(term,sort=TRUE) %>%
                top_n(20) %>%
                mutate(term = reorder(term, n)) %>%
                ggplot(aes(term, n, fill = source)) +
                geom_col() +
                xlab(NULL) +
                coord_flip()

# calculate 1-gram term frequency - inverse document frequency (tf-idf)
tidySampleTFIDF1 <-  tidySample1 %>%
                    count(source, term, sort = TRUE) %>%
                    ungroup() %>%
                    bind_tf_idf(term, source, n) %>%
                    arrange(desc(tf_idf))

# plot 1-gram (tf-idf)
tidySampleTFIDF1 %>% mutate(term = factor(term, levels = rev(unique(term)))) %>% 
                    group_by(source) %>% 
                    top_n(20) %>% 
                    ungroup %>%
                    ggplot(aes(term, tf_idf, fill = source)) +
                    geom_col(show.legend = FALSE) +
                    labs(x = NULL, y = "tf-idf") +
                    facet_wrap(~source, ncol = 3, scales = "free") +
                    coord_flip()

The total frequency 1-gram histogram indicates that some of the most frequent words (time, just, new) are shared by all three datasets. However, there are other words that appear only in certain datasets. The twitter dataset contains terms that are unique, such as rt (short for retweet) or u.

The tf-idf plot highlights terms that are more common within a specific dataset and will filter out common words (such as time, just, new) across datasets. We see that the blog dataset has a great variety of terms. The news dataset is focused more on location and proper names. The twitter dataset contains a number of pseudo-words, such as ur, thx, cuz, or acronyms like omg, lmao, idk.

# plot 2-gram frequency by source
tidySample2 %>%  group_by(source) %>%
                count(term,sort=TRUE) %>%
                top_n(15) %>%
                mutate(term = reorder(term, n)) %>%
                ggplot(aes(term, n, fill = source)) +
                geom_col() +
                xlab(NULL) +
                coord_flip()

# calculate 2-gram term frequency - inverse document frequency (tf-idf)
tidySampleTFIDF2 <-  tidySample2 %>%
                    count(source, term, sort = TRUE) %>%
                    ungroup() %>%
                    bind_tf_idf(term, source, n) %>%
                    arrange(desc(tf_idf))

# plot 2-gram (tf-idf)
tidySampleTFIDF2 %>% mutate(term = factor(term, levels = rev(unique(term)))) %>% 
                    group_by(source) %>% 
                    top_n(20) %>% 
                    ungroup %>%
                    ggplot(aes(term, tf_idf, fill = source)) +
                    geom_col(show.legend = FALSE) +
                    labs(x = NULL, y = "tf-idf") +
                    facet_wrap(~source, ncol = 3, scales = "free") +
                    coord_flip()

The total frequency 2-gram histogram indicates that a few of the most frequent 2-grams (don’t_know, don’t_want) are shared by all three datasets. The news 2-grams again are more focused on location names, while the twitter 2-gram terms feature many terms used in personal conversation (happy_birthday, good_morning, can’t_wait).

The tf-idf 2-gram histograms show great diversity between datasets. The blog dataset has many terms with amazon in it. The news 2-gram dataset is focused more on location and proper names, similar to the 1-gram dataset. The twitter dataset consists mostly of pseudo-words.

# plot 3-gram total frequency by source
tidySample3 %>%  group_by(source) %>%
                count(term,sort=TRUE) %>%
                top_n(15) %>%
                mutate(term = reorder(term, n)) %>%
                ggplot(aes(term, n, fill = source)) +
                geom_col() +
                xlab(NULL) +
                coord_flip()

# calculate 3-gram term frequency - inverse document frequency (tf-idf)
tidySampleTFIDF3 <-  tidySample3 %>%
                    count(source, term, sort = TRUE) %>%
                    ungroup() %>%
                    bind_tf_idf(term, source, n) %>%
                    arrange(desc(tf_idf))

# plot 3-gram (tf-idf)
tidySampleTFIDF3 %>% mutate(term = factor(term, levels = rev(unique(term)))) %>% 
                    group_by(source) %>% 
                    top_n(20) %>% 
                    ungroup %>%
                    ggplot(aes(term, tf_idf, fill = source)) +
                    geom_col(show.legend = FALSE) +
                    labs(x = NULL, y = "tf-idf") +
                    facet_wrap(~source, ncol = 3, scales = "free") +
                    coord_flip()

The total frequency histogram shows that virtually none of the 3-gram terms are shared by the three datasets. Once again, the news 3-grams are more focused on location names and the twitter 3-gram terms feature terms used in personal conversation. The tf-idf 3-gram histograms show a similar trend to the 2-gram terms.

Model plans

The linguistic diversity from the three blog, news and twitter datasets indicate that they should all be incorporated for a more robust text-prediction model.

For the prediction model, the following steps are planned:

tokenize the fullCorpus or sampleCorpus object (depending on runtimes) and perform the data cleaning processes listed in the Clean Data section. Note: stopwords will not be removed, as they are required for accurate prediction.
Generate n-grams, using n from 2-5. The n parameter will be adjusted for efficiency during the modelling process.
Calculate frequency distributions for the n-grams.
Use a smoothing algorithm, to shift total probability from more frequent n-grams to previously-unseen n-grams. Algorithms such as Stupid backoff and Kneser-Ney smoothing will be tested.
Incorporate a back-off model to give zero-frequency n-grams a non-zero probability by using a (n-1)-gram probability instead.
Store the n-gram and frequency output into a data.table object, due to its speed and memory efficiency.
This data table will allow the prediction model to quickly access the probability of predicting a word based on previous word input.
The output model will be designed to prioritize size and runtime efficiency, as it will be required to run within a Shiny app.