Coursera JHU Data Science Specialization

Summary

I loaded in a corpus of texts from the Coursera course, and conducted some exploratory analyses of these texts. Since the corpus is quite large, I sampled 10% of lines from the corpus to find the most common n-grams (i.e., n = 1, 2, 3, 4). In my sample, I also determined that ~13 000 words provided coverage for 90% of the corpus, and the number of words required to reach 100% coverage exceeded 100 000.

I used the following packages: ggplot2, gridExtra, knitr, readtext, quanteda, stringi, and tidyverse.

In the coming weeks, I will need to construct a predictive model for user text input. The data will be divided into training, development and test sets to develop this model. The n-gram frequency tables from the training set will be used to predict the next word, given the user’s text input (Markov assumption) This will be based on a backoff strategy, where an (n-1)-gram will be used whenever an n-gram is not available, defaulting to a unigram if needed.

In the Shiny app, users will input text which will then:

Be converted to lowercase,
Be stripped of numbers, symbols, punctuation, non-ASCII characters, and whitespace,
Search n-grams to retrieve matching patterns to present to the user

Tasks

From Capstone Project descriptions:

Obtaining the data - Can you download the data and load/manipulate it in R?
Familiarizing yourself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.
Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
Profanity filtering - removing profanity and other words you do not want to predict.
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.
Build a predictive model based on the previous data modeling steps - you may combine the models in any way you think is appropriate.
Evaluate the model for efficiency and accuracy - use timing software to evaluate the computational complexity of your model. Evaluate the model accuracy using different metrics like perplexity, accuracy at the first word, second word, and third word.
Explore new models and data to improve your predictive model.
Evaluate your new predictions on both accuracy and efficiency.
Create a data product to show off your prediction algorithm You should create a Shiny app that accepts an n-gram and predicts the next word.
Create a slide deck promoting your product. Write 5 slides using RStudio Presenter explaining your product and why it is awesome!

Loading data

We start by loading in the assigned corpus (a collection of texts to analyze), derived from from blogs, news, and Twitter sources. The data set also also texts from other languages, but we will only focus on English sources here.

We also load in a publically-available list of profanities to filter out later in our analysis.

# blogs
btemp <- file("en_US.blogs.txt", open = "r")
blogs <- readLines(btemp, encoding = "UTF-8", skipNul = TRUE)

# news
ntemp <- file("en_US.news.txt", open = "r")
news <- readLines(ntemp, encoding = "UTF-8", skipNul = TRUE)

# twitter
ttemp <- file("en_US.twitter.txt", open = "r")
twitter <- readLines(ttemp, encoding = "UTF-8", skipNul = TRUE)

close(btemp)
close(ntemp)
close(ttemp)
rm(btemp, ntemp, ttemp)

Next, we will quickly summarize the three text files comprising our corpus (i.e., file size, total number of lines, total number of words, and total number of characters). The table below demonstrates that these are quite large files, with tens of millions of words in the entire corpus. Texts from Twitter have fewer words per line, as expected given the platform’s constraints message length.

# summarizing the data files
# file size
filesize <- round(file.info(c("en_US.blogs.txt", 
                                "en_US.news.txt", 
                                "en_US.twitter.txt"))$size / 1024 ^ 2)

# line count
nlines <- sapply(list(blogs, news, twitter), 
                 length)

# word count
nwords <- sapply(list(blogs, news, twitter), 
                 stri_stats_latex)[4,]

# mean words per line
blogslwords <- stri_count_words(blogs)
newslwords <- stri_count_words(news)
twitterlwords <- stri_count_words(twitter)

summarylwords <- sapply(list(blogslwords, newslwords, twitterlwords),
             function(x) summary(x)[c('Mean', 'Max.')])
rownames(summarylwords) = c('words_per_line_mean', 'words_per_line_max')

# character count
nchar <- sapply(list(blogs, news, twitter),
                stri_stats_general)[3,]

# summarize of data sets
initsummary <- data.frame(source = c("blogs", "news", "twitter"),
        file_size_MB = format(filesize, big.mark = ","),
        total_lines = format(nlines, big.mark = ","),
        total_words = format(nwords, big.mark = ","),
        total_characters = format(nchar, big.mark = ","),
        t(rbind(round(summarylwords)))
           )
kable(initsummary, align = "r")

source	file_size_MB	total_lines	total_words	total_characters	words_per_line_mean	words_per_line_max
blogs	200	899,288	37,570,839	206,824,382	42	6726
news	196	1,010,242	34,494,539	203,223,154	34	1796
twitter	159	2,360,148	30,451,170	162,096,241	13	47

pl1 <- qplot(blogslwords,
        main = "Blogs",
        xlab = "Words per Line",
        ylab = "Frequency",
        binwidth = 10)
pl2 <- qplot(newslwords,
        main = "News",
        xlab = "Words per Line",
        ylab = "Frequency",
        binwidth = 10)
pl3 <- qplot(twitterlwords,
        main = "Twitter",
        xlab = "Words per Line",
        ylab = "Frequency",
        binwidth = 1)

grid.arrange(pl1, pl2, pl3)

We plot the words per line for each of the platforms. As suggested in the summary table above, there is a long tail for blog texts in particular, with its longest text having over 6000 words.

Data sampling and cleaning

The text collections in our corpus are very large, and it would be computationally intensive to use the entire corpus. For our analysis, we sample 10% of the lines from each collection to generate a sample corpus for analysis.

The quanteda package is used here for its fast processing and convenience of use.

The documents in the corpus next need to be tokenized, and we will be removing:

Profanity,
Hastags,
Symbols,
Upper case lettering,
Numbers,
Punctuation, and
Excess whitespace.

N-gram modelling

For our model, we will use n-grams (n = 2, 3, 4) to predict the next word based on the one, two, or three previous words, and also to demonstrate the most common words. We also visualize the frequencies of the most common n-grams.

First, we create a document-feature matrix (dfm) from the list of cleaned tokens. The dfm demonstrates the number of times each unique token appears in each text file in the corpus. Common English stop words (e.g., “the”, “me”, “my”) and individual letters are removed during this step.

# make n-gram dfms
unigram <- tokens_ngrams(sampletokens, n = 1)
unigramdfm <- dfm(unigram, 
                  remove = stopwords("english"),
                  remove_separators = TRUE)

bigram <- tokens_ngrams(sampletokens, n = 2)
bigramdfm <- dfm(bigram, remove = stopwords("english"))

trigram <- tokens_ngrams(sampletokens, n = 3)
trigramdfm <- dfm(trigram, remove = stopwords("english"))

tetragram <- tokens_ngrams(sampletokens, n = 4)
tetragramdfm <- dfm(tetragram, remove = stopwords("english"))

# top 20 unigrams
top20unigram <- topfeatures(unigramdfm, 20)
top20unigram <- data.frame(word = names(top20unigram), freq = top20unigram, row.names = NULL)

ggplot(top20unigram, aes(x = reorder(word, -freq), y = freq)) +
    geom_bar(stat = "identity") +
    labs(title = "Top 20 Unigram Words") +
    xlab("Frequency") +
    ylab("Unigram") +
    coord_flip()

# top 20 bigrams
top20bigrams <- topfeatures(bigramdfm, 20)
top20bigrams <- data.frame(word = names(top20bigrams), freq = top20bigrams, row.names = NULL)

ggplot(top20bigrams, aes(x = reorder(word, -freq), y = freq)) +
    geom_bar(stat = "identity") +
    labs(title = "Top 20 Bigrams") +
    xlab("Frequency") +
    ylab("Bigram") +
    coord_flip()

# top 20 trigrams
top20trigrams <- topfeatures(trigramdfm, 20)
top20trigrams <- data.frame(word = names(top20trigrams), freq = top20trigrams, row.names = NULL)

ggplot(top20trigrams, aes(x = reorder(word, -freq), y = freq)) +
    geom_bar(stat = "identity") +
    labs(title = "Top 20 Trigrams") +
    xlab("Frequency") +
    ylab("Trigram") +    
    coord_flip()

# top 20 4-grams
top20tetragrams <- topfeatures(tetragramdfm, 20)
top20tetragrams <- data.frame(word = names(top20tetragrams), freq = top20tetragrams, row.names = NULL)

ggplot(top20tetragrams, aes(x = reorder(word, -freq), y = freq)) +
    geom_bar(stat = "identity") +
    labs(title = "Top 20 4-grams") +
    xlab("Frequency") +
    ylab("4-gram") +    
    coord_flip()

Coverage

Next, we construct a plot to demonstrate the number of words needed to cover a given proportion of our corpus (i.e., 50%, 60%, 70%, 80%, 90%, 95%, 97.5%, 99%, 100%). We can see that the number of required words grows exponentially from about 90% until 100% coverage.

# clean up objects from memory
rm(sampletokens, top20unigram, top20bigrams, top20trigrams, top20tetragrams, unigram, bigram, trigram, tetragram)

# save dfms to local text files for convenience
saveRDS(unigramdfm, file = "unigramdfm.RDS")
saveRDS(bigramdfm, file = "bigramdfm.RDS")
saveRDS(trigramdfm, file = "trigramdfm.RDS")
saveRDS(tetragramdfm, file = "tetragramdfm.RDS")

unigramfeatures <- textstat_frequency(unigramdfm)

# function to calculate corpus coverage with given number of words
coverage <- function(features, percentc, c){
    total <- sum(features$frequency)
        i = 0
        covered = c # starting from a certain c to reduce looping times
    while( covered < total*percentc ){
        i = i + 1
        covered = covered + features[i,]$frequency
}
i
}

unigrams0.5 <- coverage(unigramfeatures, 0.5, 500)
unigrams0.6 <- coverage(unigramfeatures, 0.6, 1200)
unigrams0.7 <- coverage(unigramfeatures, 0.7, 1500)
unigrams0.8 <- coverage(unigramfeatures, 0.8, 2000)
unigrams0.9 <- coverage(unigramfeatures, 0.9, 6000)
unigrams0.95 <- coverage(unigramfeatures, 0.95, 10000)
unigrams0.975 <- coverage(unigramfeatures, 0.975, 20000)
unigrams0.99 <- coverage(unigramfeatures, 0.99, 40000)
unigrams1.0 <- coverage(unigramfeatures, 1, 60000)

unigramcoverage <- data.frame(Percent.Coverage = c(50, 60, 70, 80, 90, 95, 97.5, 99, 100), Words = c(unigrams0.5, unigrams0.6, unigrams0.7, unigrams0.8, unigrams0.9, unigrams0.95, unigrams0.975, unigrams0.99, unigrams1.0))

# remove unnecessary objects
rm(unigrams0.5, unigrams0.6, unigrams0.7, unigrams0.8, unigrams0.9, unigrams0.95, unigrams0.975, unigrams0.99, unigrams1.0)

# graph of coverage
ggplot(unigramcoverage, aes(x = Words, y = Percent.Coverage, group = 0.5)) +
    geom_line(color = "maroon") + 
    geom_text(aes(label = Words, vjust = 1)) + 
    labs(x = "Number of unigrams", y = "Coverage (%)") + 
    labs(title = "Number of unigrams to cover a percentage of total corpus") +
    coord_flip()

Next steps

Work remains to be done on refining my data cleaning procedures. I will need to read further around different modelling methods for natural language processing. The Katz Back-Off model seems to be commonly used for word prediction, and is a discounting method which attempts to overcome data sparcity. The Stupid Back-Off model appears to be a simplified but similar approach, and I will try to implement these in the next few weeks.