Introduction

In this capstone I will be applying data science in the area of natural language processing. The next word prediction model will be trained on a data set from HC Corpora which is available on the Coursera site. This report aims to provide a basic understanding of the data that will be used in the Data Science capstone. I will explain the major features of the data and briefly summarize my plans for creating the prediction algorithm and Shiny app. The model will be a n-gram model with 2-grams and 3-grams.

Overview of the data set

The file from Coursera contains text data in four languages. This project will be built only on the English language data. There are 3 text files that will be used to build the corpora and they contain text from 3 different sources: blogs, news and twitter.

In addition, I will be using a file containing profanities from Google in order to filter these words out of the corpora. This file can be downloaded at: https://code.google.com/archive/p/badwordslist/downloads

File_Name	Object_Size_MB	Number_of_Lines	Number_of_Words	Max_Words_per_Line
en.US.blogs.txt	255	899288	37334131	6630
en_US.news.txt	257	1010242	34372530	1792
en_US.twitter.txt	319	2360148	30373583	47

The table above provides a summary of the 3 files containing text from news, blogs and twitter sources. All 3 data sets occupy around the same space in memory 250-300 MB. The twitter file contains more rows than blogs and news combined. However they all have comparable number of words 3-3.7 million. This means there are fewer words per line in the twitter file compared to blogs and news. In fact, the longest twitter line is only 47 words which makes sense.

Because of the very large number of words, the model will not be able to run on the entire data set so I will sample 10% from each source. These samples are then merged into one data set that is used to construct the corpus for the prediction model.

Exploratory Analysis

Before looking at word frequencies, the text corpus will require some additional pre-processing to improve the accuracy of the final model. The following changes are done to the text: remove non-English characters, remove punctuation, remove numbers, remove URL, Twitter handles, email patterns and emojis, remove profane words based on the Google list, remove stop words, convert to lower case and trim extra white space. Word stemming is not used here, but this could improve the accuracy of the model if done correctly.

After building the corpus, the next step is to construct the tokens for the Natural Language Processing model. I will be using 1, 2 and 3-grams. These will be made using the RWeka package in R.

It is useful to look at the frequencies for the n-grams starting with unigrams which are basically words.

A graph of the top 20 words by number of occurrences in the text shows that there are around 31000 occurrences of will, which is the most common word. Also the 20th most common has about 13000 occurrences. There are no words like “the”, “a”, “an”, “so”, “what” since they have been removed from the corpus. A word cloud can be seen below.

It is also useful to look at the coverage of words. For example, how many unique words would be required in a frequency sorted dictionary to cover 50% of all word instances in the language? What about 90%?

The graph shows that in order to cover for 50% of the whole text, only 986 words would be sufficient. However 90% would require 16852 unique words in the dictionary. This should be kept in mind when building the model as reducing the size of the dictionary could save processing time and memory. This should not come at the cost of too high accuracy loss.

The most common bigram is “right now” with about 2400 occurrences and the 20th most common is “looks like” with about 900.

The most common trigram is “cant wait see” with about 400 occurrences and the 20th most common is “gov chris christie” with about 90.

Ngram_name	Object_Size_MB	Number_of_Lines
unigram	30	214454
bigram	498	3264367
trigram	780	4567812

The bigram and especially trigram dictionaries can be extremely large and occupy significant memory space compared to single word unigrams. This is because the same word can appear in multiple bigram and trigrams. Using anything more than a trigram would exceed the memory capabilities for a Shiny app.

Conclusion and way forward

The predictive algorithm will be developed using an n-gram model with n=3. The current plan is to start from 3-gram and go down to 2-gram and 1-gram. The predicted probabilities for each model will be interpolated with one another. The Katz’s back-off model will be used to estimate the probability of unobserved n-grams. I am thinking of using Laplace smoothing to solve the problem of zero probabilities in the model. The model will most likely be stored using a Markov Chain method. The model will be evaluated by applying it to a test set and measuring different metrics like perplexity and accuracy.

The plan for the design of the Shiny App is to have a field where the user enters a sentence and after every word, the app will display the top 5 predictions for the next word in the sentence. Then, the user can click on one of the predicted words to add it to the sentence.

These plans are subject to change after exploring more options for the final model. As illustrated in this report, the main concern is the size of the data set and attention needs to be given to memory and performance issues. The parameters of the model will be adjusted in order to make sure the app runs on the Shiny server and doesn’t take too long to load. The dictionaries will most likely be reduced, taking into account the coverage of words.

Appendix: All code for this report

knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)

library(ngram)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(RWeka)

#Download and unzip data files from Coursera

if (!file.exists("Coursera-SwiftKey.zip")) {
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "./Coursera-SwiftKey.zip", method = "curl")
unzip("Coursera-SwiftKey.zip")
}

#Download list of profanities from Google

if (!file.exists("badwords.txt")) {
download.file(url = "http://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt", destfile = "badwords.txt", method = "curl")
}

#Read English version of text files used to construct the model

con <- file("./final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(con, skipNul = TRUE)
close(con)
con <- file("./final/en_US/en_US.news.txt", open = "rb")
news <- readLines(con, skipNul = TRUE)
close(con)
con <- file("./final/en_US/en_US.twitter.txt", open = "rb")
twitter <- readLines(con, skipNul = TRUE)
close(con)

#Read file containing bad words.

con <- file("./badwords.txt", open = "r")
profanity_filter <- readLines(con, skipNul = TRUE)
close(con)

#Construct a summary table for the size of character vectors in memory, number of lines and words

fileNames <- c("en.US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
fileSize <- c(round(object.size(blogs)/1048576,0), round(object.size(news)/1048576,0), 
              round(object.size(twitter)/1048576,0))
numLines <- c(length(blogs),length(news),length(twitter))
numWords <- c(wordcount(blogs), wordcount(news), wordcount(twitter))

#Find maximum number of words in a line for each file

max_blogs <-0
for (i in 1:length(blogs)) {
  if (wordcount(blogs[i])>max_blogs) {
    max_blogs <- wordcount(blogs[i])
  }
}

max_news <-0
for (i in 1:length(news)) {
  if (wordcount(news[i])>max_news) {
    max_news <- wordcount(news[i])
  }
}

max_twitter <-0
for (i in 1:length(twitter)) {
  if (wordcount(twitter[i])>max_twitter) {
    max_twitter <- wordcount(twitter[i])
  }
}

maxWordsLine <- c(max_blogs,max_news,max_twitter)
fileSummary <- data.frame(File_Name = fileNames, Object_Size_MB = fileSize, Number_of_Lines = numLines, 
                          Number_of_Words = numWords, Max_Words_per_Line = maxWordsLine)
knitr::kable(fileSummary)


#Sample 10% of blogs, news and twitter data

set.seed(47567)
blogs <- sample(blogs, length(blogs) * 0.1, replace = FALSE)
news <- sample(news, length(news) * 0.1, replace = FALSE)
twitter <- sample(twitter, length(twitter) * 0.1, replace = FALSE)

#Remove foreign characters from the words

blogs <- iconv(blogs, "latin1", "ASCII", sub = "")
news <- iconv(news, "latin1", "ASCII", sub = "")
twitter <- iconv(twitter, "latin1", "ASCII", sub = "")

#Construct one file from blogs, news and twitter files and save it

allSample <- c(blogs, news, twitter)
fileCon<-file("data.txt")
writeLines(allSample, fileCon)
close(fileCon)

#Preprocess the profanity filter

profanity_filter <- iconv(profanity_filter, "latin1", "ASCII", sub = "")
profanity_filter <- gsub(profanity_filter, pattern = "[^A-Z a-z.-]", replacement = "")
profanity_filter <- tolower(profanity_filter)

#Construct corpus, remove punctuation, numbers, special characters and patterns, convert to lowercase, remove stop words and profanities, remove extra white space

corpus <- Corpus(VectorSource(allSample))

corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, function(x) gsub(x, pattern = "(f|ht)tp(s?)://(.*)[.][a-z]+", replacement = ""))
corpus <- tm_map(corpus, function(x) gsub(x, pattern = "@[^\\s]+", replacement = ""))
corpus <- tm_map(corpus, function(x) gsub(x, pattern = "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b", replacement = ""))
corpus <- tm_map(corpus, function(x) gsub(x, pattern = "[^\x01-\x7F]", replacement = ""))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, profanity_filter)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

#Convert corpus to data frame and save to file

corpusText <- data.frame(text = sapply(corpus, as.character), stringsAsFactors = FALSE)
fileCon<-file("corpus.txt")
writeLines(corpusText$text, fileCon)
close(fileCon)
rm(allSample, blogs, news, twitter, profanity_filter, con, fileCon)

#Reconstruct corpus from data frame into a volatile corpus. Tokenization doesn't work on simple corpus

corpus <- VCorpus(VectorSource(corpusText))

#Define unigram, bigram and trigram functions

unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))


#Build the term document matrix for unigrams and remove sparse terms

unigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))
unigramMatrix2 <- removeSparseTerms(unigramMatrix, sparse = 0.99)

#Construct a frequency data frame for unigrams

m <- as.matrix(unigramMatrix2)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

#Plot the top 20 most frequent unigrams (words)

b  <- barplot(d[1:20,]$freq, las = 2, names.arg = d[1:20,]$word,
        col ="lightblue", ylim = range(pretty(c(0, d[1:20,]$freq)))*1.1,main ="Most Common Words (Unigrams)",
        ylab = "Word Frequency")
text(x = b, y = d[1:20,]$freq, label = d[1:20,]$freq, pos = 3, cex = 0.8, col = "red")


#Build a word cloud

set.seed(47567)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))


#Calculate the cumulative relative frequency for every word

d_coverage <- transform(d, cum_percent = cumsum(prop.table(freq)))

#Determine how many words are required for 50% and 90% coverage

uni50 <- which(d_coverage$cum_percent>=0.5)[1]
uni90 <- which(d_coverage$cum_percent>=0.9)[1]

#Plot the graph for word coverage

plot(d_coverage$cum_percent, type='l', xlab ='Number of words' ,ylab = 'Cummulative Frequency', main  = 'Word Coverage')
abline(v=uni50, col ="red")
abline(v=uni90, col ="blue")
abline(h=0.5, col ="red")
abline(h=0.9, col ="blue")
legend("bottomright", legend = c(paste("50th percentile = ", uni50), paste("90th percentile = ", uni90)), lwd = c(5,3), lty = c(2,1), col = c("red","blue"))

#Save the unigrams data frame to a file

unigrams <-d
saveRDS(unigrams,file = "./unigram.rds")

rm(unigramMatrix,unigramMatrix2, d_coverage, uni50, uni90)


#Build the term document matrix for bigrams and remove sparse terms

bigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))
bigramMatrix2 <- removeSparseTerms(bigramMatrix, sparse = 0.99)

#Construct a frequency data frame for bigrams

m <- as.matrix(bigramMatrix2)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

#Plot the top 20 most frequent bigrams

b  <- barplot(d[1:20,]$freq, las = 2, names.arg = d[1:20,]$word,
        col ="lightblue", ylim = range(pretty(c(0, d[1:20,]$freq)))*1.1,cex.names=0.8,main ="Most Common Bigrams",
        ylab = "Bigram Frequency")
text(x = b, y = d[1:20,]$freq, label = d[1:20,]$freq, pos = 3, cex = 0.8, col = "red")

#Save the bigrams data frame to a file

bigrams <-d
saveRDS(bigrams,file = "./bigram.rds")

rm(bigramMatrix,bigramMatrix2)


#Build the term document matrix for trigrams and remove sparse terms

trigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))
trigramMatrix2 <- removeSparseTerms(trigramMatrix, sparse = 0.99)

#Construct a frequency data frame for trigrams

m <- as.matrix(trigramMatrix2)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

#Plot the top 20 most frequent trigrams

par(mar=c(8,4,4,2))
b  <- barplot(d[1:20,]$freq, las = 2, mar=c(8,8,8,8),names.arg = d[1:20,]$word,
        col ="lightblue", ylim = range(pretty(c(0, d[1:20,]$freq)))*1.1, cex.names=0.7 , main ="Most Common Trigrams",
        ylab = "Trigram Frequency")
text(x = b, y = d[1:20,]$freq, label = d[1:20,]$freq, pos = 3, cex = 0.8, col = "red")

#Save the trigrams data frame to a file

trigrams <-d
saveRDS(trigrams,file = "./trigram.rds")

rm(trigramMatrix,trigramMatrix2,m,v,d,corpus,corpusText, unigramTokenizer,bigramTokenizer,trigramTokenizer)

#Construct a summary table for the size of n-gram objects and number n-grams of each type

ngramNames <- c("unigram","bigram","trigram")

ngramSize <- c(round(object.size(unigrams)/1048576,0), round(object.size(bigrams)/1048576,0), 
              round(object.size(trigrams)/1048576,0))

numLines <- c(nrow(unigrams),nrow(bigrams),nrow(trigrams))

fileSummary <- data.frame(Ngram_name = ngramNames, Object_Size_MB = ngramSize, Number_of_Lines = numLines)

knitr::kable(fileSummary)

rm(unigrams,bigrams,trigrams)

Milestone Report Data Science Capstone

Gabriel

14/12/2021

Introduction

Overview of the data set

Exploratory Analysis

Conclusion and way forward

Appendix: All code for this report