In this capstone I will be applying data science in the area of natural language processing. The next word prediction model will be trained on a data set from HC Corpora which is available on the Coursera site. This report aims to provide a basic understanding of the data that will be used in the Data Science capstone. I will explain the major features of the data and briefly summarize my plans for creating the prediction algorithm and Shiny app. The model will be a n-gram model with 2-grams and 3-grams.
The file from Coursera contains text data in four languages. This project will be built only on the English language data. There are 3 text files that will be used to build the corpora and they contain text from 3 different sources: blogs, news and twitter.
In addition, I will be using a file containing profanities from Google in order to filter these words out of the corpora. This file can be downloaded at: https://code.google.com/archive/p/badwordslist/downloads
| File_Name | Object_Size_MB | Number_of_Lines | Number_of_Words | Max_Words_per_Line |
|---|---|---|---|---|
| en.US.blogs.txt | 255 | 899288 | 37334131 | 6630 |
| en_US.news.txt | 257 | 1010242 | 34372530 | 1792 |
| en_US.twitter.txt | 319 | 2360148 | 30373583 | 47 |
The table above provides a summary of the 3 files containing text from news, blogs and twitter sources. All 3 data sets occupy around the same space in memory 250-300 MB. The twitter file contains more rows than blogs and news combined. However they all have comparable number of words 3-3.7 million. This means there are fewer words per line in the twitter file compared to blogs and news. In fact, the longest twitter line is only 47 words which makes sense.
Because of the very large number of words, the model will not be able to run on the entire data set so I will sample 10% from each source. These samples are then merged into one data set that is used to construct the corpus for the prediction model.
Before looking at word frequencies, the text corpus will require some additional pre-processing to improve the accuracy of the final model. The following changes are done to the text: remove non-English characters, remove punctuation, remove numbers, remove URL, Twitter handles, email patterns and emojis, remove profane words based on the Google list, remove stop words, convert to lower case and trim extra white space. Word stemming is not used here, but this could improve the accuracy of the model if done correctly.
After building the corpus, the next step is to construct the tokens for the Natural Language Processing model. I will be using 1, 2 and 3-grams. These will be made using the RWeka package in R.
It is useful to look at the frequencies for the n-grams starting with unigrams which are basically words.
A graph of the top 20 words by number of occurrences in the text shows that there are around 31000 occurrences of will, which is the most common word. Also the 20th most common has about 13000 occurrences. There are no words like “the”, “a”, “an”, “so”, “what” since they have been removed from the corpus. A word cloud can be seen below.
It is also useful to look at the coverage of words. For example, how many unique words would be required in a frequency sorted dictionary to cover 50% of all word instances in the language? What about 90%?
The graph shows that in order to cover for 50% of the whole text, only 986 words would be sufficient. However 90% would require 16852 unique words in the dictionary. This should be kept in mind when building the model as reducing the size of the dictionary could save processing time and memory. This should not come at the cost of too high accuracy loss.
The most common bigram is “right now” with about 2400 occurrences and the 20th most common is “looks like” with about 900.
The most common trigram is “cant wait see” with about 400 occurrences and the 20th most common is “gov chris christie” with about 90.
| Ngram_name | Object_Size_MB | Number_of_Lines |
|---|---|---|
| unigram | 30 | 214454 |
| bigram | 498 | 3264367 |
| trigram | 780 | 4567812 |
The bigram and especially trigram dictionaries can be extremely large and occupy significant memory space compared to single word unigrams. This is because the same word can appear in multiple bigram and trigrams. Using anything more than a trigram would exceed the memory capabilities for a Shiny app.
The predictive algorithm will be developed using an n-gram model with n=3. The current plan is to start from 3-gram and go down to 2-gram and 1-gram. The predicted probabilities for each model will be interpolated with one another. The Katz’s back-off model will be used to estimate the probability of unobserved n-grams. I am thinking of using Laplace smoothing to solve the problem of zero probabilities in the model. The model will most likely be stored using a Markov Chain method. The model will be evaluated by applying it to a test set and measuring different metrics like perplexity and accuracy.
The plan for the design of the Shiny App is to have a field where the user enters a sentence and after every word, the app will display the top 5 predictions for the next word in the sentence. Then, the user can click on one of the predicted words to add it to the sentence.
These plans are subject to change after exploring more options for the final model. As illustrated in this report, the main concern is the size of the data set and attention needs to be given to memory and performance issues. The parameters of the model will be adjusted in order to make sure the app runs on the Shiny server and doesn’t take too long to load. The dictionaries will most likely be reduced, taking into account the coverage of words.
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
library(ngram)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(RWeka)
#Download and unzip data files from Coursera
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "./Coursera-SwiftKey.zip", method = "curl")
unzip("Coursera-SwiftKey.zip")
}
#Download list of profanities from Google
if (!file.exists("badwords.txt")) {
download.file(url = "http://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt", destfile = "badwords.txt", method = "curl")
}
#Read English version of text files used to construct the model
con <- file("./final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(con, skipNul = TRUE)
close(con)
con <- file("./final/en_US/en_US.news.txt", open = "rb")
news <- readLines(con, skipNul = TRUE)
close(con)
con <- file("./final/en_US/en_US.twitter.txt", open = "rb")
twitter <- readLines(con, skipNul = TRUE)
close(con)
#Read file containing bad words.
con <- file("./badwords.txt", open = "r")
profanity_filter <- readLines(con, skipNul = TRUE)
close(con)
#Construct a summary table for the size of character vectors in memory, number of lines and words
fileNames <- c("en.US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
fileSize <- c(round(object.size(blogs)/1048576,0), round(object.size(news)/1048576,0),
round(object.size(twitter)/1048576,0))
numLines <- c(length(blogs),length(news),length(twitter))
numWords <- c(wordcount(blogs), wordcount(news), wordcount(twitter))
#Find maximum number of words in a line for each file
max_blogs <-0
for (i in 1:length(blogs)) {
if (wordcount(blogs[i])>max_blogs) {
max_blogs <- wordcount(blogs[i])
}
}
max_news <-0
for (i in 1:length(news)) {
if (wordcount(news[i])>max_news) {
max_news <- wordcount(news[i])
}
}
max_twitter <-0
for (i in 1:length(twitter)) {
if (wordcount(twitter[i])>max_twitter) {
max_twitter <- wordcount(twitter[i])
}
}
maxWordsLine <- c(max_blogs,max_news,max_twitter)
fileSummary <- data.frame(File_Name = fileNames, Object_Size_MB = fileSize, Number_of_Lines = numLines,
Number_of_Words = numWords, Max_Words_per_Line = maxWordsLine)
knitr::kable(fileSummary)
#Sample 10% of blogs, news and twitter data
set.seed(47567)
blogs <- sample(blogs, length(blogs) * 0.1, replace = FALSE)
news <- sample(news, length(news) * 0.1, replace = FALSE)
twitter <- sample(twitter, length(twitter) * 0.1, replace = FALSE)
#Remove foreign characters from the words
blogs <- iconv(blogs, "latin1", "ASCII", sub = "")
news <- iconv(news, "latin1", "ASCII", sub = "")
twitter <- iconv(twitter, "latin1", "ASCII", sub = "")
#Construct one file from blogs, news and twitter files and save it
allSample <- c(blogs, news, twitter)
fileCon<-file("data.txt")
writeLines(allSample, fileCon)
close(fileCon)
#Preprocess the profanity filter
profanity_filter <- iconv(profanity_filter, "latin1", "ASCII", sub = "")
profanity_filter <- gsub(profanity_filter, pattern = "[^A-Z a-z.-]", replacement = "")
profanity_filter <- tolower(profanity_filter)
#Construct corpus, remove punctuation, numbers, special characters and patterns, convert to lowercase, remove stop words and profanities, remove extra white space
corpus <- Corpus(VectorSource(allSample))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, function(x) gsub(x, pattern = "(f|ht)tp(s?)://(.*)[.][a-z]+", replacement = ""))
corpus <- tm_map(corpus, function(x) gsub(x, pattern = "@[^\\s]+", replacement = ""))
corpus <- tm_map(corpus, function(x) gsub(x, pattern = "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b", replacement = ""))
corpus <- tm_map(corpus, function(x) gsub(x, pattern = "[^\x01-\x7F]", replacement = ""))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, profanity_filter)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
#Convert corpus to data frame and save to file
corpusText <- data.frame(text = sapply(corpus, as.character), stringsAsFactors = FALSE)
fileCon<-file("corpus.txt")
writeLines(corpusText$text, fileCon)
close(fileCon)
rm(allSample, blogs, news, twitter, profanity_filter, con, fileCon)
#Reconstruct corpus from data frame into a volatile corpus. Tokenization doesn't work on simple corpus
corpus <- VCorpus(VectorSource(corpusText))
#Define unigram, bigram and trigram functions
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
#Build the term document matrix for unigrams and remove sparse terms
unigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))
unigramMatrix2 <- removeSparseTerms(unigramMatrix, sparse = 0.99)
#Construct a frequency data frame for unigrams
m <- as.matrix(unigramMatrix2)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
#Plot the top 20 most frequent unigrams (words)
b <- barplot(d[1:20,]$freq, las = 2, names.arg = d[1:20,]$word,
col ="lightblue", ylim = range(pretty(c(0, d[1:20,]$freq)))*1.1,main ="Most Common Words (Unigrams)",
ylab = "Word Frequency")
text(x = b, y = d[1:20,]$freq, label = d[1:20,]$freq, pos = 3, cex = 0.8, col = "red")
#Build a word cloud
set.seed(47567)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
#Calculate the cumulative relative frequency for every word
d_coverage <- transform(d, cum_percent = cumsum(prop.table(freq)))
#Determine how many words are required for 50% and 90% coverage
uni50 <- which(d_coverage$cum_percent>=0.5)[1]
uni90 <- which(d_coverage$cum_percent>=0.9)[1]
#Plot the graph for word coverage
plot(d_coverage$cum_percent, type='l', xlab ='Number of words' ,ylab = 'Cummulative Frequency', main = 'Word Coverage')
abline(v=uni50, col ="red")
abline(v=uni90, col ="blue")
abline(h=0.5, col ="red")
abline(h=0.9, col ="blue")
legend("bottomright", legend = c(paste("50th percentile = ", uni50), paste("90th percentile = ", uni90)), lwd = c(5,3), lty = c(2,1), col = c("red","blue"))
#Save the unigrams data frame to a file
unigrams <-d
saveRDS(unigrams,file = "./unigram.rds")
rm(unigramMatrix,unigramMatrix2, d_coverage, uni50, uni90)
#Build the term document matrix for bigrams and remove sparse terms
bigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))
bigramMatrix2 <- removeSparseTerms(bigramMatrix, sparse = 0.99)
#Construct a frequency data frame for bigrams
m <- as.matrix(bigramMatrix2)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
#Plot the top 20 most frequent bigrams
b <- barplot(d[1:20,]$freq, las = 2, names.arg = d[1:20,]$word,
col ="lightblue", ylim = range(pretty(c(0, d[1:20,]$freq)))*1.1,cex.names=0.8,main ="Most Common Bigrams",
ylab = "Bigram Frequency")
text(x = b, y = d[1:20,]$freq, label = d[1:20,]$freq, pos = 3, cex = 0.8, col = "red")
#Save the bigrams data frame to a file
bigrams <-d
saveRDS(bigrams,file = "./bigram.rds")
rm(bigramMatrix,bigramMatrix2)
#Build the term document matrix for trigrams and remove sparse terms
trigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))
trigramMatrix2 <- removeSparseTerms(trigramMatrix, sparse = 0.99)
#Construct a frequency data frame for trigrams
m <- as.matrix(trigramMatrix2)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
#Plot the top 20 most frequent trigrams
par(mar=c(8,4,4,2))
b <- barplot(d[1:20,]$freq, las = 2, mar=c(8,8,8,8),names.arg = d[1:20,]$word,
col ="lightblue", ylim = range(pretty(c(0, d[1:20,]$freq)))*1.1, cex.names=0.7 , main ="Most Common Trigrams",
ylab = "Trigram Frequency")
text(x = b, y = d[1:20,]$freq, label = d[1:20,]$freq, pos = 3, cex = 0.8, col = "red")
#Save the trigrams data frame to a file
trigrams <-d
saveRDS(trigrams,file = "./trigram.rds")
rm(trigramMatrix,trigramMatrix2,m,v,d,corpus,corpusText, unigramTokenizer,bigramTokenizer,trigramTokenizer)
#Construct a summary table for the size of n-gram objects and number n-grams of each type
ngramNames <- c("unigram","bigram","trigram")
ngramSize <- c(round(object.size(unigrams)/1048576,0), round(object.size(bigrams)/1048576,0),
round(object.size(trigrams)/1048576,0))
numLines <- c(nrow(unigrams),nrow(bigrams),nrow(trigrams))
fileSummary <- data.frame(Ngram_name = ngramNames, Object_Size_MB = ngramSize, Number_of_Lines = numLines)
knitr::kable(fileSummary)
rm(unigrams,bigrams,trigrams)