0.1 Executive Summary

The goal of this project is to build a predictive text model using large corpus of text documents. I will be using three diferent document - Blogs, News and Tweets. The process involves acquiring the data, cleaning it, structuring it for analysys, conducting exploratory analysis, building a predictive model and then finally deploying a Shiny app incorporating that model. For the purpose of this report, I will not be focussing on the model development or the Shiny app. However, I will sumarizing my approach in the last section “Next Steps”.

0.2 Data Acquisition and Cleaning

There were Corpora for many different languages. I have selected English versions.

Initial observation of the Corpora is below:

Corpora Size (MB) # of Lines # of Words
Blogs 200 899,288 37,546,246
News 196 1,010,242 2.674,536
Tweets 160 2,360,148 30,093,410

Now let’s inspect the longest line in each Corpora:

Corpora Chars Words
Blogs 40,833 6,726
News 5,760 1,123
Tweets 140 47

I am using quanteda package instead of tm as this is much faster when working with large files. Since the files are huge, I am using rbinom function to get sample of the data. I am selecting 10% of the data fopr my analysis.

The following are the data transformation/cleanup that are performed for this project pertaiing to text prediction:

  1. Convert all characters to lower case
  2. Excess white spaces are removed.
  3. All numbers are removed.
  4. Profanity words will be removed based on custom list defined by me. This is a very small list and can be easily enhanced to add more words easily later on. See https://en.wikipedia.org/wiki/Seven_dirty_words
  5. Twitter handles will be stripped off as they also has nothing to do with the context.

There are a few text related transformation which I will not be doing as it will be detrimental for this project. They are as follows:

  1. Stopwords will not be removed, as they represent more than just the primary carriers of the message and will be part of prediction in this exercise.
  2. Stemming will not be used as N-Grams are typically based on wordforms.
  3. Sparse words will not be removed, as they become clues to probability of unseen N-Grams (Jurafsky and Martin)

It is interesting to observe that while there are a lot of number of lines in twitter Corpora, the size of the file is not that large compared ot other two. This is because twitter has a limit of 140 character max per tweet.

0.3 Exploratory Data Analysis

Before we get to the EDA, here are few more points about the data: 1. I am only using a sample (10%) of the complete dataset at this time. 2. All the three corpus are combined into one corpus, to create one big training set for the purpose of the anaysis. 3. I will create quanteda dfm for Unigram, Bigram, Trigram and 4gram.

The first part of exploratory analysis starts with the frequency plots of words in each of teh 4 NGrams (1,2,3 ad 4). I am only displaying the top 40 frequntly see words in the corpora as it can get very difficult to read otherwise.

Now we will be looking at the Coverage of each of the N-grams. Here we notice that for Unigram, less than 9000 words provide 90% coverage for 190K words. However, as we look at Bigram, Trigram and 4gram this coverage trend reduces, with it being worst at 4gram.

Last, we will look at the beautiful word clouds. While this does not provide much statistical value, it is visually appealing. Again for clarity, I am only showing a subset of the high frequency words for each N-Gram.

0.4 Next Steps

The intial basic analysis was done using the entire dataset. But the exploratory analysis was used using 10% data using rbinom to get that small dataset. This was due to the memory restriction of my machine. I will be working on an approach to break up the corpera and process them in parts so that my training set has 75% data and test set has 25%.

I also have to work on a predictive model using Katz Backoff. This model estimates the conditional probability of a word given its history in the n-gram. I am also researching into applyting Kneser-Ney Smoothing techniques to improve predictability of the model. The final goal is to build a Shiny app that will incorporate this model. The challenge will be to balance quality vs response time of the app. I am also researching into Naive Bayes classifier to come up with a model for prediction as it uses Eager “Learning Algorithm”. Naive Bayes has very close relaionship to language modeling. I must confess, I need to dig deeper into this to better understand it.

I will need to refine the model, algorithm and trim the data to ensure acceptable response time. If time permits, I might add some more sophistication to this model to get better accuracy.

0.5 References

0.6 Source Code

Based on the plagiarism that has been going on in Coursera DSS courses, I have been debating on whether to share mode code or not. There has been proponents and opponents of this in the discussion forum. I think no matter what, cheaters will always find a way to beat the system. So I have finally decided to share my code.

I have gained a lot of insight by code snippets shared in the discussion forums and Stackoverflow. I am hoping that you might find value and get some ideas from my code for your projects. Maybe you might even have some recommendations in optimising my code. If so, then please provide feedback in your evaluation.

For the purpose of this report, I had to rely on saved objects as recreating the objects took some time. However, in the code below, you will not notice those statements (saveRDS and readRDS) as these lines in this section are not executed.

suppressPackageStartupMessages(library(quanteda))
suppressPackageStartupMessages(library(NLP)) 
suppressPackageStartupMessages(library(R.oo))
suppressPackageStartupMessages(library(R.methodsS3))
suppressPackageStartupMessages(library(R.utils))
suppressPackageStartupMessages(library(stringi))
suppressPackageStartupMessages(library(tm))
suppressPackageStartupMessages(library(SnowballC))
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(wordcloud))
suppressPackageStartupMessages(library(slam))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(ggplot2))

# Data file location
blogFile <- "final/en_US/en_US.blogs.txt"
newsFile <- "final/en_US/en_US.news.txt"
twitterFile <- "final/en_US/en_US.twitter.txt"


# create Corpus object
getCorpus <- function(fileName, sampleSize) {  
  fileData <-readLines(fileName, encoding="UTF-8",warn=FALSE, skipNul=TRUE)
  # remove emojies and other characters.
  fileData <- iconv(fileData, "latin1", "ASCII", sub="")
  fileData <- iconv(fileData, "ISO-8859-2", "ASCII", sub="")
  # We want to use only a sample data set as the file is very big. 
  # At sample size 1, the binomial is a "biased coin flip"
  indexData <- as.logical (rbinom (n = length (fileData), size = 1, prob = sampleSize))
  fileData <- fileData[indexData]
  # remove $, +, ~ and ^. 
  chartr("$+~^","   ",fileData)
  ## create Corpus Object using Quanteda package
  corpusObj <- corpus(fileData)
}

# get Quanteda DFM object
getQdfm <- function(corpusObj, ngram) {
  # https://en.wikipedia.org/wiki/Seven_dirty_words 
  profanityList <- c("shit", "piss", "fuck", "cunt", "tits", "cocksucker", "motherfucker")
  
  mydfm <- dfm(corpusObj, ngrams=ngram, concatenator=" ", 
                what = "fastestword", 
                toLower=TRUE, removeNumbers=TRUE,
                removePunct=TRUE, removeSeparators=TRUE,
                removeTwitter=TRUE, stem=FALSE,
                ignoredFeatures=NULL,
                language="english", dictionary=NULL)
  mydfm <- removeFeatures(mydfm, profanityList)  
}

#create Frequency Plot
plotWordFrequency <- function(wordFreq, maxLimit, title, yLimit, labSize) {
  barplot(wordFreq[1:maxLimit], las=2, 
          main=title,
          xlab="Word", cex.names=labSize,
          ylab="Frequency", cex.axis=labSize,
          ylim=c(0,yLimit))
}

#create Data Table
createDataTable <- function(dtable, wordFreq) {
  dtable  <- data.table(word = names(wordFreq), freq = wordFreq)
  sumFreq <- dtable[,sum(freq)]
  dtable[, cvalue:= cumsum(freq)]
  dtable[, coverage:= cvalue/sumFreq]
  dtable[, rank := 1: .N]
}

#create Coverage plot
getCoveragePlot <- function(dtable,aTitle, xLimit, xBreak) {
  ggplot(dtable, aes(x=rank, y=coverage)) +
    geom_line() +
    labs(title=aTitle, x="Word", y="Coverage") +
    scale_x_continuous(breaks=c(seq(0,xLimit, by=xBreak))) + 
    scale_y_continuous(breaks=c(seq(0,1, by=0.10))) 
}

#create Word Cloud
plotWordCloud <- function(qdfm, maxWords, aScale, title) {
  layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
  par(mar=rep(0, 4))
  plot.new()
  text(x=0.5, y=0.5, title)
  plot(qdfm, max.words=maxWords, colors=brewer.pal(8,"Dark2"), 
       random.order=FALSE, scale=aScale)
}

# get 10% sample
corpusBlog <- getCorpus(blogFile, 0.10)
corpusNews <- getCorpus(newsFile, 0.10)
corpusTwitter <- getCorpus(twitterFile, 0.10)
corpusAll <- corpusBlog + corpusNews + corpusTwitter

dfm1 <- getQdfm(corpusAll, 1)
dfm2 <- getQdfm(corpusAll, 2)
dfm3 <- getQdfm(corpusAll, 3)
dfm4 <- getQdfm(corpusAll, 4)

#plot of frequent words
n <- 40
wordsFreq1 <- colSums(sort(dfm1))
wordsFreq2 <- colSums(sort(dfm2))
wordsFreq3 <- colSums(sort(dfm3))
wordsFreq4 <- colSums(sort(dfm4))

plotWordFrequency(wordsFreq1, n, paste(n," Most Frequent Unigrams"), 350000, 0.75)
plotWordFrequency(wordsFreq2, n, paste(n," Most Frequent Bigrams"), 30000, 0.75)
plotWordFrequency(wordsFreq3, n, paste(n," Most Frequent Trigrams"), 2500, 0.60)
plotWordFrequency(wordsFreq4, n, paste(n," Most Frequent 4grams"), 700, 0.50)

# conversion to data table
dtable1G <- createDataTable(data.table(), wordsFreq1)
dtable2G <- createDataTable(data.table(), wordsFreq2)
dtable3G <- createDataTable(data.table(), wordsFreq3)
dtable4G <- createDataTable(data.table(), wordsFreq4)

## plotting coverage for all ngrams
getCoveragePlot(dtable1G, "Coverage for Unigrams", 200000, 20000)
getCoveragePlot(dtable2G, "Coverage for Bigrams", 2000000, 200000)
getCoveragePlot(dtable3G, "Coverage for Trigrams", 4500000, 500000)
getCoveragePlot(dtable4G, "Coverage for 4grams", 5500000, 500000)

# Word Cloud plot
set.seed(1234)
plotWordCloud(dfm1, 200, c(4,.5), "Word Cloud for Unigram")
plotWordCloud(dfm2, 150, c(2,.5), "Word Cloud for Bigram")
plotWordCloud(dfm3, 125, c(1,.5), "Word Cloud for Trigram")
plotWordCloud(dfm4, 100, c(1,.5), "Word Cloud for 4gram")