Data Science Capstone Week 2 - Milestone Report

Overview

The goal of the capstone is to develop a prediction algorithm for the most likely next word in a sequence of words. The objective is to explain the Exploratory Data Analysis which will have prediction application and algorithm. The model will be trained using a collection of text (i.e. corpus) which is compiled from 3 sources - news, blogs, and tweets. The purpose of this report is to demonstrate how data was downloaded, imported into R and cleaned. Furthermore it contains some exploratory analyses to investigate few features of the data.

1. Include the libraries neded for Exploratory analysis

library(NLP)
library(tm)

## Warning: package 'tm' was built under R version 3.4.3

library(stringi)
library(RWeka)

## Warning: package 'RWeka' was built under R version 3.4.3

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 3.4.3

## Loading required package: RColorBrewer

library(RColorBrewer)

2.Download the content to local directory

The raw corpus data is downloaded and stored locally at

setwd("/Sridharan/Others/Data Science/Capstone/")


sBlog <- readLines("./final/en_US/en_US.blogs.txt", 1000)
sNews <- readLines("./final/en_US/en_US.news.txt", 1000)
sTwit <- readLines("./final/en_US/en_US.twitter.txt", 1000)

library(stringi)
library(knitr)

3.Validate the profile of the file and check for Lines, Min and Max characteristics

The raw corpus data is downloaded and stored locally at

WordsPerLine <- sapply(list(sBlog,sNews,sTwit),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WordsPerLine) <- c('WordsPerLine_Min','WordsPerLine_Mean','WordsPerLine_Max')
filestats <- data.frame(
  FileName=c("blogs","news","twitter"),      
  t(rbind(
    sapply(list(sBlog,sNews,sTwit),stri_stats_general)[c('Lines','Chars'),],
    Words=sapply(list(sBlog,sNews,sTwit),stri_stats_latex)['Words',],
    WordsPerLine)
  ))
kable(filestats)

FileName	Lines	Chars	Words	WordsPerLine_Min	WordsPerLine_Mean	WordsPerLine_Max
blogs	1000	232636	42850	1	42.877	395
news	1000	198531	33760	1	34.189	156
twitter	1000	68647	12865	2	12.749	31

4.Read sample lines of the downloaded files

Select 100 lines of blogs, news and twitter

blogs_samplelines <- readLines("./final/en_US/en_US.blogs.txt", 100)
news_samplelines <- readLines("./final/en_US/en_US.news.txt", 100)
twitter_samplelines <- readLines("./final/en_US/en_US.twitter.txt", 100)

5. Crate a subset of the files for processing

THe raw files are quite huge so create a small subset for creating the corpus

blogs_subset <-  blogs_samplelines
news_subset <- news_samplelines
twitter_subset <- twitter_samplelines
# clean up objects that are no longer needed
rm( blogs_samplelines, news_samplelines, twitter_samplelines)

6-A. Data Cleanup: Convert to ASCII characters

Make sure all the special characters and non ASCII characters are converted

# Remove non-standard characters for sampled Blogs/News/Twitter
blogs_subset <- iconv(blogs_subset, "UTF-8", "ASCII", sub="")
news_subset <- iconv(news_subset, "UTF-8", "ASCII", sub="")
twitter_subset <- iconv(twitter_subset, "UTF-8", "ASCII", sub="")
sampleData <- c(blogs_subset,news_subset,twitter_subset)
# clean up objects that are no longer needed
rm(blogs_subset,news_subset,twitter_subset)

6-B. Data Cleanup: Convert to lower case, remove punctuation, remove numerals and profinity words

library(tm)
library(NLP)
corpus <- VCorpus(VectorSource(sampleData))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
# Remove offensive words (https://www.cs.cmu.edu/~biglou/resources/bad-words.txt)

#bad_words <- read.csv("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt",header =FALSE, strip.white = TRUE, stringsAsFactors = FALSE)
#corpus <- tm_map(corpus, removeWords, bad_words$V1)

7. Create Tokenize on the sample

Set function to create unigram, Bigram, Trigram, Quadgram and Qunitgram

library(RWeka) # Weka is a collection of machine learning algorithms for data mining

UnigramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
QuadgramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
QuintgramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))

8. Generate Term Document Matrix for the above

set dataset calling the above function to create unigram, Bigram, Trigram, Quadgram and Qunitgram

Unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokens))
Bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokens))
Trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokens))
Quadgrams <- TermDocumentMatrix(corpus, control = list(tokenize = QuadgramTokens))
Quintgrams <- TermDocumentMatrix(corpus, control = list(tokenize = QuintgramTokens))

Unigrams

## <<TermDocumentMatrix (terms: 2936, documents: 300)>>
## Non-/sparse entries: 6040/874760
## Sparsity           : 99%
## Maximal term length: 21
## Weighting          : term frequency (tf)

9. Remove Sparse Terms

Exclude the words that are sparse - 1% of occurance are removed

UnigramsDense <- removeSparseTerms(Unigrams, 0.999)
BigramsDense <- removeSparseTerms(Bigrams, 0.999)
TrigramsDense <- removeSparseTerms(Trigrams, 0.999)
QuadgramsDense <- removeSparseTerms(Quadgrams, 0.9999)
QuintgramsDense <- removeSparseTerms(Quintgrams, 0.9999)

10. Analyze the word frequencies and Sort the Dataset for the order

Function to sort and identify the frequency of the words

freq_frame <- function(tdm){
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_frame <- data.frame(word=names(freq), freq=freq)
  return(freq_frame)
}

Invoke the function for each of the n-grams

UnigramsDenseOrdered <- freq_frame(UnigramsDense)
BigramsDenseOrdered <- freq_frame(BigramsDense)
TrigramsDenseOrdered <- freq_frame(TrigramsDense)
QuadgramsDenseOrdered <- freq_frame(QuadgramsDense)
QuintgramsDenseOrdered <- freq_frame(QuintgramsDense)

11. Create the ggplot for the n-grams

Function to generate the ggplot for the Words with Frequency

library(RWeka) # Weka is a collection of machine learning algorithms for data mining


library(ggplot2)

plotgrams <- function(data, title, num) {
  top_grams<-data[1:num,]
  top_grams$word<-as.character(top_grams$word)
  
  ggplot(top_grams, aes(x=reorder(word, -freq),y=freq, label = word, fill = factor(word)   )) + 
    geom_bar(stat="identity") + 
    ggtitle(paste(title, "- Top ", num)) + 
    xlab(title) + ylab("Frequency") + 
    theme(axis.text.x=element_text(angle=90, hjust=1)) +
    theme(legend.position="none")
  
}

12. Plot the Graphs for the top 25 words

Unigrams - Most Frequently Used words

plotgrams(UnigramsDenseOrdered,"Unigrams",25)

Bigrams - Most Frequently used words in pair

plotgrams(BigramsDenseOrdered,"Bigrams",25)

Trigrams - Most Frequently used three consecutive words

plotgrams(TrigramsDenseOrdered,"Trigrams",25)

Quadgrams - Most Frequently used four consecutive words

plotgrams(QuadgramsDenseOrdered,"Quadgrams",25)

Quintgrams - Most Frequently used five consecutive words

plotgrams(QuintgramsDenseOrdered,"Quintgrams",25)

13. Design wordcloud to understand the top 100 words

Create the wordcloud - Most Frequently used top 100 words

wordcloud(corpus, max.words = 100, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2")) 
title("Wordcloud: 100 Most Frequently Used Words")

14. Next steps

The plan is to create a model that allows us to predict the next word with a set of words in a sentence. The next steps are:

Create RDS file:
I had significant issues in handling the large files due to the size, limited memory and the network bandwidth while creating the Term Document Matrix. The size needs to be fine turned and make sure optimum amount of sample size is considered without loss of quality
Identifying the right prediction model:
Currently no weight is assigned. A better model would be assigining weights using backoff algorithm. Multiple approaches can be applied. Identifying the right model is critical for the performance and predictions. In addition decide on the n-gram (may limit to Quadgrams or to Qunitgrams)
Sample size: Possibly implement other smoothing techniques. I might use a large Linux box and upgrade to 64-bit Rstudio to handle the file size
Learn from similar applications and from the publicly available information.
There are multiple applications such as Autocomplete from Google providing a reference design for the development. Complementing the n-gram model with other similar approach (and dataset) will provide better accuracy
Develop Shiny App with server and UI compoonets which will use the n-gram function to predict the next word