The goal of the capstone is to develop a prediction algorithm for the most likely next word in a sequence of words. The objective is to explain the Exploratory Data Analysis which will have prediction application and algorithm. The model will be trained using a collection of text (i.e. corpus) which is compiled from 3 sources - news, blogs, and tweets. The purpose of this report is to demonstrate how data was downloaded, imported into R and cleaned. Furthermore it contains some exploratory analyses to investigate few features of the data.
library(NLP)
library(tm)
## Warning: package 'tm' was built under R version 3.4.3
library(stringi)
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.4.3
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.4.3
## Loading required package: RColorBrewer
library(RColorBrewer)
The raw corpus data is downloaded and stored locally at
setwd("/Sridharan/Others/Data Science/Capstone/")
sBlog <- readLines("./final/en_US/en_US.blogs.txt", 1000)
sNews <- readLines("./final/en_US/en_US.news.txt", 1000)
sTwit <- readLines("./final/en_US/en_US.twitter.txt", 1000)
library(stringi)
library(knitr)
The raw corpus data is downloaded and stored locally at
WordsPerLine <- sapply(list(sBlog,sNews,sTwit),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WordsPerLine) <- c('WordsPerLine_Min','WordsPerLine_Mean','WordsPerLine_Max')
filestats <- data.frame(
FileName=c("blogs","news","twitter"),
t(rbind(
sapply(list(sBlog,sNews,sTwit),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(sBlog,sNews,sTwit),stri_stats_latex)['Words',],
WordsPerLine)
))
kable(filestats)
| FileName | Lines | Chars | Words | WordsPerLine_Min | WordsPerLine_Mean | WordsPerLine_Max |
|---|---|---|---|---|---|---|
| blogs | 1000 | 232636 | 42850 | 1 | 42.877 | 395 |
| news | 1000 | 198531 | 33760 | 1 | 34.189 | 156 |
| 1000 | 68647 | 12865 | 2 | 12.749 | 31 |
Select 100 lines of blogs, news and twitter
blogs_samplelines <- readLines("./final/en_US/en_US.blogs.txt", 100)
news_samplelines <- readLines("./final/en_US/en_US.news.txt", 100)
twitter_samplelines <- readLines("./final/en_US/en_US.twitter.txt", 100)
THe raw files are quite huge so create a small subset for creating the corpus
blogs_subset <- blogs_samplelines
news_subset <- news_samplelines
twitter_subset <- twitter_samplelines
# clean up objects that are no longer needed
rm( blogs_samplelines, news_samplelines, twitter_samplelines)
Make sure all the special characters and non ASCII characters are converted
# Remove non-standard characters for sampled Blogs/News/Twitter
blogs_subset <- iconv(blogs_subset, "UTF-8", "ASCII", sub="")
news_subset <- iconv(news_subset, "UTF-8", "ASCII", sub="")
twitter_subset <- iconv(twitter_subset, "UTF-8", "ASCII", sub="")
sampleData <- c(blogs_subset,news_subset,twitter_subset)
# clean up objects that are no longer needed
rm(blogs_subset,news_subset,twitter_subset)
library(tm)
library(NLP)
corpus <- VCorpus(VectorSource(sampleData))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
# Remove offensive words (https://www.cs.cmu.edu/~biglou/resources/bad-words.txt)
#bad_words <- read.csv("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt",header =FALSE, strip.white = TRUE, stringsAsFactors = FALSE)
#corpus <- tm_map(corpus, removeWords, bad_words$V1)
Set function to create unigram, Bigram, Trigram, Quadgram and Qunitgram
library(RWeka) # Weka is a collection of machine learning algorithms for data mining
UnigramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
QuadgramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
QuintgramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))
set dataset calling the above function to create unigram, Bigram, Trigram, Quadgram and Qunitgram
Unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokens))
Bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokens))
Trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokens))
Quadgrams <- TermDocumentMatrix(corpus, control = list(tokenize = QuadgramTokens))
Quintgrams <- TermDocumentMatrix(corpus, control = list(tokenize = QuintgramTokens))
Unigrams
## <<TermDocumentMatrix (terms: 2936, documents: 300)>>
## Non-/sparse entries: 6040/874760
## Sparsity : 99%
## Maximal term length: 21
## Weighting : term frequency (tf)
Exclude the words that are sparse - 1% of occurance are removed
UnigramsDense <- removeSparseTerms(Unigrams, 0.999)
BigramsDense <- removeSparseTerms(Bigrams, 0.999)
TrigramsDense <- removeSparseTerms(Trigrams, 0.999)
QuadgramsDense <- removeSparseTerms(Quadgrams, 0.9999)
QuintgramsDense <- removeSparseTerms(Quintgrams, 0.9999)
Function to sort and identify the frequency of the words
freq_frame <- function(tdm){
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
freq_frame <- data.frame(word=names(freq), freq=freq)
return(freq_frame)
}
Invoke the function for each of the n-grams
UnigramsDenseOrdered <- freq_frame(UnigramsDense)
BigramsDenseOrdered <- freq_frame(BigramsDense)
TrigramsDenseOrdered <- freq_frame(TrigramsDense)
QuadgramsDenseOrdered <- freq_frame(QuadgramsDense)
QuintgramsDenseOrdered <- freq_frame(QuintgramsDense)
Function to generate the ggplot for the Words with Frequency
library(RWeka) # Weka is a collection of machine learning algorithms for data mining
library(ggplot2)
plotgrams <- function(data, title, num) {
top_grams<-data[1:num,]
top_grams$word<-as.character(top_grams$word)
ggplot(top_grams, aes(x=reorder(word, -freq),y=freq, label = word, fill = factor(word) )) +
geom_bar(stat="identity") +
ggtitle(paste(title, "- Top ", num)) +
xlab(title) + ylab("Frequency") +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
theme(legend.position="none")
}
plotgrams(UnigramsDenseOrdered,"Unigrams",25)
plotgrams(BigramsDenseOrdered,"Bigrams",25)
plotgrams(TrigramsDenseOrdered,"Trigrams",25)
plotgrams(QuadgramsDenseOrdered,"Quadgrams",25)
plotgrams(QuintgramsDenseOrdered,"Quintgrams",25)
Create the wordcloud - Most Frequently used top 100 words
wordcloud(corpus, max.words = 100, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))
title("Wordcloud: 100 Most Frequently Used Words")
The plan is to create a model that allows us to predict the next word with a set of words in a sentence. The next steps are:
Create RDS file:
I had significant issues in handling the large files due to the size, limited memory and the network bandwidth while creating the Term Document Matrix. The size needs to be fine turned and make sure optimum amount of sample size is considered without loss of quality
Identifying the right prediction model:
Currently no weight is assigned. A better model would be assigining weights using backoff algorithm. Multiple approaches can be applied. Identifying the right model is critical for the performance and predictions. In addition decide on the n-gram (may limit to Quadgrams or to Qunitgrams)
Sample size: Possibly implement other smoothing techniques. I might use a large Linux box and upgrade to 64-bit Rstudio to handle the file size
Learn from similar applications and from the publicly available information.
There are multiple applications such as Autocomplete from Google providing a reference design for the development. Complementing the n-gram model with other similar approach (and dataset) will provide better accuracy
Develop Shiny App with server and UI compoonets which will use the n-gram function to predict the next word