Milestone Report: Predicting Text

Synopsis

Goal of this project is to design Application with the help of predictive text modeling that will show next word based on previously typed one, two or three words. This way we can greatly help user in fast and correct typing and ultimately save lot of time.

Three datasets are given in English language for exploration and training purpose i.e. en_US.twitter.txt, en_US.blogs.txt and en_US.news.txt. This report explains the exploratory data analysis and future plan for prediction algorithm and designing the application.

Load Required libraries/packages for text mining

library(tm)

## Loading required package: NLP

library(RWeka)
library(slam)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Loading required package: RColorBrewer

Read and Print Main Data Set Summaries

con<-file("en_US.news.txt")
news_eng<-readLines(con)

con<-file("en_US.blogs.txt")
blogs_eng<-readLines(con)

con<-file("en_US.twitter.txt")
twitter_eng<-readLines(con)

close(con)

Sourcedetails<-data.frame(Source_name = c("News","Blogs","Twitter"),
           Text_Lines=c(length(news_eng),length(blogs_eng),length(twitter_eng)))

Sourcedetails

##   X Source_name Text_Lines
## 1 1        News    1010242
## 2 2       Blogs     899288
## 3 3     Twitter    2360148

Data Sampling

For faster excecution and better performance, we take sample of 5000 text lines from each data source and combine to form one Corpus i.e. collection of text documents

sample_news<-sample(news_eng,5000)
sample_blogs<-sample(blogs_eng,5000)
sample_twitter<-sample(twitter_eng,5000)
allsources <- paste(sample_news, sample_twitter, sample_blogs)
sample_comb<-VCorpus(VectorSource(allsources))

Text Preprocessing

In this step, we will remove the words or phrases not that much useful in prediction and also create term document matrix to anlyze most frequent words in the corpus

sample_comb<-tm_map(sample_comb,removePunctuation)
sample_comb<-tm_map(sample_comb,removeNumbers)
sample_comb<-tm_map(sample_comb,tolower)
sample_comb<-tm_map(sample_comb,stripWhitespace)
sample_comb<-tm_map(sample_comb,removeWords,stopwords("english"))
sample_comb<-tm_map(sample_comb,PlainTextDocument)

sample_comb_tdm<-TermDocumentMatrix(sample_comb)

wordsfreq<-row_sums(sample_comb_tdm,na.rm = T)
wordsfreq_df <- data.frame(word=names(wordsfreq), freq=wordsfreq)   
wordsfreq_df<-wordsfreq_df[order(wordsfreq_df$freq,decreasing = TRUE),]
wordsfreq_df$word <-factor(wordsfreq_df$word, levels=wordsfreq_df[order(wordsfreq_df$freq), "word"])

Exploratory data anlaysis Plots

Here we will look at Top 20 words bar plot and Top 100 words wordcloud graph. This will help us to analyze the most frequently accuring words.

p <- ggplot(wordsfreq_df[1:20,], aes(word, freq,fill = freq))    
p <- p + geom_bar(stat="identity")   
p <- p + theme_bw()
p <- p + xlab("Words")
p <- p + ylab("frequency")
p <- p + ggtitle("Single Word Occurrence")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1,size = 15))   
p

wordcloud(names(wordsfreq), wordsfreq, max.words=100, rot.per=0.2, colors=rainbow(80))

N Gram Models

Next step is to analyze the bigram, trigram and quadgram models, this way we can see how the words are paired most frequently

sample_comb_df <-data.frame(text=unlist(sapply(sample_comb, `[`, "content")), stringsAsFactors=F)

myngram<-function(textdoc,gram) {
                        temp <- NGramTokenizer(textdoc, 
                                                   Weka_control(min = gram, max = gram,
                                                   delimiters = " \\r\\n\\t.,;:\"()?!"))
                        tempdf<-data.frame(table(temp))
                        tempdf<-tempdf[order(tempdf$Freq,decreasing = TRUE),]
                        rownames(tempdf)<-NULL
                        colnames(tempdf)<-c("Words","Counts")
                        return(tempdf)
                       }


bigram<-myngram(sample_comb_df,2)
trigram<-myngram(sample_comb_df,3)
quadgram<-myngram(sample_comb_df,4)

Plots BIGRAM, TRIGRAM and Quadgram

bigram$Words <-factor(bigram$Words, levels=bigram[order(bigram$Counts), "Words"])

p <- ggplot(bigram[1:20,], aes(Words, Counts,fill = Counts))    
p <- p + geom_bar(stat="identity")   
p <- p + theme_bw()
p <- p + xlab("Words")
p <- p + ylab("Counts")
p <- p + ggtitle("Bigram Model - Top 20")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1,size = 15))   
p

trigram$Words <-factor(trigram$Words, levels=trigram[order(trigram$Counts), "Words"])

p <- ggplot(trigram[1:20,], aes(Words, Counts,fill = Counts))    
p <- p + geom_bar(stat="identity")   
p <- p + theme_bw()
p <- p + xlab("Words")
p <- p + ylab("Counts")
p <- p + ggtitle("Trigram Model - Top 20")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1,size = 15))   
p

quadgram$Words <-factor(quadgram$Words, levels=quadgram[order(quadgram$Counts), "Words"])

p <- ggplot(quadgram[1:20,], aes(Words, Counts,fill = Counts))    
p <- p + geom_bar(stat="identity")   
p <- p + theme_bw()
p <- p + xlab("Words")
p <- p + ylab("Counts")
p <- p + ggtitle("Quadgram Model - Top 20")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1,size = 15))   
p

Key Points from Exploratory Data Analysis

Each N gram model provide different set of words or phrases
Post analyzing the word pairs, it seems Mix Model technique would work best

What is Next?

Partition of available data into training, hold out set to train model and test set for final model evaluation
Interpolation technique for building Mix models by assigning weights
Good Turning Smoothing and Kneser Ney Smoothing for better results and text prediction for handling unseen words in training data set
History for words will be considered in the model for next word prediction, P(continuation) probability will be considered for this task
Work On the Model Performance in terms of faster execution and low memory usage
Shiny App that Quickly predicts three possible outcomes as soon as previous word typed, also provide interactive and user friendly interface
Shiny App optimal performance with minimum usage of resources