Goal of this project is to design Application with the help of predictive text modeling that will show next word based on previously typed one, two or three words. This way we can greatly help user in fast and correct typing and ultimately save lot of time.
Three datasets are given in English language for exploration and training purpose i.e. en_US.twitter.txt, en_US.blogs.txt and en_US.news.txt. This report explains the exploratory data analysis and future plan for prediction algorithm and designing the application.
library(tm)
## Loading required package: NLP
library(RWeka)
library(slam)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
## Loading required package: RColorBrewer
con<-file("en_US.news.txt")
news_eng<-readLines(con)
con<-file("en_US.blogs.txt")
blogs_eng<-readLines(con)
con<-file("en_US.twitter.txt")
twitter_eng<-readLines(con)
close(con)
Sourcedetails<-data.frame(Source_name = c("News","Blogs","Twitter"),
Text_Lines=c(length(news_eng),length(blogs_eng),length(twitter_eng)))
Sourcedetails
## X Source_name Text_Lines
## 1 1 News 1010242
## 2 2 Blogs 899288
## 3 3 Twitter 2360148
For faster excecution and better performance, we take sample of 5000 text lines from each data source and combine to form one Corpus i.e. collection of text documents
sample_news<-sample(news_eng,5000)
sample_blogs<-sample(blogs_eng,5000)
sample_twitter<-sample(twitter_eng,5000)
allsources <- paste(sample_news, sample_twitter, sample_blogs)
sample_comb<-VCorpus(VectorSource(allsources))
In this step, we will remove the words or phrases not that much useful in prediction and also create term document matrix to anlyze most frequent words in the corpus
sample_comb<-tm_map(sample_comb,removePunctuation)
sample_comb<-tm_map(sample_comb,removeNumbers)
sample_comb<-tm_map(sample_comb,tolower)
sample_comb<-tm_map(sample_comb,stripWhitespace)
sample_comb<-tm_map(sample_comb,removeWords,stopwords("english"))
sample_comb<-tm_map(sample_comb,PlainTextDocument)
sample_comb_tdm<-TermDocumentMatrix(sample_comb)
wordsfreq<-row_sums(sample_comb_tdm,na.rm = T)
wordsfreq_df <- data.frame(word=names(wordsfreq), freq=wordsfreq)
wordsfreq_df<-wordsfreq_df[order(wordsfreq_df$freq,decreasing = TRUE),]
wordsfreq_df$word <-factor(wordsfreq_df$word, levels=wordsfreq_df[order(wordsfreq_df$freq), "word"])
Here we will look at Top 20 words bar plot and Top 100 words wordcloud graph. This will help us to analyze the most frequently accuring words.
p <- ggplot(wordsfreq_df[1:20,], aes(word, freq,fill = freq))
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + xlab("Words")
p <- p + ylab("frequency")
p <- p + ggtitle("Single Word Occurrence")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1,size = 15))
p
wordcloud(names(wordsfreq), wordsfreq, max.words=100, rot.per=0.2, colors=rainbow(80))
Next step is to analyze the bigram, trigram and quadgram models, this way we can see how the words are paired most frequently
sample_comb_df <-data.frame(text=unlist(sapply(sample_comb, `[`, "content")), stringsAsFactors=F)
myngram<-function(textdoc,gram) {
temp <- NGramTokenizer(textdoc,
Weka_control(min = gram, max = gram,
delimiters = " \\r\\n\\t.,;:\"()?!"))
tempdf<-data.frame(table(temp))
tempdf<-tempdf[order(tempdf$Freq,decreasing = TRUE),]
rownames(tempdf)<-NULL
colnames(tempdf)<-c("Words","Counts")
return(tempdf)
}
bigram<-myngram(sample_comb_df,2)
trigram<-myngram(sample_comb_df,3)
quadgram<-myngram(sample_comb_df,4)
bigram$Words <-factor(bigram$Words, levels=bigram[order(bigram$Counts), "Words"])
p <- ggplot(bigram[1:20,], aes(Words, Counts,fill = Counts))
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + xlab("Words")
p <- p + ylab("Counts")
p <- p + ggtitle("Bigram Model - Top 20")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1,size = 15))
p
trigram$Words <-factor(trigram$Words, levels=trigram[order(trigram$Counts), "Words"])
p <- ggplot(trigram[1:20,], aes(Words, Counts,fill = Counts))
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + xlab("Words")
p <- p + ylab("Counts")
p <- p + ggtitle("Trigram Model - Top 20")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1,size = 15))
p
quadgram$Words <-factor(quadgram$Words, levels=quadgram[order(quadgram$Counts), "Words"])
p <- ggplot(quadgram[1:20,], aes(Words, Counts,fill = Counts))
p <- p + geom_bar(stat="identity")
p <- p + theme_bw()
p <- p + xlab("Words")
p <- p + ylab("Counts")
p <- p + ggtitle("Quadgram Model - Top 20")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1,size = 15))
p
Each N gram model provide different set of words or phrases
Post analyzing the word pairs, it seems Mix Model technique would work best
Partition of available data into training, hold out set to train model and test set for final model evaluation
Interpolation technique for building Mix models by assigning weights
Good Turning Smoothing and Kneser Ney Smoothing for better results and text prediction for handling unseen words in training data set
History for words will be considered in the model for next word prediction, P(continuation) probability will be considered for this task
Work On the Model Performance in terms of faster execution and low memory usage
Shiny App that Quickly predicts three possible outcomes as soon as previous word typed, also provide interactive and user friendly interface
Shiny App optimal performance with minimum usage of resources