For the project of building a shiny app which will be able to predict next word we will have to get familiar with the corpus we have from different domains like we have news, blogs and twitter docs. The main goal of this report is to present a glimpse of corpus we will be using and its properties with the insights we got from corpus and also the future steps we will be taking to develop the shiny app.
Let us first look at summary of the corpus we generated out of the three docs and the individual docs after the data pre-processing.
| Doc | Chars | Words |
|---|---|---|
| Blogs | 120789795 | 21970671 |
| News | 13617258 | 2271426 |
| 160540139 | 29394601 | |
| Corpus | 278091121 | 53062910 |
This plot shows the top 20 unigrams i.e., single words or letters which occurred the most in the corpus.
This plot shows the top 20 bigrams i.e., combination of two words or letters which occurred the most in the corpus.
This plot shows the top 20 trigrams i.e., combination of three words or letters which occurred the most in the corpus.
This plot shows the top 20 quadgrams i.e., combination of four words or letters which occurred the most in the corpus.
We have gathered all the information from all the three documents and assembled them into a single corpus. With the generation of n_grams we got an idea of probability of combinations of words and letters in our corpus which will be used with out text prediction algorithm.
The whole process will be like, based on the given input tokens(total words or letters separated with spaces) will be generated, and based on it recent tokens will be parsed through n_grams frequency table accordingly and when one n_gram frequency table is unable to provide a prediction algorithm will move on to the next n_gram(n-1) frequency table and if, even biGram frequency table is unable to predict then no prediction will be NULL.
As we know we have a huge data of n_grams, so in the shiny we will have to take care of speed, accuracy of our app along with the app size so that it suitable for all kinds of environment.
#required libraries
library(tokenizers)
library(tm)
library(doSNOW)
#reading data
blogs<-readLines("en_US.blogs.txt")
news<-readLines("en_US.news.txt")
twitter<-readLines("en_US.twitter.txt")
#converting all docs in international encoding
blogs<-iconv(blogs,to="latin1")
news<-iconv(news,to="latin1")
twitter<-iconv(twitter,to="latin1")
#collapsing all lines to form a single doc for each
blogs<-paste(blogs,collapse = " ")
news<-paste(news,collapse = " ")
twitter<-paste(twitter,collapse = " ")
#cleaning all docs
blogs<-gsub("'t","thisismyownapostrophetextfort",blogs)
news<-gsub("'t","thisismyownapostrophetextfort",news)
twitter<-gsub("'t","thisismyownapostrophetextfort",twitter)
blogs<-gsub("'s","thisismyownapostrophetextfors",blogs)
news<-gsub("'s","thisismyownapostrophetextfors",news)
twitter<-gsub("'s","thisismyownapostrophetextfors",twitter)
blogs<-gsub("[^A-Za-z]"," ",blogs)
news<-gsub("[^A-Za-z]"," ",news)
twitter<-gsub("[^A-Za-z]"," ",twitter)
blogs<-gsub("thisismyownapostrophetextfors","'s",blogs)
news<-gsub("thisismyownapostrophetextfors","'s",news)
twitter<-gsub("thisismyownapostrophetextfors","'s",twitter)
blogs<-gsub("thisismyownapostrophetextfort","'t",blogs)
news<-gsub("thisismyownapostrophetextfort","'t",news)
twitter<-gsub("thisismyownapostrophetextfort","'t",twitter)
#lower casing all vectors
blogs<-tolower(blogs)
news<-tolower(news)
twitter<-tolower(twitter)
#making i m to i am
blogs<-gsub(" i m "," i am ",blogs)
news<-gsub(" i m "," i am ",news)
twitter<-gsub(" i m "," i am ",twitter)
#using parallel computation
cl<-makeCluster(3)
registerDoSNOW(cl)
#removing profanity from corpus use following bad words list
#https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
badwords<-readLines("badwords.txt")
corpus<-paste(foreach(n=c(blogs,news,twitter)) %dopar%
tm::removeWords(n,badwords),collapse=" ")
stopCluster(cl)
#removing extra white space
corpus<-stripWhitespace(corpus)
#creating data frame for unigrams
uniGrams<-as.data.frame(table(tokenize_ngrams(corpus,n=1)[[1]]))
names(uniGrams)<-c("term","freq")
#creating data frame for bigrams
biGrams<-as.data.frame(table(tokenize_ngrams(corpus,n=2)[[1]]))
names(biGrams)<-c("term","freq")
#creating data frame for trigrams
triGrams<-as.data.frame(table(tokenize_ngrams(corpus,n=3)[[1]]))
names(triGrams)<-c("term","freq")
#creating data frame for quadgrams
x<-Sys.time()
quadGrams<-as.data.frame(table(tokenize_ngrams(corpus,n=4)[[1]]))
names(quadGrams)<-c("term","freq")
Sys.time()-x