This is the supporting pitch for the created application using Shiny. The application is deployed on RStudio's Shiny server. Our Shiny app, “Next word prediction”, aims to predict the next word of an input phrase using ngram analysis.
A subsample of the combined training dataset is used to build the model.
The training data is cleaned using several built-in functionalities of tm and Rweka libraries: removing numbers, punctuations, non-english objects and profanity words.
Data is tokenized using ngram library, and term documnet matrix is computed listing the word and their frequency of occurances.
en_blogs <- sampleText(en_b_tot,sample.size)
# This is repeated for all datasets
comb_data <- c(en_blogs, en_news, en_twitter)
# Combined data
Corp<-VCorpus(VectorSource(list(comb_data)))
# Volatile corpora
Corp <- tm_map(Corp, content_transformer(tolower))
Corp <- tm_map(Corp, removeNumbers)
Corp <- tm_map(Corp, removePunctuation)
Corp <- tm_map(Corp, removeWords, profanityList)
Corp <- tm_map(Corp,content_transformer(bracketX))
Corp <- tm_map(Corp, stripWhitespace)
This approach selects the next word based on the maximum frequency
Starting by the largest ngram model, attempting to find the pattern, otherwise iteratively moving down to the smallest N (unigram).
Ngrams are sorted based on descending frequency, hence the iterative search is also sorted that way.
If there are more than one candidate with the same frequency, the order of the candidates is selected randomly.