Sada
22/07/2015
The goal of the application is to predict the next word given the previous word(s)
The approach
a) Use the blog, twitter and news feed data supplied as the input bag of words. Cleanse the text and build unigram, bigram and trigram frequency matrix
b) For the input text, choose the top 5 probable terms based on bigram and trigram model. Combine the probabilities of bigram and trigram suggested word using a weighted summation approach. Compare the weighted sum probabilitiy of the top 5 predictions and choos the top one
The data from the text files is read into the text buffers. The text is convereted into Corpus data structure
The cleansing is performed through following steps
a) Remove punctuation, non-english characters and numbers
b) Replace multiple whitespaces including tabs with single space character
c) Convert all text into lower case letters
Build n-gram model using tokenizer in Weka package. The space character is used for spliiting the text and a matrix of terms versus document is built. This matrix is used to compute the frequency of terms as a whole across all the documents. This frequency is used to compute the probability of each term appearing in the corpus.
Gather the user's input text and apply the same text cleansing rules that were applied to the corpus
Pick the last 2 words from the input and select the bigrams for the 1 word and trigrams for the 2 words. Pick the top 5 from each of the 2 data sets based on the probability
If both bigram and trigram contain same prediction, combine the probability by using weight.
Weights are computed as follws
a) say bigram “A P1”“ and "A P2”“ has frequencies 2 and 3, and probability of pr1 and Pr2 and trigram "A B P2” and “A B P3” has frequency of 4 and 5 and probability of Pr3 and Pr4 respectively.
b) the total frequency within the subset for predicted words P1, P2 and P3 is (2+3) + (4+5) = 14
c) Combined probabilty for P1 = 2 * pr1/14 Combined probabilty for P2 = (3 * pr2/14) + (4 *Pr3)/14 Combined probabilty for P3 = 5 * pr3/14
The application is easy to use and available to all on the web.
Link is https://sadashivab.shinyapps.io/PredictWord
Note that the app performance is slower on Shinyapps and please bear with the application for few seconds in certain instances
Performance improvement by preselecting words for top predictions for each word
Additional modelling such as clustering and stemming to predict words which are not directly linked to the preceding word