Lynna Jirpongopas
Sat Apr 25 09:38:53 2015
These are the general steps taken to predict the next word:
5% of Twitter data & 1% of news data were used to build the model
The model takes the user's input text and determines amount of words in the text, let's call it “n”
Then the sampled data gets tokenized at n+1 grams
linesWithTheText <- tokenizedData[grepl(paste("\\<", inputText, "\\>", sep=""), text, ignore.case=T)]
Once matches are found, the model select the ones that appeared in sampled data at highest frequency
If there are ties, the model randomly selects one of them!
Stop words were not excluded. These are good indicators for predicting the next word in common phrases!