Michael Inkles
October 7, 2016
Exploratory Analysis showed that while the news and blogs datasets had fairly similar word distributions, the twitter dataset was different. So, I decided to keep the twitter data separate from the news/blogs data.
For each dataset, I performed the following steps:
The algorithm uses a combination of Kneser-Ney smoothing and backoff.
Bigram model for “the ____” (top 3 possibilities)
first same best
0.012223388 0.010758680 0.008319084
Trigram model for “to the ____”
next point top
0.008185524 0.007809579 0.006681743
4-gram model for “going to the ____”
beach movies gym
0.04269126 0.03566521 0.02867383
Perplexity was measured at various different levels of the discounting parameter. Each perplexity measurement was taken from an average of 10 perplexity measures, each scored on 100 trials. According to these measurements, 0.9 was found to be the ideal level for the parameter, with an average perplexity score of ~358.
The Shiny app first the results of the data preparation steps, saved as CSV files. These files can take up to a couple of minutes to load, so the app will not give any results until they are loaded.
Type in any amount of text in any format. The app will take the raw text input, and read the last three words, all formatted to lower case and without punctuation. It will then plug those three words into the algorithm, which will be trained on either the twitter or news/blogs data set, depending on which button is checked off.
You can then see the top 10 recommended words for each input, along with each word's probability of being correct. Factored into the probability calculations but not shown are unknown words and profanities, so it is not always necessarily the case that these are the 10 most likely words, but they are the most suitable for the purposes of the app, and their probabilities are accurate even when likely profanities and/or unknown words are not included.