Text Prediction from N-grams

Michael Inkles
October 7, 2016

Preparing the Data

Exploratory Analysis showed that while the news and blogs datasets had fairly similar word distributions, the twitter dataset was different. So, I decided to keep the twitter data separate from the news/blogs data.

For each dataset, I performed the following steps:

  • Separated the text entries into sentences, and the sentences into words and 2- 3- and 4-grams
  • Removed all punctuation other than apostrophes, and convert all words to lower case
  • Removed profanities and other offensive words
  • Downsampled each dataset to a 5,000,000 row data frame, with each row containing a word from the text and the 3 preceding words.
  • Replaced words with a freqency of less than 10 with a dummy word of “<UNK>”. This allowed for an increase in processing speed (because there were far fewer levels to each factor) and a reduction of noise.

Algorithm

The algorithm uses a combination of Kneser-Ney smoothing and backoff.

  1. Next, a bigram model is created by narrowing down the data frame to rows in which the previous word matches the previous word in the input, and a count is obtained of the number of times each word follows the input word.
  2. A discount of 0.9 is removed from each count over 0.
  3. The subtracted counts are then distributed among all words according to their distribution in the dataset.
  4. The bigram data frame is narrowed down to a trigram data frame if possible, and steps 2-4 are repeated, with the subtracted counts in step 4 instead assigned to all words according to their bigram probability
  5. The same steps are repeated with a 4-gram model, if possible.
  6. The algorithm returns the highest-order model available.

Algorithm In Action

Bigram model for “the ____” (top 3 possibilities)

      first        same        best 
0.012223388 0.010758680 0.008319084 

Trigram model for “to the ____”

       next       point         top 
0.008185524 0.007809579 0.006681743 

4-gram model for “going to the ____”

     beach     movies        gym 
0.04269126 0.03566521 0.02867383 

Model performance

Perplexity was measured at various different levels of the discounting parameter. Each perplexity measurement was taken from an average of 10 perplexity measures, each scored on 100 trials. According to these measurements, 0.9 was found to be the ideal level for the parameter, with an average perplexity score of ~358.

plot of chunk unnamed-chunk-5

Shiny App

The Shiny app first the results of the data preparation steps, saved as CSV files. These files can take up to a couple of minutes to load, so the app will not give any results until they are loaded.

Type in any amount of text in any format. The app will take the raw text input, and read the last three words, all formatted to lower case and without punctuation. It will then plug those three words into the algorithm, which will be trained on either the twitter or news/blogs data set, depending on which button is checked off.

You can then see the top 10 recommended words for each input, along with each word's probability of being correct. Factored into the probability calculations but not shown are unknown words and profanities, so it is not always necessarily the case that these are the 10 most likely words, but they are the most suitable for the purposes of the app, and their probabilities are accurate even when likely profanities and/or unknown words are not included.