Capstone Shiny Project - 'Predictive Text App'

Li Jiming
22/01/2016

In this report, I discuss my creation of a predictive text app that trained on a large corpus of text from news-related sources. I state how the data was used to create n-gram models, that were subsequently used to generate my prediction model, using the Kneser-Ney smoothing algorithm. The prediction models is shown in my Shiny app https://lijiming.shinyapps.io/ShinyApp/

Introduction

The purpose of this project was to create a predictive text application that predicts the next word that follows a sentence input.

This report is broken down into the two major steps:

1. Training the app on a large corpus of text
- importing the corpus
- cleaning the corpus to remove undesired items
- converting the corpus into data tables of n-gram models
2. Generating the predictions
- smoothing the n-gram model
- retrieving predictions

Training on a large corpus of text

1. Import the corpus (or a subset of say, 500,000 lines)
2. Clean the text (e.g. remove symbols, emoticons, profanity)
3. Tokenize the text
4. Create a data table of a unigram model with the token and token counts as the columns
5. Repeat #4 for higher n-grams, up until 5-gram.
6. Add columns to split each token into its individual words, for better indexing (e.g., token1, token2, token, and count as the columns of a data table for the bigram model).
7. Remove any tokens with a count of 1 from each model to reduce size.

Generating the predictions

Kneser-Ney smoothing is an algorithm designed to adjust the weights (through discounting) by using the continuation counts of lower n-grams.

Given the sentence, “Francisco”“ is presented as the suggested ending, because it appears more often than "glasses” in some text. “Francisco” rarely occurs outside of the context of “San Francisco”. Thus, instead of observing how often a word appears, the Kneser-Ney algorithm takes into account how often a word completes a bigram type (e.g., “prescription glasses”, “reading glasses”, “small glasses” vs. “San Francisco”).

I believe that typically, the smoothing algorithm is performed on all of the n-grams (unigram models, bigram models, etc.) prior to attempting any predictions.

Kneser-Ney Algorithm

I was able to write a completely recursive Kneser-Ney algorithm for n-gram models of any n. However, in effect, I limited the number of candidate words and thus the resulting term is often very inaccurate.

To implement this in real-time means I first select the candidates (what words could come next in a sentence) to be used for smoothing.

The candidates for what word should come next are chosen as the top-ranking words to follow wi in the bigram mode, where the first word of the bigram is the final word.

So Kneser-Ney probabilities based on candidates (the possible bigram continuations for the final word in the sentence)