Predicting the Next Word in a Sequence

Mahesh Arumugam
June 5, 2016

In this project, we build a prediction algorithm to predict the next word in a sequence.
Steps involved in the project
- Data acquisition and preprocessing (Coursera Site)
  - download, sample and clean datasets
  - build a corpus and apply transformations (e.g., strip whitespaces, filter profanity)
- Exploratory Analysis (report)
  - ngram tokenization and coverage analysis
- Prediction Model and Application Development
  - ngram model and dealing with unseen tokens (using smoothing algorithms)

We build unigram, bigram, trigram and quadgram model from the sampled data.
We use the raw frequency counts from the highest possible ngram model to predict the next word.
To deal with the unseen tokens in the ngram model, we implemented two smoothing algorithms that distribute the probability mass to unseen tokens.
- Add-one smoothing (Laplacian smoothing): add one to all counts and normalize the probabilities.
- Kneser-Ney smoothing: is a backoff smoothing algorithm where the lower-order ngram probabilities are used to deal with unseen higher-order ngrams.

Try the application here

Select the smoothing algorithm on the SideBar Panel of the Shiny app.
Start typing your text in the main Panel. The predictions will appear as radio-buttons.
“Predicted Result Summary” tab shows the probabilities.

Add-one smoothing screenshot

Add-one smoothing Add-one probabilities

The predictions take some time to appear. Please wait 10-20 seconds for the session to load and the radio buttons to appear.
The prediction accuracy is very less (approximately, 10-20%).
In the current implementation, Kneser Ney algorithm does not seem to provide a better prediction results compared to the add-one smoothing algorithm.
Scope for future work:
- Improve the accuracy of the algorithm
- Develop the model with a bigger sample (currently, limited by the hardware)
- Implement other discounting models