Predicting the Next Word in a Sequence
Mahesh Arumugam
June 5, 2016
Introduction
In this project, we build a prediction algorithm to predict the next word in a sequence.
Steps involved in the project
Data acquisition and preprocessing (
Coursera Site
)
download, sample and clean datasets
build a corpus and apply transformations (e.g., strip whitespaces, filter profanity)
Exploratory Analysis (
report
)
ngram tokenization and coverage analysis
Prediction Model and Application Development
ngram model and dealing with unseen tokens (using smoothing algorithms)
Prediction Model
We build unigram, bigram, trigram and quadgram model from the sampled data.
We use the raw frequency counts from the highest possible ngram model to predict the next word.
To deal with the unseen tokens in the ngram model, we implemented two smoothing algorithms that distribute the probability mass to unseen tokens.
Add-one smoothing (Laplacian smoothing): add one to all counts and normalize the probabilities.
Kneser-Ney smoothing: is a backoff smoothing algorithm where the lower-order ngram probabilities are used to deal with unseen higher-order ngrams.
Next Word Prediction Application
Try the application
here
Select the smoothing algorithm on the SideBar Panel of the Shiny app.
Start typing your text in the main Panel. The predictions will appear as radio-buttons.
“Predicted Result Summary” tab shows the probabilities.
Add-one smoothing screenshot
Conclusion
The predictions take some time to appear.
Please wait 10-20 seconds for the session to load and the radio buttons to appear
.
The prediction accuracy is very less (approximately, 10-20%).
In the current implementation, Kneser Ney algorithm does not seem to provide a better prediction results compared to the add-one smoothing algorithm.
Scope for future work:
Improve the accuracy of the algorithm
Develop the model with a bigger sample (currently, limited by the hardware)
Implement other discounting models