Next Word Prediction

Cassie (Xi) Guo
Dec 20, 2016

Introduction

The goal of the project is to build a shiny app which takes two input words and generates the next word prediction. N-gram model is built based on the texts from blogs, news and twitter. The project includes the following tasks:

Cleaning the data
Exploratory analysis
Modeling and testing
Final product
- https://lilsummer.shinyapps.io/NextWordPrediction/

Algorithm - Katz Back-Off Model

For observed third word: Discounted probability (default discount: 0.5)
For unobserved third word: Discounted probability mass is distributed to unobserved third word
Final prediction: Produce probabilities based on both bigram and trigram, take the word with the highest probability

The Shiny App

Model Performance

Dataset
- Training: 18% of the raw data (blog, news and twitter)
- Testing: 4% of the raw data (80:20 rule)
Accuracy
- 8% - 12%: based on the size of the training set
Novelty
- The model uses both bigram and trigram models to generate the candidate words
Caveat
- Limited computing power: some common words were not covered in the original N-grams