Next Word Prediction

Cassie (Xi) Guo
Dec 20, 2016

Introduction

The goal of the project is to build a shiny app which takes two input words and generates the next word prediction. N-gram model is built based on the texts from blogs, news and twitter. The project includes the following tasks:

Algorithm - Katz Back-Off Model

  • For observed third word: Discounted probability (default discount: 0.5)

  • For unobserved third word: Discounted probability mass is distributed to unobserved third word

  • Final prediction: Produce probabilities based on both bigram and trigram, take the word with the highest probability

The Shiny App

Model Performance

  • Dataset
    • Training: 18% of the raw data (blog, news and twitter)
    • Testing: 4% of the raw data (80:20 rule)
  • Accuracy
    • 8% - 12%: based on the size of the training set
  • Novelty
    • The model uses both bigram and trigram models to generate the candidate words
  • Caveat
    • Limited computing power: some common words were not covered in the original N-grams