Next Word Prediction

by Chan Chee-Foong on 1 Oct 2016
for Data Science Capstone Project of the Data Science Specialisation by Johns Hopkins University


Summary

Objective of the project is to build a predictive text model like those used by SwifyKey.
Data are downloaded from HC Corpora and the Exploratory Data Analysis can be found here.

Prediction Model

  1. Being an English-only 4-grams model means that the application will only use up to 3 words in the input to predict the next word.
  2. Together with Natural Processing Language techniques, the prediction scores for the next word are computed using 2 methods: Stupid Backoff and Knesar-Ney Interpolation. The most likely next word has the highest score. Each method has pros and cons in terms of speed and accuracy.
  3. Due to system limitation, only samples of the corpus are used in building the model and validation testing.
  4. Records with non-English words, bad words, repeated words, etc are excluded from the sampling.
  5. Only proper English words are used in building the prediction model.

Accuracy

Based on 3 input words, Knesar-Ney Interpolation achieved an accuracy of 33.5% while Stupid Backoff achieved 32.2%.

plot of chunk Slide3

Application

The application developed using Shinyapp can be found here.

Instructions

  1. Click on Dashboard to begin and test the application. Follow the instructions in the red box.
  2. You should end the message with a blank space for the program to give a next word prediction.
  3. Click on the Model Results to see details for the N-grams prediction scores.
  4. Introduction, Notes, Appendix and References include information and considerations made when building the prediction model.

Usage

  1. On mobile devices to enhance the speed and typing experience.
  2. In conjuction with speech recognisation techniques especially when the earlier words are recognised correctly but not the last word.
  3. On real-time translation software, allowing communication among people speaking different languages.

Future Enhancements

  1. Text classification - By categorising the corpus documents and grouping similar words, word prediction should improve when there are more input words. Prediction need not only depend on the last few words.
  2. Multi-word Prediction - Potentially reducing time taken when composing a message
  3. Spelling correction - Could be implemented with Minimum Edit Distance technique
  4. Extend to other language prediction can be implemented using different word segmentation methods.