Word Prediction

Steve Senior
August 2015

Rationale

Text input is central to human-computer interaction. However typing text input on mobile devices is fiddly and not user-friendly. Predictive text software can ease this. Predicting the next word in a sentence can also help with speech recognition software, allowing software to more easily choose between possible words it hears.

The market for this kind of software is likely to be very big. According to Ofcom (the UK's telecommunications regulator), 93% of adults in the UK own a mobile phone. This figure is likely to be similar in other developed countries. In less developed countries, mobile phones are becoming even more central to personal and business lives, as broadband and other telecommunications infrastructure is often lacking or low in quality.

Project aims

The aims of this project are to:

Load and analyse natural language data provided for the project;
Build software that can extract ngrams of length 1-5 from each sentence in the corpus;
Build a simple prediction algorithm that predicts the next word using ngrams seen in training data; and
Deploy this algorithm in a shiny app that is usable on the web and which allows the user to explore the different parameters in the app.

Description of algortithm

The input sentence fragment is cleaned and used identify ngrams of length 1-5 that represent the final n words in the fragment.
Using ngram frequency tables generated from a sample of the training data provided the conditional probability of the next word is calculated. Ngrams seen less than k times are discounted (ngrams seen once are always discounted for performance reasons).
Probabilities assigned to shorter ngrams are reduced by a factor, d (default is 0.1) for each word that the ngram is shorter than five. For example, length four ngrams have their probabilities reduced by a factor of d; length three ngrams have their probabilities reduced by a factor of d² and so on.
The word with the highest probability is returned as the prediction. The five most likely words are also returned.

Using the app

The app provides two user interfaces, a basic and an advanced one.

The basic interface allows the user to enter a sentence fragment and predicts a result once the 'Submit' button is pressed.
The advanced interface also allows the user to alter the minimum freqency that ngrams must have from the training data to be included and the discount rate for shorter ngrams.

To use the app, just type some words and press submit! Take me to the app!

The app works by loading ngram frequency tables pre-saved as an R object. It contains the prediction function described in the previous slide.