Text Prediction through Data Science

joog2006
1/24/2016

Text prediction is used in a variety of applications, usually to allow users to enter text faster through autocomplete functions
A variety of methods can be used for text prediction, but many popular methods focus on models based on the frequency of ngrams
The ease of obtaining large corpi and the need for quick prediction make ngrams particularly attractive

The training data came exclusively from the news corpus. My original plan was to use sentence structure and part of speech to make predictions. The news corpus provided more consistent grammatical consistency than the Twitter or blog data.
Data were cleaned so that only letters remained. This means patterns with symbols and numbers will not be predicted. This might cause problems in some applications. For example, Twitter usrs often use hashtags denoted by “#”. Cleaning all symbols means that a hashtag can never be predicted.
The advantage of a smaller and simpler model that excludes symbols and numbers justified this cleaning

Ngram predictions are based on finding a probability of observing a word given the previous n words. \[ P_{bo} (w_i \mid w_{i-n+1} \cdots w_{i-1}) \]
When multiple words are observed given the same set of preceding words, the \( w_i \) with the highest probability given the preceding words is selected.
Backoff models place more weight on higher order ngrams, and backoff weight predictions based on lower order ngrams
The version here places a zero weight on lower order ngrams, and simply uses the prediction based on the highest order ngram available.

Using the app is intutive, the user only needs to start typing in the left hand text box. A prediction will render in the right hand frame.

The app works by

Taking app takes the user's input and parsing it into a set of strings
It looks up the last two words in a precalculated table of trigrams, and returns the highest frequency third word given the observation of the last two.
If step 2 fails to find a matching trigram, a single word is looked up the bigram table
If step 3 fails, the highest frequency unigram, “the” is used as a prediction