Next Word Prediction

Kanti Chalasani
04/26/2015

Next Word Prediction application suggests the possible next word in a sentence as the users types them on their smart devices.

Next Word Prediction (NWP) Application - PREVIEW !! NWP Application is very easy to use as you can see below. Access this application here. Access the source code here.

Application

Data Processing

News, Twitter and Blogs data in English from HC Corpora (over 4 million lines) are loaded .
R Routines built for data processing, data cleansing and analysis.
Data Sampling - 70% Training and 30% Testing/Cross Validation - from each type of data (twitter, news and blogs)

N-Grams - Language Model

Build ngram language model on the merged training - “seen” data set.
- Ngram : [Phrase (word1 … wordn); frequency count; Conditional Probability]
The conditional probability of a word is defined by considering the frequency of the words preceding the word and its usage history in the document.
- p(awesome|This project is)= count(This project is awesome)/count(This project is)
- p(there|Hi)= count(Hi there)/count(Hi)
NGram Reduction - Ngram model size was reduced by omitting the lower frequency ngrams.

Prediction Algorithm - Katz Backoff

Get input phrase, clean and tokenize it
If phrase has three or more words, look for evidence in quadgrams (fourgrams), use it when evidence found;
Otherwise backoff to trigram with highest conditional probability, use it when evidence found;
Otherwise backoff to bigram; otherwise use maximum likelihood estimate (MLE) of a partially matched unigram.

Additional Features

When first word is being typed, algorithm uses unigrams to predict; when match is found it returns the word; otherwise it returns “the” or “and” as begining words.
When first space is entered it looks for evidence in bigrams, uses it when found; if not it backtracks to unigram. So based on the number of spaces in the phrase it decides which highest ngram to start with before backing off to lower grams.

Future Enhancements

Spell correction; Handle numeric data; Process sentences
Improve algorithm performance
Incorporate parallel computing options

Accuracy & Performance

Language Model evaluated by applying partitioned test sets from “unseen” test/validation data. Two thousand data partitions, each from news, blogs, twitter and combined data was used for evaluation. About 250 phrases were included in each partition.
95% confidence interval for prediction accuracy of single word on combined data is between 15.4% and 15.7%. Prediction accuracy improves to an average of 26.8% when upto four words are predicted.
Prediction model predicts next word in about 0.03 sec.