Capstone Project- Word Prediction Algorithm

Sudheer Paladugu
April 15, 2016

Word prediction algorithm predicts next word for user input. This model enhanced to run on all devices desktops, ipads/notes, and smart-phones. Optimized for quick response by reducing memory footprint.

Getting and Processing Data

Donwloaded dataset from HC Corpora and only en_US data files were considered to build prediction model, for US English. As a first step, sample data sets were prepared with a 0.1% data from blogs, news, and twitter data and saved as RData files for further processing.

Created sample data Corpus object ({tm} package) and content was cleaned and transformations were applied (lower case, remove numbers and punctuations, strip whiteSpace, remov links, repeated characters, special characters, and removed Profanity/Swear words) from Corpus.

N-Gram Tokens: 1-gram, 2-gram, 3-gram, and 4-gram takens created for Corpus data using NGramTokenizer. Generated single, double, triple, and four words matrix, by passing tokenizers as control-list for TermDocumentMatrix({RWeka} package). Removed sparse terms with 0.1 sparsity and saved all n-grams as RData files.

Probability and Ordering Tokens

Created data tables for n-grams for word(s) and frequency. Frequency used to find more frequent word(s) from n-gram tokens. Probability column added as 'p', for each word, which will be used for listing top 3 words for prediction. 2-gram, 3-gram, 4-gram tokens converted as individual columns t1, t2, t3, and t4 respectively.

Ordering Tokens

  • Set index as a key for 1-gram table.
  • Re-order 2-gram, 3-gram, and 4-gram tokens using 'setorderv' of data.table based on columns provided, like word and probability.
  • List top 3 columns from n-gram tokens by words, frequency, and probability.
  • Update columns names.
  • Bind all n-gram tokens and save as 'model.RData'.

Use 'rm()' function to delete variable reference after usage to flush/clear memory for re-use.

Prediction Algorithm

Word Prediction Algorithm is the implentation of backoff algorithm. Prediction travers from 4-gram up to 1-gram tokens for predictions. Top 3 higher frequency words will be return as output. Prediction algorithm flow -

  • User input will be cleaned and passed to prediction model.
  • 3 words length input search for predictions in 4-gram model (grdgram[t1==w1&t2==w2&t3==w3,t4]) returns output. Otherwise search contuinues in 3-gram by n-1 and n-2 words. This will continues to 2-gram and 1-gram model, if no predictions found in heirarchy.
  • 1-gram tokens will be searched if no prediction found in higher gram tokens and list top 3 high frequency words as output.
  • Order predictions by frequency, probability, and decrease=FALSE and return 3 words.

Application Usage Instructions

Prediction model takes maximum words 3 or last three words from longer user input, to predict next word.

  • Enter/Input some text in text box provided.
  • Click on 'Predict' button to Execute model on input text.
  • Model Output predictions will be printed on right side panel. Shiny Application screen-prints attached below -

Model Improvements and References

Model Improvements -

  • Smoothing with Backoff model implementation provides accurate prictions input.
  • Number based search algorithem considerably improves prediction model performance instead of text.
  • '{tm}' stemDocument and Dictionary provides precise search for english words.

References
Text Mining Infrastructure in R
Natural Language Processing
N-Gram
Katz's back-off model
Good-Turing Frequency Estimation