Sudheer Paladugu
April 15, 2016
Word prediction algorithm predicts next word for user input. This model enhanced to run on all devices desktops, ipads/notes, and smart-phones. Optimized for quick response by reducing memory footprint.
Donwloaded dataset from HC Corpora and only en_US data files were considered to build prediction model, for US English. As a first step, sample data sets were prepared with a 0.1% data from blogs, news, and twitter data and saved as RData files for further processing.
Created sample data Corpus object ({tm} package) and content was cleaned and transformations were applied (lower case, remove numbers and punctuations, strip whiteSpace, remov links, repeated characters, special characters, and removed Profanity/Swear words) from Corpus.
N-Gram Tokens: 1-gram, 2-gram, 3-gram, and 4-gram takens created for Corpus data using NGramTokenizer. Generated single, double, triple, and four words matrix, by passing tokenizers as control-list for TermDocumentMatrix({RWeka} package). Removed sparse terms with 0.1 sparsity and saved all n-grams as RData files.
Created data tables for n-grams for word(s) and frequency. Frequency used to find more frequent word(s) from n-gram tokens. Probability column added as 'p', for each word, which will be used for listing top 3 words for prediction. 2-gram, 3-gram, 4-gram tokens converted as individual columns t1, t2, t3, and t4 respectively.
Ordering Tokens
Use 'rm()' function to delete variable reference after usage to flush/clear memory for re-use.
Word Prediction Algorithm is the implentation of backoff algorithm. Prediction travers from 4-gram up to 1-gram tokens for predictions. Top 3 higher frequency words will be return as output. Prediction algorithm flow -
Prediction model takes maximum words 3 or last three words from longer user input, to predict next word.
Model Improvements -
References Text Mining Infrastructure in R Natural Language Processing N-Gram Katz's back-off model Good-Turing Frequency Estimation