predictIV

Alexander Spray
20161008

Ngram prediction

This application, available at the below link presents a simple ngram text prediction algorithm. The most common nth word is presented given an input string of length n-1.

https://roastmoose.shinyapps.io/textPred/.

  • Raw text data is transformed into ngram-result pairs recursively
  • ngrams with invalid characters are trimmed
  • ngrams with exceedingly rare occurrence are trimmed
  • most common antecedents are stored in a “model”, generating ngram predictor-response pairs

Backoff model

The user string is tested for existence in the highest order stored ngram space (in this case we use a 3-gram model. If the user string does not exist, we recursively trim the string until we find an existing n(-n)gram and generate a prediction based on that restricted ngram model. Our model size is perhaps somewhat large (1Gb) and the time to execute is relatively small (median<1s). This is a compromise chosen to balance the size of the model, it's accuracy, and the time to fit and evaluate the model.

Fitting

The training corpus is very large, and the simple reality of its sheer size presents issues in fitting the model described in the previous slide. The training corpus used includes some 70 million unique words (for the english dataset alone), as such, some care was neccessary in the fitting of the model. The package data.table provides for efficient (importantly efficient on search) data structures, and efficient search time in model evaluation.

plot of chunk unnamed-chunk-1

Time to fit under data.frame was impractical… much than 24 hours for a 40% sample.

Model Accuracy

The model was found to be accurate in predicting the next word approximately 20% of the time.

  • assessed against random sets of 1000 sample strings.
  • 21% were the maximum accuracy observed.
  • time to execute increases with model size and accuracy.
  • fastest models were found to arrive at accuracies of around 7-8% The model performs adequately against human prediction. A simple test of 100 strings performed by the researcher reached 15% accuracy.

plot of chunk unnamed-chunk-2

Extensions

Do to time considerations, many possible improvements to the application remain unimplemented. These include, but are not limited to

  • Filtering offensive language
  • Spell checking and correction
  • word prediction from partial words (completion)
  • formal backoff modeling
  • trimming redundant predictors (ngram and n-1gram equivalents etc.)