Predictive Text Modeling: A Statistical Application in Python and R

Alexander Lee
August 20, 2015

Problem and Approach

The problem: Devise a model that will predict a user's next desired word based on arbitrary text input

Givens:

  • 3 text corpora in .txt format
  • frontend application requirement of R Shiny
  • a computer
  • my wits

My approach: Model the corpora using Python's capabilities, and build the prediction logic in R

Tools:

  • Python used for text cleaning, frequency analysis, data formatting, and function prototyping
  • R used for final prediction model and Shiny application functionality

Corpora Modeling in Python

Raw text data were first processed in Python as follows:

  • Cleaned text using regular expressions
  • Split cleaned text into semantically-ordered chunks using natural language separators (punctuation, line endings)
  • Mined the semantically-ordered text for n-gram (word sequence) frequencies
  • Pruned the data analytically to optimally trade off size complexity and accuracy – term frequency measured against volume of terms in the corpus to determine cutoff point for dropping low-frequency terms
  • Exported cleaned term frequency data (term, term length, term frequency, leading lookup key, trailing prediction) for use in R

Prediction Algorithm in R

Algorithm logic:

  • Seeks exact term matches of the greatest length
  • Steps down term length if no matches found
  • Performs fuzzy match against input if no exact matches found
  • Favors longest matching sequences of the highest frequency

Algorithm performance:

  • In-sample*: 26% exact match, 34% in top 4 matches against known next word
  • Out-of-sample*: 8% exact match, 16% in top 4 matches against known next word

*In-sample text randomly selected from raw corpus data; out-of-sample text randomly selected from arbitrary Google News / Twitter content outside of provided corpora

Final Model and Front-End Application

The (final-ish) product