Predictive Text Modeling: A Statistical Application in Python and R

Alexander Lee
August 20, 2015

Problem and Approach

The problem: Devise a model that will predict a user's next desired word based on arbitrary text input

Givens:

3 text corpora in .txt format
frontend application requirement of R Shiny
a computer
my wits

My approach: Model the corpora using Python's capabilities, and build the prediction logic in R

Tools:

Python used for text cleaning, frequency analysis, data formatting, and function prototyping
R used for final prediction model and Shiny application functionality

Corpora Modeling in Python

Raw text data were first processed in Python as follows:

Cleaned text using regular expressions
Split cleaned text into semantically-ordered chunks using natural language separators (punctuation, line endings)
Mined the semantically-ordered text for n-gram (word sequence) frequencies
Pruned the data analytically to optimally trade off size complexity and accuracy – term frequency measured against volume of terms in the corpus to determine cutoff point for dropping low-frequency terms
Exported cleaned term frequency data (term, term length, term frequency, leading lookup key, trailing prediction) for use in R

Prediction Algorithm in R

Algorithm logic:

Seeks exact term matches of the greatest length
Steps down term length if no matches found
Performs fuzzy match against input if no exact matches found
Favors longest matching sequences of the highest frequency

Algorithm performance:

In-sample*: 26% exact match, 34% in top 4 matches against known next word
Out-of-sample*: 8% exact match, 16% in top 4 matches against known next word

*In-sample text randomly selected from raw corpus data; out-of-sample text randomly selected from arbitrary Google News / Twitter content outside of provided corpora

Final Model and Front-End Application

User enters text at the input prompt
Application provides a top prediction and graphical summary of runners-up, scored by match length and frequency
Click here to try it out for yourself!

The (final-ish) product