Predictive Text Model

John Theodore
11/10/19

Predictive text algorithms have many applications…
- word completion when typing in a search engine
- word suggestion when texting on a mobile device
- word correction in text editors
The application presented today would simply aid someone in writing a sentence in an electronic device, offering a suggestive “next” word while typing.

Collecting actual text data from three sources: twitter feeds, blogs and online news sources (approx. 1-2 million observations from each source)
Theses text samples (or corpuses) were then pooled together into one corpus where the text we cleaned and prepared for word tokenization
Individual words or combinations of words (next to each other in a sentence) were tokenized and extracted to form various “ngrams”, which represent 1to 4-word combinations
Using Markov-chain rules, transition probability matrices were calculated for each size of ngram; these matrices capture the probability of a word occurring based on the previous word or combination of previous words in the corpuses used
A shiny application was built as the application interface, where someone begins writing a sentence and the model will automatically generate a list of “next” word options after each word typed
The model itself will utilize two, three and four word combinations to predict the next word (a text “back-off” model)

To test the model's overall accuracy in predicting “next”“ words in a sentence, a series of 10 random sentences (8 words in length) were fed into the model. Additionally, three different models were used–each using a different amount of training data:
- Model 1 - 15,000 training observations
- Model 2 - 30,000 training observations
- Model 3 - 45,000 training observations
The following table compares the model results. "Accuracy” reflects the average percentage of words in each sentence that the model predicted accurately (e.g., Model 15K predicted an avg of 58% of words in all 10 sentences correctly). The run-time of each model was also collected using R's built-in “system.time” function.

	user	elapsed	Accuracy
Model 15K	0.03	0.04	0.58
Model 30K	0.05	0.06	0.61
Model 45k	0.08	0.09	0.63

The findings suggest that model accuracy improved with more training observations–from 58% to 63% with an additional 30K training observations. Model run-time more than doubled with this same increase (from .04 to .09 seconds). In summary, the 15K model was chosen for the application given its relatively similar accuracy as the others but with fewer records.

plot of chunk unnamed-chunk-3

Once in the application, start typing a sentence in the text box and a list of predicted “next” words will appear to the right of the text box (in order of highest to lowest probability).
PLEASE NOTE: WORDS WILL APPEAR AFTER TYPING THE LAST LETTER OF EACH WORD–DO NOT HIT THE “SPACE BAR” ON YOUR KEY BOARD UNTIL THE LIST OF WORDS APPEAR, THEN GO ON TO TYPE THE NEXT WORD IN YOUR SENTENCE.