>Predicting your next word|

> K. MacAvaney
> October 4, 2016

Using the app

Type the beginning of a sentence
When you finish a word, a primary suggestion and two more secondary suggestions will appear
Suggestions appear in order of likelihood

The basic algorithm

Suggestions are based on corpora of text taken from twitter, blogs, and news stories
Cleaning and analysis of the corpora was performed using the quanteda package
Suggestions are primarily taken from commonly used n-grams If a quadgram is not found, the algorithm looks for a trigram; if no trigram, bigram; if no bigram, it suggests commonly used unigrams.
To increase app speed and decrease the size of probability tables, extraneous quad-, tri- and bigrams are cut For instance, hundreds of trigrams begin with “and the”. The app only needs 3 total suggestions, so everything except the top 3 most probable “and the” trigrams are removed.

More tinkering to do

It's interesting to see some suggestions that are clearly influenced by the corpora's time period and origin. I would like to train another algorithm with different data and see what changes. (Example: suggestions for “Welcome to” are proof we used Twitter data)

[1] "the"     "my"      "twitter"

Trimming the probability tables seemed easy, but upon further examination, many low-frequency n-grams are common phrases that don't seem like they should be cut. I would like to explore further to increase app speed. (Example: many infrequently occurring trigrams are pretty common phrases)

              one       two count
52599 gain access        to     2
52600    need for        an     2
52601    may deem necessary     2

Citations

While refining this app, I read several articles and experimented with various text mining packages. Even though many of these articles led to dead ends where this project is concerned, they were critical in shaping my understanding of text mining.

Introduction to the tm package
FAQs about the tm package
Getting started with quanteda
Introduction to NLP - succinct explanation of Maximum Likelihood Estimate (MLE) and the Katz back-off model
Too many comments to count in stackoverflow