>Predicting your next word|

> K. MacAvaney
> October 4, 2016

Using the app

Try it here!

  • Type the beginning of a sentence
  • When you finish a word, a primary suggestion and two more secondary suggestions will appear
  • Suggestions appear in order of likelihood

The basic algorithm

  • Suggestions are based on corpora of text taken from twitter, blogs, and news stories
  • Cleaning and analysis of the corpora was performed using the quanteda package
  • Suggestions are primarily taken from commonly used n-grams If a quadgram is not found, the algorithm looks for a trigram; if no trigram, bigram; if no bigram, it suggests commonly used unigrams.
  • To increase app speed and decrease the size of probability tables, extraneous quad-, tri- and bigrams are cut For instance, hundreds of trigrams begin with “and the”. The app only needs 3 total suggestions, so everything except the top 3 most probable “and the” trigrams are removed.

More tinkering to do

  • It's interesting to see some suggestions that are clearly influenced by the corpora's time period and origin. I would like to train another algorithm with different data and see what changes. (Example: suggestions for “Welcome to” are proof we used Twitter data)
[1] "the"     "my"      "twitter"
  • Trimming the probability tables seemed easy, but upon further examination, many low-frequency n-grams are common phrases that don't seem like they should be cut. I would like to explore further to increase app speed. (Example: many infrequently occurring trigrams are pretty common phrases)
              one       two count
52599 gain access        to     2
52600    need for        an     2
52601    may deem necessary     2

Citations

While refining this app, I read several articles and experimented with various text mining packages. Even though many of these articles led to dead ends where this project is concerned, they were critical in shaping my understanding of text mining.