Tri-gram word prediction for Twitter data

Jonathan Kropko
June 8, 2017

How to predict the next word in a tweet?

One method is as follows:

  • Calculate the tri-grams: all unique appearances of three consecutive words, with their frequencies
  • Take the final two words of the given phrase (the bi-gram),
  • Filter the tri-grams to those that begin with the given bi-gram,
  • And predict the most frequently occurring third word.

Example

wordPred("all that and a bag of potato")
# A tibble: 6 x 3
  firstTerms lastTerm frequency
       <chr>    <chr>     <int>
1  of_potato    chips         3
2  of_potato      and         1
3  of_potato     chip         1
4  of_potato     hold         1
5  of_potato    salad         1
6  of_potato     stix         1

The next predicted word in this case is “chips”

Memory problems

http://www.shinyapps.io/ has strict limits on the amount of memory an app can use. And the raw Twitter data is over 300Mb. Here are some things I did to reduce memory usage:

  • Don't calculate ALL tri-grams
  • Instead, filter the Twitter data to only the tweets with the last two words in the given phrase. Then calculate tri-grams for the filtered data.
  • Filter the tri-grams to only those beginning with the last two words of the given phrase
  • If necessary, start with a random 50% sample of the original Twitter data

See the app in action!

Please visit https://jkropko.shinyapps.io/capstoneapp/ to try the app out for yourself.

Instructions: Type in a phrase and the app will calculate the top five words that are the best predictions for the next word in the phrase.

Thank you!