6/28/2021

Introduction

This app is an interactive one that takes any string as input and produces the predicted next word as output. In addition, the app shows the five words that are most likely to follow the user’s string, each with its own likelihood “score”. Scores are based off of how often each word follows the previous 3-gram, 2-gram and 1-gram, as well as how often the word follows the most recent non-stopword in skipGrams (n-2) that exclude stopwords.

This app uses millions of tweets and blogs to help predict the user’s next word. It uses data.table for its data storage, including tables of ngrams and their frequencies combined with dictionaries that pair ngrams with integer lookups.

Under the Hood: Steps to Set Up

  • Download data and packages into R
  • Create data tables
    • Four with ngrams, their frequencies, their last word and every word before their last word
    • Four dictionaries: 2-gram, 3-gram, 4-gram, skipGram
  • Convert ngram data tables to integer lookup tables

Under the Hood: Steps to Run the App

  • Compile new data frames with the given input and the likelihood that the possible words would complete a 4-gram, 3-gram, 2-gram or skipGram
  • Combine these data frames and calculate their overall score
  • Print the most likely next word, as well as a bar chart of the 5 most likely next words

Example 1

What happens when you enter the string, “It’s the most”?

Example 2

Evaluation

  • Competitively accurate predictor
  • Shows an easy-to-understand visual of possible next words
  • The combination of many tables and the making of the chart slow it down considerably
  • The extra consideration of the previous non-stopword reduces the likelihood that the will be too many stopwords chosen, but does lead to some stopwords suggested at inappropriate times