3/30/2022

Motivation

This application was developed as a final submission to the capstone program for the Johns Hopkins University “Data Science and Machine Learning” specialization certificate. It demonstrates the flexibility of the R language and Shiny application development framework by showcasing a text-prediction algorithm. The application returns a next-word suggestion along with an array of additional suggestions as long as there is one word present in the text box and a match for the input has been found in the database used for the algorithm.

The application is available for use at https://5yf0hd-such70sam-pichardo.shinyapps.io/predictr/

Web App Usage

After the app initializes (this may take up to 10 seconds as the word databases are loaded in), the user is prompted to type at least one word. A default is placed automatically. The app will not load any data if the text box is left empty.

The algorithm will provide a suggestion for the next word along with a table of the top 5 suggestions, if available. Uncommon misspellings and inputs for which there are no matches in the databases will not return any results.

Algorithm

This application operates on a “stupid backoff” model. When an input is detected, the app utilizes a function devoted to searching a database of 3-grams extracted from a sample of 5% of the english-language corpus of tweets blog posts, and news articles available to download through this link. When the last two words of the input can be matched with a set of 3-grams in the corpus, the array with the top 5 most common configurations is shown along with the top result as a suggestion in the sidebar.

To parse the input, the application extracts the words into a single-column data frame of length n. The nth and (n-1)th word are extracted for use in the 3-gram function, which calls on a 2-gram function first if the (n-1)th word is not present and second if no matches are available for both words.

Known Issues

  • The app takes about 10 seconds to initialize, which is due to the size of the 2- and 3-gram data files the app requires.
  • Only 5% of the available data was used due to memory constraints.
    • Using more data would allow for more flexible predictions.
  • No statistical smoothing was applied to the data.
    • This means that a weakness of this algorithm is its inability to predict on unseen inputs.
  • Contextual analysis is unavailable, and not more than two words can be predicted on.
    • Memory and speed constraints contributed to the removal of 4-gram processing during the testing process.