Data Science Capstone Project - Word Prediction

D. Ansell

2025-05-20

Overview & Instructions

This app predicts the next word to be typed, based on the frequency of 4-gram, 3-gram, and 2-gram word sequences and using a back-off algorithm.

The data for the (n)grams was collected from three sources (news, blogs, and twitter content) that can be sourced here:

Instructions:

Enter at least one word in the box and click ‘Submit’ to have the app return phrases with the most likely next word.

The entire phrase is returned, instead of just the next word, so that the user can see if the algorithm had to “back off” (more on this later).

Algorithm

The source data was heavily processed and merged to create data tables of bigrams, trigrams, and 4-grams, plus the frequency of their occurrences in the news, blogs, and twitter sources.

These (n)grams make up the data that the algorithm searches in order to make a prediction. If there are three or more words typed, then the last sequence of three words is identified and the algorithm searches the 4-grams table for 4-grams that begins with the 3 word sequence, ranks them by decreasing frequency, and returns the top three results.

If a 4-gram cannot be found, then the algorithm “backs off” and identifies the last two-word sequence typed and searches for trigrams that begin with that sequence. If no matching trigram is found, the algorithm searches for the last single word typed in the bigrams table.

Technical Constraints

There were many challenges to this assignment, but the greatest challenge came from having to deploy the app onto Shinyapps.io free tier of hosting. Resource limits were much tighter than on a personal pc.

While the original solution using data tables ran fine locally, it would not start up on Shinyapps.io, presumably because of memory limits. In fact, the error feedback and logs on Shinyapps.io were so sparse that it made debugging the app impossible.

The solution I found was to avoid using data tables, which must be loaded into memory, and employ an out-of-memory database approach using sqlite3. The problem with this, is that the database files are about 7-fold greater in disk space requirements than the light-weight Apache Arrow feather files that my data tables were written in. I had to delete considerable data (~33%) just to be able to upload the database file onto Shinyapps.io.

While the database approach allows the app to run on Shinyapps.io, the performance is less than I was hoping for. I suspect that there is a bottleneck when the app tries to read from the database file on Shinyapps.io disk storage.

Final Word

If you’re reading this, it probably means that you have just completed the last assignment for the Johns Hopkins Data Science Certificate. Congratulations to you!

It hasn’t always been an easy journey, but it has been rewarding. Best of luck to you in your Data Science pursuits!