2024-01-10

Background

  • Over the past few years, predictive text algorithms have become commonplace in many different applications and electronic devices
  • Thanks to a collaboration with SwiftKey, I was able to use their curated text corpuses/datasets in order to create my own predictive text application
An example of a common predictive text feature

An example of a common predictive text feature

The Data

  • As previously mentioned, SwiftKey has provided the data from a corpus of text called HC Corpora
  • The data used to train the model consists of webscraped lines from three different sources: news pages, blog postings, and Twitter posts
  • A training set of 100,000 lines was used to train the model (40,000 lines from the news sources, 40,000 lines from blog postings, and 20,000 lines from Twitter posts)

Algorithm

  • For this text processing model, I decided to use a Stupid Backoff Algorithm from the SBO package
  • The sbo_predictor() function does much of the heavy lifting when it comes to building the model, as it has convenient integrated arguments to preprocess and filter the text data to a user’s needs
  • The model is created through an sbo_predtable() object, which is then saved as an .rds object out of physical memory. The model can then be quickly recovered/loaded into the Shiny app itself
  • This means that model predictions are essentially instant within the app, with almost nonexistent lag or delay:

Using the App

  • To use the app, input a sentence or phrase into the text box that you wish to predict the next word of
  • From there, simply click the button to generate the next word and the algorithm will return the top three most probable words that follow the inputted phrase
  • NOTE: This is a very basic, first-attempt implementation of a text prediction algorithm. There are certainly ways to improve the accuracy of such models that are outside the scope of this project