Capstone Project: Text Prediction App

R. Holley
Nov. 24, 2020

Johns Hopkins University/Coursera Data Science Certificate

Intro

This text-prediction application was built with R for data-wrangling and shiny for the UI and hosting. The initial model-building data was provided by Swiftkey for the Johns Hopkins University/Coursera Data Science Certificate Program.
This app is easy-to-use and has a clean, simple interface. A few of the salient features are:

  • Slider for control of output length
  • Dynamic output that updates as the user is typing
  • Immediate results with a light-weight named-list data structure
  • Easy to update server for more advanced future models

Example Use

Example
In this screenshot, the user input 'rather' and selected output length 5. The results on the right give possible words, ranked by probability.

Under the Hood

This app uses a bigram Markov-chain model to build a 'dictionary.' The name of each dictionary entry is the input word; the entry itself contains a vector of named probabilities - each name is an output word, returned in order of highest probability.

Below are the first few lines of the dictionary entry named apple, printed in table format for easy reading.

           apple
and   0.05120232
store 0.04104478
pie   0.04104478
cider 0.03565506
is    0.03130182
has   0.02611940

This type of structure makes retrieving entries incredibly efficient. Because each entry is named, it can be called directly without needing a search function to the scan all entries looking for a match.

Unknowns and Errors

When a user inputs an unknown word, the app uses the last known word as input instead. For example, if the user accidentally enters “Christmas trer” instead of “Christmas tree,” the algorithm skips the unknown word “trer” and uses “Christmas” as the input instead.

example2

Future Developments

Although the app is already deployed and usable, there are several options for future development to improve its overall performance. They include:

  • adding trigram and 4-gram probabilities to the model
  • expanding dictionary vocabulary
  • deploying models for languages other than English
  • adjusting input and layout for ease on small screens
  • alternative models such as a Recurrent Neural Net (RNN)

For further questions on this application and its development, visit the github repository here.