Text Prediction Project

Brian Crilly
12/15/2021

The Challenge

Create a text prediction application

  • Predict the word that comes next following a given sequence of words
  • The target device could be a mobile phone
    • Balance accuracy, computational load, and memory requirements

The Approach

Markov Chain with Backoff

  • Parse trigrams from sample data (3-word sequences)
  • Use trigram frequencies to predict the third word based on the last two words given
    • If no word is predicted, repeat with bigrams (2-word sequences)
    • If no word is predicted, repeat with unigrams (individual words)

The Steps

Use data supplied for the project (Blogs, News Articles, Twitter Posts)

  • Split the data into a training set (70%) and a test set (30%)
  • Clean the data
    • Remove symbols, punctuation, numbers, bad words
    • Convert to lower case
  • Create unigrams, bigrams, and trigrams
  • Drop low frequency of occurrence n-grams
    • Helps to balance accuracy against computational complexity and memory requirements

Analysis

Find the balance between prediction accuracy and file size

  • The graph below help balance between prediction accuracy versus computational complexity and memory requirements

Accuracy vs. File Size

The Application

Shiny was used to create a demonstration application

  • The demo application can be found here: https://cartan.shinyapps.io/TextPredict/
  • As the user types in the text box, up to three suggested words are presented
    • If the user selects a predicted word, it is automatically appended to the text string

App Screenshot