Text Prediction Project

Brian Crilly
12/15/2021

Create a text prediction application

Predict the word that comes next following a given sequence of words
The target device could be a mobile phone
- Balance accuracy, computational load, and memory requirements

Markov Chain with Backoff

Parse trigrams from sample data (3-word sequences)
Use trigram frequencies to predict the third word based on the last two words given
- If no word is predicted, repeat with bigrams (2-word sequences)
- If no word is predicted, repeat with unigrams (individual words)

Use data supplied for the project (Blogs, News Articles, Twitter Posts)

Split the data into a training set (70%) and a test set (30%)
Clean the data
- Remove symbols, punctuation, numbers, bad words
- Convert to lower case
Create unigrams, bigrams, and trigrams
Drop low frequency of occurrence n-grams
- Helps to balance accuracy against computational complexity and memory requirements

Find the balance between prediction accuracy and file size

The graph below help balance between prediction accuracy versus computational complexity and memory requirements

Accuracy vs. File Size

Shiny was used to create a demonstration application

The demo application can be found here: https://cartan.shinyapps.io/TextPredict/
As the user types in the text box, up to three suggested words are presented
- If the user selects a predicted word, it is automatically appended to the text string

App Screenshot