2025-02-20

Description

The Shiny interactive web app prompts a user to input text. It then predicts the next word based on the last few words input. The prediction is obtained using the Simple Backoff method.

The implementation first looks for the highest probability word grouping that begins with the last three words input by the user. If no match is found it backs off to using the last two words, possibly backing off to using only the final word. If no words match then the method returns the most common word, the.

General functionality is summarized below.

  • Predicts next word using the final 1-3 words input by the user
  • Spell checks all user inputs and suggests a replacement
  • Displays a few common words and word combinations
  • Filters profanity and obscenities

Performance

The prototype was built from a very small source of text, ~82,000 unique English words.

  • Using only the last word yielded ~2% prediction accuracy (tested on an unseen set of words)
  • Using the last 3 words yielded ~10% prediction accuracy (tested on an unseen set of words)
  • Rare word combos were removed to reduce server load resulting in ~37,000 unique words

These factors frame the seemingly low accuracy. However, even in its basic form the app detects a number of common words and phrases. These performances statistics reinforce the notion that we can somewhat reliably predict common words using a relatively small set of words.

Possible Improvements

Stanford professor Andre Miranda stated for large datasets the Simple Backoff method has demonstrated accuracy on par with more complex models.

The current dictionary of words in the app uses only ~ 10 MB of space. The dictionary can be expanded considerably and doing so would greatly improve accuracy.

Thus, many limitations can be addressed with a larger source of text. Other options include:

  • Use of longer word sequences for prediction (e.g. last 4 words, 5 words, etc.)
  • Use of more sophisticated probability models to predict words (e.g. Kneser- Ney Smoothing)
  • Identifying more sophisticated prediction methods when user input does not exist within the app dictionary

Use Cases/Extensions

Potential uses cases include:

  • Auto-complete features
  • Spell checking

Potential extensions of functionality include:

  • Predicting multiple words
  • Suggesting multiple spelling replacements
  • Suggesting synonyms or other similar words

These cases and extensions apply to scenarios such as typing and texting. Please feel free to suggest other use cases/extensions, or algorithmic improvements. Thank you for your consideration.