Text Prediction Presentation

Aliza Shoop
July 11, 2018

Background

This project focused on trying to create a library of text from which we can make educated guesses for a word prediction app. This project involved the following sources:

App Summary

  • The app uses functions from the following libraries to perform all of the tasks.

– tm, dplyer, stringr, stylo

  • First the function cleans the input text of all of the numbers, punctuation and forces lower case to better match against the dictionaries, which have also been cleaned.
  • Next the function cycles through the longest, sorted n-gram dictionary, depending on the phrase entered and looks for the same terms.
  • Finally, the function returns the top 5 terms that followed the user's input.

Data Exploration

Data

Total Lines

Breakdown of the lines for each dataset

N-Grams

3-Grams

Summary of the top 3-grams in combined sampled data

Solutions & Accuracy

The final model uses a backoff approach by matching words in a string to lines in an n-gram library and finding the next word by highest frequency. If no match is found in the longest n-gram (6-grams), then the function moves to the 5-gram dictionary (until the 2-gram library) to find a match until one is found. The accuracy is higher for a larger dataset, however, in the interest of saving time and space, a smaller dataset is used for the App. There are times when the app was unable to find a prediction. This is likely due to insufficient data in the n-gram libraries. The next step here would be to add more data and drop the lines which had very low frequencies.

Resources
Link to data exploration here
Link to app here

Shiny App

This is the Shiny app created as result of this project. It attempts to predict possible next words based on user input.

  1. The user enters a phrase in the box and clicks the button.
  2. The app will run the word prediction function and attempt to return likely terms that may follow the user's phrase.
  3. If no match is found, the app will return “No prediction found.”