8/31/2017

Project Description

We have developed a "proof-of-concept" text prediction engine that draws from social media sources and predicts the next word as a user types, with a goal of improving speed, efficiency, and accuracy in online communication and social media collaboration.

Proof-of-Concept

  • A Shiny R App has been created to prove out the approach
  • The model is simple and performant, leveraging an n-gram / back-off algorithm based on social media feeds
  • The model has been successfully validated with independent data

Data Model

Over two million records were cleaned and converted to n-grams (unigrams to 5-grams). To minimize real-time processing power and improve performance, we utilize lookups against a data-table. These lookups are simple, fast, and they scale.

The model uses a simple n-gram / back-off algorithm against this data table to predict user input:

  1. The model parses the last 4 words in a user input string and compares to an in-memory 5-gram table, showing the 3 most probable predictions
  2. If the result contains less than 3 predictions, the algorithm "backs-off" from 5-grams to 4-grams and searches for the most probably 4-gram based on the user's last 3 words.
  3. If the result still contains less than 3 predictions the algorithm continues to back-off to smaller n-grams.

Accuracy and Feedback

  • Testing two sets of comparable data (10,000 phrases each), the model predicted about 15% of words with the first prediction and about 24% of words in the top three predictions.
Validation Set 1 Validaton Set 2
First Word Prediction Success 14.4 15.9
Top Three Prediction Success 23.3 24.8
  • To remain relevent and accurate, the model must continually adapt as user inputs and styles of communication evolve. To that end, the model is set up to accept feedback from actual users to update the data.

Using the Application

The Shiny App is meant to be simple -

  • Text is entered and submitted with a button.
  • The top prediction shows up with the text typed.
  • In addition, the next two most likely words are included.
  • To provide feedback for this pilot program, the associate can click on the word that they intended to use or enter it in a text box if it did not appear in the top three.
  • This combined phrase is then added to a database to use in the next version of the model.