Incredible Word Oracle

Tyler Byers
December 2014

The word prediction app originally built for the Coursera/Johns Hopkins University Data Science Capstone project

Contact: tybyers@gmail.com

Why a Word Prediction Machine?

This app, deployed on the R Shiny Server website, is a prototype app to show the capabilities of the Word Oracle prediction machine. It is not meant to be a stand-alone product. The prediction machine, if accepted, may be useful if further developed for the following:

  • Fill in “garbled” words in voice-to-text machine translations (potential market: Doctor's office dictation devices)
  • Deliver faster results for online marketplace searches (app deployed behind-the scenes, has potential results “queued up” based on what users are typing).
  • Provide suggestions to company's employees when typing in intranet search bar (higher in-house productivity for employees).
  • Customer service. Deploy this predictor for our customer service reps in online chats to allow faster customer communication.

Application View

Application url: https://tybyers.shinyapps.io/wordoracle/.
Note: Initial chart and table load may take several seconds.

screenshot

Example: User enters “I love the Denver” into text box, clicks “Get Score Charts” button, and top result is “broncos”! Top result shown, and other less-probable results, with corresponding “scores” shown in table and chart.

Prediction Machine -- Data

Text data for prediction are derived from news articles and tweets in English supplied in the HC Corpora corpus (http://www.corpora.heliohost.org/). The data were cleaned and put into a look-up table using the following process.

Data cleaning:

  • Used 6% of each of the news and tweet data sets. A better model will use more data, but resources are limited for this project.
  • Removed: punctuation, capitalization, extra whitespace, numbers, and Seven Dirty words.
  • Converted all characters to ASCII.

Tokenization:

  • Used RWeka package to create 4-grams.
  • Put 4-grams into a table with 4 columns, one word per column, and 3.16 million rows.
  • Eliminated rows that contained words that appeared fewer than 10 times in the table. Results in 2.51-million row table for look-up prediction.

Prediction/Scoring Algorithm

The Word Oracle uses a look-up algorithm to determine the most likely word(s) for prediction. The 4-gram table of 2.51 million lines is used. The algorithm works fairly quickly, and accuracy would only improve with more data capacity. The algorithm is explained below. At this time, all results are returned as lower-case words with no punctuation or numerals.

If the user enters 3 or more words into Text Box:

  • Predictor takes last 3 words entered (a 3-gram), finds 3-gram matches in 4-gram table columns 1-3, outputs best match(es) for final word.
  • If no matching 3-gram found, then searches for matching 2-grams, using last 2 words entered in text box. The matching 2-grams are found from columns 2,3 of 4-gram table. Predicted words come from column 4.
  • If no matching 2-gram found, searches for matching 1-grams. Matching 1-grams found in column 3 of 4, and predicted words come from column 4.
  • If no matching 1-gram found, outputs a random word from the table (finds random row/column and outputs word in that cell).

If the user enters fewer than 3 words into Text Box:

  • Predictor assumes “beginning of sentence” entered.
  • Searches for matching 2-grams or 1-grams from columns 1,2 and column 1, respectively, with predictions coming from column 3 or 2, respectively.

When the user clicks the Get Score Charts button:

  • The Word Oracle calculates all the scores for the predicted words on a 100-point basis, and then ranks the predicted words based on score and displays them in the table and chart.