The Plan

Using over 4 million text entries, combined from twitter, blog posts, and news articles, we will create a predictive text engine in a Shiny App.

The steps will be as followed:

Download
Describe the relevant functions we will use
Clean the data, using said functions
Trim the data using said functions AND the input given to us by the app user
Finally, present the predictions in order.

The Benefits

Creating a text prediction app allows for more automation throughout web systems.

For example, many websites contain FAQs, but often people will search the website through a search bar instead. They use natural language and trust the search engine to get them to the answer. When they have trouble, they often go to the “contact” or “comment” feature of a site and leave a comment written in natural language as well.

A Predictive Text feature could be added to all websites, which would narrow down the most common searches on the website. From there, you could better connect pre-written FAQs to pre-predicted searches. This saves the business time and money by reducing the need for people to read people’s comments. It also better serves the customers by allowing them to narrow their searches to those that return pre-written FAQs.

The Process

Katz-Backoff Search

The app takes an input, then indexes the corpus to find which entries contain the input. It then repeats the process by removing words to increase flexibility (default is everything but last two words). Entries that contain more of the words are weighted by duplication.

Example:

“I want to go to the beach because I like _________”

First, it removes stopwords - words that are extremely common in English and often interchangeable. Our example sentence becomes:

“want go beach like”

The search trims the corpus to all entries that include “beach like” and then duplicated any entries that also have “go beach like” and “want go beach like”. Thus, having more keywords gives the predicted words more weight.

From there, it creates an n-gram of the relevant size (the number of words + 1), counts them, sorts them, and returns the last word (i.e. the predicted word) as a proportion of all the counts.

Some Test Cases

The app returns 3 of the most common words. Locally the app can handle a full 606MB corpus, but due to Shinyapps.io memory limits. The demo app only uses 100,000 tweets, blog posts, and news articles to predict. This makes it much faster, but less accurate for phrases that have few keywords.

One example you can test is provided:

“I want to meet someone” predicts: new, else, like

CTPDeck

The Plan

The Benefits

The Process

Some Test Cases