Next Word Predict

Felix G Lopez

September 20th, 2024

Objective

This presentation features the Next Word Predict app including an introduction to the application user interface and details about the text prediction algorithm.

The Next Word Predict app is located at:

Shiny Application Overview

The Next Word Predictor is a Shiny application that leverages a text prediction algorithm to anticipate the next word(s) based on user input. The app employs an n-gram model to predict the subsequent word in a given sentence.

An n-gram refers to a sequence of n consecutive words within a text.

The predictive model was developed using a substantial dataset comprising blogs, news articles, and tweets. From this dataset, n-grams were extracted and used to train the predictive model.

Various approaches were investigated to enhance both the performance and accuracy of the model, utilizing techniques from Natural Language Processing (NLP) and text mining.

Details of the Predictive Model

The predictive model was constructed using a subset of 800,000 lines sourced from a comprehensive dataset of blogs, news, and tweets. The data underwent tokenization and cleaning with the tm package and several regular expressions via the gsub function.

During this process, the data was converted to lowercase, and non-ASCII characters, URLs, email addresses, Twitter handles, hashtags, ordinal numbers, punctuation, and extra whitespace were removed. The cleaned data was then tokenized into n-grams.

When a user enters text, the algorithm processes from the longest n-gram (4-gram) down to the shortest (2-gram) to find a match.

The next word prediction is based on the most frequent, longest matching n-gram. The model utilizes a straightforward back-off mechanism for prediction.

User interface of the Application

The predicted next word(s) will appear once the application detects that the user has completed entering a word or phrase.

Please allow a few seconds for the predictions to be generated. The slider allows users to select up to three predictions, with the top choice appearing first, followed by the second and third likely next words.

I thank you for taking your time to evaluate this.