Coursera JHU Capstone Presentation

2024-12-04

Introduction

This documents provides key takeaways for the Next Word Predict app including the goal of the application, the user interface and details about the text prediction algorithm.

The Next Word Predict app is located at:
https://psylva79.shinyapps.io/next_word/
The source code files can be found on GitHub:
https://github.com/psylva79/DS_JHU_Capstone/tree/main/

Shiny Application

Next Word Predict is a Shiny app that employs a text prediction algorithm to forecast the next word(s) based on text entered by a user.

The application will put forward various suggestions for the next word in a sentence using an n-gram algorithm. An n-gram is a contiguous sequence of n words from a given sequence of text.

The predictive text model relies on a large corpus of blogs, news and twitter data. N-grams were gleaned from the corpus and then utilized in the formulation of the predictive text model.

Various methods were explored to improve speed and accuracy using natural language processing and text mining techniques.

The Predictive Text Model

The predictive text model was built from a sample of 30,000 lines extracted from the large corpus of blogs, news and twitter data.

The sample data was then tokenized and cleaned using different packages including tidytext, quanteda, tm and tokenizers. As part of the cleaning process the data was converted to lowercase, removed all ordinal numbers, profane words, punctuation and whitespace. The data was then split into tokens (n-grams).

As text is entered by the user, the algorithm iterates from longest n-gram (3-gram) to shortest (2-gram) to detect a match. The predicted next word is considered using the longest, most frequent matching n-gram. The algorithm makes use of a simple back-off strategy.

Application User Interface

The predicted next word will be shown when the app detects that you have finished typing one word. When entering text, please allow a few seconds for the output to appear. The top prediction will be shown first followed by the second and third likely next words.