Next Word Prediction

Capstone Project

by Enrique Figueroa

December 19, 2021

The Shiny App predicts the next word of a user-entered English phrase.
The basic functionality of next word prediction is currently seen in, for instance, word processors or even programming IDEs. We try to mimic it in R code.
Text sources have been provided by Coursera. These are English texts from three sources: news, tweets and blog contributions.
The user should enter a two or three-word phrase in the App input box. The App will find and display the most probable next words based on their conditional probabilities of being together.
The App is stored at shinyapps.io.

As reported in the Part 1 of the course, we worked on a random sample of the big text bodies provided.
First, we remove punctuation, numbers, white spaces and graph characters; lowercase all words; and also get rid of too highly frequently words.
Next, a corpus is created, where all word are indexed, a necessary step for creating n-grams (“contiguous sequence of n items from a given sample of text”).
Afterwards, ordered tables of the most frequent 1, 2 or 3-grams are created and stored in .Rdata file format.

Bag of words (BoW) in the form of .Rdata files, created in the processing step are available for the App.
The user's input is processed for appropriately feeding the function that retrieves the most probable words. For instance:
- Only the last two words will be considered if there more than three words are provided.
The top 3 predictions will be returned if available.
Unsuccessful searches of the 3-grams will resort to a 2-gram table search.
The App can be found at github.com/efignav.

Since we only want to proof concept the algorithms behind next word predictions the user interface is simple:

On the left panel enter a phrase composed of 1, 2 or 3 words and press the enter button.
Top 3 predictions are displayed on the right panel.