Next Word Prediction: a natural language processing approach

M Fisher (Coursera Student)
December 16, 2021

This presentation and accompanying app were developed for the capstone course of Coursera's Data Science Specialization, offered by Johns Hopkins University. You can read more about the specialization on the Coursera website. You can view the Shiny app at Next Word Prediction.

What is Natural Language Processing?

What is NLP?

Intersection of linguistics, computer science, and artificial intelligence
Processing and analysis of large volumes of natural language data
Goal of understanding the contents of documents

Source: Wikipedia

Uses of NLP

Classify and organize documents
Speech recognition
Sentiment analysis
Text prediction

How Does This Prediction Model Work?

This prediction model uses a data set generated by analyzing natural language data. The data set contains the top words following phrases of one, two, three, or four other words.

The prediction model reads the user's input, compares it to the data set, and suggests a next word using a simple back-off model. It always returns a suggestion, though the longer the phrase that fits, the better the prediction.

The model was built using R and the resulting application was published as a Shiny app. Try it out. It's awesome! ;)

What Does the Preprocessing Look Like?

We used text from blogs, news articles, and Twitter to generate the data set used by the model. Using that data, the processing script performed the following actions in order to prepare the data set for the Shiny app.

Read in data sets and filter them for profanity
Create n-grams by splitting them into phrases that are 1 to 5 words long (e.g., “it was”, “it was the”, “it was the best”, “it was the best of”)
Create a tally of the number of times each n-gram appears and grab the top n-grams
Combine each of the sets of n-grams (with different lengths) and combine them into a single data set with a phrase and the predicted next words

For more information, you can look at packages like tm, nlp, and quanteda. I used tidytext and a big a-ha moment came when I read the first four chapters of Text Mining in R: A Tidy Approach. (I highly recommend that book!)

What Are Shortcomings of this Approach?

Ideally, one would apply this approach to the largest possible amount of training text and achieve perfect accuracy. However, there are limitations:

Preprocessing takes a lot of machine time and RAM
The larger the data set and word history, the more processing time and RAM is needed to run the model (an issue for phones)
A good model will provice suggestions that make sense, but unlike other kinds of predictions, there are too many “correct” answers (e.g., “I am a _______”) to be right
The training data heavily influence the word suggestions, and there is no “right” data set unless the goal is to guess famous quotes
A more advanced NLP algorithm might be able to understand context, but the data and processing requirements would increase