Word prediction

Coursera Data Science Specialization capstone

Emilio González, 7/11/2020

Introduction

This presentation is about a Shiny web application that predicts the next word from a sentence, using natural language processing (NLP) techniques. This app is the final stop of an investigation process made during the Capstone project of the Coursera Data Science Specialization course.

One of the basic techniques used in word prediction consists of analyzing groups of consecutive words called n-grams. The next word in the sentence is assigned a probability based on the occurrence of the previous words in a training dataset, or corpus. This probability is grounded on the Markov assumption, that states that the transition probability is defined by the state defined by the previous \(n-1\) tokens.

In this project, an extensive dataset was provided, kindly prepared by Swiftkey, which we sampled for speed, cleaned and explored. Later on, a model for n-grams was devised and an algorithm to predict the next word based on these n-grams was built.

Getting and cleaning the data. Exploratory analysis.

The provided English corpus had three sources: twitter,blogs and news. It was acquired with the library quanteda.

Each file was about 200 Mb of data (millions of tokens). Such a lot was not needed to provide reasonable accuracy, so we sampled at 0.1% to save resources and make it more user-friendly.

We used the library functions to get the n-grams, performing also a profanity check to remove unwanted words. Many sparse words exist - only some tens of words make 90% of the tokens.

N-gram generation

We generated data frames with 1-grams, 2-grams, 3-grams and 4-grams ordered by frequency. For each n-gram, the following columns were generated:

As we are using the SBO method (no smoothing algorithm), only the frequencies of each n-gram were needed.

Prediction

We decided to use the maximum likelihood estimation (MLE) to decide on the next word. In this method, the probability of seeing a word \(w_n\), conditional to the occurrence of the previous \(w_1\) to \(w_{n-1}\) is

\[P(w_n | w_1...w_{n-1}) = \frac{c(w_1...w_n)}{c(w_1...w_{n-1})}\]

where \(c(x)\) is the count of the n-grams defined by \(x\). In a simple implementation of the SBO method, to predict the next word from a stream, we take into account the appearance of the last ones in the dictionary of n-grams of the next-higher level, and fall back to the lower-level dictionary if that n-gram is not found.

The probability is decreasing in the histogram, thus the first appearance will be the most likely. If the word does not exist in the 1-gram dictionary, we do two things:

We also tested the Katz backoff method to smooth the probabilities (to cater for the event of unseen words), but, from our point of view, there was no apparent gain in performance compared to the added complexity of the algorithm (penalizing the user experience). We tried also to use dictionaries to predict according to the context, but the results were not conclusive.

The app

In this app, the result of this project is available for playing.

(Notice that the app recovers from the typo in the text box. I hope you enjoy it!)