Next Word Prediction

Manfredi Ruggeri
01/03/2022

Introduction

This report describes the steps involved in building a data-driven web application that aims to predict the next word in a phrase.

Roadmap

Getting and cleaning Data
Processing Data
Building prediction algorithm

Getting and cleaning Data
The raw data, three big text files provided from SwiftKey, was preprocessed:

removing non-literal characters;
removing punctuation and extra spaces;
making all characters lowercase;

Processing Data
This phase aims to generate tables that contain the information useful for the working of Data product.

text tokenization
generating n-grams
calculating maximum likelihood estimation mle) for n-grams

Building the prediction algorithm

This last phase aims to show the prediction to the user. A stupid backoff model was implemented. In summary, the software starts to calculate mle for n-grams and if can't find any match, goes back to (n-1)-grams where mle has to be multiplied for a factor equal to 0.4 in every step.

Implementation details:

n-grams with low frequency (less than two occurrings) were removed. The so called long tail
the tables made by processing data, because of their size, were splitted in smaller pieces in order to minimize resources used by system and best performance
when the algorithm isn't able to predict any word, the software shows the most common word

Instructions

plot of chunk unnamed-chunk-1

The app is as simple as possible and shows a good reactivity: write a text, select the number of possible words and click predict!

Go to Next Word Prediction and try it!
Thanks for reading