Next Word Prediction

Manfredi Ruggeri
01/03/2022

Introduction

This report describes the steps involved in building a data-driven web application that aims to predict the next word in a phrase.

Roadmap

  • Getting and cleaning Data
  • Processing Data
  • Building prediction algorithm

Getting and cleaning Data
The raw data, three big text files provided from SwiftKey, was preprocessed:

  • removing non-literal characters;
  • removing punctuation and extra spaces;
  • making all characters lowercase;

Processing Data
This phase aims to generate tables that contain the information useful for the working of Data product.

  • text tokenization
  • generating n-grams
  • calculating maximum likelihood estimation mle) for n-grams

Building the prediction algorithm

This last phase aims to show the prediction to the user. A stupid backoff model was implemented. In summary, the software starts to calculate mle for n-grams and if can't find any match, goes back to (n-1)-grams where mle has to be multiplied for a factor equal to 0.4 in every step.

Implementation details:

  • n-grams with low frequency (less than two occurrings) were removed. The so called long tail
  • the tables made by processing data, because of their size, were splitted in smaller pieces in order to minimize resources used by system and best performance
  • when the algorithm isn't able to predict any word, the software shows the most common word

Instructions

plot of chunk unnamed-chunk-1

The app is as simple as possible and shows a good reactivity: write a text, select the number of possible words and click predict!

Go to Next Word Prediction and try it!
Thanks for reading