Swiftkey Capstone Project

Coursera Data Science Specialisation

John Hopkings University

Myriam Ragni - April,2020

The Shiny application code is available here: http://github.com/RAGNIMY1/DSCapstoneProject
The application is hosted here: http://myriamragni.shinyapps.io/NextWord/

INTRODUCTION

The objective of this final project was to develop a text prediction algorithm using Natural Language Processing along with a Shiny Application that takes as input a phrase and outputs a prediction of the next word.
The following diagram depicts the different phases of the development of the Predictive Text Product.

HIGHLIGHTS

Data Acquisition: The predictive text model is based on the English Blogs,News & Twitter files
Data Sampling
- For each of the txt file I’ve created a ‘Training’ sample (70% of the data), a ‘Validate’ sample (30% of the data) and a ‘Test’ one (10% of Train - for code testing purposes).
- The Training sample files are combined later to create the Corpus for the model.
Corpus Cleanup & Tokenization
- Advanced text cleanup/transformations were applied to the data contained in the Corpus: e.g. replacement of contractions, removal of profanities/numbers/punctuation/special characters/URLs/tags…I decided not to remove stopwords as they may be useful for the prediction of a word in a sentence.
- Due to memory limitations I could only generate up to 4-Grams. The Quanteda package was the most efficient.
- For performance purposes I decided to ignore the n-Grams with a frequency equal to 1 (pruning).
- Possible improvements for next product version: implement stemming/lemmatization, assign a special token for abbreviations/numerical values…

PREDICTION ALGORITHM

After the N-Grams tokenization, uni/bi/tri and quadgram term frequency matrices were created; those are the fundament for the generation of frequency dictionaries which include the smoothed probabilities to the different N-Grams calculated using the Kneser-Ney smoothing method.
The flow below shows the logic used in the Shiny APPS to predict the possible words following a sentence provided by the user. It is based on the Katz Back-Off technique.

SHINY APPLICATION

INPUT: Enter a short sentence (minimum 1 word) in English and hit the ‘Submit next word…’ button
OUTPUT: A table listing the first 10 N-Grams found in the appropriate dictionary, sorted by the estimated probability. If the combination of the last 3/2 words of the sentence could not be found in the dictionaries, the table returns a random list of unigrams among the top 100.

Detailed instructions and output description are available is the 'About' tab of the application.