Swiftkey Capstone Project

Coursera Data Science Specialisation

John Hopkings University

Myriam Ragni - April,2020

INTRODUCTION

The objective of this final project was to develop a text prediction algorithm using Natural Language Processing along with a Shiny Application that takes as input a phrase and outputs a prediction of the next word.
The following diagram depicts the different phases of the development of the Predictive Text Product.

HIGHLIGHTS

  • Data Acquisition: The predictive text model is based on the English Blogs,News & Twitter files
  • Data Sampling
    • For each of the txt file I’ve created a ‘Training’ sample (70% of the data), a ‘Validate’ sample (30% of the data) and a ‘Test’ one (10% of Train - for code testing purposes).
    • The Training sample files are combined later to create the Corpus for the model.
  • Corpus Cleanup & Tokenization
    • Advanced text cleanup/transformations were applied to the data contained in the Corpus: e.g. replacement of contractions, removal of profanities/numbers/punctuation/special characters/URLs/tags…I decided not to remove stopwords as they may be useful for the prediction of a word in a sentence.
    • Due to memory limitations I could only generate up to 4-Grams. The Quanteda package was the most efficient.
    • For performance purposes I decided to ignore the n-Grams with a frequency equal to 1 (pruning).
    • Possible improvements for next product version: implement stemming/lemmatization, assign a special token for abbreviations/numerical values…

PREDICTION ALGORITHM

After the N-Grams tokenization, uni/bi/tri and quadgram term frequency matrices were created; those are the fundament for the generation of frequency dictionaries which include the smoothed probabilities to the different N-Grams calculated using the Kneser-Ney smoothing method.
The flow below shows the logic used in the Shiny APPS to predict the possible words following a sentence provided by the user. It is based on the Katz Back-Off technique.

SHINY APPLICATION

  • INPUT: Enter a short sentence (minimum 1 word) in English and hit the ‘Submit next word…’ button
  • OUTPUT: A table listing the first 10 N-Grams found in the appropriate dictionary, sorted by the estimated probability. If the combination of the last 3/2 words of the sentence could not be found in the dictionaries, the table returns a random list of unigrams among the top 100.


Detailed instructions and output description are available is the 'About' tab of the application.