10/6/2020

Introduction

These slides are created as a pitch for the final Capstone project on the Data Science Specialization.

This project creates a word prediction machine learning system based on a given sentence or set of words.

This process is commonly used in texting applications in the mobile industry to facilitate the typing process to the user.

With that objective in mind, special care has been given to the portability and size of the dictionaries as well as time and computing process constraints even if doing so reduced its predictive capabilities.

Still the algorithm performs fairly well with over a 72 % rate of success.

Predictive Model

  • Working data set undergoes a process of optimized random sampling, cleansing & filtering to increase n-gram frequency rates and prediction probability.
  • Filtering process not only removes symbols but also hashtags, email, emoticons, word elongations, numbers, ordinas, etc.
  • Different dictionaries are created based on a combination of one to 5 n-gram using one to 5 minimum word size and removal or not of common stop words. Weights are calculated percentually.
  • Continues a filtering process based on extremely skewed histogram of repeated words to reduce dictionary size and based on a minimum number of repeating times for each word.
  • We create a training set from random sentences in the whole corpora. It includes for each given sentence 25 possible solutions with information based on different n-gram percentages minimum characters per word, part of speech tagging forming a Matrix.
  • This entire matrix of weights is trained using random forests and extreme gradient boosting to optimize the weighted hit value to find which dictionary’s chosen word is used next based on feature importance.

Predictive Algorithm Performance

  • A simple model of choosing 5-ngram, 4-ngram, 3-ngram, 2-ngram, 1-gram initially was providing a 3 % correct rate.
  • After optimization the prediction model reaches a 72 % accuracy rate.
    • It can recognize and predict nouns from politicians, actors, institutions.
    • It can predict adult or foul text
    • It can recognize smilies, some internet slang and perform some level of correction
    • It does all this in a reduced memory and processing environment that tries to keep stable performing periodic garbage collection.
    • The datafiles all together sum 39,5 Mb.
  • You can test it here ! -> https://dpellon.shinyapps.io/Capstone
  • Thanks !

Usage Example