12/5/2020

Introduction

This presentation serves as the introduction of the application that was build for the capstone project of the Coursera Data Science specialization.

The application is intended to take a string of words and predict the next word, based on the probability of occurence.

The prediction algorithm is trained upon a set of three documents containing raw text from blogs, news articles and tweets.

The original corpus size was overwhelming for the specs of the laptop and contained some noise, so it needed to be cleaned and compressed. To reduce the size of the corpus, the text lines were randomly sampled and only less than 50% of the lines were kept. That greatly increased processing speed but we’re paying it with less accuracy.

Analyzing the Corpus

The sentences in the corpus were subsequently split into individual words combinations, i.e. it was tokenized.

  • Punctuation is removed
  • Every case os lowered
  • Numbers and Symbols are removed
  • Ultra common and insignificant words in english are removed.

Then n-grams were created: Up to 4 combination of words were processed and for each of one, frequences of occurences were calculated. The algorithm will look for matchs on the Quadrgram first, then the trigram second and so on. According to probabilities, it will return the best fit word.

Algorithm

According to Markov Chain assumption, any word which is more than 2 words prior to the prediction words is irrelevant to the prediction outcome in our application;

This app uses backoff models to estimate the probability of unobserved n-grams. That means if the prior n words can’t be found in the n-gram, the algorithm will look for the prior n-1 words in the n-1 gram and provide the most likely predictions.

The App