Data Science Capstone Project

Ilya Tishchenko
08/02/2021

This is final slide deck on Data Science Capstone Project done on Coursera from John Hopkins University.

Data set has been provided by SwiftKey and downloaded from Coursera respective link.

The main workflow taken from the paper Text analysis in R.

Data is fairly large and my home mac has only 8Gb RAM ;) so I've tried to follow the principle of simplicity.

Created dictionary with 2 and 3n grams
Split the base of ngram (either one or two words) and leave the right most word for prediction
Then the function first tries to match last two words entered, if no match then last word is taken and if no match then algorithm suggest the most popular word 'the'

Of course pre-processing of input also happends in the background, i.e. removing punctuations, lowercasing etc..

alt text