Data Science Capstone Project

Ilya Tishchenko
08/02/2021

Intro/Agenda

This is final slide deck on Data Science Capstone Project done on Coursera from John Hopkins University.

  • Data and pre-processing
  • Function
  • Links to shiny app

Data/Pre-processing

Data set has been provided by SwiftKey and downloaded from Coursera respective link.

The main workflow taken from the paper Text analysis in R.

  • Tokenisation
  • Lowcasing
  • Creating ngrams
  • Gather statistics and create dictionary

How function works

Data is fairly large and my home mac has only 8Gb RAM ;) so I've tried to follow the principle of simplicity.

  • Created dictionary with 2 and 3n grams
  • Split the base of ngram (either one or two words) and leave the right most word for prediction
  • Then the function first tries to match last two words entered, if no match then last word is taken and if no match then algorithm suggest the most popular word 'the'

    Of course pre-processing of input also happends in the background, i.e. removing punctuations, lowercasing etc..

Shiny App

Shiny app is (https://ilyatishchenko.shinyapps.io/Predict_text/) available by the link.

alt text