Data Science Capstone: Final Project

Marco Adamo
06-08-2020

Project goal

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit:

  • A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
  • A slide deck consisting of no more than 5 slides created with R Studio Presenter pitching your algorithm and app as if you were presenting to your boss or an investor.

Dataset

The data used for this project is a collection of text aggregated by web crawler from twitter, blogs and news publicly available online.Only the english dataset has been used in this example.

The dataset is downloadable here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Modus Operandi

The following steps have been made to prepare the dataset:

  • Extracted 5% of each file (twitter, blogs and news)
  • Transform upper case letters into lower case
  • Tokenise
  • Remove punctuation
  • Remove profanity (list accessible here: https://www.cs.cmu.edu/~biglou/resources/)
  • Remove stopwords according to the database of English stopwords
  • Create unigrams, bigrams and trigrams

Next word prediction

The algorithm works as follows:

  • A word or sentence is taken as input and considered as a string
  • The string is then handled as previously (transformed to lower, tokenised, punctuation is removd, as well as profanities and English stopwords)
  • If the sentence is made of two or more words, the last two words are benchmarked in the list of trigrams from the dataset and, when they match an entry, the third word of the trigram is used as prediction
  • If there is no match or if the sentence is shorter, then the last (and only) word is benchmarked against the bigrams from the dataset. If it matches an entry, the second word of the bigram is used as prediction
  • If there is no match, the most frequent word from the list of unigrams is used as a prediction
  • If the

Work instruction: