Data Science Capstone: Final Project

Marco Adamo
06-08-2020

Dataset

The data used for this project is a collection of text aggregated by web crawler from twitter, blogs and news publicly available online.Only the english dataset has been used in this example.

The dataset is downloadable here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Dataset preparation

The following steps have been made to prepare the dataset:

  • Extracted 5% of each file (twitter, blogs and news)
  • Transform upper case letters into lower case
  • Tokenise
  • Remove punctuation
  • Remove profanity (list accessible here: https://www.cs.cmu.edu/~biglou/resources/)
  • Remove stopwords according to the database of English stopwords
  • Create unigrams, bigrams and trigrams

The algorithm

The algorithm works as follows:

  • A word or sentence is taken as input and considered as a string
  • The string is then handled as previously (transformed to lower, tokenised, punctuation is removd, as well as profanities and English stopwords)
  • If the sentence is made of two or more words, the last two words are benchmarked in the list of trigrams from the dataset and, when they match an entry, the third word of the trigram is used as prediction
  • If there is no match or if the sentence is shorter, then the last (and only) word is benchmarked against the bigrams from the dataset. If it matches an entry, the second word of the bigram is used as prediction
  • If there is no match, the most frequent word from the list of unigrams is used as a prediction

Instructions

Work instruction: