Data Science Capstone: Final Project

Marco Adamo
06-08-2020

Dataset

The data used for this project is a collection of text aggregated by web crawler from twitter, blogs and news publicly available online.Only the english dataset has been used in this example.

The dataset is downloadable here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Dataset preparation

The following steps have been made to prepare the dataset:

Extracted 5% of each file (twitter, blogs and news)
Transform upper case letters into lower case
Tokenise
Remove punctuation
Remove profanity (list accessible here: https://www.cs.cmu.edu/~biglou/resources/)
Remove stopwords according to the database of English stopwords
Create unigrams, bigrams and trigrams

The algorithm

The algorithm works as follows:

A word or sentence is taken as input and considered as a string
The string is then handled as previously (transformed to lower, tokenised, punctuation is removd, as well as profanities and English stopwords)
If the sentence is made of two or more words, the last two words are benchmarked in the list of trigrams from the dataset and, when they match an entry, the third word of the trigram is used as prediction
If there is no match or if the sentence is shorter, then the last (and only) word is benchmarked against the bigrams from the dataset. If it matches an entry, the second word of the bigram is used as prediction
If there is no match, the most frequent word from the list of unigrams is used as a prediction

Instructions

Work instruction:

Open the following link: https://marcity.shinyapps.io/CourseraCapstoneFinal/
Wait for a few seconds to allow the dataset to be loaded (up to 45 seconds)
Insert a word/sentence
Press submit and wait for the result