Coursera Data Science Capstone - Final Project

Paolo Guderzo
August 31th, 2020

Overview

This project corresponds to the final assignment of the Data Science Capstone Coursera course by Johns Hopkins University.

The main goal of this project is to create a shiny application containing a predictive text model capable of predicting the next word against a sentence entered by a user.

For Shiny app: https://duplo59.shinyapps.io/Data_Science_Capstone_Project/

Data Preparation

Before building the prediction model, data have been preprocessed and cleaned. Steps:

reading of the three given txt files about blogs , news and twitter;

file cleanup (strip whitespace, removing punctuation and numbers, character conversion to lowercase);

create an unique file;

file sampling;

creation of the corpus;

tokenization and buildings of 2-grams , 3-grams and 4-grams dataframes;

saving n-grams files as RData files (compressed files).

Prediction Model

As the application starts, the RData files are loaded . Then the sentence typed by the user is cleaned (eg. punctuation is removed) and analyzed with the following criteria:

quadgram is first used (looking for the first three words of quadgram using the last three words of the sentence provided by the user);

if there is no match, then trigram is used (looking for the first two words of trigram using the last two words of the sentence provided by the user);

if there is no match, then bigram is used (looking for the first word of bigram using the last word of the sentence provided by the user);

if there is no match, a 'no match found' warning is displayed, reporting the word with the highest frequency ('the').

Shiny Application

plot of chunk unnamed-chunk-1