Data Science Capstone

Isabel Méndez
18 March 2021

Introduction

This Capstone project is part of a 10 course certification, which I strongly recommend as it strengths your skills and teach you in deep detail:

This is the link of the course: Data Science track by Johns Hopkins University on Coursera.

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit:

A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

A slide deck consisting of no more than 5 slides created with R Studio Presenter pitching the algorithm and app as if you were presenting to your boss or an investor.

Dataset

The data is from a corpus called Coursera Cappstone.

The data consist from three txt files: Twitter, Blog and News. At the end I used RWeka library.

I used a random sample of 90% from the raw data to build the final model.

Libraries

Other libraries used to compare ngrams, to process graphs and build the Milestone:

library(dplyr)
library(ggplot2)
library™
library(NLP)
library(stringi)
library(stringr)
library(kableExtra)
library(tidytext)
library(RColorBrewer)
library(plotly)

The Milestone Report you can find it here: Milestone Report

Tokenization

The tokenization I made:

Remove non-english characters
Remove white spaces
Remove URL and mail
Convert all words to lowercase
Remove english stopwords
Remove punctuation marks
Convert to Latin encoding to remove lines with non-English words

Shiny Application

Once the data is cleaned, I used for the n-gram: unigram, bigram, trigram, and quadgrams. The data was saved in RData file to read in the server.R code, and again I pre-process the data with all filtering as in the previous step. I filtered the data with these n-gram and on the ui.R I gave format, I added a Shiny theme.

You can find the app here: Shiny App The Milestone Report you can find it here: Milestone Report