Data Science Capstone Slide Deck

Enrique Estrada
November 28, 2018

Objective

This is the Capstone project for the Coursera Data Science specialization, which involved developing a word predicting application in R/Shiny. We were provided a Corpus of Text from Blogs, Twitter and News from HC Corpora which is a collection of corpora for various languages freely available to download. However we were required to use only The English texts.

This project is conducted with the support of the Johns Hopkins University and in cooperation with SwiftKey.

Development

The application uses natural language processing, namely, n-grams, Markov model, and Katz's back-off model to perform text prediction.

The series of steps to build the model were:

Cleaning and preparing the data
Exploratory Analysis
Build n-grams from the data corpus
Build frequencies from the n-grams
Build the prediction model

The Shiny Application

The application predicts the next word in a phrase/sentence. Up to four possible next word predicitons are available, and you have the option to click on any of them.

The word selected will be added to your text then application continues on predicting the next following word.

Appendix

Natural Language Processing

N-grams

Markov Model

Katz's back-off model

Github Repository

Shiny App