Data Science Capstone Project

2/25/2021

Introduction

The goal of this Data Science Capstone Project was to create a predictive text model that takes a word or phrase and gives a prediction for the next word using natural language processing.

This application used the English data provided in the Capstone project as the training data. The data is taken from Twitter, blogs, and the news. A random sample of all three of these data files was combined and used to train the model.

Data Cleaning

The first step in using the data to create this model was to clean the data. The tm package was used to remove any URLs, change everything to lowercase, remove profanity words, remove punctuation, remove numbers, and strip white space.

Modeling

The approach used for this model was the n-gram model. An n-gram is a sequence of n words found consecutively in text. The model works by searching the last (n-1) words in a phrase and using the most frequent n-gram to predict the next word.

The RWeka package was used to tokenize the data into n-grams and then the tidytext package was used to sort and order the n-grams into frequency tables. The maximum n-gram used for this model was a 5-gram and the minumum was a 2-gram. The model uses the backoff model, where if the last 4 words of a phrase are not found in the 5-gram table, the last 3 words will then be searched in the 4-gram table, and so on. If the model does not recognize the last word in the phrase the application will state that the phrase is unknown.

How to use the Shiny App

To use this predictive text model, enter your word or phrase in the text box and press the submit button. The next predicted word will then be shown below. The Shiny App is located here: Shiny App

This is an example of what the app looks like.