Data Science Specialization Capstone Project

Felix E. Rivera-Mariani, PhD

Purpose of this Presentation

In this presentataion, I'll briefly pitch but concisely summarize a Shiny Web Application for word predictions.

Project's Objective and Summary

The Data Science Capstone Project's main purpose was to elaborate a shiny web application that would be able to predecit the next word.

Data cleaning and exploratory analysis of word frequencies were performed in order to facilitate building a predictive model.

The text data use to build the predictive model originates from the following corpus: HC Corpora.

Text mining and processing of natural language was performed in R. Packages used can be found in the application code .

Methods and Applieds Models

First, from the HC Corpora data 1) a sample was obtained. This sample data was 2) cleaned as followes: converted to lowercase, and punctuations, special characters, white spaces and links removed.

After cleaning, 3) the sample was tokenized into n-grams. Find more information and examples about n-grams here. Briefly, n-grams are “sets of co-occuring words within a given window”. (Reference for n-gram definition)

Dictionary frequencies were then created from bi-, tri-, and quadgram aggregates.

The resulting data.frames gather from the frequency dictionaries were then applied for next-word predict models. This prediction will link the application user's text input and with n-gram table frequencies mentioned above. These frequencies will be working on the background to aid in the prediction.

Utility of the Application

Due to the increasing trends in smartphone usage, with estimated usage of 90% among the United States population, the application was created as “Smartphone Friendly.”

When the user types a word in the space located at the top of the application, the app will predict the second word. This predicted word will be appear below the text input space. In addition, the app will also provide below the predicted word your input text.

Screenshot of Application

Additional Information related to the Project