Data Science Capstone Presentation

José Tapiz Arrondo AKA Pachi Tapiz Arrondo AKA PTA

2025-04-19

Objective

-Goal: Predicting the Next Word from a Given String
-A collaboration of Johns Hopkins University with SwiftKey

Introduction

The goal of the Data Science Capstone Project from Johns Hopkins University (JHU) is to create a usable application on natural language processing. This capstone project is offered in collaboration with SwiftKey.

The objective of the project is to build a functioning predictive text model. The data is from a corpus called HC Corpora, and, for this application, only the english datasets have been utilized.

For this project, the Text Mining packages tm and text2vec were used, along with the data manipulation package dplyr and the package doParallel. The app was created using the shiny package.

Predictive Model

To build the predictive model, 1.000.000 lines from twitter, blogs and news datasets were sampled. The sample dataset was then cleaned, by removing all non-ascii characters, like emoji, being converted to lowercase letters and then by removing all contractions, punctuation, numbers, profanities, leftout letters and extra whitespaces.

The data was then tokenized to form Maximum Likelihood Estimation (MLE) matrices of various n-grams. For the sake of accuracy, all frequencies up to 6-grams were computed.

Finally, the top 3 predictions, using a simple back-off model, are being calculated as predictions to the user input. The reason for having 3 predictions instead of 1 is that the accuracy the user experiences is substantially increased.

The Shiny Application

You can find the application here. Below is an image of the UI.

shiny-app
Each of the 3 predicted words is clickable, and when clicked the respective word is being added to the input text.
The application provides a prediction almost instantly.

Additional Information