Data Science Capstone Project

Chaithra

December 2024

Introduction

The goal of the Data Science Capstone Project from Johns Hopkins University (JHU) is to create a usable application on natural language processing. This capstone project is offered in collaboration with SwiftKey.

The objective of the project is to build a functioning predictive text model. The data is from a corpus called HC Corpora, and, for this application, only the english datasets have been utilized.

For this project, the Text Mining packages tm and text2vec were used, along with the data manipulation package dplyr and the package doParallel. The app was created using the shiny package.

Predictive Model

To build the predictive model, 1.000.000 lines from all twitter, blogs and news datasets were sampled. The sample dataset was then cleaned, by removing all non-ascii characters, like emoji, being converted to lowercase letters and then by removing all contractions, punctuation, numbers, profanities, leftout letters and extra whitespaces. Here news datasets is used for word prediction.

The data was then tokenized to form Maximum Likelihood Estimation (MLE) matrices of various n-grams.

Finally, the top 5 predictions, using a simple back-off model, are being calculated as predictions to the user input. The reason for having 5 predictions instead of 1 is that the accuracy the user experiences is substantially increased.

The Shiny Application

You can find the application here. Below is an image of the UI.

As soon as letter or word is entered in the text box, the application provides a prediction almost instantly.