August 16, 2019

Overview

The application is the capstone project for the Coursera Data Science Specialization Track by John Hopkins University and in cooperation with SwiftKey.

The main goal of this capstone project is to build a shiny application that will predict the next word based on the words written by the user.

This app is hosted at: https://jrnaputo.shinyapps.io/DS_CapstoneProject/

Building the Algorithm

  1. The data that was used are blogs, news, and twitter text files in an English - United States language. Due to the hardware limitations, the entire dataset was only sample to 1%.

  2. Sampled data used is processed by converting the uppercase to lowercase, removing profanity words, removing links and urls, removing extra white spaces, removing punctuations and numbers.

  3. The data is tokenized into n-grams, in this project the data are tokenized up to 4-grams and the n-grams are compiled based on their frequency and probability.

  4. Prediction is based on the probability of the n-grams. In this project, the 4-grams will be used first, if there is no predicted word based on 4-grams, it will proceed to 3-grams, and so on and so on, up to 1-gram.

Word Prediction Application

The application takes a text input in a text box input and outputs the top 3 prediction of the next word. There is also an instruction in the user interface on how to use the application.

This app is hosted at: https://jrnaputo.shinyapps.io/DS_CapstoneProject/