JHU Coursera - Data Science Capstone Project

Anang Hudaya Muhamad Amin
9 September 2018

Introduction

This presentation has been prepared as part of the John Hopkins University - Coursera Data Science Capstone Project submission. It briefly describe the Word Prediction Application that has been developed for this course.

The aim of this project is to develop a web-based application that predicts the next possible word that the user is likely to enter, based on a chosen prediction algorithm used. This application focuses on words in English language. Future development may include other languages as well.

The application is built entirely in R and running as a Shiny App, and the predictions are made on the dataset provided by SwiftKey. You can retrieved the dataset from here.

Prediction Algorithm

The prediction algorithm utilizes the maximum likelihood estimation (MLE) function on n-gram language model being developed. A sample size of 1000 words are taken from each data file within the dataset (en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt).

The NLP and RWeka library is used to tokenize the words into a set of 1-, 2-, 3-, and 4-gram models. These models are stored as an object list stored in four different Rds files.

MLE function is used to find the closest match between the input phrase given by the user (according to number of words) and the four different n-gram models and the possible next word is determined.

The Word Prediction Application

alt text

  • The App consists of an action button to initiate the prediction process. User needs to insert a text phrase in the textbox provided and click on the Predict button to start.
  • Once the button is pressed, the predicted word and the n-gram model used will be displayed in two different output textboxes.

Reflection - Experience with the App

This is a simple application prototype for next-word prediction.It utilizes the dataset obtained from SwiftKey.

There are certain limitations with this app:

  • The prediction model was trained with only 5% of the entire data provided. As such, the prediction may not be highly accurate.

It has been a challenging task in completing this project.

The Word Prediction App can be accessed with the following URL:https://ananghudaya.shinyapps.io/submission/