Predict a Word

Mark H
December, 2016

Introduction

The goal of the project is to utilize R and create an application that is capable of making word predictions after users input a phrase (multiple words) in a text box.

This project is part of the Coursera Data Science course, which requires its participants to develop the above described application via the R language while using Shiny App.

The app developed for the first part of the assignment is avalilable at: https://markh.shinyapps.io/predict-a-word/

Data and Approach

The application utilizes course provided data gathered from twitter, news sources, and various blogs.

Due to the amount of data, prediction model was build on only 30% of sample data to ensure efficiency.

Each data source was preprocessed into Ngram (bigram, trigram, fourgram, fivegram) data files.

However, the resultent application is still quite large (>15MB) and requires some time (>3 seconds) to load.

Prediction Algorithm and Drawbacks

The prediction capability of this application is based off of classic N-Gram model with Katz's back-off model as the main prediction model.

The Katz model allows the application to “estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by "backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.“[1]

https://en.wikipedia.org/wiki/Katz's_back-off_model

Link to App and Usage

The app developed for the first part of the assignment is avalilable at:

https://markh.shinyapps.io/capstone-word-predict/

Simply enter a short phrase into the text box and click on submit, and the appliation will return its top next-word prediction as well as the next five best guesses.