Title

Predict Next Word - Language Modeling Capstone Project (coursera.org)

by : Cho Seng Mong

Date : 16 April 2016

Introduction

This presentation is a high level description of the language modeling Capstone Project of Coursera Data Science Specilization

The purpose of this project is to build a natural language model that suggests an appropriate next unseen word in the user specified words sequence. Three types of data including twitter, news and blogs were consumed to train the model. Appropriate data cleaning and sub-setting techniques were applied to finalize the training data. Various word combinations (N-Grams) were then created using clean data sets and a predictive algorithm (Katz Back-off) was applied to predict next word. The final predictive model was optimized appropriately to work as a Shiny application.

[Shiny App URL] (https://chosengmong.shinyapps.io/Final_Capstone_Project/)

[Project Software] (https://github.com/chosengmong/Final_Capstone_Project)

Data Handling and Cleaning

Prior to building word prediction algorithm, the following steps were executed to handle and clean very large twitter, news and blogs files

Word Prediction Model

The next word prediction model is based on the Katz Back-off algorithm. Here are the steps involved in predicting the next word of the user specified sentence

Shiny Application

A Shiny application was developed based on the next word prediction model described previously. Here are key features of the App