JHU DS Capstone Project Next Word Prediction

JoshuaJ
2014-12-12

1 of 5

Overview

This is my JHU Data Science Specialization Capstone Project to practice all the skills learned from the program.

The application is to demonstrate a text mining capability and to predict the next word that the user typed in the text input box. The problem statement is simple, but the techniques to solve the problem have wide applications.

The shiny application is hosted here https://jjjin.shinyapps.io/CapstoneProj for your reviews.

2 of 5

User Interface and Instruction

In this Shiny application, the UI layout is used a sidebar panel on the left to take user's input, and a main panel to display the word prediction from the model.

User Input Instruction: From the user input text box from the lift, type your word or a simple sentence. Almost immediately, the predicted next word will be displayed in the main panel on the right hand side.

The UI design kept minimum display elements.

3 of 5

Algorithm for Text Mining and N-gram

The training data can be retrieved from the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

In the data cleansing included the following processes: profanity removal, canonical case, stopping words removal.

Then n-gram (2,3,4) are generated for using in the next word prediction model. The best match is used to predict the next word following the input n-gram. As my learning curve, in addition to R packages, I also explored Python NGram, and ended up using RWeka library.

4 of 5

Project Summary

First of all, this is a good project to work on. The scope is simple and manageable; however, it has enough technical challenges for a student to do his/her research for the contents that were not taught in the DS specialization program.

In the seven-week project time, I certainly learn a quite bit of basic text processing in R, more importantly learn n-gram construction and how to apply the n-gram to solve real-world problem. The prediction model delivered reasonable speed. It is fast enough to see the next word predicted. For the algorithm, it always has rooms to improve.

I deeply appreciate these opportunities learning from the professors and professional students world-wide.

Thank you for review!

5 of 5