Coursera Data Science Capstone Project

Feng Qi
07-14-2016

This presentation is a part of the capstone project for the Coursera Data Science specialization in cooperation with SwiftKey, offered by Johns Hopkins University on Coursera. It will briefly pitch the project online app for prediction the next word given a phrase.

Introduction

The goal of this final project is to create a product to highlight the prediction algorithm that students have built and to provide an interface that can be accessed by others. For this project student must submit:

  • A Shiny app (here) that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

  • A slide deck consisting of no more than 5 slides created with R Studio Presenter pitching the algorithm and app.

All provided text data were used to create one word dictionary, and several n-gram frequency tables. Then those tables are used to predict the next words.

Course dataset

The training data used in the project was supplied in the course webpage.

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Data exploration can be found in the milestone report

Methods and Model

The cleaned texts were then tokenized into so-called n-grams. I have created 2-, 3-, 4-, 5-, and 6-term frequency tables which are used to do the predictions in the shiny app.

  • For the n-gram frequency tables (n>1), I have convert each n-gram to a list of a words, then the words list is coded to a list of numbers which present their location in the master lookup table.

  • To save the space and speed up the query, I only considered n-grams which at least show up 5 times in all training data.

  • To reach the maximum speed, I have taken the advantage of fast search feature in the data.table. The prediction can be done instantaneously.

Application Usage

The user interface of this application is shown below.

While users input the phrase in the input text box, the app will automatically update the prediction result. And the app also gives other basic n-gram information given input phrase, both in tables or bar-plots format.

Application Screenshot