JHU Data Science Capstone: Next Word Prediction App Pitch

kuanhoong
20th April 2016

Overview

This Shiny App is a Next Word Prediction App based on SwiftKey dataset. The App predicts the next word based on user's text input.

The next word is predicted based on the probability built from the 5 N-Gram Data Models (unigram, bigram, trigram, fourgram and fivegram). The next predicted word is based on the the word with the highest occurance or frequency.

Dataset

The dataset for the prediction application has been built using the 3 types of available user written data. The 3 categories include data from News, Blogs and Twitter.

A Corpus has been created by merging all the three and then it was sampled to select around 300,000 lines from the merged combination.

Sample Preprocessing: Lower case conversion, removing numbers, removing punctuations, removing profanity language, stripping off whitespaces, etc. Subsequently, the sample will be segmented into various N-Grams. For this app, birams, trigrams, fourgrams and fivegrams were built for prediction, where the frequency of each entry is featured.

Algorithm

For the shiny app, user will be asked to enter minimum of two words. Since fivegrams was constructed, the last 4 words will be selected and then searched in the table of the fivegrams.

The entered words will be matched with the all the fivegrams and the ones matching all the first 4 words will be shown on the basis of frequency.

If there is no match in the fivegrams, then the last 3 words will selected and similar search on the basis of matching and frequency is performed in the fourgrams where all the possible most frequent words will be suggested.

In there is no match, this process will be repeated to lower grams until a match is found.

Shiny App

A Shiny App (http://kuanhoong1.shinyapps.io/final_project) has been created for the Data Science Capstone Project.

The user will be required to enter a phrase that contains minimum of two words and click the submit button to display the next predicted word. User can also copy and paste the five phrases from Twitter/News dataset to test the accuracy of the prediction.

Interim Milestone Report can be obtained from here