Coursera Data Science Capstone Project: Word Prediction

May 19, 2017

Introduction

This project aims to predict the next word from a given phrase via Shiny App

We are using the data from a corpus called HC Corpora which could be downloaded from the link below.

The app will load 5 n-gram data derived from the cleaned sample data (0.1% from the corpora)
The N-gram representation of a text lists all N-tuples of words that appear. The simplest case is the unigram (1 word), followed by bigram (2 words), trigram (3 words), fourgram (4 words), and fivegram (5 words).
The n-gram data would be converted into frequency table by phrase respectively.
The algorithm will first clean the typed phrase and start predict by looking up the highest frequency from the matched phrase in the fivegrams frequency table. If the phrase does not match against the typed phrase, then it will start to look from fourgrams frequency table, trigrams frequency table, bigrams frequency table, and lastly unigrams frequency table

The prediction app only built based on sample data of 0.1% from the corpora (Twitter, News and Blog)
We need to balance up between Shiny Server performance and prediction accuracy. Higher sample will gives better accruacy at the expenses of longer processing time.
As data scientist, we always have to work around with available resources to analyse the data.