Coursera Data Science Capstone

2016-03-11

Objective

The main goal of this capstone is building web application which can predict next word in current sentence. All data is used to creat a frequency table that comes from corpus called HC COrpora.

Flow to Goal

Load Data
Sample Data
Clean Data
Build frequency table
Build model
Test model
Build application
Test application

Clean & Model

How to Cleaning

'WORD' is defined as a only letters, numbers. All other characters are being removed.
common words such as 'article'(a, the), 'be verb'(are,is) are kept.

How to Modeling

Tokenization
Prepare unigram, bigram and trigram from the data
Count the occurrences of each unique unigram, bigram and trigram
Get the text phrase from the user
Extract the last two tokens from the phrase.
Calculate the probabilty of all the possible match
Return predicted words

Shiny App & Code

Loading application takes some time. - Inefficiency of my work :(
Put Text in screen
Then App shows you predicted word by 2 models, and some plot