2016-03-11

Objective

The main goal of this capstone is building web application which can predict next word in current sentence. All data is used to creat a frequency table that comes from corpus called HC COrpora.

Flow to Goal

  1. Load Data
  2. Sample Data
  3. Clean Data
  4. Build frequency table
  5. Build model
  6. Test model
  7. Build application
  8. Test application

Clean & Model

How to Cleaning

  • 'WORD' is defined as a only letters, numbers. All other characters are being removed.
  • common words such as 'article'(a, the), 'be verb'(are,is) are kept.

How to Modeling

  • Tokenization
  • Prepare unigram, bigram and trigram from the data
  • Count the occurrences of each unique unigram, bigram and trigram
  • Get the text phrase from the user
  • Extract the last two tokens from the phrase.
  • Calculate the probabilty of all the possible match
  • Return predicted words

Shiny App & Code

Shiny App

  • Loading application takes some time. - Inefficiency of my work :(
  • Put Text in screen
  • Then App shows you predicted word by 2 models, and some plot

Github