20190712

Introduction

This is the final project for the Data Science Capstone course on Coursera.

The goal is to train N-Gram models using data from the course and set up a shiny app to predict the next word in a partial sentence.

Clean corpus and build N-Gram models

Raw text data is first built into corpus using the 'tm' R library.

Then several data cleaning process is done including converting to lower case, removing extra whitespace, removing punctuations, etc

The clean corpus is then tokenized to create the Term-Document matrix. Quadgram, trigram, bigram are tried here.

From the term-document matrix, we can gernerate the n-gram word frequencies or probabilities.

Efficient RAM usage

It takes a lot of RAM to generate term document. It's not possible to load the whole raw text data and go trough the process on a normal home laptop even with a 60% training dataset.

So I split the raw data into 1000 batches, in each batch, I'll have a term document matrix. If any word combination occurred only once, it'll be pruned. Finally I'll combine these 1000 term document to create the final N-Gram model.

Prediction Model

When the partial sentence is supplied by the user, it will be cleaned and tokenized, and then fed into our N-Gram model.

Quadgram model is used first then back off to Trigram or Bigram if no matching word is found.