Coursera Data Science Capstone Project: Word Prediction

CY Ting
29/4/2016

Project Overview

There are three data sources used in this project, obtained from twitter, blogs, and news. The overall dataset size goes up to 500++ MB. In this project, however, only 10% from the combined datasets were used for next word prediction.

This project uses n-grams models, with n=2, 3, and 4 only.

The final product can be found at https://tingshinyapps.shinyapps.io/WordPrediction/

Data Preprocessing

The original dataset consist of

  • blogs.txt = 205 MB
  • news.txt = 200 MB
  • Twitter.txt = 163 MB

The data requires pre-processing before any prediction of next word could happen. The pre-processing includes data cleaning, outlier removal, data tranformation, and data sampling. Only 10% of the cleaned dataset was used in this project.

The dataset was prepared for 2-gram, 3-gram, and 4-gram next word prediction purpose. Data is saved in “.RData”

Prediction Algorithm

A step by step illustration of prediction process is given below:

  1. count input text. Take the last three words only
  2. load datasets for 2-gram, 3-gram, and 4-gram
  3. match the input text with the 4-gram dataset
    3.1 if found, return the 4th words with highest frequency
    3.2 if not found, match last two words with 3-gram dataset 3.3 if not found, match last word with 2-gram dataset
    3.4 repeat until found
  4. output through ui.R

R Shiny App