Coursera Data Science Capstone Project: Word Prediction

CY Ting
29/4/2016

Project Overview

There were three data sources used in this project; obtained from twitter, blogs, and news. The total dataset size goes up to 500++ MB. In this project, however, only 10% from the combined datasets were used for next word prediction. This is to reduce the server loading time.

This project implemented n-grams models, with n=2, 3, and 4 only.

The final product can be found at https://tingshinyapps.shinyapps.io/WordPrediction/

Data Preprocessing

Detail information about the original dataset:

blogs.txt = 205 MB
news.txt = 200 MB
Twitter.txt = 163 MB

The data requires pre-processing before any prediction of next word could happen. The pre-processing includes data cleaning, outlier removal, data tranformation, and data sampling. Only 10% of the cleaned dataset was used in this project.

The dataset was prepared for 2-gram, 3-gram, and 4-gram next word prediction purpose. Data is saved in “.RData”

Prediction Algorithm

A step by step illustration of prediction process is given below:

count input text. Take the last three words only
load datasets for 2-gram, 3-gram, and 4-gram
match the input text with the 4-gram dataset
3.1 if found, return the 4th words with highest frequency
3.2 if not found, match last two words with 3-gram dataset 3.3 if not found, match last word with 2-gram dataset
3.4 repeat until found
output through ui.R

R Shiny App

https://tingshinyapps.shinyapps.io/WordPrediction/