9/7/2017

Introduction

This presentation is created to fullfill the requirements of the Data Science Capstone Prpject of the Data Science Specialization provided by Johns Hopkins University.

The goal of this project is to build a predictive text model by analyzing a large corpus of text documents to discover the structure in the data and how words are put together. My work will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, I will present the predictive text product that I have built using Shiny application.

Getting and preparing data

  1. Sampling data I have used Blog, News and Twitter data sets to build my text model. Since these data sets are pretty big in size, I had to sample the data first and 0.1% of each data set is sampled to due to the memory capacity of my machine.
  2. Cleaning data Data cleaning process includes converstion to all lower cases, elimination of punctuations, elimination of numbers, and striping whitespaces.
  3. Create ngrams The corresponding n-grams are created with descending order frequencies and combined into a single list which is exported as RData file for further use by the prediction algorithm.

Next word prediction model

The prediction model for next word is based on the Katz Back-off algorithm.

  1. RData file is loaded. RData file contains ngrams with words in descending frequency orders.

  2. The prediction function is initialized where Trigram is used first followed by Bigram and Unigram.

  3. If the word is not found in the Trigram list then it backs off to Bigram.

  4. If no Bigram is found then it backs off to the most common word with highest frequency rate in the Unigram table.

Shiny Application and References