Capstone Project: Milestone Report

Introduction
Executive Summary
Download and Load the Data:
Summary Statistics of the Data
Interesting Observation
Goals for Creating a Prediction Algorithm and Shiny App

FALSE [1] 0

Introduction

The extent to which people around the world text each other via emails, social networks, and text messages, is phenomenal.

The key challenge today is to parse and analyze the user-generated text data that is generated in natural language on different digital media. The data analysis of the natural language gives insights into people’s preferences regarding products, services etc. These insights can be fruitfully utilized by the companies to build predictive models that could provide better services and products to the consumers.

Executive Summary

This report will present the initial exploratory analysis of a large corpus of text documents to discover the structure in the data and the relationships between the words.

Download and Load the Data:

blogs <- readLines("./final/en_US/en_US.blogs.txt")
twitter <- readLines("./final/en_US/en_US.twitter.txt")
news <- readLines("./final/en_US/en_US.news.txt")

Summary Statistics of the Data

STEP 1:

Get Summary of the Data

File Names	File Size (in MB)	Number of Words	Number of Lines
en_US.blogs	200.4242	38154238	899288
en_US.twitter	159.3641	30218125	2360148
en_US.news	196.2775	2693898	77259

STEP 2:

Create Sample CSV Files

dir.create("Sample/", showWarnings = FALSE)

## Get samples of the original data and write to new files
set.seed(20)
blogs1 <- blogs[rbinom(length(blogs)*.01, length(blogs), 0.5)]
write.csv(blogs1, file = "Sample/blogs1.csv", row.names = FALSE)

set.seed(20)
news1 <- news[rbinom(length(news)*.01, length(news), 0.5)]
write.csv(news1, file = "Sample/news1.csv", row.names = FALSE)

set.seed(20)
twitter1 <- twitter[rbinom(length(twitter)*.005, length(twitter), 0.5)]
write.csv(twitter1, file = "Sample/twitter1.csv", row.names = FALSE)

STEP 3:

Tokenization

The third step is to perform tokenization on the sample corpus. Tokenization for Natural language Processing is a method of breaking down the text into 1, 2, or more number of word clusters. Before any real text processing is to be done, text needs to be segmented into linguistic units such as words, punctuation, numbers, alpha-numerics, etc.

A token is linguistically significant and methodologically useful for analysis and building a predictive model. Finding significant tokens simply helps to recognize patterns displaying significant collocation.

TWITTER

BLOGS

NEWS

WORD CLOUD FOR NEWS UNIGRAM TOKENIZER

Interesting Observation

People seem to be using Twitter to send out quick “Thank You” messages. Both Twitter and Blogs are the sources that allow individuals to present their personal perspectives, as indicated by 3-grams tokenization for Twitter and Blogs.

Goals for Creating a Prediction Algorithm and Shiny App

Here’s the plan for building a predictive model and a Shiny App for our text data:

Pick a particular n-gram model developed here. A good one to start with would be 2-gram model.
Build a function that can generate any n-gram tokenizer for analysis.
Build an algorithm that moves from first state to the next using the weighted list of probabilities (The Markov’s Chain).
The algorithm should finally be able to predict the next set of characters or word sets, for a given set of characters or words input by a user.
The Shiny App will render output which will be the next set of predicted characters or words, when the user inputs certain set of characters or words.