FALSE [1] 0

Introduction

The extent to which people around the world text each other via emails, social networks, and text messages, is phenomenal.

The key challenge today is to parse and analyze the user-generated text data that is generated in natural language on different digital media. The data analysis of the natural language gives insights into people’s preferences regarding products, services etc. These insights can be fruitfully utilized by the companies to build predictive models that could provide better services and products to the consumers.

Executive Summary

This report will present the initial exploratory analysis of a large corpus of text documents to discover the structure in the data and the relationships between the words.

Download and Load the Data:

blogs <- readLines("./final/en_US/en_US.blogs.txt")
twitter <- readLines("./final/en_US/en_US.twitter.txt")
news <- readLines("./final/en_US/en_US.news.txt")

Summary Statistics of the Data

STEP 1:

Get Summary of the Data

File Names File Size (in MB) Number of Words Number of Lines
en_US.blogs 200.4242 38154238 899288
en_US.twitter 159.3641 30218125 2360148
en_US.news 196.2775 2693898 77259

STEP 2:

Create Sample CSV Files

dir.create("Sample/", showWarnings = FALSE)

## Get samples of the original data and write to new files
set.seed(20)
blogs1 <- blogs[rbinom(length(blogs)*.01, length(blogs), 0.5)]
write.csv(blogs1, file = "Sample/blogs1.csv", row.names = FALSE)

set.seed(20)
news1 <- news[rbinom(length(news)*.01, length(news), 0.5)]
write.csv(news1, file = "Sample/news1.csv", row.names = FALSE)

set.seed(20)
twitter1 <- twitter[rbinom(length(twitter)*.005, length(twitter), 0.5)]
write.csv(twitter1, file = "Sample/twitter1.csv", row.names = FALSE)

STEP 3:

Tokenization

The third step is to perform tokenization on the sample corpus. Tokenization for Natural language Processing is a method of breaking down the text into 1, 2, or more number of word clusters. Before any real text processing is to be done, text needs to be segmented into linguistic units such as words, punctuation, numbers, alpha-numerics, etc.

A token is linguistically significant and methodologically useful for analysis and building a predictive model. Finding significant tokens simply helps to recognize patterns displaying significant collocation.

TWITTER

BLOGS

NEWS

WORD CLOUD FOR NEWS UNIGRAM TOKENIZER

Interesting Observation

People seem to be using Twitter to send out quick “Thank You” messages. Both Twitter and Blogs are the sources that allow individuals to present their personal perspectives, as indicated by 3-grams tokenization for Twitter and Blogs.

Goals for Creating a Prediction Algorithm and Shiny App

Here’s the plan for building a predictive model and a Shiny App for our text data:

  1. Pick a particular n-gram model developed here. A good one to start with would be 2-gram model.
  2. Build a function that can generate any n-gram tokenizer for analysis.
  3. Build an algorithm that moves from first state to the next using the weighted list of probabilities (The Markov’s Chain).
  4. The algorithm should finally be able to predict the next set of characters or word sets, for a given set of characters or words input by a user.
  5. The Shiny App will render output which will be the next set of predicted characters or words, when the user inputs certain set of characters or words.