Milestone Report: Exploratory Analysis

Introduction

This is the Week 2 Milestone assignment for the Capstone project as a part of the Data Science Specialization on Coursera. Presented here is an exploratory analysis on the Capstone data set. The data includes three English language text files sourced from:

Blogs
Twitter
News

The objective is to understand the distribution and relationship between the words, tokens, and phrases in the texts. Ultimately, this exploratory analysis will serve as a foundation to prepare the linguistic models for next word prediction.

Understanding the data

Let’s explore each data set to understand the basic features of the text file.

library(stringr)
library(stringi)
library(tokenizers)
library(wordcloud)
library(knitr)

all_data <- list(blogs, twitter, news)

n_lines <- sapply(all_data, length)
n_words <- sapply(all_data, function(x) sum(stri_count_boundaries(x, type="word")))
n_sentence <- sapply(all_data, function(x) sum(stri_count_boundaries(x, type="sentence")))
n_characters <- sapply(all_data, function(x) sum(stri_count_boundaries(x, type="character")))

summary_data <- data.frame(Data_Source = c("Blogs", "Twitter", "News"), n_lines, n_words, n_sentence, n_characters)
kable(summary_data)

Data_Source	n_lines	n_words	n_sentence	n_characters
Blogs	899288	79345275	2381035	206043906
Twitter	2360148	65141209	3770622	161961555
News	1010242	74103402	2024367	202917604

The above table shows the number of lines, words, sentences and characters for each of the three texts.

The histograms below show the number of words per line for each data set.

The max words per line was 14200 for blogs, 93 for twitter and 4102 for news.

Sample Data

Clearly, these are very large data sets. Hence it is necessary to sample representative data from each text file before performing further analyses. Data will be randomly sampled from each of the three files using the rbinom function.

# Sampling ~5000 lines of each of the data sets
set.seed(123)
bset <- rbinom(length(blogs), 1, 0.005)
sampleb <- blogs[which(bset %in% 1)]

set.seed(987)
tset <- rbinom(length(twitter), 1, 0.002)
samplet <- twitter[which(tset %in% 1)]

set.seed(106)
nset <- rbinom(length(news), 1, 0.005)
samplen <- news[which(nset %in% 1)]

# Combining sub-sample from blogs, twitter and news
sample <- c(sampleb, samplet, samplen)

Further exploratory analyses will be performed with the sample data set that combines approximately 5000 lines from each of the three text sources.

Generating a Clean Corpus

These data sets may contain words of offensive and profane meaning. A list of 450 bad words can be found on github. Let’s remove any of these bad words from the sample data set.

Let’s also remove extra whitespaces, punctuations and numbers for ease of analysis.

library(tm)
cleanSample <- removeWords(sample, badWords)
n_sampleWords <- sum(stri_count_words(sample))
n_cleanSampleWords <- sum(stri_count_words(cleanSample))

sampleCorpora <- VCorpus(VectorSource(sample))

funs <- list(stripWhitespace, removePunctuation, removeNumbers, content_transformer(tolower))
sampleCorpora <- tm_map(sampleCorpora, FUN = tm_reduce, tmFuns = funs)
sampleCorpora <- tm_map(sampleCorpora, removeWords, badWords)

This suggests that only 0.1144369% of the words in the sample data were “bad words”.

N-gram Tokenization

Let’s create 1-gram, 2-gram and 3-gram tokens of the data set. The bar graphs below show the 20 most common 1-gram, 2-gram and 3-gram word sets in the sample corpora.

library(RWeka)

unigram <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

tdm1 <- removeSparseTerms(TermDocumentMatrix(sampleCorpora, control = list(tokenize = unigram)),0.999)
tdm2 <- removeSparseTerms(TermDocumentMatrix(sampleCorpora, control = list(tokenize = bigram)),0.999)
tdm3 <- removeSparseTerms(TermDocumentMatrix(sampleCorpora, control = list(tokenize = trigram)),0.999)

head(findFreqTerms(tdm1, lowfreq=200)) # Most frequent 1-grams

## [1] "about"  "after"  "again"  "all"    "also"   "always"

head(twograms <- findFreqTerms(tdm2, lowfreq=100)) # Most frequent 2-grams

## [1] "a few"    "a good"   "a great"  "a little" "a lot"    "a new"

head(threegrams <- findFreqTerms(tdm3, lowfreq=50)) # Most frequent 3-grams

## [1] "a couple of" "a lot of"    "as well as"  "be able to"  "i dont know"
## [6] "i want to"

freq1 <- sort(rowSums(as.matrix(tdm1)), decreasing = TRUE)
freq2 <- sort(rowSums(as.matrix(tdm2)), decreasing = TRUE)
freq3 <- sort(rowSums(as.matrix(tdm3)), decreasing = TRUE)

barplot(freq1[1:20], las = 2, ylab = "Single Word Frequency")

barplot(freq2[1:20], las = 2, ylab = "Couple Word Frequency", cex.names = 0.9)

barplot(freq3[1:20], las = 2, ylab = "Triple Word Frequency", cex.names = 0.57)

Summary

This exploratory analysis performs the following

provides an overview of the blogs, twitter and news data sets
samples a representative subset of the datasets
cleans and tokenizes the data subset
evaluates the most common 1-gram, 2-gram and 3-gram words