Introduction

This is the Week 2 Milestone assignment for the Capstone project as a part of the Data Science Specialization on Coursera. Presented here is an exploratory analysis on the Capstone data set. The data includes three English language text files sourced from:

The objective is to understand the distribution and relationship between the words, tokens, and phrases in the texts. Ultimately, this exploratory analysis will serve as a foundation to prepare the linguistic models for next word prediction.

Understanding the data

Let’s explore each data set to understand the basic features of the text file.

library(stringr)
library(stringi)
library(tokenizers)
library(wordcloud)
library(knitr)

all_data <- list(blogs, twitter, news)

n_lines <- sapply(all_data, length)
n_words <- sapply(all_data, function(x) sum(stri_count_boundaries(x, type="word")))
n_sentence <- sapply(all_data, function(x) sum(stri_count_boundaries(x, type="sentence")))
n_characters <- sapply(all_data, function(x) sum(stri_count_boundaries(x, type="character")))

summary_data <- data.frame(Data_Source = c("Blogs", "Twitter", "News"), n_lines, n_words, n_sentence, n_characters)
kable(summary_data)
Data_Source n_lines n_words n_sentence n_characters
Blogs 899288 79345275 2381035 206043906
Twitter 2360148 65141209 3770622 161961555
News 1010242 74103402 2024367 202917604

The above table shows the number of lines, words, sentences and characters for each of the three texts.

The histograms below show the number of words per line for each data set.

The max words per line was 14200 for blogs, 93 for twitter and 4102 for news.

Sample Data

Clearly, these are very large data sets. Hence it is necessary to sample representative data from each text file before performing further analyses. Data will be randomly sampled from each of the three files using the rbinom function.

# Sampling ~5000 lines of each of the data sets
set.seed(123)
bset <- rbinom(length(blogs), 1, 0.005)
sampleb <- blogs[which(bset %in% 1)]

set.seed(987)
tset <- rbinom(length(twitter), 1, 0.002)
samplet <- twitter[which(tset %in% 1)]

set.seed(106)
nset <- rbinom(length(news), 1, 0.005)
samplen <- news[which(nset %in% 1)]

# Combining sub-sample from blogs, twitter and news
sample <- c(sampleb, samplet, samplen)

Further exploratory analyses will be performed with the sample data set that combines approximately 5000 lines from each of the three text sources.

Generating a Clean Corpus

These data sets may contain words of offensive and profane meaning. A list of 450 bad words can be found on github. Let’s remove any of these bad words from the sample data set.

Let’s also remove extra whitespaces, punctuations and numbers for ease of analysis.

library(tm)
cleanSample <- removeWords(sample, badWords)
n_sampleWords <- sum(stri_count_words(sample))
n_cleanSampleWords <- sum(stri_count_words(cleanSample))

sampleCorpora <- VCorpus(VectorSource(sample))

funs <- list(stripWhitespace, removePunctuation, removeNumbers, content_transformer(tolower))
sampleCorpora <- tm_map(sampleCorpora, FUN = tm_reduce, tmFuns = funs)
sampleCorpora <- tm_map(sampleCorpora, removeWords, badWords)

This suggests that only 0.1144369% of the words in the sample data were “bad words”.

N-gram Tokenization

Let’s create 1-gram, 2-gram and 3-gram tokens of the data set. The bar graphs below show the 20 most common 1-gram, 2-gram and 3-gram word sets in the sample corpora.

library(RWeka)

unigram <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

tdm1 <- removeSparseTerms(TermDocumentMatrix(sampleCorpora, control = list(tokenize = unigram)),0.999)
tdm2 <- removeSparseTerms(TermDocumentMatrix(sampleCorpora, control = list(tokenize = bigram)),0.999)
tdm3 <- removeSparseTerms(TermDocumentMatrix(sampleCorpora, control = list(tokenize = trigram)),0.999)

head(findFreqTerms(tdm1, lowfreq=200)) # Most frequent 1-grams
## [1] "about"  "after"  "again"  "all"    "also"   "always"
head(twograms <- findFreqTerms(tdm2, lowfreq=100)) # Most frequent 2-grams
## [1] "a few"    "a good"   "a great"  "a little" "a lot"    "a new"
head(threegrams <- findFreqTerms(tdm3, lowfreq=50)) # Most frequent 3-grams
## [1] "a couple of" "a lot of"    "as well as"  "be able to"  "i dont know"
## [6] "i want to"
freq1 <- sort(rowSums(as.matrix(tdm1)), decreasing = TRUE)
freq2 <- sort(rowSums(as.matrix(tdm2)), decreasing = TRUE)
freq3 <- sort(rowSums(as.matrix(tdm3)), decreasing = TRUE)

barplot(freq1[1:20], las = 2, ylab = "Single Word Frequency")

barplot(freq2[1:20], las = 2, ylab = "Couple Word Frequency", cex.names = 0.9)

barplot(freq3[1:20], las = 2, ylab = "Triple Word Frequency", cex.names = 0.57)

Summary

This exploratory analysis performs the following