This is the Week 2 Milestone assignment for the Capstone project as a part of the Data Science Specialization on Coursera. Presented here is an exploratory analysis on the Capstone data set. The data includes three English language text files sourced from:
Blogs
News
The objective is to understand the distribution and relationship between the words, tokens, and phrases in the texts. Ultimately, this exploratory analysis will serve as a foundation to prepare the linguistic models for next word prediction.
Let’s explore each data set to understand the basic features of the text file.
library(stringr)
library(stringi)
library(tokenizers)
library(wordcloud)
library(knitr)
all_data <- list(blogs, twitter, news)
n_lines <- sapply(all_data, length)
n_words <- sapply(all_data, function(x) sum(stri_count_boundaries(x, type="word")))
n_sentence <- sapply(all_data, function(x) sum(stri_count_boundaries(x, type="sentence")))
n_characters <- sapply(all_data, function(x) sum(stri_count_boundaries(x, type="character")))
summary_data <- data.frame(Data_Source = c("Blogs", "Twitter", "News"), n_lines, n_words, n_sentence, n_characters)
kable(summary_data)
| Data_Source | n_lines | n_words | n_sentence | n_characters |
|---|---|---|---|---|
| Blogs | 899288 | 79345275 | 2381035 | 206043906 |
| 2360148 | 65141209 | 3770622 | 161961555 | |
| News | 1010242 | 74103402 | 2024367 | 202917604 |
The above table shows the number of lines, words, sentences and characters for each of the three texts.
The histograms below show the number of words per line for each data set.
The max words per line was 14200 for blogs, 93 for twitter and 4102 for news.
Clearly, these are very large data sets. Hence it is necessary to sample representative data from each text file before performing further analyses. Data will be randomly sampled from each of the three files using the rbinom function.
# Sampling ~5000 lines of each of the data sets
set.seed(123)
bset <- rbinom(length(blogs), 1, 0.005)
sampleb <- blogs[which(bset %in% 1)]
set.seed(987)
tset <- rbinom(length(twitter), 1, 0.002)
samplet <- twitter[which(tset %in% 1)]
set.seed(106)
nset <- rbinom(length(news), 1, 0.005)
samplen <- news[which(nset %in% 1)]
# Combining sub-sample from blogs, twitter and news
sample <- c(sampleb, samplet, samplen)
Further exploratory analyses will be performed with the sample data set that combines approximately 5000 lines from each of the three text sources.
These data sets may contain words of offensive and profane meaning. A list of 450 bad words can be found on github. Let’s remove any of these bad words from the sample data set.
Let’s also remove extra whitespaces, punctuations and numbers for ease of analysis.
library(tm)
cleanSample <- removeWords(sample, badWords)
n_sampleWords <- sum(stri_count_words(sample))
n_cleanSampleWords <- sum(stri_count_words(cleanSample))
sampleCorpora <- VCorpus(VectorSource(sample))
funs <- list(stripWhitespace, removePunctuation, removeNumbers, content_transformer(tolower))
sampleCorpora <- tm_map(sampleCorpora, FUN = tm_reduce, tmFuns = funs)
sampleCorpora <- tm_map(sampleCorpora, removeWords, badWords)
This suggests that only 0.1144369% of the words in the sample data were “bad words”.
Let’s create 1-gram, 2-gram and 3-gram tokens of the data set. The bar graphs below show the 20 most common 1-gram, 2-gram and 3-gram word sets in the sample corpora.
library(RWeka)
unigram <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
tdm1 <- removeSparseTerms(TermDocumentMatrix(sampleCorpora, control = list(tokenize = unigram)),0.999)
tdm2 <- removeSparseTerms(TermDocumentMatrix(sampleCorpora, control = list(tokenize = bigram)),0.999)
tdm3 <- removeSparseTerms(TermDocumentMatrix(sampleCorpora, control = list(tokenize = trigram)),0.999)
head(findFreqTerms(tdm1, lowfreq=200)) # Most frequent 1-grams
## [1] "about" "after" "again" "all" "also" "always"
head(twograms <- findFreqTerms(tdm2, lowfreq=100)) # Most frequent 2-grams
## [1] "a few" "a good" "a great" "a little" "a lot" "a new"
head(threegrams <- findFreqTerms(tdm3, lowfreq=50)) # Most frequent 3-grams
## [1] "a couple of" "a lot of" "as well as" "be able to" "i dont know"
## [6] "i want to"
freq1 <- sort(rowSums(as.matrix(tdm1)), decreasing = TRUE)
freq2 <- sort(rowSums(as.matrix(tdm2)), decreasing = TRUE)
freq3 <- sort(rowSums(as.matrix(tdm3)), decreasing = TRUE)
barplot(freq1[1:20], las = 2, ylab = "Single Word Frequency")
barplot(freq2[1:20], las = 2, ylab = "Couple Word Frequency", cex.names = 0.9)
barplot(freq3[1:20], las = 2, ylab = "Triple Word Frequency", cex.names = 0.57)
This exploratory analysis performs the following