Introduction

This is the Milestone report for final course of the Coursera Data Science Specialization course. In this project I will be applying data science in the area of natural language processing. For this analysis I use the provide data that is collected from publicly available sources by a web crawler. This report provides a summary of exploratory data analysis and data preprocessing.

library(stringr)
library(tm)
library(ggplot2)
set.seed(123)

Loading the datasets

For this project text documents have bene provided in three languages; English, German, Finnish and Russion. For each language there are three documents: blogs, news and twitter. The documents used for this project are the ones in English.

blogs <- readLines("~/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("~/final/en_US/en_US.news.txt", encoding = "UTF-8")
twitter <- readLines("~/final/en_US/en_US.twitter.txt", encoding = "UTF-8")

Data summary

The following table provides basic information about the three documents (blogs, news and twitter) including size, word count and line count.

blogs_words <- sum(stringi::stri_count_words(blogs))
blogs_length <- length(blogs)
blogs_size <- file.info("~/final/en_US/en_US.blogs.txt")$size/1024^2
news_words <- sum(stringi::stri_count_words(news))
news_length <- length(news)
news_size <- file.info("~final/en_US/en_US.news.txt")$size/1024^2
twitter_words <- sum(stringi::stri_count_words(twitter))
twitter_length <- length(twitter)
twitter_size <- file.info("~/final/en_US/en_US.twitter.txt")$size/1024^2

data_table <- data.frame(filename = c("blogs", "news", "twitter"),
                         file_size_MB = c(blogs_size, news_size, twitter_size),
                         word_count = c(blogs_words, news_words, twitter_words),
                         line_count = c(blogs_length, news_length, twitter_length))
data_table
##   filename file_size_MB word_count line_count
## 1    blogs     200.4242   37546239     899288
## 2     news           NA    2674536      77259
## 3  twitter     159.3641   30093372    2360148

Data preprocessing

Because the datasets are quite large it is necessary to take a small random sample from the datasets for preprocessing.

Taking 1% of each dataset as a sample.

blogs_sample <- sample(blogs, size = (blogs_length*0.01), replace = TRUE)
news_sample <- sample(news, size = (news_length*0.01), replace = TRUE)
twitter_sample <- sample(twitter, size = (twitter_length*0.01), replace = TRUE)

We now combine these three samples into one sample.

sample <- c(blogs_sample, news_sample, twitter_sample)
length(sample)
## [1] 33365

The created sample consists over 33000 lines. Within this sample we see that are around 25 words that have a very high frequency within the dataset.

words <- stringi::stri_count_words(sample)
hist(words, breaks=250, main="Frequency of words", xlab = "")

Now we move on to cleaning the data such that it is ready for data analysis. This includes removing punctuation, special characters an digits.

special_char <- 128:255 
special_char <- as.raw(special_char)
special_char <- sapply(special_char, rawToChar)
special_char <- paste(special_char, sep = "", collapse= "|")

sample_clean <- gsub("[[:punct:]]", "", sample)
sample_clean <- gsub("[[:digit:]]", "", sample_clean)
sample_clean <- gsub(special_char, "", sample_clean)

Now we put the clean sample into a corpus object using the “tm” package.

data <- Corpus(VectorSource(sample_clean))

Word frequency

Furthermore we preprocess the data to a Term Document Matrix. In this matrix the frequency of each word is saved. The following shows the most frequent words within the clean sample we created.

tdm <- TermDocumentMatrix(data)
tdm1 <- removeSparseTerms(tdm, 0.95)

function_freq <- function(tdm){
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_frame <- data.frame(word=names(freq), freq=freq)
  return(freq_frame)
}

tmd2 <- function_freq(tdm1)
tmd_top_20 <- tmd2[1:20,]
print(tmd_top_20, row.names = FALSE)
##  word  freq
##   the 29383
##   and 15976
##   you  8486
##   for  7852
##  that  7299
##  with  4803
##  this  4220
##   was  4135
##  have  3980
##   but  3467
##   are  3463
##   not  3139
##  your  2732
##   all  2641
##  just  2523
##  from  2500
##   its  2438
##   out  2292
##  will  2241
##  what  2226

This can also be graphically shown:

ggplot(tmd_top_20, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(y="Frequency", title="Top 20 most frequent words in sample", x="")