This is the Milestone report for final course of the Coursera Data Science Specialization course. In this project I will be applying data science in the area of natural language processing. For this analysis I use the provide data that is collected from publicly available sources by a web crawler. This report provides a summary of exploratory data analysis and data preprocessing.
library(stringr)
library(tm)
library(ggplot2)
set.seed(123)
For this project text documents have bene provided in three languages; English, German, Finnish and Russion. For each language there are three documents: blogs, news and twitter. The documents used for this project are the ones in English.
blogs <- readLines("~/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("~/final/en_US/en_US.news.txt", encoding = "UTF-8")
twitter <- readLines("~/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
The following table provides basic information about the three documents (blogs, news and twitter) including size, word count and line count.
blogs_words <- sum(stringi::stri_count_words(blogs))
blogs_length <- length(blogs)
blogs_size <- file.info("~/final/en_US/en_US.blogs.txt")$size/1024^2
news_words <- sum(stringi::stri_count_words(news))
news_length <- length(news)
news_size <- file.info("~final/en_US/en_US.news.txt")$size/1024^2
twitter_words <- sum(stringi::stri_count_words(twitter))
twitter_length <- length(twitter)
twitter_size <- file.info("~/final/en_US/en_US.twitter.txt")$size/1024^2
data_table <- data.frame(filename = c("blogs", "news", "twitter"),
file_size_MB = c(blogs_size, news_size, twitter_size),
word_count = c(blogs_words, news_words, twitter_words),
line_count = c(blogs_length, news_length, twitter_length))
data_table
## filename file_size_MB word_count line_count
## 1 blogs 200.4242 37546239 899288
## 2 news NA 2674536 77259
## 3 twitter 159.3641 30093372 2360148
Because the datasets are quite large it is necessary to take a small random sample from the datasets for preprocessing.
Taking 1% of each dataset as a sample.
blogs_sample <- sample(blogs, size = (blogs_length*0.01), replace = TRUE)
news_sample <- sample(news, size = (news_length*0.01), replace = TRUE)
twitter_sample <- sample(twitter, size = (twitter_length*0.01), replace = TRUE)
We now combine these three samples into one sample.
sample <- c(blogs_sample, news_sample, twitter_sample)
length(sample)
## [1] 33365
The created sample consists over 33000 lines. Within this sample we see that are around 25 words that have a very high frequency within the dataset.
words <- stringi::stri_count_words(sample)
hist(words, breaks=250, main="Frequency of words", xlab = "")
Now we move on to cleaning the data such that it is ready for data analysis. This includes removing punctuation, special characters an digits.
special_char <- 128:255
special_char <- as.raw(special_char)
special_char <- sapply(special_char, rawToChar)
special_char <- paste(special_char, sep = "", collapse= "|")
sample_clean <- gsub("[[:punct:]]", "", sample)
sample_clean <- gsub("[[:digit:]]", "", sample_clean)
sample_clean <- gsub(special_char, "", sample_clean)
Now we put the clean sample into a corpus object using the “tm” package.
data <- Corpus(VectorSource(sample_clean))
Furthermore we preprocess the data to a Term Document Matrix. In this matrix the frequency of each word is saved. The following shows the most frequent words within the clean sample we created.
tdm <- TermDocumentMatrix(data)
tdm1 <- removeSparseTerms(tdm, 0.95)
function_freq <- function(tdm){
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
freq_frame <- data.frame(word=names(freq), freq=freq)
return(freq_frame)
}
tmd2 <- function_freq(tdm1)
tmd_top_20 <- tmd2[1:20,]
print(tmd_top_20, row.names = FALSE)
## word freq
## the 29383
## and 15976
## you 8486
## for 7852
## that 7299
## with 4803
## this 4220
## was 4135
## have 3980
## but 3467
## are 3463
## not 3139
## your 2732
## all 2641
## just 2523
## from 2500
## its 2438
## out 2292
## will 2241
## what 2226
This can also be graphically shown:
ggplot(tmd_top_20, aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
coord_flip() +
labs(y="Frequency", title="Top 20 most frequent words in sample", x="")