Data Science Specialization Capstone: Milestone Report Week 2

Introduction

This is the Peer-graded Assignment (Milestone Report) for “Week 2” of the Data Science Specialization from John Hopkins University (Coursera).

These are the goals:

Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Preparing the datasets and calculating basic summaries

In this assignment, I will only use the English datasets.

Loading libraries and opening the files:

library(knitr)
library(stringi)

twitter <- readLines(file("en_US.twitter.txt", "r"))
blogs <- readLines(file("en_US.blogs.txt", "r"))
news <- readLines(file("en_US.news.txt", "r"))

The basic summaries are: Word counts, line counts and basic data tables.

Because of the size of the files, it is wise to assign calculations to objects so we only calculate them once.

twitter_list <- stri_count_words(twitter)
blogs_list <- stri_count_words(blogs)
news_list <- stri_count_words(news)

twitter_words <- sum(twitter_list)
twitter_lines <- length(twitter)

blogs_words <- sum(blogs_list)
blogs_lines <- length(blogs)

news_words <- sum(news_list)
news_lines <- length(news)

Word counts:

cat("The number of WORDS for each dataset is:", "\n", 
    "Twitter: ", twitter_words, "\n",
    "Blogs: ", blogs_words, "\n",
    "News: ", news_words, sep = "")

## The number of WORDS for each dataset is:
## Twitter: 30218125
## Blogs: 38154238
## News: 2693898

Line counts:

cat("The number of LINES for each dataset is:", "\n", 
    "Twitter: ", twitter_lines, "\n",
    "Blogs: ", blogs_lines, "\n",
    "News: ", news_lines, sep = "")

## The number of LINES for each dataset is:
## Twitter: 2360148
## Blogs: 899288
## News: 77259

Basic data tables:

cat("Dataset", "\t", "Words", "\t", "\t", "Lines",  "\t", "Average words per line", "\n",
    "Twitter", "\t", twitter_words, "\t", twitter_lines, "\t", twitter_words/twitter_lines, "\n",
    "Blogs", "\t", blogs_words, "\t", blogs_lines, "\t", blogs_words/blogs_lines, "\n",
    "News", "\t", "\t", news_words, "\t", "\t", news_lines, "\t", news_words/news_lines, "\n", sep = "")

## Dataset  Words       Lines   Average words per line
## Twitter  30218125    2360148 12.80349
## Blogs    38154238    899288  42.42716
## News     2693898     77259   34.8684

For a better visualization of the distribution of number of words per line:

par(mfrow = c(3,1))
hist(twitter_list, xlab = "Distribution (words per line)", main = "TWITTER", 
     xlim = c(0, 40), breaks = 50)
hist(blogs_list, xlab = "Distribution (words per line)", main = "BLOGS", 
     xlim = c(0, 200), breaks = 500)
hist(news_list, xlab = "Distribution (words per line)", main = "NEWS", 
     xlim = c(0, 200), breaks = 500)

Pre-processing the training datasets

Because of the size of the files, I will use 1% of each dataset. I will also combine the datasets.

set.seed(123)
twitter_s <- sample(twitter, length(twitter) * 0.01)
blogs_s <- sample(blogs, length(blogs) * 0.01)
news_s <- sample(news, length(news) * 0.01)
tbn <- c(twitter_s, blogs_s, news_s); rm(twitter_s, blogs_s, news_s)

Parse the text for further summary:

tbn_corpus <- as.list(strsplit(tbn, " "))
tbn_corpus <- unlist(tbn_corpus)

Delete non-alphabetic symbols (puncutation, numbers, etc.).

tbn_corpus <- strsplit(gsub("[^[:alnum:] ]", "", tbn_corpus), " +")
tbn_corpus <- unlist(tbn_corpus)

There are hundreds of techniques that could be used to further process the data. For the purpose of this assignment, I will stop here.

Exploring the data

There are many ways to explore and summarize the data.

Here, I will report the top 10 most common words (unigrams) with > 6 letters.

First, identify the desired words and their frequencies.

top10 <- as.data.frame(table(tbn_corpus[nchar(tbn_corpus) > 6]))
top10 <- top10[order(top10$Freq, decreasing = T),]
top10 <- top10[c(1:10),]
names(top10)[1] <- paste("word"); names(top10)[2] <- paste("freq")

Next, create a histogram to better visualize them.

barplot(top10$freq, names = top10$word, las = 2)

As we can see, in a sample of 1% of the dataset, the top word with > 6 letters was “because”.

People made a lot of sentences using meaningful words such as “friends”, “tonight”, “something”, and “someone”.

Conclusion

This report was simply to get things started for further production of a Shiny app.

Thank you for your time. Have a good day.