This is the Peer-graded Assignment (Milestone Report) for “Week 2” of the Data Science Specialization from John Hopkins University (Coursera).
These are the goals:
In this assignment, I will only use the English datasets.
Loading libraries and opening the files:
library(knitr)
library(stringi)
twitter <- readLines(file("en_US.twitter.txt", "r"))
blogs <- readLines(file("en_US.blogs.txt", "r"))
news <- readLines(file("en_US.news.txt", "r"))
The basic summaries are: Word counts, line counts and basic data tables.
Because of the size of the files, it is wise to assign calculations to objects so we only calculate them once.
twitter_list <- stri_count_words(twitter)
blogs_list <- stri_count_words(blogs)
news_list <- stri_count_words(news)
twitter_words <- sum(twitter_list)
twitter_lines <- length(twitter)
blogs_words <- sum(blogs_list)
blogs_lines <- length(blogs)
news_words <- sum(news_list)
news_lines <- length(news)
Word counts:
cat("The number of WORDS for each dataset is:", "\n",
"Twitter: ", twitter_words, "\n",
"Blogs: ", blogs_words, "\n",
"News: ", news_words, sep = "")
## The number of WORDS for each dataset is:
## Twitter: 30218125
## Blogs: 38154238
## News: 2693898
Line counts:
cat("The number of LINES for each dataset is:", "\n",
"Twitter: ", twitter_lines, "\n",
"Blogs: ", blogs_lines, "\n",
"News: ", news_lines, sep = "")
## The number of LINES for each dataset is:
## Twitter: 2360148
## Blogs: 899288
## News: 77259
Basic data tables:
cat("Dataset", "\t", "Words", "\t", "\t", "Lines", "\t", "Average words per line", "\n",
"Twitter", "\t", twitter_words, "\t", twitter_lines, "\t", twitter_words/twitter_lines, "\n",
"Blogs", "\t", blogs_words, "\t", blogs_lines, "\t", blogs_words/blogs_lines, "\n",
"News", "\t", "\t", news_words, "\t", "\t", news_lines, "\t", news_words/news_lines, "\n", sep = "")
## Dataset Words Lines Average words per line
## Twitter 30218125 2360148 12.80349
## Blogs 38154238 899288 42.42716
## News 2693898 77259 34.8684
For a better visualization of the distribution of number of words per line:
par(mfrow = c(3,1))
hist(twitter_list, xlab = "Distribution (words per line)", main = "TWITTER",
xlim = c(0, 40), breaks = 50)
hist(blogs_list, xlab = "Distribution (words per line)", main = "BLOGS",
xlim = c(0, 200), breaks = 500)
hist(news_list, xlab = "Distribution (words per line)", main = "NEWS",
xlim = c(0, 200), breaks = 500)
Because of the size of the files, I will use 1% of each dataset. I will also combine the datasets.
set.seed(123)
twitter_s <- sample(twitter, length(twitter) * 0.01)
blogs_s <- sample(blogs, length(blogs) * 0.01)
news_s <- sample(news, length(news) * 0.01)
tbn <- c(twitter_s, blogs_s, news_s); rm(twitter_s, blogs_s, news_s)
Parse the text for further summary:
tbn_corpus <- as.list(strsplit(tbn, " "))
tbn_corpus <- unlist(tbn_corpus)
Delete non-alphabetic symbols (puncutation, numbers, etc.).
tbn_corpus <- strsplit(gsub("[^[:alnum:] ]", "", tbn_corpus), " +")
tbn_corpus <- unlist(tbn_corpus)
There are hundreds of techniques that could be used to further process the data. For the purpose of this assignment, I will stop here.
There are many ways to explore and summarize the data.
Here, I will report the top 10 most common words (unigrams) with > 6 letters.
First, identify the desired words and their frequencies.
top10 <- as.data.frame(table(tbn_corpus[nchar(tbn_corpus) > 6]))
top10 <- top10[order(top10$Freq, decreasing = T),]
top10 <- top10[c(1:10),]
names(top10)[1] <- paste("word"); names(top10)[2] <- paste("freq")
Next, create a histogram to better visualize them.
barplot(top10$freq, names = top10$word, las = 2)
As we can see, in a sample of 1% of the dataset, the top word with > 6 letters was “because”.
People made a lot of sentences using meaningful words such as “friends”, “tonight”, “something”, and “someone”.
This report was simply to get things started for further production of a Shiny app.
Thank you for your time. Have a good day.