This milestone report is the project of week two within the Data Science Capstone Project Course on the Data Science Specialization by Johns Hopkins University on Coursera.
The overall goal of the capstone is to develop a prediction algorithm for the most likely next word in a sequence of words. The purpose of this milestone report is to show some exploratory data analyses to investigate some features of the data, which will lead to the eventual prediction app and algorithm.
To know more about how the data looks like, we can see the size in Megabytes, the number of lines, number of characters, and number of words for each of the 3 datasets (Blog, News and Twitter). And last, the min, max and average number of words per line.
library(knitr);
library(dplyr);
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(doParallel);
## Warning: package 'doParallel' was built under R version 4.3.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.3.3
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.3.3
## Loading required package: parallel
library(tm);
## Warning: package 'tm' was built under R version 4.3.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 4.3.3
library(SnowballC);
library(stringi);
library(tm);
library(ggplot2);
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud);
## Warning: package 'wordcloud' was built under R version 4.3.3
## Loading required package: RColorBrewer
library(kableExtra);
## Warning: package 'kableExtra' was built under R version 4.3.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
setwd("E:/Project/Data Science Capstone Project/final/en_US");
blogpath <- "./en_US.blogs.txt";
newspath <- "./en_US.news.txt";
twitpath <- "./en_US.twitter.txt";
# Read blogs data in binary mode
conn <- file(blogpath, open="rb");
blog <- readLines(conn, skipNul = TRUE ,encoding="UTF-8");
close(conn);
# Read news data in binary mode
conn <- file(newspath, open="rb");
news <- readLines(conn,skipNul = TRUE, encoding="UTF-8");
close(conn);
# Read twitter data in binary mode
conn <- file(twitpath, open="rb");
twit <- readLines(conn,skipNul = TRUE, encoding="UTF-8");
close(conn);
# Remove temporary variable
rm(conn)
WPL <- sapply(list(blog,news,twit),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL) <- c('WPL_Min','WPL_Mean','WPL_Max')
stats <- data.frame(
FileName=c("en_US.blogs","en_US.news","en_US.twitter"),
"File Size" = sapply(list(blog, news, twit), function(x){format(object.size(x),"MB")}),
t(rbind(
sapply(list(blog,news,twit),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(blog,news,twit),stri_stats_latex)['Words',],
WPL)
))
View(stats)
So far, we made a table of raw data stats using only base functions (i.e. no dependencies).Sample the data and obtain descriptive statistics. Obtain a sample of the data Now, we will obtain the same set of statistics for a sample of the data. First, I will set the seed so that I can obtain the same exact samples later. Next, I could sample by number of characters or words, but I will sample by line count.
set.seed(20170219)
## For whatever reason, the sample function (as below) truncated the news dataset
blog10 <- sample(blog, size = length(blog) / 10, replace = FALSE)
news10 <- sample(news, size = length(news)/10, replace = FALSE)
twit10 <- sample(twit, size = length(twit) / 10, replace = FALSE)
## I used the rbinom subsetting method below and it did not work for me.
#blog10 <- blog[rbinom(length(blog)/10, length(blog), .5)]
#news10 <- news[rbinom(length(news)/10, length(news), .5)]
#twit10 <- twit[rbinom(length(twit)/10, length(twit), .5)]
The next few steps are (almost) the same as before, except this time I will use the samples instead of the full datasets.
First, I will obtain sample sizes in mebibytes (MiB), as per my IEC vs. SI unit rant above :) MiB still makes sense here even though I took a 1/10 sample of the datasets. If I took a smaller sample it would be better to use kebibyte units.
blog10MB <- format(object.size(blog10), standard = "IEC", units = "MiB")
news10MB <- format(object.size(news10), standard = "IEC", units = "MiB")
twit10MB <- format(object.size(twit10), standard = "IEC", units = "MiB")
## Get the number of lines
blog10Lines <- length(blog10)
news10Lines <- length(news10)
twit10Lines <- length(twit10)
## Get the number of words per line using sapply and gregexpr base functions
blog10Words<-sapply(gregexpr("[[:alpha:]]+", blog10), function(x) sum(x > 0))
news10Words<-sapply(gregexpr("[[:alpha:]]+", news10), function(x) sum(x > 0))
twit10Words<-sapply(gregexpr("[[:alpha:]]+", twit10), function(x) sum(x > 0))
## Sum the number of words in each line to get total words
blog10WordsSum<-sum(blog10Words)
news10WordsSum<-sum(news10Words)
twit10WordsSum<-sum(twit10Words)
##Get the character count (per line) for each data set
blog10Char<-nchar(blog10, type = "chars")
news10Char<-nchar(news10, type = "chars")
twit10Char<-nchar(twit10, type = "chars")
##Sum the character counts to get total number of characters
blog10CharSum<-sum(blog10Char)
news10CharSum<-sum(news10Char)
twit10CharSum<-sum(twit10Char)
## Alternative: Use the Unix command wc e.g. system("wc filepath")
## This will give the lines, words and characters.
## For simple things like these, I trust Unix commands > R base functions > R packages :)
This is the second deliverable and nicely summarizes information about our samples (from the previous code chunk). It is important to make sure that the values in this table match the previous table. If this is not the case, then it may indicate that something went wrong with the sampling.
df10 <- data.frame(File=c("Blogs Sample", "News Sample", "Twitter Sample"),
fileSize = c(blog10MB, news10MB, twit10MB),
lineCount = c(blog10Lines, news10Lines, twit10Lines),
wordCount = c(blog10WordsSum, news10WordsSum, twit10WordsSum),
charCount = c(blog10CharSum,news10CharSum,twit10CharSum),
wordMean = c(mean(blog10Words), mean(news10Words), mean(twit10Words)),
charMean = c(mean(blog10Char), mean(news10Char), mean(twit10Char))
)
View(df10)
For the data cleaning steps, I will first put all three of the datasets together. Then I will remove stop words, extra whitespace, punctuation, profanity, one-letter words and symbols. For many of these steps, I will use the tm package. Put all of the dataset samples together
## Put all of the data samples together
#dat<- c(blog,news,twit)
dat10<- c(blog10,news10,twit10)
dat10NoPunc<- removePunctuation(dat10)
dat10NoWS<- stripWhitespace(dat10NoPunc)
dat10NoStop <- removeWords(dat10NoWS, stopwords("english"))
At first, I was not sure whether to remove profanity… because I didn’t like the list of “bad” words I found on github e.g. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en and https://gist.github.com/jamiew/1112488 There are perfectly normal (in my opinion) words mixed in those lists e.g. anatomical words We are all adults here (I assume) and I think the profane words can also be interesting for analysis. I decided to remove profanity because I was terrified at the thought that N-word would be one of the top ranked unigrams by frequency. In the end, it did not make a big difference in object size, so probably not a big loss in data.
## download profanity word lists
download.file("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en",
destfile = paste(getwd(),"/profan1.csv", sep=""))
download.file("https://gist.githubusercontent.com/jamiew/1112488/raw/7ca9b1669e1c24b27c66174762cb04e14cf05aa7/google_twunter_lol",
destfile = paste(getwd(),"/profan2.csv", sep=""))
##Read in profanity word lists
profan1<- as.character(read.csv("profan1.csv", header=FALSE))
profan2<- as.character(row.names(read.csv("profan2.csv", header=TRUE, sep = ":")))
## Put the two lists together
profan<-c(profan1, profan2)
## Trim the first and last line of profan
profan<-profan[-1]
profan<-profan[-length(profan)]
## Remove profanity
dat10NoProfan <- removeWords(dat10NoStop, profan)
## Find out the object size difference after removing profanity
object.size(dat10NoPunc)
## 85298144 bytes
object.size(dat10NoProfan)
## 74721856 bytes
object.size(dat10NoPunc)-object.size(dat10NoProfan)
## 10576288 bytes