Milestone Report

Overview

The Milestone Report for the Coursera Data Science Capstone Project is intended to look at and analyze some basic features of text data provided from three sources: blogs, news, and tweets.

This report shows the following about each of the text files:
- File size in Mb
- Number of lines
- Word count
- Data cleaning/preprocessing process
- Histograms and word clouds of unique words (Only for blogs)
- Histograms and word clouds of unique words without stop words (excludes: the, if, etc.)

Load Libraries

This loads the various libraries used in the document.

library(knitr)
library(ngram)
library(kableExtra)
library(corpus)
library(tm)
library(tidyverse)
library(tidytext)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)

Acquiring the Data

The data is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and unzipped into a data folder in the current working directory. This step also creates the other directories (tidy and ngram) used during the project. The following is only be done if the directories and data do not already exist in the current working directory.

if(!file.exists("data")){dir.create("data")}
if(!file.exists("data/Coursera-SwiftKey.zip")){
      download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                    destfile="data/Coursera-SwiftKey.zip",mode = "wb")
      }
if(!file.exists("data/final")){
      unzip(zipfile="data/Coursera-SwiftKey.zip",exdir="data")
      }
if(!file.exists("data/final/en_US/tidy")){dir.create("data/final/en_US/tidy")}
if(!file.exists("data/final/en_US/ngram")){dir.create("data/final/en_US/ngram")}

Load the Data

The first thing is to load the data into variables for use.

blogLines <- readLines("data/final/en_US/en_US.blogs.txt",
                          encoding="UTF-8", skipNul = TRUE)
newsLines <- readLines("data/final/en_US/en_US.news.txt",
                          encoding="UTF-8", skipNul = TRUE)
twitterLines <- readLines("data/final/en_US/en_US.twitter.txt",
                          encoding="UTF-8", skipNul = TRUE)

Data File Properties

This now looks at various properties of each data file within the English language (data/final/en_US) directory only.

dataInfo <- data.frame(
      "File.Size"=c(file.info("data/final/en_US/en_US.blogs.txt")$size/(2^20), file.info("data/final/en_US/en_US.news.txt")$size/(2^20), file.info("data/final/en_US/en_US.twitter.txt")$size/(2^20)),
      "Line.Count"=c(length(blogLines), length(newsLines), length(twitterLines)),
      "Word.Count"=c(wordcount(blogLines, sep=" ", count.function = sum), wordcount(newsLines, sep=" ", count.function = sum), wordcount(twitterLines, sep=" ", count.function = sum))
      )
row.names(dataInfo) <- c("Blogs", "News", "Twitter")
kable(dataInfo, align = "c") %>% 
      kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
      column_spec(1, bold = T, border_right = T) %>%
      footnote(general = "All file sizes are in Mb.")

	File.Size	Line.Count	Word.Count
Blogs	200.4242	899288	37334131
News	196.2775	1010242	34372530
Twitter	159.3641	2360148	30373583
Note:
All file sizes are in Mb.

Cleaning the Data

The raw data has not been cleaned which means that the text includes lowercase and capital letters, numbers, spelling errors, non-alphanumeric characters, etc. Tidying the data removes many of these complications. The following tidies the text data and saves it to a new text file in the tidy directory using a created function tidyText.

tidyText <- function(inputVar, outputFile){
   #Input variable
   lines <- inputVar
   #Convert lines from text file into a corpus object for cleaning
   post <- Corpus(VectorSource(lines))
   #Convert to all lowercase letter
   post <- tm_map(post,content_transformer(tolower))
   #Remove numbers
   post <- tm_map(post, removeNumbers)
   #Create user-defined cleaning transformation.
   #This will take in a user-defined pattern and convert it to nothing
   removePattern <- content_transformer(function(x, pattern) gsub(pattern, "", x))
   #Remove @, #, http://, and https://
   post <- tm_map(post, removePattern, "([^[:space:]]*)(@|#|http://|https://)([^[:space:]]*)")
   #Remove any character that isn't alphanumeric, a space, or a .
   post <- tm_map(post, removePattern, "[^a-zA-Z0-9_. ]+")
   #Remove an lingering punctuation
   post <- tm_map(post, removePunctuation)
   #Remove extra whitespace between characters
   post <- tm_map(post, stripWhitespace)
   #Remove white space at the beginning and end of strings
   post <- tm_map(post, removePattern, "^\\s+|\\s+$")
   #Convert corpus object back to a data frame with cleaned text
   #post <- data.frame(text=sapply(post, identity), stringsAsFactors=F)
   post <- sapply(post, identity)
   #Write cleaned text to a txt file
   write.table(post, file=outputFile, sep="", col.names=FALSE,
               row.names=FALSE, quote=FALSE)
   }
if(!file.exists("data/final/en_US/tidy/en_US.blogs.tidy.txt")){
   tidyText(blogLines,"data/final/en_US/tidy/en_US.blogs.tidy.txt")
}
if(!file.exists("data/final/en_US/tidy/en_US.news.tidy.txt")){
   tidyText(newsLines,"data/final/en_US/tidy/en_US.news.tidy.txt")
}
if(!file.exists("data/final/en_US/tidy/en_US.twitter.tidy.txt")){
   tidyText(twitterLines,"data/final/en_US/tidy/en_US.twitter.tidy.txt")
}

Analyzing the Data

Unigrams of the newly cleaned data are created with and without stop words (the, if, etc.). The top 10 most used words are displayed in a table, and the top 30 most used words are displayed in a histogram and word cloud.

blogLines <- readLines("data/final/en_US/tidy/en_US.blogs.tidy.txt", encoding="UTF-8", skipNul = TRUE)
newsLines <- readLines("data/final/en_US/tidy/en_US.news.tidy.txt", encoding="UTF-8", skipNul = TRUE)
twitterLines <- readLines("data/final/en_US/tidy/en_US.twitter.tidy.txt", encoding="UTF-8", skipNul = TRUE)

blogUnigramStop <- term_stats(blogLines)
blogUnigram <- term_stats(blogLines, subset=!term %in% stopwords_en)
newsUnigram <- term_stats(newsLines, subset=!term %in% stopwords_en)
twitterUnigram <- term_stats(twitterLines, subset=!term %in% stopwords_en)

Blogs

kable(head(blogUnigramStop, 10), align="c", caption="Blog Unigrams with Stop Words") %>% 
      kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position="float_left") %>%
      column_spec(1, bold = T, border_right = T)
kable(head(blogUnigram, 10), align="c", caption="Blog Unigrams without Stop Words") %>% 
      kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position="right") %>%
      column_spec(1, bold = T, border_right = T)

Blog Unigrams with Stop Words
term	count	support
the	1855768	551649
and	1086108	471001
to	1065697	456676
a	896942	421329
of	875028	408516
in	593633	336534
i	769493	304213
that	459500	268832
is	431834	258758
for	362866	248739

Blog Unigrams without Stop Words
term	count	support
one	124339	101758
just	100015	85391
like	98257	82064
can	98108	80758
time	88143	74236
get	70768	60858
now	59604	54116
im	66950	53189
know	59932	51097
also	55256	49993

ggplot(data=blogUnigramStop[1:30,], aes(x=reorder(term,-count), y=count)) +
      geom_bar(stat="identity", fill="green3") +
      theme(axis.text.x = element_text(angle = 90)) +
      ggtitle("Top 30 Blog Unigrams with Stop Words") +
      xlab("Words") +
      ylab("Count")

wordcloud(words = blogUnigramStop$term, freq = blogUnigramStop$count, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

ggplot(data=blogUnigram[1:30,], aes(x=reorder(term,-count), y=count)) +
      geom_bar(stat="identity", fill="green3") +
      theme(axis.text.x = element_text(angle = 90)) +
      ggtitle("Top 30 Blog Unigrams without Stop Words") +
      xlab("Words") +
      ylab("Count")

wordcloud(words = blogUnigram$term, freq = blogUnigram$count, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

News

It is clear from the blogs text that most unique words are stop words and will be removed from further consideration

kable(head(newsUnigram, 10), align="c", caption="News Unigrams without Stop Words") %>% 
      kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position="center") %>%
      column_spec(1, bold = T, border_right = T)

News Unigrams without Stop Words
term	count	support
said	250348	227828
one	83171	75700
new	70305	62513
also	58757	56613
two	57341	52741
can	58673	51956
year	57616	51633
just	53150	49258
last	51522	48788
first	52630	48480

ggplot(data=newsUnigram[1:30,], aes(x=reorder(term,-count), y=count)) +
      geom_bar(stat="identity", fill="green3") +
      theme(axis.text.x = element_text(angle = 90)) +
      ggtitle("Top 30 News Unigrams without Stop Words") +
      xlab("Words") +
      ylab("Count")

wordcloud(words = newsUnigram$term, freq = newsUnigram$count, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

Twitter

kable(head(twitterUnigram, 10), align="c", caption="Twitter Unigrams without Stop Words") %>% 
      kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position="center") %>%
      column_spec(1, bold = T, border_right = T)

Twitter Unigrams without Stop Words
term	count	support
im	158502	147951
just	149601	145925
like	121294	115260
get	111905	107395
love	105249	99032
good	99616	95685
thanks	88595	87501
dont	90064	85888
can	89110	85312
day	89997	85301

ggplot(data=twitterUnigram[1:30,], aes(x=reorder(term,-count), y=count)) +
      geom_bar(stat="identity", fill="green3") +
      theme(axis.text.x = element_text(angle = 90)) +
      ggtitle("Top 30 Twitter Unigrams without Stop Words") +
      xlab("Words") +
      ylab("Count")

wordcloud(words = twitterUnigram$term, freq = twitterUnigram$count, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

Conclusions

It is clear from the unigrams that stop words dominate most of the text. After removing these, differing patterns emerge from the blogs, news, and Twitter sources. Blog text appears to have a strong prevalence for “one”, “just”, “like”, “can”, and “time”. News text strongly favors the word “said”, which is probably related to citing sources of information and quotations. Tweets have a distribution that tails off quickly after “im” (presumably “I’m”) and “just”.