The goal of this work is to practice preparing different datasets for downstream analysis work.
The task is to:
Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: ??? Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. ??? Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] ??? Perform the analysis requested in the discussion item. ??? Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission, for each of the three chosen datasets: ??? The URL to the .Rmd file in your GitHub repository, and ??? The URL for your rpubs.com web page.
This is a dataset that will contain top best selling and highly rated books across 5 different genres (History, Law, Literature & Fiction, Science and Math, and Religion) and will be used to produce a .csv file containing 250 observations and 8 variables describing up to date record of 50 top selling books across the aforementioned 5 genres each scrapped from Amazon.
The data will contain descriptive observations in a fairly clean form because specific information were looked for on the web pages. Each book will have eight variables depicting its title, author, genre, rating, price, type (kindle edition, paperback, etc), etc.
The data will then be worked on using appropriate functions like group_by, filter, arrange, select, and other relevant functions from the dplyr and tidyr packages to see how the books compare with each others in terms of their genres, types, prices, and the most rated such that at the end, I will be able to to know the cheapest 5 (one from each of the 5 genres) from among them.
This involves scrapping the the pages of each genre for the required information.
religionRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-Religion-Spirituality/zgbs/books/22/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')
sciMathRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-Science-Math/zgbs/books/75/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')
litFictionRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-Literature-Fiction/zgbs/books/17/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')
historyRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-History/zgbs/books/9/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')
lawRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-Law/zgbs/books/10777/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')Some of the book titles contain : which implies a long decriptive part of the tile comes after the colon. These are therefore shortened at the :
processNodes <- function(rawNode, genre)
{
imgs <- vector()
genres <- vector()
titles <- vector()
authors <- vector()
ratings <- vector()
numRaters <- vector()
types <- vector()
prices <- vector()
itemsRaw <-html_nodes(rawNode, 'span.zg-item')
for(i in 1: length(itemsRaw))
{
x <- itemsRaw[i]
xImg <- str_replace_all(str_extract(x, '(?<=src=)"(.+?)"'), '\"', '')
xTitle <- str_trim(str_replace_all(html_text(html_node(x, '.a-link-normal > div')), "[\r\n]" , ""), side='both')
if(str_detect(xTitle, ':') == TRUE)
{
xTitle <- str_split(xTitle, ':')[[1]][1] # Shorten the title by splitting at the colon (:)
}
xAuthor <- str_trim(html_text(html_node(x, '.a-link-child')), side='both')
if(is.na(xAuthor) || is.null(xAuthor))
{
xAuthor <- str_trim(html_text(html_node(x, 'span.a-color-base')), side='both')
}
xRating <- str_trim(str_extract(html_node(x, '.a-icon-alt'), '\\d+(\\.\\d+)'), side='both')
xNumRaters <- str_replace(str_trim(html_text(html_node(x, 'div.a-icon-row > a.a-size-small')), side='both'), ',', '')
if(is.na(xNumRaters) || is.null(xNumRaters))
{
xNumRaters <- str_replace(str_trim(html_text(html_node(x,
'div.a-spacing-none > a-link-normal')), side='both'), ',', '')
}
xType <- str_trim(html_text(html_node(x, 'div.a-size-small > span.a-color-secondary')), side='both')
p <- str_extract(html_node(x, 'span.p13n-sc-price'), '\\d+(\\.\\d+)')
xPrice <- round(as.numeric(p), 2)
if(is.na(xPrice) || is.null(xPrice))
{
xPrice <- str_extract(str_trim(html_text(html_node(x,
'span.a-color-price')), side='both'), '[[:alpha:] ]{2,}')
}
imgs <- c(imgs, xImg)
genres <- c(genres, genre)
titles <- c(titles, xTitle)
authors <- c(authors, xAuthor)
ratings <- c(ratings, xRating)
numRaters <- c(numRaters, xNumRaters)
types <- c(types, xType)
prices <- c(prices, xPrice)
}
return(data.frame(title = titles, author = authors, genre = genres, rating = ratings, img = imgs,
numRater = numRaters, type = types, price = prices))
}religion <- processNodes(religionRaw, 'Religion')
science <- processNodes(sciMathRaw, 'Science & Math')
fiction <- processNodes(litFictionRaw, 'Literature & Fiction')
history <- processNodes(historyRaw, 'History')
law <- processNodes(lawRaw, 'Law')allBooks <- rbind(history, religion, science, fiction, law)
dim(allBooks)## [1] 250 8
write.csv(allBooks, "topBooks", row.names=FALSE)Preview the data
mixedBooks <- subset(allBooks, select=c(title, author, genre, rating, numRater, type, price)) #Remove the book image url
datatable(head(mixedBooks, 10), colnames= c("Title", "Author", "Genre", "Rating", "Number of reviewers", "Type", "Price($)"), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))Audible Audiobook and Kindle Edition types since the interest is only on printed types, and convert the rating, numRater (Number of Reviewers), and price variables from factor type to numeric typebooks <- mixedBooks %>% filter(!type %in% "Kindle Edition" & !type %in% "Audible Audiobook")
books <- books%>% mutate(numRater = as.numeric(as.character(numRater))) %>% mutate(price = as.numeric(as.character(price)))%>%mutate(rating = as.numeric(as.character(rating)))
datatable(head(books, 10), colnames= c("Title", "Author", "Genre", "Rating", "Number of reviewers", "Type", "Price($)"), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
)) books[which.max(books$numRater),]## title author genre rating numRater type price
## 26 Jesus Calling Sarah Young Religion 4.9 17554 Hardcover 8.49
books[which.max(books$rating),]## title author genre rating numRater type price
## 10 Spearhead Adam Makos History 5 81 Hardcover 16.8
group_by the genre and add up the number of reviewers so as to know the most reviewed genrebooks2 <- books%>% group_by(genre) %>% summarize(reviewers = sum(na.omit(numRater)))%>%select(genre, reviewers)
books2## # A tibble: 5 x 2
## genre reviewers
## <fct> <dbl>
## 1 History 15626
## 2 Religion 96426
## 3 Science & Math 35402
## 4 Literature & Fiction 6223
## 5 Law 38133
ggplot(books2, aes(x=genre, y=reviewers, fill=genre))+
scale_fill_manual(values = c("#1f77b4", "#26868d", '#D32F2F', '#7B1FA2', '#3F51B5') ) +
labs(title="Genre Vs Reviewers", x ="Genre", y="Reviewers") +
geom_bar(stat="identity", width=.7, show.legend = FALSE)Let’s select the 5 most reviewed books in each of the genres
topReviewd <- books %>%group_by(genre) %>%top_n(n = 5, wt = numRater)datatable(topReviewd, colnames= c("Title", "Author", "Genre", "Rating", "Number of reviewers", "Type", "Price($)"), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))cheapestTopReviewd <- topReviewd%>%top_n(n = 1, wt = -price)datatable(cheapestTopReviewd, colnames= c("Title", "Author", "Genre", "Rating", "Number of reviewers", "Type", "Price($)"), class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))Description: This is a dataset provided from an experiment with his daughter to complete her project for a science fair. For the experiment, they placed thermometers in 5 different color t-shirts and recorded the temperatures at 10 minute intervals for 1 hour as the garments were exposed to heat. The heater was then turned off, and the temperatures measured again at 10 minute intervals as the garments cooled. The attached provided .csv file represents their findings.
Let’s read the data
untidyHeat <- read.csv("science_proj_data.csv", header = TRUE, stringsAsFactors = FALSE) # Read the .csv filedatatable(untidyHeat, class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))dim(untidyHeat)## [1] 10 9
The dataset has 10 observations and 9 variables.
Let’s gather the minutes colums under say minute and their values under temperature, and then remove minute. from the values
temp <- untidyHeat %>% gather(minute.0:minute.60, key = 'minute', value = 'temperature')%>%mutate(minute = str_replace_all(minute, 'minute.', ''))datatable(temp, class = 'cell-border stripe', options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
"$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
"}")
))Let’s how the colors fare during heating and cooling phases
heating <- subset(temp, phase == 'heating')#ggplot(temp, aes(x = minute, y = temperature)) + geom_line()
ggplot(heating, aes(x = minute, y = temperature, fill = color)) +
geom_bar(position="dodge", stat = "identity") +
labs(title="Temperature performances of the colors during heating phase", x ="Minute", y="Temperature") +
scale_fill_manual(values = c("#000000", "#2ca02c", "#440154", "#D32F2F", "#45367f") )cooling <- subset(temp, phase == 'cooling')#ggplot(temp, aes(x = minute, y = temperature)) + geom_line()
ggplot(cooling, aes(x = minute, y = temperature, fill = color)) +
geom_bar(position="dodge", stat = "identity") +
labs(title="Temperature performances of the colors during cooling phase", x ="Minute", y="Temperature") +
scale_fill_manual(values = c("#000000", "#2ca02c", "#440154", "#D32F2F", "#45367f") )
Social Security Disability Claims - a Dataset provided by Rajwant Mishra
This is a data set of Social Security Administration Disability benefits Claims from 2008 to 2017 Fiscal Years. Each year has months with each month showing the amount of claims collected through other means (like application in person, etc) and the ones registered via the internet.
Fiscal Year (FY): is the 12-month period from October 1st through September 30th.
We want to see how each monthly claims in each year compare to those of other years and to see if the claims submitted through the internet increased with time or not over the period of 2008 to 2017.
Reading the data from the provided .csv file
The data set has 10 observations and 25 variables.
Looking at the data, the monthly totals can be gathered under a single column say
TotalThe columns with
_Internetcan be gathered under sayInternetRemove
FY, _Total, _Internetfrom the values in the appropriate columns and adding20to theFiscal_Yearcolumn, also remove the commas in theValueandVolumecolumns and convert their values to numericLet’s see how the total claims for each year compare aginst other years by aggregating all the total monthly claims for each year: First, let’s subset the
Fiscal_Year, Totals, Value, and Volumecolumns, renamingTotalstoMonthsand creating a new columnTotalValueto be the sum of each month’s total figures which isValue.Volumeis already contained inValue. It was made to be a separete a column to point out how much claims were made via internetThe above data can be used to vsualise monthly claims for each year.
Lets further group the data for months of each year together so as to get a total figure for the claims per year
A summary of the yearly aggregations
The maximum claims occured in 2012
While the minimum claims occured in 2017
In overall, this shows that the claims were highest in 2011, and lowest in 2017. Let’s see if further visual comparisms can show why the claims were lowest in 2017
Let’s see how the monthly claims (via ineternet and other means) over ther years compare to each other
From this plot, it is obvious that the provided data set has claims for only for months in 2017 hence the low claims rate noted for that year.
Now, let’s see how the claims over the internet compares yearl and monthly
Monthly
The highst number of claims filed via the internet occured on August 2014
The trend was progressive incrementally from 2008
Yearly