Data wrangling and transformations

The goal of this work is to practice preparing different datasets for downstream analysis work.

The task is to:

  1. Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: ??? Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. ??? Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] ??? Perform the analysis requested in the discussion item. ??? Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

  2. Please include in your homework submission, for each of the three chosen datasets: ??? The URL to the .Rmd file in your GitHub repository, and ??? The URL for your rpubs.com web page.

Data set 1

Top 5 Best Seller books on Amazon

This is a dataset that will contain top best selling and highly rated books across 5 different genres (History, Law, Literature & Fiction, Science and Math, and Religion) and will be used to produce a .csv file containing 250 observations and 8 variables describing up to date record of 50 top selling books across the aforementioned 5 genres each scrapped from Amazon.

The data will contain descriptive observations in a fairly clean form because specific information were looked for on the web pages. Each book will have eight variables depicting its title, author, genre, rating, price, type (kindle edition, paperback, etc), etc.

The data will then be worked on using appropriate functions like group_by, filter, arrange, select, and other relevant functions from the dplyr and tidyr packages to see how the books compare with each others in terms of their genres, types, prices, and the most rated such that at the end, I will be able to to know the cheapest 5 (one from each of the 5 genres) from among them.

Data Scrapping

This involves scrapping the the pages of each genre for the required information.

religionRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-Religion-Spirituality/zgbs/books/22/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')

sciMathRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-Science-Math/zgbs/books/75/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')

litFictionRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-Literature-Fiction/zgbs/books/17/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')

historyRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-History/zgbs/books/9/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')

lawRaw <- html_nodes(read_html('https://www.amazon.com/Best-Sellers-Books-Law/zgbs/books/10777/ref=zg_bs_nav_b_1_b', encoding = "en_US.UTF-8"), 'ol#zg-ordered-list')

Some of the book titles contain : which implies a long decriptive part of the tile comes after the colon. These are therefore shortened at the :

processNodes <- function(rawNode, genre)
{
    imgs <- vector()
    genres <- vector()
    titles <- vector()
    authors <- vector()
    ratings <- vector()
    numRaters <- vector()
    types <- vector()
    prices <- vector()
    
    itemsRaw <-html_nodes(rawNode, 'span.zg-item')
  
    for(i in 1: length(itemsRaw))
    {
        x <- itemsRaw[i]
        xImg <- str_replace_all(str_extract(x, '(?<=src=)"(.+?)"'), '\"', '')
        xTitle <- str_trim(str_replace_all(html_text(html_node(x, '.a-link-normal > div')), "[\r\n]" , ""), side='both') 
        
        if(str_detect(xTitle, ':') == TRUE)
        {
          xTitle <- str_split(xTitle, ':')[[1]][1] # Shorten the title by splitting at the colon (:)
        }
        
        xAuthor <- str_trim(html_text(html_node(x, '.a-link-child')), side='both')
        
        if(is.na(xAuthor) || is.null(xAuthor))
        {
          xAuthor <- str_trim(html_text(html_node(x, 'span.a-color-base')), side='both')
        }
        
        xRating <- str_trim(str_extract(html_node(x, '.a-icon-alt'), '\\d+(\\.\\d+)'), side='both')
        
        xNumRaters <- str_replace(str_trim(html_text(html_node(x, 'div.a-icon-row > a.a-size-small')), side='both'), ',', '')
        
        if(is.na(xNumRaters) || is.null(xNumRaters))
        {
          xNumRaters <- str_replace(str_trim(html_text(html_node(x, 
                        'div.a-spacing-none > a-link-normal')), side='both'), ',', '')
        }
       
        xType <- str_trim(html_text(html_node(x, 'div.a-size-small > span.a-color-secondary')), side='both')
        
        p <- str_extract(html_node(x, 'span.p13n-sc-price'), '\\d+(\\.\\d+)')
        
        xPrice <-  round(as.numeric(p), 2)
        
        if(is.na(xPrice) || is.null(xPrice))
        {
          xPrice <- str_extract(str_trim(html_text(html_node(x, 
                        'span.a-color-price')), side='both'), '[[:alpha:] ]{2,}')
        }
        
        imgs <- c(imgs, xImg)
        genres <- c(genres, genre)
        titles <- c(titles, xTitle)
        authors <- c(authors, xAuthor)
        ratings <- c(ratings, xRating)
        numRaters <- c(numRaters, xNumRaters)
        types <- c(types, xType)
        prices <- c(prices, xPrice)
    }
    
    return(data.frame(title = titles, author = authors, genre = genres, rating = ratings, img = imgs,
                      numRater = numRaters, type = types, price = prices))
}
This Returns 5 different dataframes of 50 observations and 8 variables
religion <- processNodes(religionRaw, 'Religion')
science <- processNodes(sciMathRaw, 'Science & Math')
fiction <- processNodes(litFictionRaw, 'Literature & Fiction')
history <- processNodes(historyRaw, 'History')
law <- processNodes(lawRaw, 'Law')
Combine all datframes into one and check the dimensions of the resultant dataframe
allBooks <- rbind(history, religion, science, fiction, law)
dim(allBooks)
## [1] 250   8
Exporting the data to a .csv file
write.csv(allBooks, "topBooks", row.names=FALSE)
We have 250 books and 8 variables

Preview the data

mixedBooks <- subset(allBooks, select=c(title, author, genre, rating, numRater, type, price)) #Remove the book image url

datatable(head(mixedBooks, 10), colnames= c("Title", "Author", "Genre", "Rating", "Number of reviewers", "Type", "Price($)"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
Let’s filter out the the Audible Audiobook and Kindle Edition types since the interest is only on printed types, and convert the rating, numRater (Number of Reviewers), and price variables from factor type to numeric type
books <- mixedBooks %>% filter(!type %in% "Kindle Edition" & !type %in% "Audible Audiobook") 

books <- books%>% mutate(numRater = as.numeric(as.character(numRater))) %>% mutate(price = as.numeric(as.character(price)))%>%mutate(rating = as.numeric(as.character(rating)))

datatable(head(books, 10), colnames= c("Title", "Author", "Genre", "Rating", "Number of reviewers", "Type", "Price($)"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
Let’s see the book with the highest rating and the one with highest number of reviewers
 books[which.max(books$numRater),]
##            title      author    genre rating numRater      type price
## 26 Jesus Calling Sarah Young Religion    4.9    17554 Hardcover  8.49
books[which.max(books$rating),]
##        title     author   genre rating numRater      type price
## 10 Spearhead Adam Makos History      5       81 Hardcover  16.8
This then shows that the book with the highest rating or with a 5-star rating is not the most bought or reviewed. It then makes sense to lay emphasis more on the ones with the highest number of reviewers and still with appreciable ratings say between 4.0 and 5.0
Let’s see how the genres compare to each other. We will group_by the genre and add up the number of reviewers so as to know the most reviewed genre
books2 <- books%>% group_by(genre) %>% summarize(reviewers = sum(na.omit(numRater)))%>%select(genre, reviewers)
books2
## # A tibble: 5 x 2
##   genre                reviewers
##   <fct>                    <dbl>
## 1 History                  15626
## 2 Religion                 96426
## 3 Science & Math           35402
## 4 Literature & Fiction      6223
## 5 Law                      38133
ggplot(books2, aes(x=genre, y=reviewers, fill=genre))+
  scale_fill_manual(values = c("#1f77b4", "#26868d", '#D32F2F', '#7B1FA2', '#3F51B5') ) + 
  labs(title="Genre Vs Reviewers", x ="Genre", y="Reviewers") + 
  geom_bar(stat="identity", width=.7, show.legend = FALSE)

It appears Religion is the most reviewed and consequently most purchased genre among the 5 selected genres

Let’s select the 5 most reviewed books in each of the genres

topReviewd <- books %>%group_by(genre) %>%top_n(n = 5, wt = numRater)
datatable(topReviewd, colnames= c("Title", "Author", "Genre", "Rating", "Number of reviewers", "Type", "Price($)"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
Now we have the 5 top most reviewed books from each of the genres.
Finally, let’s select the cheapest book from among each of the genres
cheapestTopReviewd <- topReviewd%>%top_n(n = 1, wt = -price)
datatable(cheapestTopReviewd, colnames= c("Title", "Author", "Genre", "Rating", "Number of reviewers", "Type", "Price($)"), class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Conclusion

Finally, I have the 5 cheapest books from among top rated and best selling books from among 5 genres to purchase on Amazon!

Data set 2

Social Security Disability Claims - a Dataset provided by Rajwant Mishra

This is a data set of Social Security Administration Disability benefits Claims from 2008 to 2017 Fiscal Years. Each year has months with each month showing the amount of claims collected through other means (like application in person, etc) and the ones registered via the internet.

Fiscal Year (FY): is the 12-month period from October 1st through September 30th.

We want to see how each monthly claims in each year compare to those of other years and to see if the claims submitted through the internet increased with time or not over the period of 2008 to 2017.

Reading the data from the provided .csv file

untidySSD <- read.csv("ssadisability.csv", header = TRUE, stringsAsFactors = FALSE) # Read the .csv file
datatable(untidySSD, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
dim(untidySSD)
## [1] 10 25

The data set has 10 observations and 25 variables.

Looking at the data, the monthly totals can be gathered under a single column say Total
ssda <- untidySSD %>% gather(matches("_Total"), key = 'Totals', value = 'Value')
datatable(ssda, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

The columns with _Internet can be gathered under say Internet

ssda1 <- ssda %>% gather(Oct_Internet:Sept_Internet, key = 'Internet', value = 'Volume')
datatable(ssda1, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
Remove FY, _Total, _Internet from the values in the appropriate columns and adding 20 to the Fiscal_Year column, also remove the commas in the Value and Volume columns and convert their values to numeric
ssda2 <- ssda1%>%mutate(Volume = as.numeric(str_replace_all(Volume, ',', '')))%>%mutate(Value = as.numeric(str_replace_all(Value, ',', '')))%>%mutate(Fiscal_Year = str_replace_all(Fiscal_Year, 'FY', '20'))%>%mutate(Totals = str_replace_all(Totals, '_Total', ''))%>%mutate(Internet = str_replace_all(Internet, '_Internet', ''))
datatable(ssda2, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Let’s see how the total claims for each year compare aginst other years by aggregating all the total monthly claims for each year: First, let’s subset the Fiscal_Year, Totals, Value, and Volume columns, renaming Totals to Months and creating a new column TotalValue to be the sum of each month’s total figures which is Value. Volume is already contained in Value. It was made to be a separete a column to point out how much claims were made via internet

s3 <- ssda2%>%mutate(TotalValue = Value, Month = Totals)%>%select('Fiscal_Year', 'Month', 'TotalValue')
s3 <- na.omit(s3)
datatable(s3, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

The above data can be used to vsualise monthly claims for each year.

Lets further group the data for months of each year together so as to get a total figure for the claims per year

s4 <- s3%>%group_by(Fiscal_Year)%>% summarise(Total = sum(TotalValue))
datatable(s4, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

A summary of the yearly aggregations

The maximum claims occured in 2012

s4[which.max(s4$Total),]
## # A tibble: 1 x 2
##   Fiscal_Year    Total
##   <chr>          <dbl>
## 1 2011        36541992

While the minimum claims occured in 2017

s4[which.min(s4$Total),]
## # A tibble: 1 x 2
##   Fiscal_Year   Total
##   <chr>         <dbl>
## 1 2017        8513220
ggplot(s4, aes(x=Fiscal_Year, y=Total, fill=Fiscal_Year))+
  scale_fill_manual(values = c("#1f77b4", "#26868d", '#440154', '#7B1FA2', '#3F51B5', '#D32F2F', '#00695c', '#303f9f', '#6acb5c', '#2e748e', '#6acb5c','#45bc72' ) ) + 
  labs(title="SSD Claims per year", x ="Fiscal Year", y="Claims") + 
  geom_bar(stat="identity", width=.7, show.legend = FALSE)

In overall, this shows that the claims were highest in 2011, and lowest in 2017. Let’s see if further visual comparisms can show why the claims were lowest in 2017

Let’s see how the monthly claims (via ineternet and other means) over ther years compare to each other

ggplot(s3, aes(x = Fiscal_Year, y = TotalValue, fill = Month)) + 
  geom_bar(position="fill", stat = "identity") + 
  labs(title="Monthly SSD Claims per year", x ="Fiscal Year", y="Claims") +
  scale_fill_manual(values = c("#1f77b4", "purple", "#440154", "#7986cb", "#26868d", '#2ca02c', '#35618d', '#ff7f0e', '#2dab81', '#2e748e', '#d62728', '#3d4c89') )

From this plot, it is obvious that the provided data set has claims for only for months in 2017 hence the low claims rate noted for that year.

Now, let’s see how the claims over the internet compares yearl and monthly
Monthly
s5 <- ssda2%>%mutate(Total = Volume, Month = Internet)%>%select('Fiscal_Year', 'Month', 'Total')
s5 <- na.omit(s5)
datatable(s5, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
s5[which.max(s5$Total),]
##      Fiscal_Year  Month  Total
## 1207        2014 August 139000

The highst number of claims filed via the internet occured on August 2014

ggplot(s5, aes(x = Fiscal_Year, y = Total, fill = Month)) + 
  geom_bar(position="dodge", stat = "identity") + 
  labs(title="Monthly SSD Claims filed via the Internet per year", x ="Fiscal Year", y="Claims") +
  scale_fill_manual(values = c("#1f77b4", "#26868d", "#440154", "#45367f", "#45bc72", '#6acb5c', '#35618d', '#6acb5c', '#2dab81', '#2e748e', '#45367f', '#3d4c89') ) # + coord_flip()

The trend was progressive incrementally from 2008
Yearly
s6 <- s5%>%group_by(Fiscal_Year)%>% summarise(Total = sum(Total))
datatable(s6, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
ggplot(s6, aes(x=Fiscal_Year, y=Total, fill=Fiscal_Year))+
  scale_fill_manual(values = c("#1f77b4", "#26868d", '#440154', '#7B1FA2', '#3F51B5', '#D32F2F', '#00695c', '#303f9f', '#2ca02c', '#ff7f0e' ) ) + 
  labs(title="SSD Claims via the internet per year", x ="Fiscal Year", y="Claims") + 
  geom_bar(stat="identity", width=.7, show.legend = FALSE)

ggplot(s5, aes(x = Fiscal_Year, y = Total)) + 
     geom_line(color = "#00AFBB", size = 3) + 
  labs(title="Monthly SSD Claims via the Internet per year", x ="Fiscal Year", y="Claims")

Conclusion

It is evident that the claims filed via the internet generally increased progressively from 2008, achieving maximum peak in 2014 but suprisingly started decreasing progressively from that point.

Data set 3

Color and heat absorption - a dataset provided by Christopher Ayre

Description: This is a dataset provided from an experiment with his daughter to complete her project for a science fair. For the experiment, they placed thermometers in 5 different color t-shirts and recorded the temperatures at 10 minute intervals for 1 hour as the garments were exposed to heat. The heater was then turned off, and the temperatures measured again at 10 minute intervals as the garments cooled. The attached provided .csv file represents their findings.

The experiment question then is to analyze the rate at which different colors absorb and release heat.

Let’s read the data

untidyHeat <- read.csv("science_proj_data.csv", header = TRUE, stringsAsFactors = FALSE) # Read the .csv file
datatable(untidyHeat, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))
dim(untidyHeat)
## [1] 10  9

The dataset has 10 observations and 9 variables.

Let’s gather the minutes colums under say minute and their values under temperature, and then remove minute. from the values

temp <- untidyHeat %>% gather(minute.0:minute.60, key = 'minute', value = 'temperature')%>%mutate(minute = str_replace_all(minute, 'minute.', ''))
datatable(temp, class = 'cell-border stripe', options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#1f77b4', 'color': '#fff', 'text-align': 'center !important'});",
    "$(this.api().table().body()).css({'color': '#000', 'text-align': 'center !important'});",
    "}")
))

Let’s how the colors fare during heating and cooling phases

For heating phase
heating <- subset(temp, phase == 'heating')
#ggplot(temp, aes(x = minute, y = temperature)) + geom_line()

 ggplot(heating, aes(x = minute, y = temperature, fill = color)) + 
  geom_bar(position="dodge", stat = "identity") + 
  labs(title="Temperature performances of the colors during heating phase", x ="Minute", y="Temperature") +
  scale_fill_manual(values = c("#000000", "#2ca02c", "#440154", "#D32F2F", "#45367f") )

For cooling phase
cooling <- subset(temp, phase == 'cooling')
#ggplot(temp, aes(x = minute, y = temperature)) + geom_line()

 ggplot(cooling, aes(x = minute, y = temperature, fill = color)) + 
  geom_bar(position="dodge", stat = "identity") + 
  labs(title="Temperature performances of the colors during cooling phase", x ="Minute", y="Temperature") +
  scale_fill_manual(values = c("#000000", "#2ca02c", "#440154", "#D32F2F", "#45367f") )

Conclusion

From all indications, Black colors tend tend to absorb heat the most (very fast) and slowest in releasing (emitting) heat. White colors on the other hand tend to be slowest in absorbing heat and very fast in cooling down (releasing/emiting heat). Red colors appears to retain heat as much and is next to black color in that regards. Other colors fall in-between these ones