Project 2 Assignment

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

  3. Perform the analysis requested in the discussion item.

  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

load the required libraries

require(stringr) 
## Loading required package: stringr
require(dplyr)
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(tidyr)
## Loading required package: tidyr
require(ggplot2)
## Loading required package: ggplot2
require(readr)
## Loading required package: readr

1 AvocadoPrices

Load AvocadoPrices.csv posted by Samriti

Avocado_Prices <- read_csv("~/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj2/AvocadoPrices.csv")
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   AveragePrice = col_double(),
##   `Total Volume` = col_double(),
##   `4046` = col_double(),
##   `4225` = col_double(),
##   `4770` = col_double(),
##   `Total Bags` = col_double(),
##   `Small Bags` = col_double(),
##   `Large Bags` = col_double(),
##   `XLarge Bags` = col_double()
## )

Identify the names of the columns

names(Avocado_Prices)
##  [1] "Date"         "AveragePrice" "Total Volume" "4046"        
##  [5] "4225"         "4770"         "Total Bags"   "Small Bags"  
##  [9] "Large Bags"   "XLarge Bags"

Select the useful columns therefore eliminating the unlabelled data, also reducing the table to one type of observation in order to be compliant with tidy data. Rename the columns to more accurate descriptions.

Avocado_Prices_bag <- Avocado_Prices %>% 
  select(Date, "Small Bags", "Large Bags", "XLarge Bags", "Total Bags") %>%
  rename(Week = Date, Small = "Small Bags", Large = "Large Bags", XLarge = "XLarge Bags", Total = "Total Bags")
Avocado_Prices_bag[1:10,]
## # A tibble: 10 x 5
##    Week           Small    Large  XLarge     Total
##    <date>         <dbl>    <dbl>   <dbl>     <dbl>
##  1 2018-12-02  7174281. 2895343.  81548. 10151173.
##  2 2018-11-25  6016143. 2183011.  91689.  8290842.
##  3 2018-11-18  9174635. 3103023. 105316. 12382973.
##  4 2018-11-11 10721367. 3865596. 111282. 14698246.
##  5 2018-11-04 10602849. 4538743.  80605. 15222197.
##  6 2018-10-28 11437563. 4320344. 107439. 15865346.
##  7 2018-10-21  9914170. 3463120. 172919. 13550209 
##  8 2018-10-14 10460714. 3928694. 189885. 14579294.
##  9 2018-10-07 11085672. 3792435. 236376. 15114483.
## 10 2018-09-30 11080362. 3869002. 257267. 15206631.

Moved from wide to long. Now each variable has its own column and each observation has its own row.

AvocadoPricesResult <- Avocado_Prices_bag %>% 
  gather(`Bag Size`, Total, Small:XLarge)
AvocadoPricesResult[1:10,]
## # A tibble: 10 x 3
##    Week       `Bag Size`     Total
##    <date>     <chr>          <dbl>
##  1 2018-12-02 Small       7174281.
##  2 2018-11-25 Small       6016143.
##  3 2018-11-18 Small       9174635.
##  4 2018-11-11 Small      10721367.
##  5 2018-11-04 Small      10602849.
##  6 2018-10-28 Small      11437563.
##  7 2018-10-21 Small       9914170.
##  8 2018-10-14 Small      10460714.
##  9 2018-10-07 Small      11085672.
## 10 2018-09-30 Small      11080362.

ANALYSIS

Identify the best and worst times of the year to buy hash avocados

AvocadoPricesResult %>%
 arrange(Total) %>% 
  slice(1:5)
## # A tibble: 5 x 3
##   Week       `Bag Size`   Total
##   <date>     <chr>        <dbl>
## 1 2018-11-04 XLarge      80605.
## 2 2018-12-02 XLarge      81548.
## 3 2018-11-25 XLarge      91689.
## 4 2018-11-18 XLarge     105316.
## 5 2018-10-28 XLarge     107439.
AvocadoPricesResult %>%
 arrange(desc(Total)) %>% 
  slice(1:3)
## # A tibble: 3 x 3
##   Week       `Bag Size`     Total
##   <date>     <chr>          <dbl>
## 1 2018-05-06 Small      15436247.
## 2 2018-02-04 Small      13384587.
## 3 2018-05-13 Small      13110016.

The best time to buy Avocados in 2018 was late October through early December. While the worst time to buy Avocados in 2018, was mid May and early February.The prices were sky high!

2 TopBooks

Load TopBooks.csv posted by Henry

Observe column name and charcteristics

Top_Books <- read_csv("~/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj2/TopBooks.csv")
## Parsed with column specification:
## cols(
##   title = col_character(),
##   author = col_character(),
##   genre = col_character(),
##   rating = col_double(),
##   img = col_character(),
##   numRater = col_double(),
##   type = col_character(),
##   price = col_character()
## )
Top_Books[1:10,]
## # A tibble: 10 x 8
##    title     author    genre  rating img            numRater type   price  
##    <chr>     <chr>     <chr>   <dbl> <chr>             <dbl> <chr>  <chr>  
##  1 The Brok… Enjeela … Histo…    4.5 https://image…      214 Kindl… 4.99   
##  2 The Thre… Andrew G… Histo…    4.7 https://image…      128 Audib… Free w…
##  3 The Thre… Andrew G… Histo…    4.7 https://image…      128 Hardc… 17.99  
##  4 The Thre… Andrew G… Histo…    4.7 https://image…      128 Kindl… 14.99  
##  5 Sapiens   Yuval No… Histo…    4.5 https://image…     5603 Audib… Free w…
##  6 Extreme … Jocko Wi… Histo…    4.8 https://image…     2585 Audib… Free w…
##  7 The Wint… Neal Bas… Histo…    4.6 https://image…      567 Kindl… 9.99   
##  8 Born Sur… Wendy Ho… Histo…    4.8 https://image…     1337 Kindl… 11.99  
##  9 Man's Se… Viktor E… Histo…    4.7 https://image…     4525 Paper… 7.36   
## 10 Bury My … Dee Brown Histo…    4.7 https://image…     1225 Kindl… 9.99

The laws of tidy data indicate that each observation needs to be in its own table. However this data is more useful in a single table; it is easier to analyze and draw conclusions.

Change the class of the price column so it can be easily analyzed later on.

Top_Books$price <- as.numeric(Top_Books$price)
## Warning: NAs introduced by coercion

Select the Columns I will be working with then rename to add uppercase to Headers and more accurate naming.

TopBooksResult <- Top_Books %>% 
  select(title, author, genre, type, price) %>%
  rename(Title = title, Author = author, Genre = genre, Format = type, Price = price)
TopBooksResult[1:10,]
## # A tibble: 10 x 5
##    Title                    Author             Genre   Format         Price
##    <chr>                    <chr>              <chr>   <chr>          <dbl>
##  1 The Broken Circle        Enjeela Ahmadi-Mi… History Kindle Edition  4.99
##  2 The Threat               Andrew G. McCabe   History Audible Audio… NA   
##  3 The Threat               Andrew G. McCabe   History Hardcover      18.0 
##  4 The Threat               Andrew G. McCabe   History Kindle Edition 15.0 
##  5 Sapiens                  Yuval Noah Harari  History Audible Audio… NA   
##  6 Extreme Ownership        Jocko Willink      History Audible Audio… NA   
##  7 The Winter Fortress      Neal Bascomb       History Kindle Edition  9.99
##  8 Born Survivors           Wendy Holden       History Kindle Edition 12.0 
##  9 Man's Search for Meaning Viktor E. Frankl   History Paperback       7.36
## 10 Bury My Heart at Wounde… Dee Brown          History Kindle Edition  9.99

ANALYSIS

Henry would like to find the cheapest printed books on the table of best sellers he provided.

cheapbooks <- TopBooksResult %>%
  filter(Format == "Paperback") %>%
  arrange((Price))  
  cheapbooks[1:2,]
## # A tibble: 2 x 5
##   Title                         Author           Genre   Format    Price
##   <chr>                         <chr>            <chr>   <chr>     <dbl>
## 1 The Negro Motorist Green-Book Victor H. Green  History Paperback  6.64
## 2 Man's Search for Meaning      Viktor E. Frankl History Paperback  7.36

The cheapest paperback best sellers are The Negro Motorist Green-Book and Man’s search for meaning.

My daughter loves to read hardcover books, and the only available appropriate genre is Literature and fiction.

mybooks <- TopBooksResult %>%
  filter(Format == "Hardcover", Genre == "Literature & Fiction")
  mybooks
## # A tibble: 1 x 5
##   Title                   Author      Genre                Format    Price
##   <chr>                   <chr>       <chr>                <chr>     <dbl>
## 1 Where the Crawdads Sing Delia Owens Literature & Fiction Hardcover  16.2

The only hardcover book within the Literature and Fiction genre is Where the Crawdads Sing

I like to read hardcover books but I dont want to pay hardcover price. How do Hardcover books compare to paperback in terms of price.

hardcoverbookprice <- TopBooksResult %>%
  filter(Format == "Hardcover")
paperbackbookprice <- TopBooksResult %>%
  filter(Format == "Paperback")

plot(hardcoverbookprice$Price, type="o", col="blue")
lines(paperbackbookprice$Price, type="o", pch=22, lty=2, col="red")

legend(20, c("hardcover","paperback"), cex=0.8, 
   col=c("blue","red"), pch=21:22, lty=1:2)

3 ScienceProj

load ScienceProj.csv posted by Chris Ayre

Science_Proj <- read_csv("~/Google Drive/CUNY SPRING 19/COURSES/data 607/ASSIGNMENTS/607 proj2/ScienceProj.csv")
## Parsed with column specification:
## cols(
##   color = col_character(),
##   `minute 0` = col_double(),
##   `minute 10` = col_double(),
##   `minute 20` = col_double(),
##   `minute 30` = col_double(),
##   `minute 40` = col_double(),
##   `minute 50` = col_double(),
##   `minute 60` = col_double(),
##   phase = col_character()
## )
Science_Proj[1:10,]
## # A tibble: 10 x 9
##    color `minute 0` `minute 10` `minute 20` `minute 30` `minute 40`
##    <chr>      <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
##  1 white         78          81          83          88          93
##  2 red           78          82          90          93          98
##  3 pink          78          82          84          90          96
##  4 black         78          88          92          98         108
##  5 green         78          81          85          91          95
##  6 white         98          96          93          80          78
##  7 red          109         106          95          87          82
##  8 pink         102          96          90          83          80
##  9 black        121         108          98          90          84
## 10 green        105          94          90          82          80
## # … with 3 more variables: `minute 50` <dbl>, `minute 60` <dbl>,
## #   phase <chr>

Rename in preparation for the (7) minute Columns to be converted to a single minute column, Each Variable must have its own column for the table to be tidy.

Science_Proj <- Science_Proj %>% 
  rename(Color = color, "0" = "minute 0", "10" = "minute 10", "20" = "minute 20", "30" = "minute 30", "40" = "minute 40", "50" = "minute 50", "60" = "minute 60", Phase = phase)
Science_Proj[1:10,]
## # A tibble: 10 x 9
##    Color   `0`  `10`  `20`  `30`  `40`  `50`  `60` Phase  
##    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>  
##  1 white    78    81    83    88    93    96    98 heating
##  2 red      78    82    90    93    98   106   109 heating
##  3 pink     78    82    84    90    96    99   102 heating
##  4 black    78    88    92    98   108   116   121 heating
##  5 green    78    81    85    91    95   102   105 heating
##  6 white    98    96    93    80    78    78    78 cooling
##  7 red     109   106    95    87    82    80    78 cooling
##  8 pink    102    96    90    83    80    78    78 cooling
##  9 black   121   108    98    90    84    79    78 cooling
## 10 green   105    94    90    82    80    78    78 cooling

Convert the table from wide to tall by consolidating 7 minute columns to one. All recorded observations now have their own row.

Science_Proj_tall <- Science_Proj %>% 
  gather(`Minute`, Temperature, "0":"60")
Science_Proj_tall[1:10,]
## # A tibble: 10 x 4
##    Color Phase   Minute Temperature
##    <chr> <chr>   <chr>        <dbl>
##  1 white heating 0               78
##  2 red   heating 0               78
##  3 pink  heating 0               78
##  4 black heating 0               78
##  5 green heating 0               78
##  6 white cooling 0               98
##  7 red   cooling 0              109
##  8 pink  cooling 0              102
##  9 black cooling 0              121
## 10 green cooling 0              105

The table contains two different types of observations; heating and cooling. Tidy criteria requires each type of data to be in a single table. First we will create a new table from the observations which occurred during the heating phase of the experiment

Science_Proj_heating <- Science_Proj_tall %>%
  slice(1:5, 11:15, 21:25, 31:35, 41:45, 51:55, 61:65)
Science_Proj_heating[1:10,]
## # A tibble: 10 x 4
##    Color Phase   Minute Temperature
##    <chr> <chr>   <chr>        <dbl>
##  1 white heating 0               78
##  2 red   heating 0               78
##  3 pink  heating 0               78
##  4 black heating 0               78
##  5 green heating 0               78
##  6 white heating 10              81
##  7 red   heating 10              82
##  8 pink  heating 10              82
##  9 black heating 10              88
## 10 green heating 10              81

__ Create another table from the observations which occurred during the cooling phase of the experiment__

Science_Proj_cooling <- Science_Proj_tall %>%
  slice(6:10, 16:20, 26:30, 36:40, 46:50, 56:60, 66:70)
Science_Proj_cooling[1:10,]
## # A tibble: 10 x 4
##    Color Phase   Minute Temperature
##    <chr> <chr>   <chr>        <dbl>
##  1 white cooling 0               98
##  2 red   cooling 0              109
##  3 pink  cooling 0              102
##  4 black cooling 0              121
##  5 green cooling 0              105
##  6 white cooling 10              96
##  7 red   cooling 10             106
##  8 pink  cooling 10              96
##  9 black cooling 10             108
## 10 green cooling 10              94

ANALYSIS

Compare how the colors red and pink absorb heat

#identify the heating data on red and pink
Science_Proj_heating %>%
 filter(Color == "red")
## # A tibble: 7 x 4
##   Color Phase   Minute Temperature
##   <chr> <chr>   <chr>        <dbl>
## 1 red   heating 0               78
## 2 red   heating 10              82
## 3 red   heating 20              90
## 4 red   heating 30              93
## 5 red   heating 40              98
## 6 red   heating 50             106
## 7 red   heating 60             109
Science_Proj_heating %>%
 filter(Color == "green")
## # A tibble: 7 x 4
##   Color Phase   Minute Temperature
##   <chr> <chr>   <chr>        <dbl>
## 1 green heating 0               78
## 2 green heating 10              81
## 3 green heating 20              85
## 4 green heating 30              91
## 5 green heating 40              95
## 6 green heating 50             102
## 7 green heating 60             105

Red absorbed more heat than green at each 10 minute interval throughout the experiment

# define vectors from heating data
green <- c(78, 81, 85, 91, 95,102,105)
red <- c(78,82,90,93,98,106,109)

# Graph red, Label X and Y axis
plot(red, type="o",xlab="Minutes",ylab="Temperature", col="red", ylim=c(70,120))

# Make x axis using labels
axis(1, at=1:7, lab=c("0","10","20","30","40", "50","60"))

# Graph green
lines(green, type="o", pch=22, lty=2, col="green")

# Create a title with a red, bold/italic font
title(main="Heat Absorption - Red vs Green", col.main="blue", font.main=4)