Our task is to choose any three of the “wide” datasets identified in the Week 6 Discussion items.For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform the analysis requested in the discussion item.
The code is in an R Markdown file, posted to rpubs.com, which includes narrative descriptions of data cleanup work, analysis, and conclusions.
Dataset: Retail Volume & Avocado Price for 2018
source: http://www.hassavocadoboard.com/retail/volume-and-price-data
Load packages
library(tidyr)
library(dplyr)
library(lubridate)
Read csv file into R
avocado <- read.csv("https://raw.githubusercontent.com/miachen410/DATA607/master/avocado.csv")
Take a look at the first 6 rows of the dataset
head(avocado)
## Date AveragePrice Total.Volume X4046 X4225 X4770
## 1 2018-12-02 1.22 27652426 9516477 7462866 521910.0
## 2 2018-11-25 1.25 21855494 7140335 6021450 402865.8
## 3 2018-11-18 1.14 29871135 9719229 7271528 497404.3
## 4 2018-11-11 1.00 39042283 13355864 10385141 603032.8
## 5 2018-11-04 1.01 38363283 12420772 10192035 528277.9
## 6 2018-10-28 1.01 39402227 12142265 10845124 549490.8
## Total.Bags Small.Bags Large.Bags XLarge.Bags
## 1 10151173 7174281 2895343 81547.92
## 2 8290842 6016143 2183011 91688.68
## 3 12382973 9174635 3103023 105315.78
## 4 14698246 10721367 3865596 111282.12
## 5 15222197 10602849 4538743 80604.81
## 6 15865346 11437563 4320345 107438.89
Take a look at the last 6 rows of the dataset
tail(avocado)
## Date AveragePrice Total.Volume X4046 X4225 X4770
## 39 2018-02-11 0.97 43167806 15870678 11541845 652856.6
## 40 2018-02-04 0.87 62505647 21620181 20445501 1066830.2
## 41 2018-01-28 1.09 40171641 14551800 12119885 575974.7
## 42 2018-01-21 1.08 42939822 14218844 13929702 928815.1
## 43 2018-01-14 1.20 37299945 12600918 11866198 652808.4
## 44 2018-01-07 1.13 36703157 13730993 10781339 677714.9
## Total.Bags Small.Bags Large.Bags XLarge.Bags
## 39 15102427 10844852 4023485 234089.7
## 40 19373134 13384587 5719097 269451.0
## 41 12923982 9749412 3041125 133444.4
## 42 13862460 9866218 3789723 206519.3
## 43 12180021 8128242 3917570 134208.9
## 44 11513110 8231766 3130919 150424.6
Let’s look at the structure of the dataset
str(avocado)
## 'data.frame': 44 obs. of 10 variables:
## $ Date : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
## $ AveragePrice: num 1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
## $ Total.Volume: num 27652426 21855494 29871135 39042283 38363283 ...
## $ X4046 : num 9516477 7140335 9719229 13355864 12420772 ...
## $ X4225 : num 7462866 6021450 7271528 10385141 10192035 ...
## $ X4770 : num 521910 402866 497404 603033 528278 ...
## $ Total.Bags : num 10151173 8290842 12382973 14698246 15222197 ...
## $ Small.Bags : num 7174281 6016143 9174635 10721367 10602849 ...
## $ Large.Bags : num 2895343 2183011 3103023 3865596 4538743 ...
## $ XLarge.Bags : num 81548 91689 105316 111282 80605 ...
Let’s look at the summary of the dataset
summary(avocado)
## Date AveragePrice Total.Volume X4046
## 2018-01-07: 1 Min. :0.870 Min. :21855494 Min. : 7140335
## 2018-01-14: 1 1st Qu.:1.030 1st Qu.:37192349 1st Qu.:11872406
## 2018-01-21: 1 Median :1.070 Median :40738473 Median :13719191
## 2018-01-28: 1 Mean :1.083 Mean :41058219 Mean :13527481
## 2018-02-04: 1 3rd Qu.:1.130 3rd Qu.:44350184 3rd Qu.:14710288
## 2018-02-11: 1 Max. :1.290 Max. :63716144 Max. :21620181
## (Other) :38
## X4225 X4770 Total.Bags
## Min. : 6021450 Min. : 402866 Min. : 8290842
## 1st Qu.:10407883 1st Qu.: 634174 1st Qu.:13672216
## Median :11588285 Median : 734278 Median :15212227
## Mean :11758540 Mean : 744075 Mean :15028123
## 3rd Qu.:12711307 3rd Qu.: 803143 3rd Qu.:16247283
## Max. :20445501 Max. :1169256 Max. :21625373
##
## Small.Bags Large.Bags XLarge.Bags
## Min. : 6016143 Min. :2183011 Min. : 80605
## 1st Qu.: 9845629 1st Qu.:3500838 1st Qu.:199184
## Median :11045183 Median :3929403 Median :267982
## Mean :10800320 Mean :3986979 Mean :240824
## 3rd Qu.:11709426 3rd Qu.:4470543 3rd Qu.:298018
## Max. :15436247 Max. :5840415 Max. :420065
##
avocado2 <- avocado %>% gather(ProductCode, Volume, -Date:-Total.Volume, -Total.Bags:-XLarge.Bags)
# New column ProductCode, with values gather from X4046, X4225 X4770, excluding all other columns
head(avocado2)
## Date AveragePrice Total.Volume Total.Bags Small.Bags Large.Bags
## 1 2018-12-02 1.22 27652426 10151173 7174281 2895343
## 2 2018-11-25 1.25 21855494 8290842 6016143 2183011
## 3 2018-11-18 1.14 29871135 12382973 9174635 3103023
## 4 2018-11-11 1.00 39042283 14698246 10721367 3865596
## 5 2018-11-04 1.01 38363283 15222197 10602849 4538743
## 6 2018-10-28 1.01 39402227 15865346 11437563 4320345
## XLarge.Bags ProductCode Volume
## 1 81547.92 X4046 9516477
## 2 91688.68 X4046 7140335
## 3 105315.78 X4046 9719229
## 4 111282.12 X4046 13355864
## 5 80604.81 X4046 12420772
## 6 107438.89 X4046 12142265
avocado3 <- avocado2 %>% gather(BagSize, Bags, -Date:-Total.Bags, -ProductCode, -Volume)
head(avocado3)
## Date AveragePrice Total.Volume Total.Bags ProductCode Volume
## 1 2018-12-02 1.22 27652426 10151173 X4046 9516477
## 2 2018-11-25 1.25 21855494 8290842 X4046 7140335
## 3 2018-11-18 1.14 29871135 12382973 X4046 9719229
## 4 2018-11-11 1.00 39042283 14698246 X4046 13355864
## 5 2018-11-04 1.01 38363283 15222197 X4046 12420772
## 6 2018-10-28 1.01 39402227 15865346 X4046 12142265
## BagSize Bags
## 1 Small.Bags 7174281
## 2 Small.Bags 6016143
## 3 Small.Bags 9174635
## 4 Small.Bags 10721367
## 5 Small.Bags 10602849
## 6 Small.Bags 11437563
Now we look at the structure again
str(avocado3)
## 'data.frame': 396 obs. of 8 variables:
## $ Date : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
## $ AveragePrice: num 1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
## $ Total.Volume: num 27652426 21855494 29871135 39042283 38363283 ...
## $ Total.Bags : num 10151173 8290842 12382973 14698246 15222197 ...
## $ ProductCode : chr "X4046" "X4046" "X4046" "X4046" ...
## $ Volume : num 9516477 7140335 9719229 13355864 12420772 ...
## $ BagSize : chr "Small.Bags" "Small.Bags" "Small.Bags" "Small.Bags" ...
## $ Bags : num 7174281 6016143 9174635 10721367 10602849 ...
Change the class of Date from “factor” to “Date” using lubridate package
avocado3$Date <- ymd(avocado3$Date)
class(avocado3$Date)
## [1] "Date"
Change ProductCode and BagSize from character to factors
avocado3 <- avocado3 %>% mutate_at(vars(ProductCode, BagSize), funs(as.factor))
class(avocado3$ProductCode)
## [1] "factor"
class(avocado3$BagSize)
## [1] "factor"
Let’s look at the distribution of avocado average price: - The average price is centered around 1.0 to 1.1
hist(avocado3$AveragePrice)
Relationship between total volume and average price showing on a scatterplot: - There is a decreasing linear relationship: the more the cheaper
plot(avocado3$Total.Volume, avocado3$AveragePrice)
Average Price changes throughout the year: - Higher price in January, September and December
avocado3 %>% select(Date, AveragePrice) %>% plot()
Let’s find out the price differences among the products:
# For each month, summarize the sum of volume and mean of average price, arrange from high to low price
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(AvgPrice))
## # A tibble: 12 x 3
## Month TotalVolume AvgPrice
## <dbl> <dbl> <dbl>
## 1 12 52503759. 1.22
## 2 9 337349911. 1.21
## 3 8 143750709. 1.17
## 4 1 319904974. 1.12
## 5 11 235613809. 1.1
## 6 10 267680249. 1.08
## 7 3 312853337. 1.06
## 8 6 348435216. 1.06
## 9 4 415004435. 1.06
## 10 7 254879204. 1.05
## 11 2 361425548. 0.995
## 12 5 386571538. 0.985
# For each month, summarize the sum of volume and mean of average price, arrange from high to low quantity
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(TotalVolume))
## # A tibble: 12 x 3
## Month TotalVolume AvgPrice
## <dbl> <dbl> <dbl>
## 1 4 415004435. 1.06
## 2 5 386571538. 0.985
## 3 2 361425548. 0.995
## 4 6 348435216. 1.06
## 5 9 337349911. 1.21
## 6 1 319904974. 1.12
## 7 3 312853337. 1.06
## 8 10 267680249. 1.08
## 9 7 254879204. 1.05
## 10 11 235613809. 1.1
## 11 8 143750709. 1.17
## 12 12 52503759. 1.22
- In 2018, December is the month in which avocado had the highest price and lowest volume.
- In 2018, May is the month in which avocado had the lowest price and second highest volume.