Goal of this project:

- Practice in preparing different datasets for downstream analysis work.

Our task is to choose any three of the “wide” datasets identified in the Week 6 Discussion items.For each of the three chosen datasets:

  • Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

  • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  • Perform the analysis requested in the discussion item.

  • The code is in an R Markdown file, posted to rpubs.com, which includes narrative descriptions of data cleanup work, analysis, and conclusions.


Dataset: Retail Volume & Avocado Price for 2018

source: http://www.hassavocadoboard.com/retail/volume-and-price-data


Read the Data

Load packages

library(tidyr)
library(dplyr)
library(lubridate)

Read csv file into R

avocado <- read.csv("https://raw.githubusercontent.com/miachen410/DATA607/master/avocado.csv")

Take a look at the first 6 rows of the dataset

head(avocado)
##         Date AveragePrice Total.Volume    X4046    X4225    X4770
## 1 2018-12-02         1.22     27652426  9516477  7462866 521910.0
## 2 2018-11-25         1.25     21855494  7140335  6021450 402865.8
## 3 2018-11-18         1.14     29871135  9719229  7271528 497404.3
## 4 2018-11-11         1.00     39042283 13355864 10385141 603032.8
## 5 2018-11-04         1.01     38363283 12420772 10192035 528277.9
## 6 2018-10-28         1.01     39402227 12142265 10845124 549490.8
##   Total.Bags Small.Bags Large.Bags XLarge.Bags
## 1   10151173    7174281    2895343    81547.92
## 2    8290842    6016143    2183011    91688.68
## 3   12382973    9174635    3103023   105315.78
## 4   14698246   10721367    3865596   111282.12
## 5   15222197   10602849    4538743    80604.81
## 6   15865346   11437563    4320345   107438.89

Take a look at the last 6 rows of the dataset

tail(avocado)
##          Date AveragePrice Total.Volume    X4046    X4225     X4770
## 39 2018-02-11         0.97     43167806 15870678 11541845  652856.6
## 40 2018-02-04         0.87     62505647 21620181 20445501 1066830.2
## 41 2018-01-28         1.09     40171641 14551800 12119885  575974.7
## 42 2018-01-21         1.08     42939822 14218844 13929702  928815.1
## 43 2018-01-14         1.20     37299945 12600918 11866198  652808.4
## 44 2018-01-07         1.13     36703157 13730993 10781339  677714.9
##    Total.Bags Small.Bags Large.Bags XLarge.Bags
## 39   15102427   10844852    4023485    234089.7
## 40   19373134   13384587    5719097    269451.0
## 41   12923982    9749412    3041125    133444.4
## 42   13862460    9866218    3789723    206519.3
## 43   12180021    8128242    3917570    134208.9
## 44   11513110    8231766    3130919    150424.6

Let’s look at the structure of the dataset

str(avocado)
## 'data.frame':    44 obs. of  10 variables:
##  $ Date        : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
##  $ AveragePrice: num  1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
##  $ Total.Volume: num  27652426 21855494 29871135 39042283 38363283 ...
##  $ X4046       : num  9516477 7140335 9719229 13355864 12420772 ...
##  $ X4225       : num  7462866 6021450 7271528 10385141 10192035 ...
##  $ X4770       : num  521910 402866 497404 603033 528278 ...
##  $ Total.Bags  : num  10151173 8290842 12382973 14698246 15222197 ...
##  $ Small.Bags  : num  7174281 6016143 9174635 10721367 10602849 ...
##  $ Large.Bags  : num  2895343 2183011 3103023 3865596 4538743 ...
##  $ XLarge.Bags : num  81548 91689 105316 111282 80605 ...

Let’s look at the summary of the dataset

summary(avocado)
##          Date     AveragePrice    Total.Volume          X4046         
##  2018-01-07: 1   Min.   :0.870   Min.   :21855494   Min.   : 7140335  
##  2018-01-14: 1   1st Qu.:1.030   1st Qu.:37192349   1st Qu.:11872406  
##  2018-01-21: 1   Median :1.070   Median :40738473   Median :13719191  
##  2018-01-28: 1   Mean   :1.083   Mean   :41058219   Mean   :13527481  
##  2018-02-04: 1   3rd Qu.:1.130   3rd Qu.:44350184   3rd Qu.:14710288  
##  2018-02-11: 1   Max.   :1.290   Max.   :63716144   Max.   :21620181  
##  (Other)   :38                                                        
##      X4225              X4770           Total.Bags      
##  Min.   : 6021450   Min.   : 402866   Min.   : 8290842  
##  1st Qu.:10407883   1st Qu.: 634174   1st Qu.:13672216  
##  Median :11588285   Median : 734278   Median :15212227  
##  Mean   :11758540   Mean   : 744075   Mean   :15028123  
##  3rd Qu.:12711307   3rd Qu.: 803143   3rd Qu.:16247283  
##  Max.   :20445501   Max.   :1169256   Max.   :21625373  
##                                                         
##    Small.Bags         Large.Bags       XLarge.Bags    
##  Min.   : 6016143   Min.   :2183011   Min.   : 80605  
##  1st Qu.: 9845629   1st Qu.:3500838   1st Qu.:199184  
##  Median :11045183   Median :3929403   Median :267982  
##  Mean   :10800320   Mean   :3986979   Mean   :240824  
##  3rd Qu.:11709426   3rd Qu.:4470543   3rd Qu.:298018  
##  Max.   :15436247   Max.   :5840415   Max.   :420065  
## 

Data tidying and transformation

Gather columns(Part 1)

  • Columns X4046, X4225 X4770 should be gathered under the variable “ProductCode”
  • We will use the gather function in tidyr to turn the wide table into long table: Product Code and Volume show as key-value pairs
avocado2 <- avocado %>% gather(ProductCode, Volume, -Date:-Total.Volume, -Total.Bags:-XLarge.Bags)
# New column ProductCode, with values gather from X4046, X4225 X4770, excluding all other columns
head(avocado2)
##         Date AveragePrice Total.Volume Total.Bags Small.Bags Large.Bags
## 1 2018-12-02         1.22     27652426   10151173    7174281    2895343
## 2 2018-11-25         1.25     21855494    8290842    6016143    2183011
## 3 2018-11-18         1.14     29871135   12382973    9174635    3103023
## 4 2018-11-11         1.00     39042283   14698246   10721367    3865596
## 5 2018-11-04         1.01     38363283   15222197   10602849    4538743
## 6 2018-10-28         1.01     39402227   15865346   11437563    4320345
##   XLarge.Bags ProductCode   Volume
## 1    81547.92       X4046  9516477
## 2    91688.68       X4046  7140335
## 3   105315.78       X4046  9719229
## 4   111282.12       X4046 13355864
## 5    80604.81       X4046 12420772
## 6   107438.89       X4046 12142265

Gather columns(Part 2)

  • Columns Small.Bags, Large.Bags, XLarge.Bages should be gathered under the variable “BagSize”
  • Again, we will use the gather function in tidyr
avocado3 <- avocado2 %>% gather(BagSize, Bags, -Date:-Total.Bags, -ProductCode, -Volume)
head(avocado3)
##         Date AveragePrice Total.Volume Total.Bags ProductCode   Volume
## 1 2018-12-02         1.22     27652426   10151173       X4046  9516477
## 2 2018-11-25         1.25     21855494    8290842       X4046  7140335
## 3 2018-11-18         1.14     29871135   12382973       X4046  9719229
## 4 2018-11-11         1.00     39042283   14698246       X4046 13355864
## 5 2018-11-04         1.01     38363283   15222197       X4046 12420772
## 6 2018-10-28         1.01     39402227   15865346       X4046 12142265
##      BagSize     Bags
## 1 Small.Bags  7174281
## 2 Small.Bags  6016143
## 3 Small.Bags  9174635
## 4 Small.Bags 10721367
## 5 Small.Bags 10602849
## 6 Small.Bags 11437563

Now we look at the structure again

str(avocado3)
## 'data.frame':    396 obs. of  8 variables:
##  $ Date        : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
##  $ AveragePrice: num  1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
##  $ Total.Volume: num  27652426 21855494 29871135 39042283 38363283 ...
##  $ Total.Bags  : num  10151173 8290842 12382973 14698246 15222197 ...
##  $ ProductCode : chr  "X4046" "X4046" "X4046" "X4046" ...
##  $ Volume      : num  9516477 7140335 9719229 13355864 12420772 ...
##  $ BagSize     : chr  "Small.Bags" "Small.Bags" "Small.Bags" "Small.Bags" ...
##  $ Bags        : num  7174281 6016143 9174635 10721367 10602849 ...

Change the class of Date from “factor” to “Date” using lubridate package

avocado3$Date <- ymd(avocado3$Date)
class(avocado3$Date)
## [1] "Date"

Change ProductCode and BagSize from character to factors

avocado3 <- avocado3 %>% mutate_at(vars(ProductCode, BagSize), funs(as.factor))
class(avocado3$ProductCode)
## [1] "factor"
class(avocado3$BagSize)
## [1] "factor"

Data Analysis

Let’s look at the distribution of avocado average price: - The average price is centered around 1.0 to 1.1

hist(avocado3$AveragePrice)

Relationship between total volume and average price showing on a scatterplot: - There is a decreasing linear relationship: the more the cheaper

plot(avocado3$Total.Volume, avocado3$AveragePrice)

Average Price changes throughout the year: - Higher price in January, September and December

avocado3 %>% select(Date, AveragePrice) %>% plot()

Let’s find out the price differences among the products:

# For each month, summarize the sum of volume and mean of average price, arrange from high to low price
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(AvgPrice))
## # A tibble: 12 x 3
##    Month TotalVolume AvgPrice
##    <dbl>       <dbl>    <dbl>
##  1    12   52503759.    1.22 
##  2     9  337349911.    1.21 
##  3     8  143750709.    1.17 
##  4     1  319904974.    1.12 
##  5    11  235613809.    1.1  
##  6    10  267680249.    1.08 
##  7     3  312853337.    1.06 
##  8     6  348435216.    1.06 
##  9     4  415004435.    1.06 
## 10     7  254879204.    1.05 
## 11     2  361425548.    0.995
## 12     5  386571538.    0.985
# For each month, summarize the sum of volume and mean of average price, arrange from high to low quantity
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(TotalVolume))
## # A tibble: 12 x 3
##    Month TotalVolume AvgPrice
##    <dbl>       <dbl>    <dbl>
##  1     4  415004435.    1.06 
##  2     5  386571538.    0.985
##  3     2  361425548.    0.995
##  4     6  348435216.    1.06 
##  5     9  337349911.    1.21 
##  6     1  319904974.    1.12 
##  7     3  312853337.    1.06 
##  8    10  267680249.    1.08 
##  9     7  254879204.    1.05 
## 10    11  235613809.    1.1  
## 11     8  143750709.    1.17 
## 12    12   52503759.    1.22

Conclusion:

  • In 2018, December is the month in which avocado had the highest price and lowest volume.
  • In 2018, May is the month in which avocado had the lowest price and second highest volume.