Goal of this project:

- Practice in preparing different datasets for downstream analysis work.

Our task is to choose any three of the “wide” datasets identified in the Week 6 Discussion items.For each of the three chosen datasets:

  • Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

  • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  • Perform the analysis requested in the discussion item.

  • The code is in an R Markdown file, posted to rpubs.com, which includes narrative descriptions of data cleanup work, analysis, and conclusions.


Dataset 1: Steel Imports Data from U.S. Census

Load the required library to tidy and modify data for further analysis

knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)

Read data from csv file , skip the first 15 rows

steel <- read.csv('https://raw.githubusercontent.com/weizhou2273/DATA607/master/projects/project%202/steel_import_1.csv',skip=15, stringsAsFactors = F, header= FALSE)[1:47,]

DATA CLEAN

Rename columns

names(steel) <- c('Country_areas','201810_prequantity','201810_prevalue',
                                  '201809_prequantity','201809_prevalue',
                                  '201809_finalquantity','201809_finalvalue',
                                  '201710_finalquantity','201710_finalvalue',
                                  '201709_finalquantity','201709_finalvalue')
head(steel)
##    Country_areas 201810_prequantity 201810_prevalue 201809_prequantity
## 1  North America             739346          673946             637423
## 2         Canada             409300          403874             380465
## 3         Mexico             330046          270073             256958
## 4         Europe             682053          738322             673569
## 5 European Union             427571          541635             390937
## 6        Austria              25793           38920              25552
##   201809_prevalue 201809_finalquantity 201809_finalvalue
## 1          611864               638543            613821
## 2          380754               380484            380790
## 3          231110               258059            233032
## 4          681575               675161            682885
## 5          487953               392240            488745
## 6           35841                25621             35954
##   201710_finalquantity 201710_finalvalue 201709_finalquantity
## 1               743537            665119               729999
## 2               482878            454037               435284
## 3               260659            211081               294716
## 4               820708            794907               804507
## 5               435696            563048               455216
## 6                17302             30929                25836
##   201709_finalvalue
## 1            639544
## 2            414721
## 3            224823
## 4            750333
## 5            536615
## 6             41803

Replace (-) with NA and convert to number,

steel[steel=='(-)']<-NA
steel[,2:11] = steel%>%
      select(2:11)%>%
      # mutate_all(funs(str_replace(., ",", "")))%>%  
      mutate_all(as.numeric)
head(steel)
##    Country_areas 201810_prequantity 201810_prevalue 201809_prequantity
## 1  North America             739346          673946             637423
## 2         Canada             409300          403874             380465
## 3         Mexico             330046          270073             256958
## 4         Europe             682053          738322             673569
## 5 European Union             427571          541635             390937
## 6        Austria              25793           38920              25552
##   201809_prevalue 201809_finalquantity 201809_finalvalue
## 1          611864               638543            613821
## 2          380754               380484            380790
## 3          231110               258059            233032
## 4          681575               675161            682885
## 5          487953               392240            488745
## 6           35841                25621             35954
##   201710_finalquantity 201710_finalvalue 201709_finalquantity
## 1               743537            665119               729999
## 2               482878            454037               435284
## 3               260659            211081               294716
## 4               820708            794907               804507
## 5               435696            563048               455216
## 6                17302             30929                25836
##   201709_finalvalue
## 1            639544
## 2            414721
## 3            224823
## 4            750333
## 5            536615
## 6             41803

Transform dataset

steel_final_1 = gather(steel,year_month_type, Number,2:11, factor_key = TRUE)%>%
                separate(year_month_type, c('year_month','quant_value'),sep='_')%>%
                spread(quant_value,'Number')
head(steel_final_1,10)
##    Country_areas year_month finalquantity finalvalue prequantity prevalue
## 1      Argentina     201709         13724      16354          NA       NA
## 2      Argentina     201710         20635      21591          NA       NA
## 3      Argentina     201809         12559      18054       12559    18054
## 4      Argentina     201810            NA         NA       10130    12041
## 5      Australia     201709         22087      17841          NA       NA
## 6      Australia     201710         21412      15981          NA       NA
## 7      Australia     201809         18775      17519       18775    17519
## 8      Australia     201810            NA         NA       28739    28244
## 9        Austria     201709         25836      41803          NA       NA
## 10       Austria     201710         17302      30929          NA       NA
steel_final_2 = gather(steel,year_month_type, Number,2:11, factor_key = TRUE)%>%
                separate(year_month_type, c('year_month','quant_value'),sep='_')%>%
                spread(year_month,'Number')
head(steel_final_2,10)
##    Country_areas   quant_value 201709 201710 201809 201810
## 1      Argentina finalquantity  13724  20635  12559     NA
## 2      Argentina    finalvalue  16354  21591  18054     NA
## 3      Argentina   prequantity     NA     NA  12559  10130
## 4      Argentina      prevalue     NA     NA  18054  12041
## 5      Australia finalquantity  22087  21412  18775     NA
## 6      Australia    finalvalue  17841  15981  17519     NA
## 7      Australia   prequantity     NA     NA  18775  28739
## 8      Australia      prevalue     NA     NA  17519  28244
## 9        Austria finalquantity  25836  17302  25621     NA
## 10       Austria    finalvalue  41803  30929  35954     NA

DATA ANALYSIS

Questions:

1. Which country/area has the overall largest final quantity / value
final_quantity = steel_final_2%>%
  filter(quant_value =='finalquantity')%>%
  rowwise()%>%
  mutate(finalquantity_overall= sum(`201709`,`201710`,`201809`,`201810`,na.rm=TRUE))%>%
  arrange(desc(finalquantity_overall))
head(final_quantity)
## # A tibble: 6 x 7
##   Country_areas quant_value `201709` `201710` `201809` `201810`
##   <chr>         <chr>          <dbl>    <dbl>    <dbl>    <dbl>
## 1 Europe        finalquant~   804507   820708   675161       NA
## 2 North America finalquant~   729999   743537   638543       NA
## 3 Pacific Rim ~ finalquant~   691501   662335   418794       NA
## 4 Canada        finalquant~   435284   482878   380484       NA
## 5 European Uni~ finalquant~   455216   435696   392240       NA
## 6 Euro Area     finalquant~   396011   358825   292800       NA
## # ... with 1 more variable: finalquantity_overall <dbl>
final_value = steel_final_2%>%
  filter(quant_value =='finalvalue')%>%
  rowwise()%>%
  mutate(finalvalue_overall= sum(`201709`,`201710`,`201809`,`201810`,na.rm=TRUE))%>%
  arrange(desc(finalvalue_overall))
head(final_value)
## # A tibble: 6 x 7
##   Country_areas quant_value `201709` `201710` `201809` `201810`
##   <chr>         <chr>          <dbl>    <dbl>    <dbl>    <dbl>
## 1 Europe        finalvalue    750333   794907   682885       NA
## 2 North America finalvalue    639544   665119   613821       NA
## 3 Pacific Rim ~ finalvalue    667787   657377   473571       NA
## 4 European Uni~ finalvalue    536615   563048   488745       NA
## 5 Euro Area     finalvalue    447539   450147   364049       NA
## 6 Canada        finalvalue    414721   454037   380790       NA
## # ... with 1 more variable: finalvalue_overall <dbl>

Europe has the highest overall quantity.

final_quantity$Country_areas <- factor(final_quantity$Country_areas, levels = final_quantity$Country_areas[order(final_quantity$finalquantity_overall)])

 g=ggplot(final_quantity,aes(y=finalquantity_overall,x=Country_areas,label=Country_areas))
 g+geom_bar(stat='identity', aes(fill=finalquantity_overall), width=.5)+
   coord_flip()

Europe has the highest overall value.

final_value$Country_areas <- factor(final_value$Country_areas, levels = final_value$Country_areas[order(final_value$finalvalue_overall)])

 g=ggplot(final_value,aes(y=finalvalue_overall,x=Country_areas,label=Country_areas))
 g+geom_bar(stat='identity', aes(fill=finalvalue_overall), width=.5)+
   coord_flip()

2. Which country has the highest unit value (Value / Quantity)

Brazil has the hightest unit price while Singapore has the lowest unit price.

unit_value=data.frame(final_value$Country_areas,final_quantity$finalquantity_overall,final_value$finalvalue_overall)
names(unit_value) = c('country','quantity','value')
unit_value=unit_value%>%
  mutate(unit_value = quantity /value )
  
unit_value$country <- factor(unit_value$country, levels = unit_value$country[order(unit_value$unit_value)])

g1=ggplot(unit_value,aes(y=unit_value,x=country,label=country))
 g1+geom_bar(stat='identity', aes(fill=unit_value), width=.5)+
   coord_flip()
## Warning: Removed 1 rows containing missing values (position_stack).

#####3. What does unit value distribution looks like?
The unit value distribution is a normal distribution with slightly left skewed.
The average unit value is 0.9373.

ggplot(unit_value, aes(x=unit_value)) + 
  geom_histogram(binwidth = 0.1, color="darkblue", fill="lightblue")+
  geom_density(alpha=.2, fill="#FF6666") 
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_density).

summary(unit_value)
##            country      quantity           value           unit_value    
##  Singapore     : 1   Min.   :      0   Min.   :      0   Min.   :0.2360  
##  Colombia      : 1   1st Qu.:  27939   1st Qu.:  31471   1st Qu.:0.8046  
##  Hong Kong     : 1   Median :  86679   Median : 104412   Median :0.9368  
##  Norway        : 1   Mean   : 377578   Mean   : 361586   Mean   :0.9373  
##  Poland        : 1   3rd Qu.: 515234   3rd Qu.: 407678   3rd Qu.:1.0529  
##  United Kingdom: 1   Max.   :2300376   Max.   :2228125   Max.   :1.4819  
##  (Other)       :41                                       NA's   :1
What is the defferent between preliminary and final quantity

Most quantity difference falls in 0-100 bin. This is right skewed distribution.

diff  = steel_final_1%>%
  mutate(quantity_diff = finalquantity - prequantity,
         value_diff = finalvalue - prevalue)%>%
  filter(!is.na(quantity_diff))

ggplot(diff, aes(x=quantity_diff)) +
  geom_histogram(binwidth = 100, color="darkblue", fill="lightblue")+
  geom_density(alpha=.2, fill="#FF6666")


Dataset 2: Zillow data on 2018 Median Rental Price/sqft by State by Sunny Mehta

load csv dataset into R

url <- "https://raw.githubusercontent.com/Weicaidata/607/master/State_MedianRentalPricePerSqft_AllHomes.csv"
df1 <- read.csv(url,header=TRUE)
head(df1)
##     RegionName SizeRank  X2018.01  X2018.02  X2018.03  X2018.04  X2018.05
## 1   California        1 1.8674136 1.8846154 1.9001086 1.9235033 1.9478528
## 2        Texas        2 0.9056244 0.9134432 0.9237706 0.9344262 0.9387974
## 3     New York        3 4.2682584 4.2000000 3.9166667 4.1900000 4.3078412
## 4      Florida        4 1.3180646 1.3242778 1.3303769 1.3429752 1.3420245
## 5     Illinois        5 1.2173913 1.2288786 1.2594270 1.2901892 1.2925563
## 6 Pennsylvania        6 1.0256410 1.0351064 1.0568107 1.0817308 1.0890381
##   X2018.06  X2018.07  X2018.08  X2018.09  X2018.10  X2018.11  X2018.12
## 1 1.956392 1.9519520 1.9389239 1.9215785 1.9117647 1.9245725 1.9344438
## 2 0.940000 0.9373203 0.9320449 0.9276438 0.9252537 0.9281437 0.9318022
## 3 4.042553 3.8974345 3.4742328 3.2467532 3.1003876 3.1424668 3.5211268
## 4 1.333333 1.3353116 1.3493800 1.3536748 1.3489209 1.3542038 1.3584906
## 5 1.299667 1.2840895 1.2759585 1.2681159 1.2653354 1.2568210 1.2536585
## 6 1.080456 1.0574495 1.0512691 1.0416667 1.0341616 1.0380623 1.0416667

rename the columns for better understanding the data

colnames(df1) <- c("state","size","aJan","bFeb","cMar","dApr","eMay","fJun","gJul","hAug","iSep","jOct","kNov","lDec")
head(df1)
##          state size      aJan      bFeb      cMar      dApr      eMay
## 1   California    1 1.8674136 1.8846154 1.9001086 1.9235033 1.9478528
## 2        Texas    2 0.9056244 0.9134432 0.9237706 0.9344262 0.9387974
## 3     New York    3 4.2682584 4.2000000 3.9166667 4.1900000 4.3078412
## 4      Florida    4 1.3180646 1.3242778 1.3303769 1.3429752 1.3420245
## 5     Illinois    5 1.2173913 1.2288786 1.2594270 1.2901892 1.2925563
## 6 Pennsylvania    6 1.0256410 1.0351064 1.0568107 1.0817308 1.0890381
##       fJun      gJul      hAug      iSep      jOct      kNov      lDec
## 1 1.956392 1.9519520 1.9389239 1.9215785 1.9117647 1.9245725 1.9344438
## 2 0.940000 0.9373203 0.9320449 0.9276438 0.9252537 0.9281437 0.9318022
## 3 4.042553 3.8974345 3.4742328 3.2467532 3.1003876 3.1424668 3.5211268
## 4 1.333333 1.3353116 1.3493800 1.3536748 1.3489209 1.3542038 1.3584906
## 5 1.299667 1.2840895 1.2759585 1.2681159 1.2653354 1.2568210 1.2536585
## 6 1.080456 1.0574495 1.0512691 1.0416667 1.0341616 1.0380623 1.0416667

use glimpse function in dplyr to examine the dataset, and use summary function to see if there is missing value

glimpse(df1)
## Observations: 51
## Variables: 14
## $ state <fct> California, Texas, New York, Florida, Illinois, Pennsylv...
## $ size  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ aJan  <dbl> 1.8674136, 0.9056244, 4.2682584, 1.3180646, 1.2173913, 1...
## $ bFeb  <dbl> 1.8846154, 0.9134432, 4.2000000, 1.3242778, 1.2288786, 1...
## $ cMar  <dbl> 1.9001086, 0.9237706, 3.9166667, 1.3303769, 1.2594270, 1...
## $ dApr  <dbl> 1.9235033, 0.9344262, 4.1900000, 1.3429752, 1.2901892, 1...
## $ eMay  <dbl> 1.9478528, 0.9387974, 4.3078412, 1.3420245, 1.2925563, 1...
## $ fJun  <dbl> 1.9563919, 0.9400000, 4.0425532, 1.3333333, 1.2996666, 1...
## $ gJul  <dbl> 1.9519520, 0.9373203, 3.8974345, 1.3353116, 1.2840895, 1...
## $ hAug  <dbl> 1.9389239, 0.9320449, 3.4742328, 1.3493800, 1.2759585, 1...
## $ iSep  <dbl> 1.9215785, 0.9276438, 3.2467532, 1.3536748, 1.2681159, 1...
## $ jOct  <dbl> 1.9117647, 0.9252537, 3.1003876, 1.3489209, 1.2653354, 1...
## $ kNov  <dbl> 1.9245725, 0.9281437, 3.1424668, 1.3542038, 1.2568210, 1...
## $ lDec  <dbl> 1.9344438, 0.9318022, 3.5211268, 1.3584906, 1.2536585, 1...
summary(df1)
##         state         size           aJan             bFeb       
##  Alabama   : 1   Min.   : 1.0   Min.   :0.7045   Min.   :0.7082  
##  Alaska    : 1   1st Qu.:13.5   1st Qu.:0.8461   1st Qu.:0.8577  
##  Arizona   : 1   Median :26.0   Median :0.9665   Median :0.9722  
##  Arkansas  : 1   Mean   :26.0   Mean   :1.1827   Mean   :1.1938  
##  California: 1   3rd Qu.:38.5   3rd Qu.:1.3010   3rd Qu.:1.2932  
##  Colorado  : 1   Max.   :51.0   Max.   :4.2683   Max.   :4.2000  
##  (Other)   :45                                                   
##       cMar             dApr             eMay             fJun       
##  Min.   :0.7186   Min.   :0.7277   Min.   :0.7246   Min.   :0.7292  
##  1st Qu.:0.8702   1st Qu.:0.8852   1st Qu.:0.8964   1st Qu.:0.8856  
##  Median :0.9900   Median :0.9958   Median :1.0000   Median :0.9997  
##  Mean   :1.2019   Mean   :1.2258   Mean   :1.2326   Mean   :1.2268  
##  3rd Qu.:1.3188   3rd Qu.:1.3363   3rd Qu.:1.3373   3rd Qu.:1.3333  
##  Max.   :3.9167   Max.   :4.1900   Max.   :4.3078   Max.   :4.0426  
##                                                                     
##       gJul             hAug             iSep             jOct       
##  Min.   :0.7283   Min.   :0.7246   Min.   :0.7252   Min.   :0.7253  
##  1st Qu.:0.8832   1st Qu.:0.8786   1st Qu.:0.8760   1st Qu.:0.8799  
##  Median :0.9908   Median :0.9793   Median :0.9759   Median :0.9912  
##  Mean   :1.2148   Mean   :1.1995   Mean   :1.1896   Mean   :1.1794  
##  3rd Qu.:1.3189   3rd Qu.:1.3193   3rd Qu.:1.3106   3rd Qu.:1.3078  
##  Max.   :3.8974   Max.   :3.4742   Max.   :3.2468   Max.   :3.1004  
##                                                                     
##       kNov             lDec       
##  Min.   :0.7333   Min.   :0.7323  
##  1st Qu.:0.8739   1st Qu.:0.8731  
##  Median :0.9866   Median :0.9927  
##  Mean   :1.1789   Mean   :1.1905  
##  3rd Qu.:1.3087   3rd Qu.:1.3090  
##  Max.   :3.1425   Max.   :3.5211  
## 

use gather function from tidyr to make the wide dataset into long, and sort by “size”

df2 <- gather(df1,"month","sales",3:14) %>% 
  arrange(size)
head(df2)
##        state size month    sales
## 1 California    1  aJan 1.867414
## 2 California    1  bFeb 1.884615
## 3 California    1  cMar 1.900109
## 4 California    1  dApr 1.923503
## 5 California    1  eMay 1.947853
## 6 California    1  fJun 1.956392

answer question “overall current distribution”

we can use histgram to see that it is a right skewed distributed, and use summary function to find out the mean, min,max,1st quartille and 3rd quartille.

hist(df2$sales,main="sales",xlab="Averge")

summary(df2$sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7045  0.8753  0.9875  1.2013  1.3169  4.3078

answer question “sales trend by months”

use summarise funtion,grouped by“months” to find the average price for each months.

df3 <- df2 %>% 
  arrange(month) %>% 
  group_by(month) %>% 
  summarise(avg=mean(sales))
df3
## # A tibble: 12 x 2
##    month   avg
##    <chr> <dbl>
##  1 aJan   1.18
##  2 bFeb   1.19
##  3 cMar   1.20
##  4 dApr   1.23
##  5 eMay   1.23
##  6 fJun   1.23
##  7 gJul   1.21
##  8 hAug   1.20
##  9 iSep   1.19
## 10 jOct   1.18
## 11 kNov   1.18
## 12 lDec   1.19

from the graph below, we can see that summer is hot season for sales of house, winter is the low season, which is within our expectation.

ggplot(data=df3,aes(x=month,y=avg,group=1))+
  geom_line()+
  geom_point()+
  xlab("Month")+
  ylab("Total")+
  labs(title="Sales trend by month")

answer question" trend by states"

we want to see top 5 largest state’s sales over year.

df6 <- df2 %>% 
  filter(size<=5)
head(df6)
##        state size month    sales
## 1 California    1  aJan 1.867414
## 2 California    1  bFeb 1.884615
## 3 California    1  cMar 1.900109
## 4 California    1  dApr 1.923503
## 5 California    1  eMay 1.947853
## 6 California    1  fJun 1.956392

NY has the hightst sales price per sqf compare to other states, buth the price for other states are more steady.

ggplot(df6,aes(x=month,y=sales,color=state,group=state))+
   geom_line() +
  xlab('Month')+labs(title = "Top 5 states sales trend")


Dataset 3: Retail Volume & Avocado Price for 2018

source: http://www.hassavocadoboard.com/retail/volume-and-price-data

Read csv file into R

avocado <- read.csv("https://raw.githubusercontent.com/miachen410/DATA607/master/avocado.csv")

Take a look at the first 6 rows of the dataset

head(avocado)
##         Date AveragePrice Total.Volume    X4046    X4225    X4770
## 1 2018-12-02         1.22     27652426  9516477  7462866 521910.0
## 2 2018-11-25         1.25     21855494  7140335  6021450 402865.8
## 3 2018-11-18         1.14     29871135  9719229  7271528 497404.3
## 4 2018-11-11         1.00     39042283 13355864 10385141 603032.8
## 5 2018-11-04         1.01     38363283 12420772 10192035 528277.9
## 6 2018-10-28         1.01     39402227 12142265 10845124 549490.8
##   Total.Bags Small.Bags Large.Bags XLarge.Bags
## 1   10151173    7174281    2895343    81547.92
## 2    8290842    6016143    2183011    91688.68
## 3   12382973    9174635    3103023   105315.78
## 4   14698246   10721367    3865596   111282.12
## 5   15222197   10602849    4538743    80604.81
## 6   15865346   11437563    4320345   107438.89

Take a look at the last 6 rows of the dataset

tail(avocado)
##          Date AveragePrice Total.Volume    X4046    X4225     X4770
## 39 2018-02-11         0.97     43167806 15870678 11541845  652856.6
## 40 2018-02-04         0.87     62505647 21620181 20445501 1066830.2
## 41 2018-01-28         1.09     40171641 14551800 12119885  575974.7
## 42 2018-01-21         1.08     42939822 14218844 13929702  928815.1
## 43 2018-01-14         1.20     37299945 12600918 11866198  652808.4
## 44 2018-01-07         1.13     36703157 13730993 10781339  677714.9
##    Total.Bags Small.Bags Large.Bags XLarge.Bags
## 39   15102427   10844852    4023485    234089.7
## 40   19373134   13384587    5719097    269451.0
## 41   12923982    9749412    3041125    133444.4
## 42   13862460    9866218    3789723    206519.3
## 43   12180021    8128242    3917570    134208.9
## 44   11513110    8231766    3130919    150424.6

Let’s look at the structure of the dataset

str(avocado)
## 'data.frame':    44 obs. of  10 variables:
##  $ Date        : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
##  $ AveragePrice: num  1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
##  $ Total.Volume: num  27652426 21855494 29871135 39042283 38363283 ...
##  $ X4046       : num  9516477 7140335 9719229 13355864 12420772 ...
##  $ X4225       : num  7462866 6021450 7271528 10385141 10192035 ...
##  $ X4770       : num  521910 402866 497404 603033 528278 ...
##  $ Total.Bags  : num  10151173 8290842 12382973 14698246 15222197 ...
##  $ Small.Bags  : num  7174281 6016143 9174635 10721367 10602849 ...
##  $ Large.Bags  : num  2895343 2183011 3103023 3865596 4538743 ...
##  $ XLarge.Bags : num  81548 91689 105316 111282 80605 ...

Let’s look at the summary of the dataset

summary(avocado)
##          Date     AveragePrice    Total.Volume          X4046         
##  2018-01-07: 1   Min.   :0.870   Min.   :21855494   Min.   : 7140335  
##  2018-01-14: 1   1st Qu.:1.030   1st Qu.:37192349   1st Qu.:11872406  
##  2018-01-21: 1   Median :1.070   Median :40738473   Median :13719191  
##  2018-01-28: 1   Mean   :1.083   Mean   :41058219   Mean   :13527481  
##  2018-02-04: 1   3rd Qu.:1.130   3rd Qu.:44350184   3rd Qu.:14710288  
##  2018-02-11: 1   Max.   :1.290   Max.   :63716144   Max.   :21620181  
##  (Other)   :38                                                        
##      X4225              X4770           Total.Bags      
##  Min.   : 6021450   Min.   : 402866   Min.   : 8290842  
##  1st Qu.:10407883   1st Qu.: 634174   1st Qu.:13672216  
##  Median :11588285   Median : 734278   Median :15212227  
##  Mean   :11758540   Mean   : 744075   Mean   :15028123  
##  3rd Qu.:12711307   3rd Qu.: 803143   3rd Qu.:16247283  
##  Max.   :20445501   Max.   :1169256   Max.   :21625373  
##                                                         
##    Small.Bags         Large.Bags       XLarge.Bags    
##  Min.   : 6016143   Min.   :2183011   Min.   : 80605  
##  1st Qu.: 9845629   1st Qu.:3500838   1st Qu.:199184  
##  Median :11045183   Median :3929403   Median :267982  
##  Mean   :10800320   Mean   :3986979   Mean   :240824  
##  3rd Qu.:11709426   3rd Qu.:4470543   3rd Qu.:298018  
##  Max.   :15436247   Max.   :5840415   Max.   :420065  
## 

Data tidying and transformation

Gather columns(Part 1)

  • Columns X4046, X4225 X4770 should be gathered under the variable “ProductCode”
  • We will use the gather function in tidyr to turn the wide table into long table: Product Code and Volume show as key-value pairs
avocado2 <- avocado %>% gather(ProductCode, Volume, -Date:-Total.Volume, -Total.Bags:-XLarge.Bags)
# New column ProductCode, with values gather from X4046, X4225 X4770, excluding all other columns
head(avocado2)
##         Date AveragePrice Total.Volume Total.Bags Small.Bags Large.Bags
## 1 2018-12-02         1.22     27652426   10151173    7174281    2895343
## 2 2018-11-25         1.25     21855494    8290842    6016143    2183011
## 3 2018-11-18         1.14     29871135   12382973    9174635    3103023
## 4 2018-11-11         1.00     39042283   14698246   10721367    3865596
## 5 2018-11-04         1.01     38363283   15222197   10602849    4538743
## 6 2018-10-28         1.01     39402227   15865346   11437563    4320345
##   XLarge.Bags ProductCode   Volume
## 1    81547.92       X4046  9516477
## 2    91688.68       X4046  7140335
## 3   105315.78       X4046  9719229
## 4   111282.12       X4046 13355864
## 5    80604.81       X4046 12420772
## 6   107438.89       X4046 12142265

Gather columns(Part 2)

  • Columns Small.Bags, Large.Bags, XLarge.Bages should be gathered under the variable “BagSize”
  • Again, we will use the gather function in tidyr
avocado3 <- avocado2 %>% gather(BagSize, Bags, -Date:-Total.Bags, -ProductCode, -Volume)
head(avocado3)
##         Date AveragePrice Total.Volume Total.Bags ProductCode   Volume
## 1 2018-12-02         1.22     27652426   10151173       X4046  9516477
## 2 2018-11-25         1.25     21855494    8290842       X4046  7140335
## 3 2018-11-18         1.14     29871135   12382973       X4046  9719229
## 4 2018-11-11         1.00     39042283   14698246       X4046 13355864
## 5 2018-11-04         1.01     38363283   15222197       X4046 12420772
## 6 2018-10-28         1.01     39402227   15865346       X4046 12142265
##      BagSize     Bags
## 1 Small.Bags  7174281
## 2 Small.Bags  6016143
## 3 Small.Bags  9174635
## 4 Small.Bags 10721367
## 5 Small.Bags 10602849
## 6 Small.Bags 11437563

Now we look at the structure again

str(avocado3)
## 'data.frame':    396 obs. of  8 variables:
##  $ Date        : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
##  $ AveragePrice: num  1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
##  $ Total.Volume: num  27652426 21855494 29871135 39042283 38363283 ...
##  $ Total.Bags  : num  10151173 8290842 12382973 14698246 15222197 ...
##  $ ProductCode : chr  "X4046" "X4046" "X4046" "X4046" ...
##  $ Volume      : num  9516477 7140335 9719229 13355864 12420772 ...
##  $ BagSize     : chr  "Small.Bags" "Small.Bags" "Small.Bags" "Small.Bags" ...
##  $ Bags        : num  7174281 6016143 9174635 10721367 10602849 ...

Change the class of Date from “factor” to “Date” using lubridate package

avocado3$Date <- ymd(avocado3$Date)
class(avocado3$Date)
## [1] "Date"

Change ProductCode and BagSize from character to factors

avocado3 <- avocado3 %>% mutate_at(vars(ProductCode, BagSize), funs(as.factor))
class(avocado3$ProductCode)
## [1] "factor"
class(avocado3$BagSize)
## [1] "factor"

Data Analysis

Let’s look at the distribution of avocado average price: - The average price is centered around 1.0 to 1.1

hist(avocado3$AveragePrice)

Relationship between total volume and average price showing on a scatterplot: - There is a decreasing linear relationship: the more the cheaper

plot(avocado3$Total.Volume, avocado3$AveragePrice)

Average Price changes throughout the year: - Higher price in January, September and December

avocado3 %>% select(Date, AveragePrice) %>% plot()

Let’s find out the price differences among the products:

# For each month, summarize the sum of volume and mean of average price, arrange from high to low price
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(AvgPrice))
## # A tibble: 12 x 3
##    Month TotalVolume AvgPrice
##    <dbl>       <dbl>    <dbl>
##  1    12   52503759.    1.22 
##  2     9  337349911.    1.21 
##  3     8  143750709.    1.17 
##  4     1  319904974.    1.12 
##  5    11  235613809.    1.1  
##  6    10  267680249.    1.08 
##  7     3  312853337.    1.06 
##  8     6  348435216.    1.06 
##  9     4  415004435.    1.06 
## 10     7  254879204.    1.05 
## 11     2  361425548.    0.995
## 12     5  386571538.    0.985
# For each month, summarize the sum of volume and mean of average price, arrange from high to low quantity
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(TotalVolume))
## # A tibble: 12 x 3
##    Month TotalVolume AvgPrice
##    <dbl>       <dbl>    <dbl>
##  1     4  415004435.    1.06 
##  2     5  386571538.    0.985
##  3     2  361425548.    0.995
##  4     6  348435216.    1.06 
##  5     9  337349911.    1.21 
##  6     1  319904974.    1.12 
##  7     3  312853337.    1.06 
##  8    10  267680249.    1.08 
##  9     7  254879204.    1.05 
## 10    11  235613809.    1.1  
## 11     8  143750709.    1.17 
## 12    12   52503759.    1.22

Conclusion:

  • In 2018, December is the month in which avocado had the highest price and lowest volume.
  • In 2018, May is the month in which avocado had the lowest price and second highest volume.