Goal of this project:

- Practice in preparing different datasets for downstream analysis work.

Our task is to choose any three of the “wide” datasets identified in the Week 6 Discussion items.For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform the analysis requested in the discussion item.
The code is in an R Markdown file, posted to rpubs.com, which includes narrative descriptions of data cleanup work, analysis, and conclusions.

Dataset 1: Steel Imports Data from U.S. Census

Load the required library to tidy and modify data for further analysis

knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)

Read data from csv file , skip the first 15 rows

steel <- read.csv('https://raw.githubusercontent.com/weizhou2273/DATA607/master/projects/project%202/steel_import_1.csv',skip=15, stringsAsFactors = F, header= FALSE)[1:47,]

DATA CLEAN

Rename columns

names(steel) <- c('Country_areas','201810_prequantity','201810_prevalue',
                                  '201809_prequantity','201809_prevalue',
                                  '201809_finalquantity','201809_finalvalue',
                                  '201710_finalquantity','201710_finalvalue',
                                  '201709_finalquantity','201709_finalvalue')
head(steel)

##    Country_areas 201810_prequantity 201810_prevalue 201809_prequantity
## 1  North America             739346          673946             637423
## 2         Canada             409300          403874             380465
## 3         Mexico             330046          270073             256958
## 4         Europe             682053          738322             673569
## 5 European Union             427571          541635             390937
## 6        Austria              25793           38920              25552
##   201809_prevalue 201809_finalquantity 201809_finalvalue
## 1          611864               638543            613821
## 2          380754               380484            380790
## 3          231110               258059            233032
## 4          681575               675161            682885
## 5          487953               392240            488745
## 6           35841                25621             35954
##   201710_finalquantity 201710_finalvalue 201709_finalquantity
## 1               743537            665119               729999
## 2               482878            454037               435284
## 3               260659            211081               294716
## 4               820708            794907               804507
## 5               435696            563048               455216
## 6                17302             30929                25836
##   201709_finalvalue
## 1            639544
## 2            414721
## 3            224823
## 4            750333
## 5            536615
## 6             41803

Replace (-) with NA and convert to number,

steel[steel=='(-)']<-NA
steel[,2:11] = steel%>%
      select(2:11)%>%
      # mutate_all(funs(str_replace(., ",", "")))%>%  
      mutate_all(as.numeric)
head(steel)

##    Country_areas 201810_prequantity 201810_prevalue 201809_prequantity
## 1  North America             739346          673946             637423
## 2         Canada             409300          403874             380465
## 3         Mexico             330046          270073             256958
## 4         Europe             682053          738322             673569
## 5 European Union             427571          541635             390937
## 6        Austria              25793           38920              25552
##   201809_prevalue 201809_finalquantity 201809_finalvalue
## 1          611864               638543            613821
## 2          380754               380484            380790
## 3          231110               258059            233032
## 4          681575               675161            682885
## 5          487953               392240            488745
## 6           35841                25621             35954
##   201710_finalquantity 201710_finalvalue 201709_finalquantity
## 1               743537            665119               729999
## 2               482878            454037               435284
## 3               260659            211081               294716
## 4               820708            794907               804507
## 5               435696            563048               455216
## 6                17302             30929                25836
##   201709_finalvalue
## 1            639544
## 2            414721
## 3            224823
## 4            750333
## 5            536615
## 6             41803

Transform dataset

steel_final_1 = gather(steel,year_month_type, Number,2:11, factor_key = TRUE)%>%
                separate(year_month_type, c('year_month','quant_value'),sep='_')%>%
                spread(quant_value,'Number')
head(steel_final_1,10)

##    Country_areas year_month finalquantity finalvalue prequantity prevalue
## 1      Argentina     201709         13724      16354          NA       NA
## 2      Argentina     201710         20635      21591          NA       NA
## 3      Argentina     201809         12559      18054       12559    18054
## 4      Argentina     201810            NA         NA       10130    12041
## 5      Australia     201709         22087      17841          NA       NA
## 6      Australia     201710         21412      15981          NA       NA
## 7      Australia     201809         18775      17519       18775    17519
## 8      Australia     201810            NA         NA       28739    28244
## 9        Austria     201709         25836      41803          NA       NA
## 10       Austria     201710         17302      30929          NA       NA

steel_final_2 = gather(steel,year_month_type, Number,2:11, factor_key = TRUE)%>%
                separate(year_month_type, c('year_month','quant_value'),sep='_')%>%
                spread(year_month,'Number')
head(steel_final_2,10)

##    Country_areas   quant_value 201709 201710 201809 201810
## 1      Argentina finalquantity  13724  20635  12559     NA
## 2      Argentina    finalvalue  16354  21591  18054     NA
## 3      Argentina   prequantity     NA     NA  12559  10130
## 4      Argentina      prevalue     NA     NA  18054  12041
## 5      Australia finalquantity  22087  21412  18775     NA
## 6      Australia    finalvalue  17841  15981  17519     NA
## 7      Australia   prequantity     NA     NA  18775  28739
## 8      Australia      prevalue     NA     NA  17519  28244
## 9        Austria finalquantity  25836  17302  25621     NA
## 10       Austria    finalvalue  41803  30929  35954     NA

DATA ANALYSIS

Questions:

1. Which country/area has the overall largest final quantity / value

final_quantity = steel_final_2%>%
  filter(quant_value =='finalquantity')%>%
  rowwise()%>%
  mutate(finalquantity_overall= sum(`201709`,`201710`,`201809`,`201810`,na.rm=TRUE))%>%
  arrange(desc(finalquantity_overall))
head(final_quantity)

## # A tibble: 6 x 7
##   Country_areas quant_value `201709` `201710` `201809` `201810`
##   <chr>         <chr>          <dbl>    <dbl>    <dbl>    <dbl>
## 1 Europe        finalquant~   804507   820708   675161       NA
## 2 North America finalquant~   729999   743537   638543       NA
## 3 Pacific Rim ~ finalquant~   691501   662335   418794       NA
## 4 Canada        finalquant~   435284   482878   380484       NA
## 5 European Uni~ finalquant~   455216   435696   392240       NA
## 6 Euro Area     finalquant~   396011   358825   292800       NA
## # ... with 1 more variable: finalquantity_overall <dbl>

final_value = steel_final_2%>%
  filter(quant_value =='finalvalue')%>%
  rowwise()%>%
  mutate(finalvalue_overall= sum(`201709`,`201710`,`201809`,`201810`,na.rm=TRUE))%>%
  arrange(desc(finalvalue_overall))
head(final_value)

## # A tibble: 6 x 7
##   Country_areas quant_value `201709` `201710` `201809` `201810`
##   <chr>         <chr>          <dbl>    <dbl>    <dbl>    <dbl>
## 1 Europe        finalvalue    750333   794907   682885       NA
## 2 North America finalvalue    639544   665119   613821       NA
## 3 Pacific Rim ~ finalvalue    667787   657377   473571       NA
## 4 European Uni~ finalvalue    536615   563048   488745       NA
## 5 Euro Area     finalvalue    447539   450147   364049       NA
## 6 Canada        finalvalue    414721   454037   380790       NA
## # ... with 1 more variable: finalvalue_overall <dbl>

Europe has the highest overall quantity.

final_quantity$Country_areas <- factor(final_quantity$Country_areas, levels = final_quantity$Country_areas[order(final_quantity$finalquantity_overall)])

 g=ggplot(final_quantity,aes(y=finalquantity_overall,x=Country_areas,label=Country_areas))
 g+geom_bar(stat='identity', aes(fill=finalquantity_overall), width=.5)+
   coord_flip()

Europe has the highest overall value.

final_value$Country_areas <- factor(final_value$Country_areas, levels = final_value$Country_areas[order(final_value$finalvalue_overall)])

 g=ggplot(final_value,aes(y=finalvalue_overall,x=Country_areas,label=Country_areas))
 g+geom_bar(stat='identity', aes(fill=finalvalue_overall), width=.5)+
   coord_flip()

2. Which country has the highest unit value (Value / Quantity)

Brazil has the hightest unit price while Singapore has the lowest unit price.

unit_value=data.frame(final_value$Country_areas,final_quantity$finalquantity_overall,final_value$finalvalue_overall)
names(unit_value) = c('country','quantity','value')
unit_value=unit_value%>%
  mutate(unit_value = quantity /value )
  
unit_value$country <- factor(unit_value$country, levels = unit_value$country[order(unit_value$unit_value)])

g1=ggplot(unit_value,aes(y=unit_value,x=country,label=country))
 g1+geom_bar(stat='identity', aes(fill=unit_value), width=.5)+
   coord_flip()

## Warning: Removed 1 rows containing missing values (position_stack).

#####3. What does unit value distribution looks like?
The unit value distribution is a normal distribution with slightly left skewed.
The average unit value is 0.9373.

ggplot(unit_value, aes(x=unit_value)) + 
  geom_histogram(binwidth = 0.1, color="darkblue", fill="lightblue")+
  geom_density(alpha=.2, fill="#FF6666")

## Warning: Removed 1 rows containing non-finite values (stat_bin).

## Warning: Removed 1 rows containing non-finite values (stat_density).

summary(unit_value)

##            country      quantity           value           unit_value    
##  Singapore     : 1   Min.   :      0   Min.   :      0   Min.   :0.2360  
##  Colombia      : 1   1st Qu.:  27939   1st Qu.:  31471   1st Qu.:0.8046  
##  Hong Kong     : 1   Median :  86679   Median : 104412   Median :0.9368  
##  Norway        : 1   Mean   : 377578   Mean   : 361586   Mean   :0.9373  
##  Poland        : 1   3rd Qu.: 515234   3rd Qu.: 407678   3rd Qu.:1.0529  
##  United Kingdom: 1   Max.   :2300376   Max.   :2228125   Max.   :1.4819  
##  (Other)       :41                                       NA's   :1

What is the defferent between preliminary and final quantity

Most quantity difference falls in 0-100 bin. This is right skewed distribution.

diff  = steel_final_1%>%
  mutate(quantity_diff = finalquantity - prequantity,
         value_diff = finalvalue - prevalue)%>%
  filter(!is.na(quantity_diff))

ggplot(diff, aes(x=quantity_diff)) +
  geom_histogram(binwidth = 100, color="darkblue", fill="lightblue")+
  geom_density(alpha=.2, fill="#FF6666")

Dataset 2: Zillow data on 2018 Median Rental Price/sqft by State by Sunny Mehta

load csv dataset into R

url <- "https://raw.githubusercontent.com/Weicaidata/607/master/State_MedianRentalPricePerSqft_AllHomes.csv"
df1 <- read.csv(url,header=TRUE)
head(df1)

##     RegionName SizeRank  X2018.01  X2018.02  X2018.03  X2018.04  X2018.05
## 1   California        1 1.8674136 1.8846154 1.9001086 1.9235033 1.9478528
## 2        Texas        2 0.9056244 0.9134432 0.9237706 0.9344262 0.9387974
## 3     New York        3 4.2682584 4.2000000 3.9166667 4.1900000 4.3078412
## 4      Florida        4 1.3180646 1.3242778 1.3303769 1.3429752 1.3420245
## 5     Illinois        5 1.2173913 1.2288786 1.2594270 1.2901892 1.2925563
## 6 Pennsylvania        6 1.0256410 1.0351064 1.0568107 1.0817308 1.0890381
##   X2018.06  X2018.07  X2018.08  X2018.09  X2018.10  X2018.11  X2018.12
## 1 1.956392 1.9519520 1.9389239 1.9215785 1.9117647 1.9245725 1.9344438
## 2 0.940000 0.9373203 0.9320449 0.9276438 0.9252537 0.9281437 0.9318022
## 3 4.042553 3.8974345 3.4742328 3.2467532 3.1003876 3.1424668 3.5211268
## 4 1.333333 1.3353116 1.3493800 1.3536748 1.3489209 1.3542038 1.3584906
## 5 1.299667 1.2840895 1.2759585 1.2681159 1.2653354 1.2568210 1.2536585
## 6 1.080456 1.0574495 1.0512691 1.0416667 1.0341616 1.0380623 1.0416667

rename the columns for better understanding the data

colnames(df1) <- c("state","size","aJan","bFeb","cMar","dApr","eMay","fJun","gJul","hAug","iSep","jOct","kNov","lDec")
head(df1)

##          state size      aJan      bFeb      cMar      dApr      eMay
## 1   California    1 1.8674136 1.8846154 1.9001086 1.9235033 1.9478528
## 2        Texas    2 0.9056244 0.9134432 0.9237706 0.9344262 0.9387974
## 3     New York    3 4.2682584 4.2000000 3.9166667 4.1900000 4.3078412
## 4      Florida    4 1.3180646 1.3242778 1.3303769 1.3429752 1.3420245
## 5     Illinois    5 1.2173913 1.2288786 1.2594270 1.2901892 1.2925563
## 6 Pennsylvania    6 1.0256410 1.0351064 1.0568107 1.0817308 1.0890381
##       fJun      gJul      hAug      iSep      jOct      kNov      lDec
## 1 1.956392 1.9519520 1.9389239 1.9215785 1.9117647 1.9245725 1.9344438
## 2 0.940000 0.9373203 0.9320449 0.9276438 0.9252537 0.9281437 0.9318022
## 3 4.042553 3.8974345 3.4742328 3.2467532 3.1003876 3.1424668 3.5211268
## 4 1.333333 1.3353116 1.3493800 1.3536748 1.3489209 1.3542038 1.3584906
## 5 1.299667 1.2840895 1.2759585 1.2681159 1.2653354 1.2568210 1.2536585
## 6 1.080456 1.0574495 1.0512691 1.0416667 1.0341616 1.0380623 1.0416667

use glimpse function in dplyr to examine the dataset, and use summary function to see if there is missing value

glimpse(df1)

## Observations: 51
## Variables: 14
## $ state <fct> California, Texas, New York, Florida, Illinois, Pennsylv...
## $ size  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ aJan  <dbl> 1.8674136, 0.9056244, 4.2682584, 1.3180646, 1.2173913, 1...
## $ bFeb  <dbl> 1.8846154, 0.9134432, 4.2000000, 1.3242778, 1.2288786, 1...
## $ cMar  <dbl> 1.9001086, 0.9237706, 3.9166667, 1.3303769, 1.2594270, 1...
## $ dApr  <dbl> 1.9235033, 0.9344262, 4.1900000, 1.3429752, 1.2901892, 1...
## $ eMay  <dbl> 1.9478528, 0.9387974, 4.3078412, 1.3420245, 1.2925563, 1...
## $ fJun  <dbl> 1.9563919, 0.9400000, 4.0425532, 1.3333333, 1.2996666, 1...
## $ gJul  <dbl> 1.9519520, 0.9373203, 3.8974345, 1.3353116, 1.2840895, 1...
## $ hAug  <dbl> 1.9389239, 0.9320449, 3.4742328, 1.3493800, 1.2759585, 1...
## $ iSep  <dbl> 1.9215785, 0.9276438, 3.2467532, 1.3536748, 1.2681159, 1...
## $ jOct  <dbl> 1.9117647, 0.9252537, 3.1003876, 1.3489209, 1.2653354, 1...
## $ kNov  <dbl> 1.9245725, 0.9281437, 3.1424668, 1.3542038, 1.2568210, 1...
## $ lDec  <dbl> 1.9344438, 0.9318022, 3.5211268, 1.3584906, 1.2536585, 1...

summary(df1)

##         state         size           aJan             bFeb       
##  Alabama   : 1   Min.   : 1.0   Min.   :0.7045   Min.   :0.7082  
##  Alaska    : 1   1st Qu.:13.5   1st Qu.:0.8461   1st Qu.:0.8577  
##  Arizona   : 1   Median :26.0   Median :0.9665   Median :0.9722  
##  Arkansas  : 1   Mean   :26.0   Mean   :1.1827   Mean   :1.1938  
##  California: 1   3rd Qu.:38.5   3rd Qu.:1.3010   3rd Qu.:1.2932  
##  Colorado  : 1   Max.   :51.0   Max.   :4.2683   Max.   :4.2000  
##  (Other)   :45                                                   
##       cMar             dApr             eMay             fJun       
##  Min.   :0.7186   Min.   :0.7277   Min.   :0.7246   Min.   :0.7292  
##  1st Qu.:0.8702   1st Qu.:0.8852   1st Qu.:0.8964   1st Qu.:0.8856  
##  Median :0.9900   Median :0.9958   Median :1.0000   Median :0.9997  
##  Mean   :1.2019   Mean   :1.2258   Mean   :1.2326   Mean   :1.2268  
##  3rd Qu.:1.3188   3rd Qu.:1.3363   3rd Qu.:1.3373   3rd Qu.:1.3333  
##  Max.   :3.9167   Max.   :4.1900   Max.   :4.3078   Max.   :4.0426  
##                                                                     
##       gJul             hAug             iSep             jOct       
##  Min.   :0.7283   Min.   :0.7246   Min.   :0.7252   Min.   :0.7253  
##  1st Qu.:0.8832   1st Qu.:0.8786   1st Qu.:0.8760   1st Qu.:0.8799  
##  Median :0.9908   Median :0.9793   Median :0.9759   Median :0.9912  
##  Mean   :1.2148   Mean   :1.1995   Mean   :1.1896   Mean   :1.1794  
##  3rd Qu.:1.3189   3rd Qu.:1.3193   3rd Qu.:1.3106   3rd Qu.:1.3078  
##  Max.   :3.8974   Max.   :3.4742   Max.   :3.2468   Max.   :3.1004  
##                                                                     
##       kNov             lDec       
##  Min.   :0.7333   Min.   :0.7323  
##  1st Qu.:0.8739   1st Qu.:0.8731  
##  Median :0.9866   Median :0.9927  
##  Mean   :1.1789   Mean   :1.1905  
##  3rd Qu.:1.3087   3rd Qu.:1.3090  
##  Max.   :3.1425   Max.   :3.5211  
##

use gather function from tidyr to make the wide dataset into long, and sort by “size”

df2 <- gather(df1,"month","sales",3:14) %>% 
  arrange(size)
head(df2)

##        state size month    sales
## 1 California    1  aJan 1.867414
## 2 California    1  bFeb 1.884615
## 3 California    1  cMar 1.900109
## 4 California    1  dApr 1.923503
## 5 California    1  eMay 1.947853
## 6 California    1  fJun 1.956392

answer question “overall current distribution”

we can use histgram to see that it is a right skewed distributed, and use summary function to find out the mean, min,max,1st quartille and 3rd quartille.

hist(df2$sales,main="sales",xlab="Averge")

summary(df2$sales)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7045  0.8753  0.9875  1.2013  1.3169  4.3078

answer question “sales trend by months”

use summarise funtion,grouped by“months” to find the average price for each months.

df3 <- df2 %>% 
  arrange(month) %>% 
  group_by(month) %>% 
  summarise(avg=mean(sales))
df3

## # A tibble: 12 x 2
##    month   avg
##    <chr> <dbl>
##  1 aJan   1.18
##  2 bFeb   1.19
##  3 cMar   1.20
##  4 dApr   1.23
##  5 eMay   1.23
##  6 fJun   1.23
##  7 gJul   1.21
##  8 hAug   1.20
##  9 iSep   1.19
## 10 jOct   1.18
## 11 kNov   1.18
## 12 lDec   1.19

from the graph below, we can see that summer is hot season for sales of house, winter is the low season, which is within our expectation.

ggplot(data=df3,aes(x=month,y=avg,group=1))+
  geom_line()+
  geom_point()+
  xlab("Month")+
  ylab("Total")+
  labs(title="Sales trend by month")

answer question" trend by states"

we want to see top 5 largest state’s sales over year.

df6 <- df2 %>% 
  filter(size<=5)
head(df6)

##        state size month    sales
## 1 California    1  aJan 1.867414
## 2 California    1  bFeb 1.884615
## 3 California    1  cMar 1.900109
## 4 California    1  dApr 1.923503
## 5 California    1  eMay 1.947853
## 6 California    1  fJun 1.956392

NY has the hightst sales price per sqf compare to other states, buth the price for other states are more steady.

ggplot(df6,aes(x=month,y=sales,color=state,group=state))+
   geom_line() +
  xlab('Month')+labs(title = "Top 5 states sales trend")

Dataset 3: Retail Volume & Avocado Price for 2018

source: http://www.hassavocadoboard.com/retail/volume-and-price-data

Read csv file into R

avocado <- read.csv("https://raw.githubusercontent.com/miachen410/DATA607/master/avocado.csv")

Take a look at the first 6 rows of the dataset

head(avocado)

##         Date AveragePrice Total.Volume    X4046    X4225    X4770
## 1 2018-12-02         1.22     27652426  9516477  7462866 521910.0
## 2 2018-11-25         1.25     21855494  7140335  6021450 402865.8
## 3 2018-11-18         1.14     29871135  9719229  7271528 497404.3
## 4 2018-11-11         1.00     39042283 13355864 10385141 603032.8
## 5 2018-11-04         1.01     38363283 12420772 10192035 528277.9
## 6 2018-10-28         1.01     39402227 12142265 10845124 549490.8
##   Total.Bags Small.Bags Large.Bags XLarge.Bags
## 1   10151173    7174281    2895343    81547.92
## 2    8290842    6016143    2183011    91688.68
## 3   12382973    9174635    3103023   105315.78
## 4   14698246   10721367    3865596   111282.12
## 5   15222197   10602849    4538743    80604.81
## 6   15865346   11437563    4320345   107438.89

Take a look at the last 6 rows of the dataset

tail(avocado)

##          Date AveragePrice Total.Volume    X4046    X4225     X4770
## 39 2018-02-11         0.97     43167806 15870678 11541845  652856.6
## 40 2018-02-04         0.87     62505647 21620181 20445501 1066830.2
## 41 2018-01-28         1.09     40171641 14551800 12119885  575974.7
## 42 2018-01-21         1.08     42939822 14218844 13929702  928815.1
## 43 2018-01-14         1.20     37299945 12600918 11866198  652808.4
## 44 2018-01-07         1.13     36703157 13730993 10781339  677714.9
##    Total.Bags Small.Bags Large.Bags XLarge.Bags
## 39   15102427   10844852    4023485    234089.7
## 40   19373134   13384587    5719097    269451.0
## 41   12923982    9749412    3041125    133444.4
## 42   13862460    9866218    3789723    206519.3
## 43   12180021    8128242    3917570    134208.9
## 44   11513110    8231766    3130919    150424.6

Let’s look at the structure of the dataset

str(avocado)

## 'data.frame':    44 obs. of  10 variables:
##  $ Date        : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
##  $ AveragePrice: num  1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
##  $ Total.Volume: num  27652426 21855494 29871135 39042283 38363283 ...
##  $ X4046       : num  9516477 7140335 9719229 13355864 12420772 ...
##  $ X4225       : num  7462866 6021450 7271528 10385141 10192035 ...
##  $ X4770       : num  521910 402866 497404 603033 528278 ...
##  $ Total.Bags  : num  10151173 8290842 12382973 14698246 15222197 ...
##  $ Small.Bags  : num  7174281 6016143 9174635 10721367 10602849 ...
##  $ Large.Bags  : num  2895343 2183011 3103023 3865596 4538743 ...
##  $ XLarge.Bags : num  81548 91689 105316 111282 80605 ...

Let’s look at the summary of the dataset

summary(avocado)

##          Date     AveragePrice    Total.Volume          X4046         
##  2018-01-07: 1   Min.   :0.870   Min.   :21855494   Min.   : 7140335  
##  2018-01-14: 1   1st Qu.:1.030   1st Qu.:37192349   1st Qu.:11872406  
##  2018-01-21: 1   Median :1.070   Median :40738473   Median :13719191  
##  2018-01-28: 1   Mean   :1.083   Mean   :41058219   Mean   :13527481  
##  2018-02-04: 1   3rd Qu.:1.130   3rd Qu.:44350184   3rd Qu.:14710288  
##  2018-02-11: 1   Max.   :1.290   Max.   :63716144   Max.   :21620181  
##  (Other)   :38                                                        
##      X4225              X4770           Total.Bags      
##  Min.   : 6021450   Min.   : 402866   Min.   : 8290842  
##  1st Qu.:10407883   1st Qu.: 634174   1st Qu.:13672216  
##  Median :11588285   Median : 734278   Median :15212227  
##  Mean   :11758540   Mean   : 744075   Mean   :15028123  
##  3rd Qu.:12711307   3rd Qu.: 803143   3rd Qu.:16247283  
##  Max.   :20445501   Max.   :1169256   Max.   :21625373  
##                                                         
##    Small.Bags         Large.Bags       XLarge.Bags    
##  Min.   : 6016143   Min.   :2183011   Min.   : 80605  
##  1st Qu.: 9845629   1st Qu.:3500838   1st Qu.:199184  
##  Median :11045183   Median :3929403   Median :267982  
##  Mean   :10800320   Mean   :3986979   Mean   :240824  
##  3rd Qu.:11709426   3rd Qu.:4470543   3rd Qu.:298018  
##  Max.   :15436247   Max.   :5840415   Max.   :420065  
##

Data tidying and transformation

Gather columns(Part 1)

Columns X4046, X4225 X4770 should be gathered under the variable “ProductCode”
We will use the gather function in tidyr to turn the wide table into long table: Product Code and Volume show as key-value pairs

avocado2 <- avocado %>% gather(ProductCode, Volume, -Date:-Total.Volume, -Total.Bags:-XLarge.Bags)
# New column ProductCode, with values gather from X4046, X4225 X4770, excluding all other columns
head(avocado2)

##         Date AveragePrice Total.Volume Total.Bags Small.Bags Large.Bags
## 1 2018-12-02         1.22     27652426   10151173    7174281    2895343
## 2 2018-11-25         1.25     21855494    8290842    6016143    2183011
## 3 2018-11-18         1.14     29871135   12382973    9174635    3103023
## 4 2018-11-11         1.00     39042283   14698246   10721367    3865596
## 5 2018-11-04         1.01     38363283   15222197   10602849    4538743
## 6 2018-10-28         1.01     39402227   15865346   11437563    4320345
##   XLarge.Bags ProductCode   Volume
## 1    81547.92       X4046  9516477
## 2    91688.68       X4046  7140335
## 3   105315.78       X4046  9719229
## 4   111282.12       X4046 13355864
## 5    80604.81       X4046 12420772
## 6   107438.89       X4046 12142265

Gather columns(Part 2)

Columns Small.Bags, Large.Bags, XLarge.Bages should be gathered under the variable “BagSize”
Again, we will use the gather function in tidyr

avocado3 <- avocado2 %>% gather(BagSize, Bags, -Date:-Total.Bags, -ProductCode, -Volume)
head(avocado3)

##         Date AveragePrice Total.Volume Total.Bags ProductCode   Volume
## 1 2018-12-02         1.22     27652426   10151173       X4046  9516477
## 2 2018-11-25         1.25     21855494    8290842       X4046  7140335
## 3 2018-11-18         1.14     29871135   12382973       X4046  9719229
## 4 2018-11-11         1.00     39042283   14698246       X4046 13355864
## 5 2018-11-04         1.01     38363283   15222197       X4046 12420772
## 6 2018-10-28         1.01     39402227   15865346       X4046 12142265
##      BagSize     Bags
## 1 Small.Bags  7174281
## 2 Small.Bags  6016143
## 3 Small.Bags  9174635
## 4 Small.Bags 10721367
## 5 Small.Bags 10602849
## 6 Small.Bags 11437563

Now we look at the structure again

str(avocado3)

## 'data.frame':    396 obs. of  8 variables:
##  $ Date        : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
##  $ AveragePrice: num  1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
##  $ Total.Volume: num  27652426 21855494 29871135 39042283 38363283 ...
##  $ Total.Bags  : num  10151173 8290842 12382973 14698246 15222197 ...
##  $ ProductCode : chr  "X4046" "X4046" "X4046" "X4046" ...
##  $ Volume      : num  9516477 7140335 9719229 13355864 12420772 ...
##  $ BagSize     : chr  "Small.Bags" "Small.Bags" "Small.Bags" "Small.Bags" ...
##  $ Bags        : num  7174281 6016143 9174635 10721367 10602849 ...

Change the class of Date from “factor” to “Date” using lubridate package

avocado3$Date <- ymd(avocado3$Date)
class(avocado3$Date)

## [1] "Date"

Change ProductCode and BagSize from character to factors

avocado3 <- avocado3 %>% mutate_at(vars(ProductCode, BagSize), funs(as.factor))
class(avocado3$ProductCode)

## [1] "factor"

class(avocado3$BagSize)

## [1] "factor"

Data Analysis

Let’s look at the distribution of avocado average price: - The average price is centered around 1.0 to 1.1

hist(avocado3$AveragePrice)

Relationship between total volume and average price showing on a scatterplot: - There is a decreasing linear relationship: the more the cheaper

plot(avocado3$Total.Volume, avocado3$AveragePrice)

Average Price changes throughout the year: - Higher price in January, September and December

avocado3 %>% select(Date, AveragePrice) %>% plot()

Let’s find out the price differences among the products:

# For each month, summarize the sum of volume and mean of average price, arrange from high to low price
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(AvgPrice))

## # A tibble: 12 x 3
##    Month TotalVolume AvgPrice
##    <dbl>       <dbl>    <dbl>
##  1    12   52503759.    1.22 
##  2     9  337349911.    1.21 
##  3     8  143750709.    1.17 
##  4     1  319904974.    1.12 
##  5    11  235613809.    1.1  
##  6    10  267680249.    1.08 
##  7     3  312853337.    1.06 
##  8     6  348435216.    1.06 
##  9     4  415004435.    1.06 
## 10     7  254879204.    1.05 
## 11     2  361425548.    0.995
## 12     5  386571538.    0.985

# For each month, summarize the sum of volume and mean of average price, arrange from high to low quantity
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(TotalVolume))

## # A tibble: 12 x 3
##    Month TotalVolume AvgPrice
##    <dbl>       <dbl>    <dbl>
##  1     4  415004435.    1.06 
##  2     5  386571538.    0.985
##  3     2  361425548.    0.995
##  4     6  348435216.    1.06 
##  5     9  337349911.    1.21 
##  6     1  319904974.    1.12 
##  7     3  312853337.    1.06 
##  8    10  267680249.    1.08 
##  9     7  254879204.    1.05 
## 10    11  235613809.    1.1  
## 11     8  143750709.    1.17 
## 12    12   52503759.    1.22

Conclusion:

In 2018, December is the month in which avocado had the highest price and lowest volume.

In 2018, May is the month in which avocado had the lowest price and second highest volume.

DATA607 Project 2: Data Tidying and Transformation

Wei Zhou, Weijian Cai, Mia Chen

March 10, 2019