Our task is to choose any three of the “wide” datasets identified in the Week 6 Discussion items.For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform the analysis requested in the discussion item.
The code is in an R Markdown file, posted to rpubs.com, which includes narrative descriptions of data cleanup work, analysis, and conclusions.
Load the required library to tidy and modify data for further analysis
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
steel <- read.csv('https://raw.githubusercontent.com/weizhou2273/DATA607/master/projects/project%202/steel_import_1.csv',skip=15, stringsAsFactors = F, header= FALSE)[1:47,]
names(steel) <- c('Country_areas','201810_prequantity','201810_prevalue',
'201809_prequantity','201809_prevalue',
'201809_finalquantity','201809_finalvalue',
'201710_finalquantity','201710_finalvalue',
'201709_finalquantity','201709_finalvalue')
head(steel)
## Country_areas 201810_prequantity 201810_prevalue 201809_prequantity
## 1 North America 739346 673946 637423
## 2 Canada 409300 403874 380465
## 3 Mexico 330046 270073 256958
## 4 Europe 682053 738322 673569
## 5 European Union 427571 541635 390937
## 6 Austria 25793 38920 25552
## 201809_prevalue 201809_finalquantity 201809_finalvalue
## 1 611864 638543 613821
## 2 380754 380484 380790
## 3 231110 258059 233032
## 4 681575 675161 682885
## 5 487953 392240 488745
## 6 35841 25621 35954
## 201710_finalquantity 201710_finalvalue 201709_finalquantity
## 1 743537 665119 729999
## 2 482878 454037 435284
## 3 260659 211081 294716
## 4 820708 794907 804507
## 5 435696 563048 455216
## 6 17302 30929 25836
## 201709_finalvalue
## 1 639544
## 2 414721
## 3 224823
## 4 750333
## 5 536615
## 6 41803
steel[steel=='(-)']<-NA
steel[,2:11] = steel%>%
select(2:11)%>%
# mutate_all(funs(str_replace(., ",", "")))%>%
mutate_all(as.numeric)
head(steel)
## Country_areas 201810_prequantity 201810_prevalue 201809_prequantity
## 1 North America 739346 673946 637423
## 2 Canada 409300 403874 380465
## 3 Mexico 330046 270073 256958
## 4 Europe 682053 738322 673569
## 5 European Union 427571 541635 390937
## 6 Austria 25793 38920 25552
## 201809_prevalue 201809_finalquantity 201809_finalvalue
## 1 611864 638543 613821
## 2 380754 380484 380790
## 3 231110 258059 233032
## 4 681575 675161 682885
## 5 487953 392240 488745
## 6 35841 25621 35954
## 201710_finalquantity 201710_finalvalue 201709_finalquantity
## 1 743537 665119 729999
## 2 482878 454037 435284
## 3 260659 211081 294716
## 4 820708 794907 804507
## 5 435696 563048 455216
## 6 17302 30929 25836
## 201709_finalvalue
## 1 639544
## 2 414721
## 3 224823
## 4 750333
## 5 536615
## 6 41803
steel_final_1 = gather(steel,year_month_type, Number,2:11, factor_key = TRUE)%>%
separate(year_month_type, c('year_month','quant_value'),sep='_')%>%
spread(quant_value,'Number')
head(steel_final_1,10)
## Country_areas year_month finalquantity finalvalue prequantity prevalue
## 1 Argentina 201709 13724 16354 NA NA
## 2 Argentina 201710 20635 21591 NA NA
## 3 Argentina 201809 12559 18054 12559 18054
## 4 Argentina 201810 NA NA 10130 12041
## 5 Australia 201709 22087 17841 NA NA
## 6 Australia 201710 21412 15981 NA NA
## 7 Australia 201809 18775 17519 18775 17519
## 8 Australia 201810 NA NA 28739 28244
## 9 Austria 201709 25836 41803 NA NA
## 10 Austria 201710 17302 30929 NA NA
steel_final_2 = gather(steel,year_month_type, Number,2:11, factor_key = TRUE)%>%
separate(year_month_type, c('year_month','quant_value'),sep='_')%>%
spread(year_month,'Number')
head(steel_final_2,10)
## Country_areas quant_value 201709 201710 201809 201810
## 1 Argentina finalquantity 13724 20635 12559 NA
## 2 Argentina finalvalue 16354 21591 18054 NA
## 3 Argentina prequantity NA NA 12559 10130
## 4 Argentina prevalue NA NA 18054 12041
## 5 Australia finalquantity 22087 21412 18775 NA
## 6 Australia finalvalue 17841 15981 17519 NA
## 7 Australia prequantity NA NA 18775 28739
## 8 Australia prevalue NA NA 17519 28244
## 9 Austria finalquantity 25836 17302 25621 NA
## 10 Austria finalvalue 41803 30929 35954 NA
final_quantity = steel_final_2%>%
filter(quant_value =='finalquantity')%>%
rowwise()%>%
mutate(finalquantity_overall= sum(`201709`,`201710`,`201809`,`201810`,na.rm=TRUE))%>%
arrange(desc(finalquantity_overall))
head(final_quantity)
## # A tibble: 6 x 7
## Country_areas quant_value `201709` `201710` `201809` `201810`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Europe finalquant~ 804507 820708 675161 NA
## 2 North America finalquant~ 729999 743537 638543 NA
## 3 Pacific Rim ~ finalquant~ 691501 662335 418794 NA
## 4 Canada finalquant~ 435284 482878 380484 NA
## 5 European Uni~ finalquant~ 455216 435696 392240 NA
## 6 Euro Area finalquant~ 396011 358825 292800 NA
## # ... with 1 more variable: finalquantity_overall <dbl>
final_value = steel_final_2%>%
filter(quant_value =='finalvalue')%>%
rowwise()%>%
mutate(finalvalue_overall= sum(`201709`,`201710`,`201809`,`201810`,na.rm=TRUE))%>%
arrange(desc(finalvalue_overall))
head(final_value)
## # A tibble: 6 x 7
## Country_areas quant_value `201709` `201710` `201809` `201810`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Europe finalvalue 750333 794907 682885 NA
## 2 North America finalvalue 639544 665119 613821 NA
## 3 Pacific Rim ~ finalvalue 667787 657377 473571 NA
## 4 European Uni~ finalvalue 536615 563048 488745 NA
## 5 Euro Area finalvalue 447539 450147 364049 NA
## 6 Canada finalvalue 414721 454037 380790 NA
## # ... with 1 more variable: finalvalue_overall <dbl>
Europe has the highest overall quantity.
final_quantity$Country_areas <- factor(final_quantity$Country_areas, levels = final_quantity$Country_areas[order(final_quantity$finalquantity_overall)])
g=ggplot(final_quantity,aes(y=finalquantity_overall,x=Country_areas,label=Country_areas))
g+geom_bar(stat='identity', aes(fill=finalquantity_overall), width=.5)+
coord_flip()
Europe has the highest overall value.
final_value$Country_areas <- factor(final_value$Country_areas, levels = final_value$Country_areas[order(final_value$finalvalue_overall)])
g=ggplot(final_value,aes(y=finalvalue_overall,x=Country_areas,label=Country_areas))
g+geom_bar(stat='identity', aes(fill=finalvalue_overall), width=.5)+
coord_flip()
Brazil has the hightest unit price while Singapore has the lowest unit price.
unit_value=data.frame(final_value$Country_areas,final_quantity$finalquantity_overall,final_value$finalvalue_overall)
names(unit_value) = c('country','quantity','value')
unit_value=unit_value%>%
mutate(unit_value = quantity /value )
unit_value$country <- factor(unit_value$country, levels = unit_value$country[order(unit_value$unit_value)])
g1=ggplot(unit_value,aes(y=unit_value,x=country,label=country))
g1+geom_bar(stat='identity', aes(fill=unit_value), width=.5)+
coord_flip()
## Warning: Removed 1 rows containing missing values (position_stack).
#####3. What does unit value distribution looks like?
The unit value distribution is a normal distribution with slightly left skewed.
The average unit value is 0.9373.
ggplot(unit_value, aes(x=unit_value)) +
geom_histogram(binwidth = 0.1, color="darkblue", fill="lightblue")+
geom_density(alpha=.2, fill="#FF6666")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing non-finite values (stat_density).
summary(unit_value)
## country quantity value unit_value
## Singapore : 1 Min. : 0 Min. : 0 Min. :0.2360
## Colombia : 1 1st Qu.: 27939 1st Qu.: 31471 1st Qu.:0.8046
## Hong Kong : 1 Median : 86679 Median : 104412 Median :0.9368
## Norway : 1 Mean : 377578 Mean : 361586 Mean :0.9373
## Poland : 1 3rd Qu.: 515234 3rd Qu.: 407678 3rd Qu.:1.0529
## United Kingdom: 1 Max. :2300376 Max. :2228125 Max. :1.4819
## (Other) :41 NA's :1
Most quantity difference falls in 0-100 bin. This is right skewed distribution.
diff = steel_final_1%>%
mutate(quantity_diff = finalquantity - prequantity,
value_diff = finalvalue - prevalue)%>%
filter(!is.na(quantity_diff))
ggplot(diff, aes(x=quantity_diff)) +
geom_histogram(binwidth = 100, color="darkblue", fill="lightblue")+
geom_density(alpha=.2, fill="#FF6666")
load csv dataset into R
url <- "https://raw.githubusercontent.com/Weicaidata/607/master/State_MedianRentalPricePerSqft_AllHomes.csv"
df1 <- read.csv(url,header=TRUE)
head(df1)
## RegionName SizeRank X2018.01 X2018.02 X2018.03 X2018.04 X2018.05
## 1 California 1 1.8674136 1.8846154 1.9001086 1.9235033 1.9478528
## 2 Texas 2 0.9056244 0.9134432 0.9237706 0.9344262 0.9387974
## 3 New York 3 4.2682584 4.2000000 3.9166667 4.1900000 4.3078412
## 4 Florida 4 1.3180646 1.3242778 1.3303769 1.3429752 1.3420245
## 5 Illinois 5 1.2173913 1.2288786 1.2594270 1.2901892 1.2925563
## 6 Pennsylvania 6 1.0256410 1.0351064 1.0568107 1.0817308 1.0890381
## X2018.06 X2018.07 X2018.08 X2018.09 X2018.10 X2018.11 X2018.12
## 1 1.956392 1.9519520 1.9389239 1.9215785 1.9117647 1.9245725 1.9344438
## 2 0.940000 0.9373203 0.9320449 0.9276438 0.9252537 0.9281437 0.9318022
## 3 4.042553 3.8974345 3.4742328 3.2467532 3.1003876 3.1424668 3.5211268
## 4 1.333333 1.3353116 1.3493800 1.3536748 1.3489209 1.3542038 1.3584906
## 5 1.299667 1.2840895 1.2759585 1.2681159 1.2653354 1.2568210 1.2536585
## 6 1.080456 1.0574495 1.0512691 1.0416667 1.0341616 1.0380623 1.0416667
rename the columns for better understanding the data
colnames(df1) <- c("state","size","aJan","bFeb","cMar","dApr","eMay","fJun","gJul","hAug","iSep","jOct","kNov","lDec")
head(df1)
## state size aJan bFeb cMar dApr eMay
## 1 California 1 1.8674136 1.8846154 1.9001086 1.9235033 1.9478528
## 2 Texas 2 0.9056244 0.9134432 0.9237706 0.9344262 0.9387974
## 3 New York 3 4.2682584 4.2000000 3.9166667 4.1900000 4.3078412
## 4 Florida 4 1.3180646 1.3242778 1.3303769 1.3429752 1.3420245
## 5 Illinois 5 1.2173913 1.2288786 1.2594270 1.2901892 1.2925563
## 6 Pennsylvania 6 1.0256410 1.0351064 1.0568107 1.0817308 1.0890381
## fJun gJul hAug iSep jOct kNov lDec
## 1 1.956392 1.9519520 1.9389239 1.9215785 1.9117647 1.9245725 1.9344438
## 2 0.940000 0.9373203 0.9320449 0.9276438 0.9252537 0.9281437 0.9318022
## 3 4.042553 3.8974345 3.4742328 3.2467532 3.1003876 3.1424668 3.5211268
## 4 1.333333 1.3353116 1.3493800 1.3536748 1.3489209 1.3542038 1.3584906
## 5 1.299667 1.2840895 1.2759585 1.2681159 1.2653354 1.2568210 1.2536585
## 6 1.080456 1.0574495 1.0512691 1.0416667 1.0341616 1.0380623 1.0416667
use glimpse function in dplyr to examine the dataset, and use summary function to see if there is missing value
glimpse(df1)
## Observations: 51
## Variables: 14
## $ state <fct> California, Texas, New York, Florida, Illinois, Pennsylv...
## $ size <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ aJan <dbl> 1.8674136, 0.9056244, 4.2682584, 1.3180646, 1.2173913, 1...
## $ bFeb <dbl> 1.8846154, 0.9134432, 4.2000000, 1.3242778, 1.2288786, 1...
## $ cMar <dbl> 1.9001086, 0.9237706, 3.9166667, 1.3303769, 1.2594270, 1...
## $ dApr <dbl> 1.9235033, 0.9344262, 4.1900000, 1.3429752, 1.2901892, 1...
## $ eMay <dbl> 1.9478528, 0.9387974, 4.3078412, 1.3420245, 1.2925563, 1...
## $ fJun <dbl> 1.9563919, 0.9400000, 4.0425532, 1.3333333, 1.2996666, 1...
## $ gJul <dbl> 1.9519520, 0.9373203, 3.8974345, 1.3353116, 1.2840895, 1...
## $ hAug <dbl> 1.9389239, 0.9320449, 3.4742328, 1.3493800, 1.2759585, 1...
## $ iSep <dbl> 1.9215785, 0.9276438, 3.2467532, 1.3536748, 1.2681159, 1...
## $ jOct <dbl> 1.9117647, 0.9252537, 3.1003876, 1.3489209, 1.2653354, 1...
## $ kNov <dbl> 1.9245725, 0.9281437, 3.1424668, 1.3542038, 1.2568210, 1...
## $ lDec <dbl> 1.9344438, 0.9318022, 3.5211268, 1.3584906, 1.2536585, 1...
summary(df1)
## state size aJan bFeb
## Alabama : 1 Min. : 1.0 Min. :0.7045 Min. :0.7082
## Alaska : 1 1st Qu.:13.5 1st Qu.:0.8461 1st Qu.:0.8577
## Arizona : 1 Median :26.0 Median :0.9665 Median :0.9722
## Arkansas : 1 Mean :26.0 Mean :1.1827 Mean :1.1938
## California: 1 3rd Qu.:38.5 3rd Qu.:1.3010 3rd Qu.:1.2932
## Colorado : 1 Max. :51.0 Max. :4.2683 Max. :4.2000
## (Other) :45
## cMar dApr eMay fJun
## Min. :0.7186 Min. :0.7277 Min. :0.7246 Min. :0.7292
## 1st Qu.:0.8702 1st Qu.:0.8852 1st Qu.:0.8964 1st Qu.:0.8856
## Median :0.9900 Median :0.9958 Median :1.0000 Median :0.9997
## Mean :1.2019 Mean :1.2258 Mean :1.2326 Mean :1.2268
## 3rd Qu.:1.3188 3rd Qu.:1.3363 3rd Qu.:1.3373 3rd Qu.:1.3333
## Max. :3.9167 Max. :4.1900 Max. :4.3078 Max. :4.0426
##
## gJul hAug iSep jOct
## Min. :0.7283 Min. :0.7246 Min. :0.7252 Min. :0.7253
## 1st Qu.:0.8832 1st Qu.:0.8786 1st Qu.:0.8760 1st Qu.:0.8799
## Median :0.9908 Median :0.9793 Median :0.9759 Median :0.9912
## Mean :1.2148 Mean :1.1995 Mean :1.1896 Mean :1.1794
## 3rd Qu.:1.3189 3rd Qu.:1.3193 3rd Qu.:1.3106 3rd Qu.:1.3078
## Max. :3.8974 Max. :3.4742 Max. :3.2468 Max. :3.1004
##
## kNov lDec
## Min. :0.7333 Min. :0.7323
## 1st Qu.:0.8739 1st Qu.:0.8731
## Median :0.9866 Median :0.9927
## Mean :1.1789 Mean :1.1905
## 3rd Qu.:1.3087 3rd Qu.:1.3090
## Max. :3.1425 Max. :3.5211
##
use gather function from tidyr to make the wide dataset into long, and sort by “size”
df2 <- gather(df1,"month","sales",3:14) %>%
arrange(size)
head(df2)
## state size month sales
## 1 California 1 aJan 1.867414
## 2 California 1 bFeb 1.884615
## 3 California 1 cMar 1.900109
## 4 California 1 dApr 1.923503
## 5 California 1 eMay 1.947853
## 6 California 1 fJun 1.956392
we can use histgram to see that it is a right skewed distributed, and use summary function to find out the mean, min,max,1st quartille and 3rd quartille.
hist(df2$sales,main="sales",xlab="Averge")
summary(df2$sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7045 0.8753 0.9875 1.2013 1.3169 4.3078
use summarise funtion,grouped by“months” to find the average price for each months.
df3 <- df2 %>%
arrange(month) %>%
group_by(month) %>%
summarise(avg=mean(sales))
df3
## # A tibble: 12 x 2
## month avg
## <chr> <dbl>
## 1 aJan 1.18
## 2 bFeb 1.19
## 3 cMar 1.20
## 4 dApr 1.23
## 5 eMay 1.23
## 6 fJun 1.23
## 7 gJul 1.21
## 8 hAug 1.20
## 9 iSep 1.19
## 10 jOct 1.18
## 11 kNov 1.18
## 12 lDec 1.19
from the graph below, we can see that summer is hot season for sales of house, winter is the low season, which is within our expectation.
ggplot(data=df3,aes(x=month,y=avg,group=1))+
geom_line()+
geom_point()+
xlab("Month")+
ylab("Total")+
labs(title="Sales trend by month")
we want to see top 5 largest state’s sales over year.
df6 <- df2 %>%
filter(size<=5)
head(df6)
## state size month sales
## 1 California 1 aJan 1.867414
## 2 California 1 bFeb 1.884615
## 3 California 1 cMar 1.900109
## 4 California 1 dApr 1.923503
## 5 California 1 eMay 1.947853
## 6 California 1 fJun 1.956392
NY has the hightst sales price per sqf compare to other states, buth the price for other states are more steady.
ggplot(df6,aes(x=month,y=sales,color=state,group=state))+
geom_line() +
xlab('Month')+labs(title = "Top 5 states sales trend")
source: http://www.hassavocadoboard.com/retail/volume-and-price-data
Read csv file into R
avocado <- read.csv("https://raw.githubusercontent.com/miachen410/DATA607/master/avocado.csv")
Take a look at the first 6 rows of the dataset
head(avocado)
## Date AveragePrice Total.Volume X4046 X4225 X4770
## 1 2018-12-02 1.22 27652426 9516477 7462866 521910.0
## 2 2018-11-25 1.25 21855494 7140335 6021450 402865.8
## 3 2018-11-18 1.14 29871135 9719229 7271528 497404.3
## 4 2018-11-11 1.00 39042283 13355864 10385141 603032.8
## 5 2018-11-04 1.01 38363283 12420772 10192035 528277.9
## 6 2018-10-28 1.01 39402227 12142265 10845124 549490.8
## Total.Bags Small.Bags Large.Bags XLarge.Bags
## 1 10151173 7174281 2895343 81547.92
## 2 8290842 6016143 2183011 91688.68
## 3 12382973 9174635 3103023 105315.78
## 4 14698246 10721367 3865596 111282.12
## 5 15222197 10602849 4538743 80604.81
## 6 15865346 11437563 4320345 107438.89
Take a look at the last 6 rows of the dataset
tail(avocado)
## Date AveragePrice Total.Volume X4046 X4225 X4770
## 39 2018-02-11 0.97 43167806 15870678 11541845 652856.6
## 40 2018-02-04 0.87 62505647 21620181 20445501 1066830.2
## 41 2018-01-28 1.09 40171641 14551800 12119885 575974.7
## 42 2018-01-21 1.08 42939822 14218844 13929702 928815.1
## 43 2018-01-14 1.20 37299945 12600918 11866198 652808.4
## 44 2018-01-07 1.13 36703157 13730993 10781339 677714.9
## Total.Bags Small.Bags Large.Bags XLarge.Bags
## 39 15102427 10844852 4023485 234089.7
## 40 19373134 13384587 5719097 269451.0
## 41 12923982 9749412 3041125 133444.4
## 42 13862460 9866218 3789723 206519.3
## 43 12180021 8128242 3917570 134208.9
## 44 11513110 8231766 3130919 150424.6
Let’s look at the structure of the dataset
str(avocado)
## 'data.frame': 44 obs. of 10 variables:
## $ Date : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
## $ AveragePrice: num 1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
## $ Total.Volume: num 27652426 21855494 29871135 39042283 38363283 ...
## $ X4046 : num 9516477 7140335 9719229 13355864 12420772 ...
## $ X4225 : num 7462866 6021450 7271528 10385141 10192035 ...
## $ X4770 : num 521910 402866 497404 603033 528278 ...
## $ Total.Bags : num 10151173 8290842 12382973 14698246 15222197 ...
## $ Small.Bags : num 7174281 6016143 9174635 10721367 10602849 ...
## $ Large.Bags : num 2895343 2183011 3103023 3865596 4538743 ...
## $ XLarge.Bags : num 81548 91689 105316 111282 80605 ...
Let’s look at the summary of the dataset
summary(avocado)
## Date AveragePrice Total.Volume X4046
## 2018-01-07: 1 Min. :0.870 Min. :21855494 Min. : 7140335
## 2018-01-14: 1 1st Qu.:1.030 1st Qu.:37192349 1st Qu.:11872406
## 2018-01-21: 1 Median :1.070 Median :40738473 Median :13719191
## 2018-01-28: 1 Mean :1.083 Mean :41058219 Mean :13527481
## 2018-02-04: 1 3rd Qu.:1.130 3rd Qu.:44350184 3rd Qu.:14710288
## 2018-02-11: 1 Max. :1.290 Max. :63716144 Max. :21620181
## (Other) :38
## X4225 X4770 Total.Bags
## Min. : 6021450 Min. : 402866 Min. : 8290842
## 1st Qu.:10407883 1st Qu.: 634174 1st Qu.:13672216
## Median :11588285 Median : 734278 Median :15212227
## Mean :11758540 Mean : 744075 Mean :15028123
## 3rd Qu.:12711307 3rd Qu.: 803143 3rd Qu.:16247283
## Max. :20445501 Max. :1169256 Max. :21625373
##
## Small.Bags Large.Bags XLarge.Bags
## Min. : 6016143 Min. :2183011 Min. : 80605
## 1st Qu.: 9845629 1st Qu.:3500838 1st Qu.:199184
## Median :11045183 Median :3929403 Median :267982
## Mean :10800320 Mean :3986979 Mean :240824
## 3rd Qu.:11709426 3rd Qu.:4470543 3rd Qu.:298018
## Max. :15436247 Max. :5840415 Max. :420065
##
avocado2 <- avocado %>% gather(ProductCode, Volume, -Date:-Total.Volume, -Total.Bags:-XLarge.Bags)
# New column ProductCode, with values gather from X4046, X4225 X4770, excluding all other columns
head(avocado2)
## Date AveragePrice Total.Volume Total.Bags Small.Bags Large.Bags
## 1 2018-12-02 1.22 27652426 10151173 7174281 2895343
## 2 2018-11-25 1.25 21855494 8290842 6016143 2183011
## 3 2018-11-18 1.14 29871135 12382973 9174635 3103023
## 4 2018-11-11 1.00 39042283 14698246 10721367 3865596
## 5 2018-11-04 1.01 38363283 15222197 10602849 4538743
## 6 2018-10-28 1.01 39402227 15865346 11437563 4320345
## XLarge.Bags ProductCode Volume
## 1 81547.92 X4046 9516477
## 2 91688.68 X4046 7140335
## 3 105315.78 X4046 9719229
## 4 111282.12 X4046 13355864
## 5 80604.81 X4046 12420772
## 6 107438.89 X4046 12142265
avocado3 <- avocado2 %>% gather(BagSize, Bags, -Date:-Total.Bags, -ProductCode, -Volume)
head(avocado3)
## Date AveragePrice Total.Volume Total.Bags ProductCode Volume
## 1 2018-12-02 1.22 27652426 10151173 X4046 9516477
## 2 2018-11-25 1.25 21855494 8290842 X4046 7140335
## 3 2018-11-18 1.14 29871135 12382973 X4046 9719229
## 4 2018-11-11 1.00 39042283 14698246 X4046 13355864
## 5 2018-11-04 1.01 38363283 15222197 X4046 12420772
## 6 2018-10-28 1.01 39402227 15865346 X4046 12142265
## BagSize Bags
## 1 Small.Bags 7174281
## 2 Small.Bags 6016143
## 3 Small.Bags 9174635
## 4 Small.Bags 10721367
## 5 Small.Bags 10602849
## 6 Small.Bags 11437563
Now we look at the structure again
str(avocado3)
## 'data.frame': 396 obs. of 8 variables:
## $ Date : Factor w/ 44 levels "2018-01-07","2018-01-14",..: 44 43 42 41 40 39 38 37 36 35 ...
## $ AveragePrice: num 1.22 1.25 1.14 1 1.01 1.01 1.13 1.08 1.08 1.1 ...
## $ Total.Volume: num 27652426 21855494 29871135 39042283 38363283 ...
## $ Total.Bags : num 10151173 8290842 12382973 14698246 15222197 ...
## $ ProductCode : chr "X4046" "X4046" "X4046" "X4046" ...
## $ Volume : num 9516477 7140335 9719229 13355864 12420772 ...
## $ BagSize : chr "Small.Bags" "Small.Bags" "Small.Bags" "Small.Bags" ...
## $ Bags : num 7174281 6016143 9174635 10721367 10602849 ...
Change the class of Date from “factor” to “Date” using lubridate package
avocado3$Date <- ymd(avocado3$Date)
class(avocado3$Date)
## [1] "Date"
Change ProductCode and BagSize from character to factors
avocado3 <- avocado3 %>% mutate_at(vars(ProductCode, BagSize), funs(as.factor))
class(avocado3$ProductCode)
## [1] "factor"
class(avocado3$BagSize)
## [1] "factor"
Let’s look at the distribution of avocado average price: - The average price is centered around 1.0 to 1.1
hist(avocado3$AveragePrice)
Relationship between total volume and average price showing on a scatterplot: - There is a decreasing linear relationship: the more the cheaper
plot(avocado3$Total.Volume, avocado3$AveragePrice)
Average Price changes throughout the year: - Higher price in January, September and December
avocado3 %>% select(Date, AveragePrice) %>% plot()
Let’s find out the price differences among the products:
# For each month, summarize the sum of volume and mean of average price, arrange from high to low price
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(AvgPrice))
## # A tibble: 12 x 3
## Month TotalVolume AvgPrice
## <dbl> <dbl> <dbl>
## 1 12 52503759. 1.22
## 2 9 337349911. 1.21
## 3 8 143750709. 1.17
## 4 1 319904974. 1.12
## 5 11 235613809. 1.1
## 6 10 267680249. 1.08
## 7 3 312853337. 1.06
## 8 6 348435216. 1.06
## 9 4 415004435. 1.06
## 10 7 254879204. 1.05
## 11 2 361425548. 0.995
## 12 5 386571538. 0.985
# For each month, summarize the sum of volume and mean of average price, arrange from high to low quantity
avocado3 %>% group_by(Month = month(Date)) %>% summarize(TotalVolume = sum(Volume), AvgPrice = mean(AveragePrice)) %>% arrange(desc(TotalVolume))
## # A tibble: 12 x 3
## Month TotalVolume AvgPrice
## <dbl> <dbl> <dbl>
## 1 4 415004435. 1.06
## 2 5 386571538. 0.985
## 3 2 361425548. 0.995
## 4 6 348435216. 1.06
## 5 9 337349911. 1.21
## 6 1 319904974. 1.12
## 7 3 312853337. 1.06
## 8 10 267680249. 1.08
## 9 7 254879204. 1.05
## 10 11 235613809. 1.1
## 11 8 143750709. 1.17
## 12 12 52503759. 1.22
- In 2018, December is the month in which avocado had the highest price and lowest volume.
- In 2018, May is the month in which avocado had the lowest price and second highest volume.