This data was downloaded from kaggle about historical data on avocado prices and sales volume in multiple US markets from 2015 to 2017.
Some relevant columns in the data :
- Date : The date of the observation
- AveragePrice : The average price of a single avocado in US dollar
- type : Conventional or organic
- year : The year of the observation
- region : The city or region of the observation
- Total.Volume : Total number of avocados sold
- X4046 : Total number of avocados with PLU 4046 sold
- X4225 : Total number of avocados with PLU 4225 sold
- X4770 : Total number of avocados with PLU 4770 sold
avo <- read.csv("data/avocado.csv")library(lubridate)##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
head(avo)## X Date AveragePrice Total.Volume X4046 X4225 X4770 Total.Bags
## 1 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87
## 2 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56
## 3 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35
## 4 3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16
## 5 4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95
## 6 5 2015-11-22 1.26 55979.78 1184.27 48067.99 43.61 6683.91
## Small.Bags Large.Bags XLarge.Bags type year region
## 1 8603.62 93.25 0 conventional 2015 Albany
## 2 9408.07 97.49 0 conventional 2015 Albany
## 3 8042.21 103.14 0 conventional 2015 Albany
## 4 5677.40 133.76 0 conventional 2015 Albany
## 5 5986.26 197.69 0 conventional 2015 Albany
## 6 6556.47 127.44 0 conventional 2015 Albany
tail(avo)## X Date AveragePrice Total.Volume X4046 X4225 X4770
## 16323 47 2017-02-05 1.15 227693.3 35703.61 46107.73 71.74
## 16324 48 2017-01-29 1.29 205843.2 38850.43 44464.74 109.61
## 16325 49 2017-01-22 1.20 208810.3 32006.16 45982.34 70.40
## 16326 50 2017-01-15 1.06 268743.1 43816.58 41834.41 471.92
## 16327 51 2017-01-08 1.09 267703.0 29794.90 45439.29 125.69
## 16328 52 2017-01-01 1.19 189700.7 20783.06 35758.45 36.46
## Total.Bags Small.Bags Large.Bags XLarge.Bags type year region
## 16323 145810.2 41936.54 103873.71 0 organic 2017 West
## 16324 122418.4 37767.77 84650.62 0 organic 2017 West
## 16325 130751.4 45987.98 84763.43 0 organic 2017 West
## 16326 182620.2 62810.00 119810.20 0 organic 2017 West
## 16327 192343.1 41364.31 150978.80 0 organic 2017 West
## 16328 133122.7 33998.50 99124.23 0 organic 2017 West
dim(avo)## [1] 16328 14
names(avo)## [1] "X" "Date" "AveragePrice" "Total.Volume" "X4046"
## [6] "X4225" "X4770" "Total.Bags" "Small.Bags" "Large.Bags"
## [11] "XLarge.Bags" "type" "year" "region"
Conclusion :
- The data contain 16328 rows and 14 columns
- Each of column name : “X”, “Date”, “AveragePrice”, “Total.Volume” “X4046”, “X4225”, “X4770”, “Total.Bags”, “Small.Bags”, “Large.Bags”, “XLarge.Bags”, “type”, “year”, “region”
To check if there is any missing value from the data.
anyNA(avo)## [1] FALSE
colSums(is.na(avo))## X Date AveragePrice Total.Volume X4046 X4225
## 0 0 0 0 0 0
## X4770 Total.Bags Small.Bags Large.Bags XLarge.Bags type
## 0 0 0 0 0 0
## year region
## 0 0
Conclusion : There is no missing value from the data
Check data type for each column
str(avo)## 'data.frame': 16328 obs. of 14 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Date : chr "2015-12-27" "2015-12-20" "2015-12-13" "2015-12-06" ...
## $ AveragePrice: num 1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
## $ Total.Volume: num 64237 54877 118220 78992 51040 ...
## $ X4046 : num 1037 674 795 1132 941 ...
## $ X4225 : num 54455 44639 109150 71976 43838 ...
## $ X4770 : num 48.2 58.3 130.5 72.6 75.8 ...
## $ Total.Bags : num 8697 9506 8145 5811 6184 ...
## $ Small.Bags : num 8604 9408 8042 5677 5986 ...
## $ Large.Bags : num 93.2 97.5 103.1 133.8 197.7 ...
## $ XLarge.Bags : num 0 0 0 0 0 0 0 0 0 0 ...
## $ type : chr "conventional" "conventional" "conventional" "conventional" ...
## $ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ region : chr "Albany" "Albany" "Albany" "Albany" ...
summary(avo)## X Date AveragePrice Total.Volume
## Min. : 0.00 Length:16328 Min. :0.440 Min. : 85
## 1st Qu.:13.00 Class :character 1st Qu.:1.100 1st Qu.: 10088
## Median :26.00 Mode :character Median :1.380 Median : 97536
## Mean :25.67 Mean :1.415 Mean : 530796
## 3rd Qu.:39.00 3rd Qu.:1.680 3rd Qu.: 378850
## Max. :52.00 Max. :3.250 Max. :11274749
## X4046 X4225 X4770 Total.Bags
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 790 1st Qu.: 2871 1st Qu.: 0.0 1st Qu.: 4343
## Median : 7482 Median : 26717 Median : 162.9 Median : 36255
## Mean : 179505 Mean : 189540 Mean : 14905.6 Mean : 146845
## 3rd Qu.: 97342 3rd Qu.: 133182 3rd Qu.: 5684.6 3rd Qu.: 97129
## Max. :5160897 Max. :5246830 Max. :804558.2 Max. :4145407
## Small.Bags Large.Bags XLarge.Bags type
## Min. : 0 Min. : 0.0 Min. : 0.00 Length:16328
## 1st Qu.: 2262 1st Qu.: 99.2 1st Qu.: 0.00 Class :character
## Median : 23560 Median : 2276.2 Median : 0.00 Mode :character
## Mean : 112607 Mean : 32368.1 Mean : 1869.10
## 3rd Qu.: 74724 3rd Qu.: 17593.4 3rd Qu.: 84.78
## Max. :3403581 Max. :1627453.1 Max. :130892.80
## year region
## Min. :2015 Length:16328
## 1st Qu.:2015 Class :character
## Median :2016 Mode :character
## Mean :2016
## 3rd Qu.:2017
## Max. :2017
avo$Date <- as.Date(avo$Date)
avo$type <- as.factor(avo$type)
avo$region <- as.factor(avo$region)
str(avo)## 'data.frame': 16328 obs. of 14 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Date : Date, format: "2015-12-27" "2015-12-20" ...
## $ AveragePrice: num 1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
## $ Total.Volume: num 64237 54877 118220 78992 51040 ...
## $ X4046 : num 1037 674 795 1132 941 ...
## $ X4225 : num 54455 44639 109150 71976 43838 ...
## $ X4770 : num 48.2 58.3 130.5 72.6 75.8 ...
## $ Total.Bags : num 8697 9506 8145 5811 6184 ...
## $ Small.Bags : num 8604 9408 8042 5677 5986 ...
## $ Large.Bags : num 93.2 97.5 103.1 133.8 197.7 ...
## $ XLarge.Bags : num 0 0 0 0 0 0 0 0 0 0 ...
## $ type : Factor w/ 2 levels "conventional",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ region : Factor w/ 52 levels "Albany","Atlanta",..: 1 1 1 1 1 1 1 1 1 1 ...
Subset to delete the information that we don’t need, and save it to avo_dt to be processed and analyzed
avo_dt <- avo[, c("Date", "AveragePrice", "Total.Volume", "type", "year", "region")]
avo_dt$Total.Revenue <- avo_dt$AveragePrice * avo_dt$Total.Volume
avo_dt$month <- month(avo_dt$Date, label = T)
head(avo_dt)## Date AveragePrice Total.Volume type year region Total.Revenue
## 1 2015-12-27 1.33 64236.62 conventional 2015 Albany 85434.70
## 2 2015-12-20 1.35 54876.98 conventional 2015 Albany 74083.92
## 3 2015-12-13 0.93 118220.22 conventional 2015 Albany 109944.80
## 4 2015-12-06 1.08 78992.15 conventional 2015 Albany 85311.52
## 5 2015-11-29 1.28 51039.60 conventional 2015 Albany 65330.69
## 6 2015-11-22 1.26 55979.78 conventional 2015 Albany 70534.52
## month
## 1 Dec
## 2 Dec
## 3 Dec
## 4 Dec
## 5 Nov
## 6 Nov
summary(avo_dt)## Date AveragePrice Total.Volume type
## Min. :2015-01-04 Min. :0.440 Min. : 85 conventional:8164
## 1st Qu.:2015-10-04 1st Qu.:1.100 1st Qu.: 10088 organic :8164
## Median :2016-07-03 Median :1.380 Median : 97536
## Mean :2016-07-03 Mean :1.415 Mean : 530796
## 3rd Qu.:2017-04-02 3rd Qu.:1.680 3rd Qu.: 378850
## Max. :2017-12-31 Max. :3.250 Max. :11274749
##
## year region Total.Revenue month
## Min. :2015 Albany : 314 Min. : 134 Jan :1456
## 1st Qu.:2015 Atlanta : 314 1st Qu.: 15852 May :1456
## Median :2016 BaltimoreWashington: 314 Median : 126622 Jul :1456
## Mean :2016 Boise : 314 Mean : 582529 Oct :1456
## 3rd Qu.:2017 Boston : 314 3rd Qu.: 457222 Mar :1352
## Max. :2017 BuffaloRochester : 314 Max. :8501979 Apr :1352
## (Other) :14444 (Other):7800
Summary :
1. First avocado sales occurred in 04 January 2015
2. The highest average price per avocado is USD 3.250
3. The lowest average price per avocado is USD 0.440
4. The highest total revenue is USD 8501979
5. The lowest total revenue is USD 134
6. There is two type of cultivating avocado that is organic and conventional
Checking outlier with AveragePrice
mean(avo_dt$AveragePrice)## [1] 1.415005
var(avo_dt$AveragePrice)## [1] 0.1663989
sd(avo_dt$AveragePrice)## [1] 0.4079203
boxplot(avo_dt$AveragePrice, main = "Average Price per Avocado")Conclusion : The mean of Average Price is 1.42 with standard deviation 0.41
avo_dt[avo_dt$Total.Volume == max(avo_dt$Total.Volume), ]## Date AveragePrice Total.Volume type year region
## 8159 2017-02-05 0.66 11274749 conventional 2017 West
## Total.Revenue month
## 8159 7441334 Feb
Conclusion : The highest total volume sales of avocado is at 05 February 2017
vol_year <- aggregate(Total.Volume ~ year, avo_dt, sum)
vol_year## year Total.Volume
## 1 2015 2686689443
## 2 2016 2957923317
## 3 2017 3022222547
Conclusion : The total volume of the avocado sold in 2015 is 2,686,689,443, in 2016 is 2,957,923,317, and in 2017 is 3,022,222,547
xtabs(Total.Revenue ~ type, avo_dt)## type
## conventional organic
## 9134978991 376560149
Conclusion : Avocado with the type of conventional makes more revenue than the avocado with the type of organic
aggregate(Total.Revenue ~ month + year, avo_dt, mean)## month year Total.Revenue
## 1 Jan 2015 478157.4
## 2 Feb 2015 522174.6
## 3 Mar 2015 508196.3
## 4 Apr 2015 537803.2
## 5 May 2015 588785.5
## 6 Jun 2015 577700.2
## 7 Jul 2015 566627.8
## 8 Aug 2015 535415.8
## 9 Sep 2015 505327.8
## 10 Oct 2015 466238.9
## 11 Nov 2015 426637.3
## 12 Dec 2015 413159.6
## 13 Jan 2016 524637.3
## 14 Feb 2016 551841.5
## 15 Mar 2016 547863.8
## 16 Apr 2016 542997.4
## 17 May 2016 593959.4
## 18 Jun 2016 624552.8
## 19 Jul 2016 638773.8
## 20 Aug 2016 617045.9
## 21 Sep 2016 606864.2
## 22 Oct 2016 563493.3
## 23 Nov 2016 513202.1
## 24 Dec 2016 488799.0
## 25 Jan 2017 607579.0
## 26 Feb 2017 628805.1
## 27 Mar 2017 646408.5
## 28 Apr 2017 705734.5
## 29 May 2017 763498.8
## 30 Jun 2017 738026.1
## 31 Jul 2017 717495.8
## 32 Aug 2017 702085.3
## 33 Sep 2017 674593.0
## 34 Oct 2017 655251.3
## 35 Nov 2017 580539.4
## 36 Dec 2017 590012.4
Conclusion : The average of the total revenue is being showed in the table above.
region_sum <- aggregate(Total.Revenue ~ region + year, avo_dt, sum)
reg_most <- region_sum[order(-region_sum$Total.Revenue), ]
reg_most[1:5,]## region year Total.Revenue
## 111 California 2017 386774783
## 156 West 2017 370051840
## 134 Northeast 2017 340067375
## 59 California 2016 335212658
## 104 West 2016 313204902
Conclusion : The region that has the most revenue is California in 2017
cor(avo_dt$Total.Volume, avo_dt$Total.Revenue) ## [1] 0.9692353
plot(avo_dt$Total.Volume, avo_dt$Total.Revenue, main = "Correlation Between The Total Volume Sales With The Total Revenue", xlab = "Total Volume Sales", ylab = "Total Revenue") Conclusion : There is a strong positive correlation between the total volume sales of avocado and the total revenue
sum(avo_dt$Total.Volume)## [1] 8666835307
sum(avo_dt$Total.Revenue)## [1] 9511539140
Conclusion : The total volume sales of avocado sold is 8,666,835,307 with total revenue of USD 9,511,539,140
Since 2015 to 2017, the total volume sales of avocado sold is 8,666,835,307 with total revenue of USD 9,511,539,140. In 2017, California is the region that has the most total revenue. The total volume sales of avocado is increasing every year. The total volume sales of avocado with conventional type is higher than the total volume sales of avocado with organic type. The highest average total revenue of all observation region is on May 2017.