1 Explanation

This data was downloaded from kaggle about historical data on avocado prices and sales volume in multiple US markets from 2015 to 2017.

Some relevant columns in the data :
- Date : The date of the observation
- AveragePrice : The average price of a single avocado in US dollar
- type : Conventional or organic
- year : The year of the observation
- region : The city or region of the observation
- Total.Volume : Total number of avocados sold
- X4046 : Total number of avocados with PLU 4046 sold
- X4225 : Total number of avocados with PLU 4225 sold
- X4770 : Total number of avocados with PLU 4770 sold

1.1 Input Data

avo <- read.csv("data/avocado.csv")

1.1.1 Attaching Packages

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

1.2 Data Inspection

head(avo)
##   X       Date AveragePrice Total.Volume   X4046     X4225  X4770 Total.Bags
## 1 0 2015-12-27         1.33     64236.62 1036.74  54454.85  48.16    8696.87
## 2 1 2015-12-20         1.35     54876.98  674.28  44638.81  58.33    9505.56
## 3 2 2015-12-13         0.93    118220.22  794.70 109149.67 130.50    8145.35
## 4 3 2015-12-06         1.08     78992.15 1132.00  71976.41  72.58    5811.16
## 5 4 2015-11-29         1.28     51039.60  941.48  43838.39  75.78    6183.95
## 6 5 2015-11-22         1.26     55979.78 1184.27  48067.99  43.61    6683.91
##   Small.Bags Large.Bags XLarge.Bags         type year region
## 1    8603.62      93.25           0 conventional 2015 Albany
## 2    9408.07      97.49           0 conventional 2015 Albany
## 3    8042.21     103.14           0 conventional 2015 Albany
## 4    5677.40     133.76           0 conventional 2015 Albany
## 5    5986.26     197.69           0 conventional 2015 Albany
## 6    6556.47     127.44           0 conventional 2015 Albany
tail(avo)
##        X       Date AveragePrice Total.Volume    X4046    X4225  X4770
## 16323 47 2017-02-05         1.15     227693.3 35703.61 46107.73  71.74
## 16324 48 2017-01-29         1.29     205843.2 38850.43 44464.74 109.61
## 16325 49 2017-01-22         1.20     208810.3 32006.16 45982.34  70.40
## 16326 50 2017-01-15         1.06     268743.1 43816.58 41834.41 471.92
## 16327 51 2017-01-08         1.09     267703.0 29794.90 45439.29 125.69
## 16328 52 2017-01-01         1.19     189700.7 20783.06 35758.45  36.46
##       Total.Bags Small.Bags Large.Bags XLarge.Bags    type year region
## 16323   145810.2   41936.54  103873.71           0 organic 2017   West
## 16324   122418.4   37767.77   84650.62           0 organic 2017   West
## 16325   130751.4   45987.98   84763.43           0 organic 2017   West
## 16326   182620.2   62810.00  119810.20           0 organic 2017   West
## 16327   192343.1   41364.31  150978.80           0 organic 2017   West
## 16328   133122.7   33998.50   99124.23           0 organic 2017   West
dim(avo)
## [1] 16328    14
names(avo)
##  [1] "X"            "Date"         "AveragePrice" "Total.Volume" "X4046"       
##  [6] "X4225"        "X4770"        "Total.Bags"   "Small.Bags"   "Large.Bags"  
## [11] "XLarge.Bags"  "type"         "year"         "region"

Conclusion :
- The data contain 16328 rows and 14 columns
- Each of column name : “X”, “Date”, “AveragePrice”, “Total.Volume” “X4046”, “X4225”, “X4770”, “Total.Bags”, “Small.Bags”, “Large.Bags”, “XLarge.Bags”, “type”, “year”, “region”

1.3 Data Cleansing

To check if there is any missing value from the data.

anyNA(avo)
## [1] FALSE
colSums(is.na(avo))
##            X         Date AveragePrice Total.Volume        X4046        X4225 
##            0            0            0            0            0            0 
##        X4770   Total.Bags   Small.Bags   Large.Bags  XLarge.Bags         type 
##            0            0            0            0            0            0 
##         year       region 
##            0            0

Conclusion : There is no missing value from the data

Check data type for each column

str(avo)
## 'data.frame':    16328 obs. of  14 variables:
##  $ X           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Date        : chr  "2015-12-27" "2015-12-20" "2015-12-13" "2015-12-06" ...
##  $ AveragePrice: num  1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
##  $ Total.Volume: num  64237 54877 118220 78992 51040 ...
##  $ X4046       : num  1037 674 795 1132 941 ...
##  $ X4225       : num  54455 44639 109150 71976 43838 ...
##  $ X4770       : num  48.2 58.3 130.5 72.6 75.8 ...
##  $ Total.Bags  : num  8697 9506 8145 5811 6184 ...
##  $ Small.Bags  : num  8604 9408 8042 5677 5986 ...
##  $ Large.Bags  : num  93.2 97.5 103.1 133.8 197.7 ...
##  $ XLarge.Bags : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ type        : chr  "conventional" "conventional" "conventional" "conventional" ...
##  $ year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ region      : chr  "Albany" "Albany" "Albany" "Albany" ...
summary(avo)
##        X             Date            AveragePrice    Total.Volume     
##  Min.   : 0.00   Length:16328       Min.   :0.440   Min.   :      85  
##  1st Qu.:13.00   Class :character   1st Qu.:1.100   1st Qu.:   10088  
##  Median :26.00   Mode  :character   Median :1.380   Median :   97536  
##  Mean   :25.67                      Mean   :1.415   Mean   :  530796  
##  3rd Qu.:39.00                      3rd Qu.:1.680   3rd Qu.:  378850  
##  Max.   :52.00                      Max.   :3.250   Max.   :11274749  
##      X4046             X4225             X4770            Total.Bags     
##  Min.   :      0   Min.   :      0   Min.   :     0.0   Min.   :      0  
##  1st Qu.:    790   1st Qu.:   2871   1st Qu.:     0.0   1st Qu.:   4343  
##  Median :   7482   Median :  26717   Median :   162.9   Median :  36255  
##  Mean   : 179505   Mean   : 189540   Mean   : 14905.6   Mean   : 146845  
##  3rd Qu.:  97342   3rd Qu.: 133182   3rd Qu.:  5684.6   3rd Qu.:  97129  
##  Max.   :5160897   Max.   :5246830   Max.   :804558.2   Max.   :4145407  
##    Small.Bags        Large.Bags         XLarge.Bags            type          
##  Min.   :      0   Min.   :      0.0   Min.   :     0.00   Length:16328      
##  1st Qu.:   2262   1st Qu.:     99.2   1st Qu.:     0.00   Class :character  
##  Median :  23560   Median :   2276.2   Median :     0.00   Mode  :character  
##  Mean   : 112607   Mean   :  32368.1   Mean   :  1869.10                     
##  3rd Qu.:  74724   3rd Qu.:  17593.4   3rd Qu.:    84.78                     
##  Max.   :3403581   Max.   :1627453.1   Max.   :130892.80                     
##       year         region         
##  Min.   :2015   Length:16328      
##  1st Qu.:2015   Class :character  
##  Median :2016   Mode  :character  
##  Mean   :2016                     
##  3rd Qu.:2017                     
##  Max.   :2017
avo$Date <- as.Date(avo$Date)
avo$type <- as.factor(avo$type)
avo$region <- as.factor(avo$region)
str(avo)
## 'data.frame':    16328 obs. of  14 variables:
##  $ X           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Date        : Date, format: "2015-12-27" "2015-12-20" ...
##  $ AveragePrice: num  1.33 1.35 0.93 1.08 1.28 1.26 0.99 0.98 1.02 1.07 ...
##  $ Total.Volume: num  64237 54877 118220 78992 51040 ...
##  $ X4046       : num  1037 674 795 1132 941 ...
##  $ X4225       : num  54455 44639 109150 71976 43838 ...
##  $ X4770       : num  48.2 58.3 130.5 72.6 75.8 ...
##  $ Total.Bags  : num  8697 9506 8145 5811 6184 ...
##  $ Small.Bags  : num  8604 9408 8042 5677 5986 ...
##  $ Large.Bags  : num  93.2 97.5 103.1 133.8 197.7 ...
##  $ XLarge.Bags : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ type        : Factor w/ 2 levels "conventional",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ region      : Factor w/ 52 levels "Albany","Atlanta",..: 1 1 1 1 1 1 1 1 1 1 ...

Subset to delete the information that we don’t need, and save it to avo_dt to be processed and analyzed

avo_dt <- avo[, c("Date", "AveragePrice", "Total.Volume", "type", "year", "region")]
avo_dt$Total.Revenue <- avo_dt$AveragePrice * avo_dt$Total.Volume
avo_dt$month <- month(avo_dt$Date, label = T)
head(avo_dt)
##         Date AveragePrice Total.Volume         type year region Total.Revenue
## 1 2015-12-27         1.33     64236.62 conventional 2015 Albany      85434.70
## 2 2015-12-20         1.35     54876.98 conventional 2015 Albany      74083.92
## 3 2015-12-13         0.93    118220.22 conventional 2015 Albany     109944.80
## 4 2015-12-06         1.08     78992.15 conventional 2015 Albany      85311.52
## 5 2015-11-29         1.28     51039.60 conventional 2015 Albany      65330.69
## 6 2015-11-22         1.26     55979.78 conventional 2015 Albany      70534.52
##   month
## 1   Dec
## 2   Dec
## 3   Dec
## 4   Dec
## 5   Nov
## 6   Nov

2 Data Explanation

summary(avo_dt)
##       Date             AveragePrice    Total.Volume                type     
##  Min.   :2015-01-04   Min.   :0.440   Min.   :      85   conventional:8164  
##  1st Qu.:2015-10-04   1st Qu.:1.100   1st Qu.:   10088   organic     :8164  
##  Median :2016-07-03   Median :1.380   Median :   97536                      
##  Mean   :2016-07-03   Mean   :1.415   Mean   :  530796                      
##  3rd Qu.:2017-04-02   3rd Qu.:1.680   3rd Qu.:  378850                      
##  Max.   :2017-12-31   Max.   :3.250   Max.   :11274749                      
##                                                                             
##       year                      region      Total.Revenue         month     
##  Min.   :2015   Albany             :  314   Min.   :    134   Jan    :1456  
##  1st Qu.:2015   Atlanta            :  314   1st Qu.:  15852   May    :1456  
##  Median :2016   BaltimoreWashington:  314   Median : 126622   Jul    :1456  
##  Mean   :2016   Boise              :  314   Mean   : 582529   Oct    :1456  
##  3rd Qu.:2017   Boston             :  314   3rd Qu.: 457222   Mar    :1352  
##  Max.   :2017   BuffaloRochester   :  314   Max.   :8501979   Apr    :1352  
##                 (Other)            :14444                     (Other):7800

Summary :
1. First avocado sales occurred in 04 January 2015
2. The highest average price per avocado is USD 3.250
3. The lowest average price per avocado is USD 0.440
4. The highest total revenue is USD 8501979
5. The lowest total revenue is USD 134
6. There is two type of cultivating avocado that is organic and conventional

Checking outlier with AveragePrice

mean(avo_dt$AveragePrice)
## [1] 1.415005
var(avo_dt$AveragePrice)
## [1] 0.1663989
sd(avo_dt$AveragePrice)
## [1] 0.4079203
boxplot(avo_dt$AveragePrice, main = "Average Price per Avocado")

Conclusion : The mean of Average Price is 1.42 with standard deviation 0.41

3 Data Manipulation & Transformation

3.1 When is the highest total volume sales of avocado?

avo_dt[avo_dt$Total.Volume == max(avo_dt$Total.Volume), ]
##            Date AveragePrice Total.Volume         type year region
## 8159 2017-02-05         0.66     11274749 conventional 2017   West
##      Total.Revenue month
## 8159       7441334   Feb

Conclusion : The highest total volume sales of avocado is at 05 February 2017

3.2 What is the total volume sales of avocado every year?

vol_year <- aggregate(Total.Volume ~ year, avo_dt, sum)
vol_year
##   year Total.Volume
## 1 2015   2686689443
## 2 2016   2957923317
## 3 2017   3022222547

Conclusion : The total volume of the avocado sold in 2015 is 2,686,689,443, in 2016 is 2,957,923,317, and in 2017 is 3,022,222,547

3.4 Total revenue based on type

xtabs(Total.Revenue ~ type, avo_dt)
## type
## conventional      organic 
##   9134978991    376560149

Conclusion : Avocado with the type of conventional makes more revenue than the avocado with the type of organic

3.5 How much is the average of total revenue per month in 2015 to 2017

aggregate(Total.Revenue ~ month + year, avo_dt, mean)
##    month year Total.Revenue
## 1    Jan 2015      478157.4
## 2    Feb 2015      522174.6
## 3    Mar 2015      508196.3
## 4    Apr 2015      537803.2
## 5    May 2015      588785.5
## 6    Jun 2015      577700.2
## 7    Jul 2015      566627.8
## 8    Aug 2015      535415.8
## 9    Sep 2015      505327.8
## 10   Oct 2015      466238.9
## 11   Nov 2015      426637.3
## 12   Dec 2015      413159.6
## 13   Jan 2016      524637.3
## 14   Feb 2016      551841.5
## 15   Mar 2016      547863.8
## 16   Apr 2016      542997.4
## 17   May 2016      593959.4
## 18   Jun 2016      624552.8
## 19   Jul 2016      638773.8
## 20   Aug 2016      617045.9
## 21   Sep 2016      606864.2
## 22   Oct 2016      563493.3
## 23   Nov 2016      513202.1
## 24   Dec 2016      488799.0
## 25   Jan 2017      607579.0
## 26   Feb 2017      628805.1
## 27   Mar 2017      646408.5
## 28   Apr 2017      705734.5
## 29   May 2017      763498.8
## 30   Jun 2017      738026.1
## 31   Jul 2017      717495.8
## 32   Aug 2017      702085.3
## 33   Sep 2017      674593.0
## 34   Oct 2017      655251.3
## 35   Nov 2017      580539.4
## 36   Dec 2017      590012.4

Conclusion : The average of the total revenue is being showed in the table above.

3.7 Which region that has the most total revenue and what year is it?

region_sum <- aggregate(Total.Revenue ~ region + year, avo_dt, sum)
reg_most <- region_sum[order(-region_sum$Total.Revenue), ]
reg_most[1:5,]
##         region year Total.Revenue
## 111 California 2017     386774783
## 156       West 2017     370051840
## 134  Northeast 2017     340067375
## 59  California 2016     335212658
## 104       West 2016     313204902

Conclusion : The region that has the most revenue is California in 2017

3.8 Is there any correlation between the total volume sales of avocado and the total revenue?

cor(avo_dt$Total.Volume, avo_dt$Total.Revenue) 
## [1] 0.9692353
plot(avo_dt$Total.Volume, avo_dt$Total.Revenue, main = "Correlation Between The Total Volume Sales With The Total Revenue", xlab = "Total Volume Sales", ylab = "Total Revenue") 

Conclusion : There is a strong positive correlation between the total volume sales of avocado and the total revenue

3.9 How much total avocado sold and the total revenue in 2015 to 2017?

sum(avo_dt$Total.Volume)
## [1] 8666835307
sum(avo_dt$Total.Revenue)
## [1] 9511539140

Conclusion : The total volume sales of avocado sold is 8,666,835,307 with total revenue of USD 9,511,539,140

4 Explanatory Text

Since 2015 to 2017, the total volume sales of avocado sold is 8,666,835,307 with total revenue of USD 9,511,539,140. In 2017, California is the region that has the most total revenue. The total volume sales of avocado is increasing every year. The total volume sales of avocado with conventional type is higher than the total volume sales of avocado with organic type. The highest average total revenue of all observation region is on May 2017.