Introduction

It is both a popular and logical belief that avocados are a premium commodity in retail, hospitality and other related industries with a competing interest, as well as for the direct consumer. Americans just cannot get enough avocados anytime they go to a Chipotle, Torchy’s Tacos, or any other franchise that offers them. What is not known, however, is if there are any regional pricing differences associated with the pricing of avocados and if these differences might remain true in the future. We also want to predict if we will wish to stay in the same city that we originally chose.

Specifically, we want to know is it possible to estimate the price of an avocado based on our limited data.

First, we will look at certain geographic factors.

Next, we will look at the seasonal variation.

Then we will attempt to combine the two.

Finally, we will run decision tree models for each pairing of avocado type and region (8 models total) and benchmark them against the overall average of both the Conventional and Organic Avocado types.

This analysis can be used by the consumer to decide where and when they may want to take their next vacation if their diet consists primarily of avocados.

Packages Required

library(tidyverse) #For data cleaning, preparation and exploration
library(ggplot2) #For data exploration
library(class) #For decision tree
library(tree) #For decision tree
library(forecast) #For time series analysis

NOTE: We will be using the latest version of R and RStuido for when this report was produced.

> version _
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 6.1
year 2019
month 07
day 05
svn rev 76782
language R
version.string R version 3.6.1 (2019-07-05) nickname Action of the Toes

Data Preparation

Source: https://www.kaggle.com/neuromusic/avocado-prices

According to the source website, Hass Avocado Board (HAB) is the only avocado organization that equips the entire global industry for success by collecting, focusing and distributing investments to maintain and expand demand for avocados in the United States. HAB collected the data from a weekly retail scan for National retail volume(units) and price. This comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. That being said, we will need to encode dummy variables and compress the number of regions represented by a new feature in order to answer our questions.

New Feature regionCondensed: The regions have been condensed into four primary regions as a more efficient means to run models and get a more general sense of where the cheaper avocados might be located. This process was done manually.

Dummy Coding for Port Access: The data does not come with a feature that has our point of interest in mind. Thus, we will need to create a dummy variable that shows whether a city (notated as ‘region’ in the dataset) has ocean port access. A rule for this would be that the city must have an ocean port or be within less than 100 miles of the nearest ocean port. Below is a breakdown of the target feature:

Yes = 1 No = 0
SanFrancisco Albany
LosAngeles Syracuse
Atlanta Indianapolis
Portland CincinnatiDayton
Tampa Louisville
Seattle Raleigh
MiamiFtLauderdale Boise
Boston Columbus
Houston DallasFtWorth
Chicago HarrisburgScranton
BuffaloRochester Roanoke
Philadelphia Las Vegas
Charlotte Grand Rapids
Detroit
GreatLakes
SouthCentral
Southeast
West
Northeast
setwd('C:/Users/jimmy/OneDrive/UHD Graduate MS Data Analytics/SEM 3 - Fall 2019/Applied Regression Analysis/Project/Avocados')
Avocados <- read.csv('avocado.csv')
Avocados_New <- Avocados
Avocados_New$region <- as.character(Avocados_New$region)
Avocados_New$PortAccess <- factor(with(Avocados_New, ifelse((region == 'SanFrancisco' |
region == 'LosAngeles' |
region == 'Atlanta' |
region == 'Portland' |
region == 'Seattle' |
region == 'Tampa' |
region == 'MiamiFtLauderdale' |
region == 'Boston' |
region == 'Houston' |
region == 'Philadelphia' |
region == 'Chicago' |
region == 'BuffaloRochester' |
region == 'Charlotte' |
region == 'Detroit' |
region == 'Jacksonville' |
region == 'NewOrleansMobile' |
region == 'SanDiego' |
region == 'HartfordSpringfield' |
region == 'RichmondNorfolk' |
region == 'Orlando' |
region == 'Sacramento' |
region == 'GreatLakes' |
region == 'SouthCentral' |
region == 'Southeast' |
region == 'West' |
region == 'Northeast'), 1,0)))
Avocados_New <- filter(Avocados_New, region != 'TotalUS')
Avocados_New <- Avocados_New[1:17911,]
## 
## 2015 2016 2017 2018 
## 5511 5512 5616 1272

Data Exploration

Plotting Historical Pricing Data:

Avocado Type Encoded

Nationwide, this plot shows that Organic Hass Avocados costs more on average than Conventional Hass Avocados by a margin of 50 cents.

Port Access Encoded

Nationwide, this plot shows that avocados near cities and regions with port access cost more on average than those without port access, the effect is very slight however and shows that port access is likely not a contributing factor to price.

regionCondensed Encoded

The consumer may be less inclined to purchace avocados if they’re in the Northeastern region, given that it boasts the highest average price of $1.52 per unit, 14 cents more expensive than the next most expensive, the West.

Plotting Historical Volume Data:

Avocado Type Encoded

This plot suggests that conventional avocados make up the majority of the nationwide volume.

Port Access Encoded

This plot suggests that there are higher volumes of avocados in areas with port access than without, making up over 62% of the volume.

## [1] 0.6368297

regionCondensed Encoded

This plot shows that, out of about 10 billion units avocados sold between Jan 2015 and part of 2018, over 2/3rds of the volume is concentrated in the South and West regions, and that there is less volume in the Northeast. This is also consistent with the breakdown of the Conventional and Organic Avocados.

Based on the above plots there is a clear seasonal price spike beginning in March, peaking in September - October, then returning to normal levels in December.

Average Price Prediction - Direction

This section covers the the models developed to predict pricing for each type of avocado and then further by the four condensed regions.. The reason for this breakdown is due to the following:

Each model will have a new reponse variable called ‘High’ and have each their own threshold values dictated by the respective average of the avocado type and region (i.e. Midwest Conventional, Northeast Organic, etc.). After the models have been ran, we will compare the original average prices to the predicted average prices and benchark them against the overall price averages for each type of avocado.

Region | Conventional Avocados

We will begin with a breakdown of Conventional Avocados by region.

Midwest Region

This tree model is predicting the average price of Midwest Conventional Avocados to decrease by almost one cent.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_Midwest)
## Variables actually used in tree construction:
## [1] "X4225"      "Total.Bags" "X4046"     
## Number of terminal nodes:  13 
## Residual mean deviance:  0.8743 = 1318 / 1508 
## Misclassification error rate: 0.217 = 330 / 1521

##             High.test
## Midwest.pred  No Yes
##          No  327  78
##          Yes  49 131
## [1] 0.782906
## [1] 0.217094

Northeast Region

This tree model is predicting the average price of Northeast Conventional Avocados to decrease by almost one cent.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_Northeast)
## Variables actually used in tree construction:
## [1] "X4046" "X4225" "X4770"
## Number of terminal nodes:  10 
## Residual mean deviance:  1.071 = 2523 / 2356 
## Misclassification error rate: 0.2688 = 636 / 2366

##               High.test
## Northeast.pred  No Yes
##            No  297  59
##            Yes 202 352
## [1] 0.7131868
## [1] 0.2868132

South Region

This tree model is predicting the average price of South Conventional Avocados to increase by almost one cent.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_South)
## Variables actually used in tree construction:
## [1] "X4225"      "X4770"      "Total.Bags" "X4046"     
## Number of terminal nodes:  13 
## Residual mean deviance:  0.997 = 2683 / 2691 
## Misclassification error rate: 0.2618 = 708 / 2704

##           High.test
## South.pred  No Yes
##        No  445 144
##        Yes 126 325
## [1] 0.7403846
## [1] 0.2596154

West Region

This tree model is predicting the average price of Midwest Conventional Avocados to decrease by almost one cent.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_West)
## Variables actually used in tree construction:
## [1] "X4046"      "Total.Bags" "PortAccess" "X4770"      "X4225"     
## Number of terminal nodes:  11 
## Residual mean deviance:  1.001 = 2357 / 2355 
## Misclassification error rate: 0.254 = 601 / 2366

##          High.test
## West.pred  No Yes
##       No  324  90
##       Yes 190 306
## [1] 0.6923077
## [1] 0.3076923

All Regions

This tree model is predicting the average price of all Conventional Avocados to decrease.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_ALL)
## Variables actually used in tree construction:
## [1] "X4046"      "X4770"      "X4225"      "Total.Bags"
## Number of terminal nodes:  8 
## Residual mean deviance:  1.185 = 10610 / 8949 
## Misclassification error rate: 0.3369 = 3018 / 8957

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_ALL)
## Variables actually used in tree construction:
## [1] "X4046"      "X4770"      "X4225"      "Total.Bags"
## Number of terminal nodes:  8 
## Residual mean deviance:  1.185 = 10610 / 8949 
## Misclassification error rate: 0.3369 = 3018 / 8957
##         High.test
## ALL.pred   No  Yes
##      No  1461  702
##      Yes  477  805
## [1] 0.6577649
## [1] 0.3422351

Price-Predicing Results - CONVENTIONAL

The predictions show that our cheapest options are still in the West and the South, but the average pricing in the south is rising and the Midwest is decreasing. Assuming the buyer of the conventional avocado is more cost-conscious, it is safe to say the buyer is going to schedule their next Spirit Airline headed to the Western region of the US in the Winter time.

Midwest: Down

Original Predicted
$1.175345 $1.174855

Northeast: Down

Original Predicted
$1.307363 $1.302088

South: Up

Original Predicted
$1.105987 $1.108231

West: Down

Original Predicted
$1.061796 $1.053802

OVERALL: Down

Original Predicted
$1.159285 $1.155161

Region | Organic Avocados

Now we will break down Conventional Avocado average pricing by region.

Midwest Region

This tree model is predicting the average price of Midwest Organic Avocados to decrease.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_Midwest)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4046"      "X4225"      "PortAccess"
## Number of terminal nodes:  9 
## Residual mean deviance:  1.107 = 1675 / 1512 
## Misclassification error rate: 0.2834 = 431 / 1521

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_Midwest)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4046"      "X4225"      "PortAccess"
## Number of terminal nodes:  9 
## Residual mean deviance:  1.107 = 1675 / 1512 
## Misclassification error rate: 0.2834 = 431 / 1521
##             High.test
## Midwest.pred  No Yes
##          No  244 127
##          Yes  36 178
## [1] 0.6871795
## [1] 0.2786325

Northeast Region

This tree model is predicting the average price of Northeast Organic Avocados to decrease.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_Northeast)
## Variables actually used in tree construction:
## [1] "X4225"      "X4770"      "PortAccess" "Total.Bags"
## Number of terminal nodes:  12 
## Residual mean deviance:  0.9791 = 2305 / 2354 
## Misclassification error rate: 0.2574 = 609 / 2366

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_Northeast)
## Variables actually used in tree construction:
## [1] "X4225"      "X4770"      "PortAccess" "Total.Bags"
## Number of terminal nodes:  12 
## Residual mean deviance:  0.9791 = 2305 / 2354 
## Misclassification error rate: 0.2574 = 609 / 2366
##               High.test
## Northeast.pred  No Yes
##            No  284  66
##            Yes 192 368
## [1] 0.7164835
## [1] 0.2835165

South Region

This tree model is predicting the average price of South Organic Avocados to decrease.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_South)
## Variables actually used in tree construction:
## [1] "X4770"      "X4046"      "Total.Bags" "X4225"      "PortAccess"
## Number of terminal nodes:  13 
## Residual mean deviance:  0.9012 = 2425 / 2691 
## Misclassification error rate: 0.2104 = 569 / 2704

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_South)
## Variables actually used in tree construction:
## [1] "X4770"      "X4046"      "Total.Bags" "X4225"      "PortAccess"
## Number of terminal nodes:  13 
## Residual mean deviance:  0.9012 = 2425 / 2691 
## Misclassification error rate: 0.2104 = 569 / 2704
##           High.test
## South.pred  No Yes
##        No  444 128
##        Yes 117 351
## [1] 0.7644231
## [1] 0.2355769

West Region

This tree model is predicting the average price of West Organic Avocados to decrease.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_West)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4046"      "X4225"     
## Number of terminal nodes:  6 
## Residual mean deviance:  1.163 = 2741 / 2357 
## Misclassification error rate: 0.2954 = 698 / 2363

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_West)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4046"      "X4225"     
## Number of terminal nodes:  6 
## Residual mean deviance:  1.163 = 2741 / 2357 
## Misclassification error rate: 0.2954 = 698 / 2363
##          High.test
## West.pred  No Yes
##       No  405 164
##       Yes  99 239
## [1] 0.7100331
## [1] 0.2899669

All Regions

This tree model is predicting the average price of West Organic Avocados to decrease.

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_ALL)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4225"     
## Number of terminal nodes:  3 
## Residual mean deviance:  1.31 = 11720 / 8951 
## Misclassification error rate: 0.3913 = 3504 / 8954

## 
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type + 
##     PortAccess, data = Avocados_ALL)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4225"     
## Number of terminal nodes:  3 
## Residual mean deviance:  1.31 = 11720 / 8951 
## Misclassification error rate: 0.3913 = 3504 / 8954
##         High.test
## ALL.pred   No  Yes
##      No  1282  787
##      Yes  595  778
## [1] 0.5979681
## [1] 0.4011611

Price-Predicing Results - ORGANIC

The predictions show a downward trend in Organic Avocado pricing across the United States. The Organic Avocado buyer is more than likely conscious of the Avocado’s quality, as opposed to the price. It would be wise for the consumer to take advantage of the artificial savings this model has to offer regardless of the market. If we are dealing with a cost-conscious Millenial though, they will more than likely take the next Southwest Airline flight to the South or Midwest where the decreasing average pricing was already below the overall Organic Avocado average price.

Midwest: Down

Original Predicted
$1.562327 $1.557145

Northeast: Down

Original Predicted
$1.740900 $1.735198

South: Down

Original Predicted
$1.593765 $1.591183

West: Down

Original Predicted
$1.702641 $1.690551

OVERALL: Down

Original Predicted
$1.656036 $1.654611

Summary

Based on our exploratory analysis, it appears that both location and season play an effect on the price of Avocados. We explored the possibility of shipping via sea but found no correlation to the prices. It is more likely that proximity to one of the growing zones is a more likely indicator but that is not explored here as we do not have reliable data on where they are grown. A second predicor we discovered was the time of year. It appears that supply and demand play a large part in this as growing season for Avocados is June - October which is also when the price jumps significantly because the previous years stocks are running out, and a new supply has not yet been introduced for that year. These assumptions could be further verified by someone with more experience in time series completing the model.

Based on our price predicting results, it appears that average prices will mostly decrease by each conditional pairing of regions and avocado types. Particularly, we see our lowest-priced Conventional avocados in areas where those are already on the cheaper side like the West and South regions. The Organic avocado consumer may want to take advantage of the artificial savings that these optimistic models provide, but the cost-conscious consumer may want to take a flight to the South or Midwest regions. Regardless, the West makes its case as the best market to buy your next avocado in terms of scale, pricing and product offerings. The millenial consumer will more than likely book their next Spirit Airline flight headed West in the earlier parts of the years 2019 and 2020.