It is both a popular and logical belief that avocados are a premium commodity in retail, hospitality and other related industries with a competing interest, as well as for the direct consumer. Americans just cannot get enough avocados anytime they go to a Chipotle, Torchy’s Tacos, or any other franchise that offers them. What is not known, however, is if there are any regional pricing differences associated with the pricing of avocados and if these differences might remain true in the future. We also want to predict if we will wish to stay in the same city that we originally chose.
Specifically, we want to know is it possible to estimate the price of an avocado based on our limited data.
First, we will look at certain geographic factors.
Next, we will look at the seasonal variation.
Then we will attempt to combine the two.
Finally, we will run decision tree models for each pairing of avocado type and region (8 models total) and benchmark them against the overall average of both the Conventional and Organic Avocado types.
This analysis can be used by the consumer to decide where and when they may want to take their next vacation if their diet consists primarily of avocados.
library(tidyverse) #For data cleaning, preparation and exploration
library(ggplot2) #For data exploration
library(class) #For decision tree
library(tree) #For decision tree
library(forecast) #For time series analysis
NOTE: We will be using the latest version of R and RStuido for when this report was produced.
> version _
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 6.1
year 2019
month 07
day 05
svn rev 76782
language R
version.string R version 3.6.1 (2019-07-05) nickname Action of the Toes
Source: https://www.kaggle.com/neuromusic/avocado-prices
According to the source website, Hass Avocado Board (HAB) is the only avocado organization that equips the entire global industry for success by collecting, focusing and distributing investments to maintain and expand demand for avocados in the United States. HAB collected the data from a weekly retail scan for National retail volume(units) and price. This comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. That being said, we will need to encode dummy variables and compress the number of regions represented by a new feature in order to answer our questions.
New Feature regionCondensed: The regions have been condensed into four primary regions as a more efficient means to run models and get a more general sense of where the cheaper avocados might be located. This process was done manually.
Dummy Coding for Port Access: The data does not come with a feature that has our point of interest in mind. Thus, we will need to create a dummy variable that shows whether a city (notated as ‘region’ in the dataset) has ocean port access. A rule for this would be that the city must have an ocean port or be within less than 100 miles of the nearest ocean port. Below is a breakdown of the target feature:
| Yes = 1 | No = 0 |
|---|---|
| SanFrancisco | Albany |
| LosAngeles | Syracuse |
| Atlanta | Indianapolis |
| Portland | CincinnatiDayton |
| Tampa | Louisville |
| Seattle | Raleigh |
| MiamiFtLauderdale | Boise |
| Boston | Columbus |
| Houston | DallasFtWorth |
| Chicago | HarrisburgScranton |
| BuffaloRochester | Roanoke |
| Philadelphia | Las Vegas |
| Charlotte | Grand Rapids |
| Detroit | |
| GreatLakes | |
| SouthCentral | |
| Southeast | |
| West | |
| Northeast |
setwd('C:/Users/jimmy/OneDrive/UHD Graduate MS Data Analytics/SEM 3 - Fall 2019/Applied Regression Analysis/Project/Avocados')
Avocados <- read.csv('avocado.csv')
Avocados_New <- Avocados
Avocados_New$region <- as.character(Avocados_New$region)
Avocados_New$PortAccess <- factor(with(Avocados_New, ifelse((region == 'SanFrancisco' |
region == 'LosAngeles' |
region == 'Atlanta' |
region == 'Portland' |
region == 'Seattle' |
region == 'Tampa' |
region == 'MiamiFtLauderdale' |
region == 'Boston' |
region == 'Houston' |
region == 'Philadelphia' |
region == 'Chicago' |
region == 'BuffaloRochester' |
region == 'Charlotte' |
region == 'Detroit' |
region == 'Jacksonville' |
region == 'NewOrleansMobile' |
region == 'SanDiego' |
region == 'HartfordSpringfield' |
region == 'RichmondNorfolk' |
region == 'Orlando' |
region == 'Sacramento' |
region == 'GreatLakes' |
region == 'SouthCentral' |
region == 'Southeast' |
region == 'West' |
region == 'Northeast'), 1,0)))
Avocados_New <- filter(Avocados_New, region != 'TotalUS')
Avocados_New <- Avocados_New[1:17911,]
##
## 2015 2016 2017 2018
## 5511 5512 5616 1272
Avocado Type Encoded
Nationwide, this plot shows that Organic Hass Avocados costs more on average than Conventional Hass Avocados by a margin of 50 cents.
Port Access Encoded
Nationwide, this plot shows that avocados near cities and regions with port access cost more on average than those without port access, the effect is very slight however and shows that port access is likely not a contributing factor to price.
regionCondensed Encoded
The consumer may be less inclined to purchace avocados if they’re in the Northeastern region, given that it boasts the highest average price of $1.52 per unit, 14 cents more expensive than the next most expensive, the West.
Avocado Type Encoded
This plot suggests that conventional avocados make up the majority of the nationwide volume.
Port Access Encoded
This plot suggests that there are higher volumes of avocados in areas with port access than without, making up over 62% of the volume.
## [1] 0.6368297
regionCondensed Encoded
This plot shows that, out of about 10 billion units avocados sold between Jan 2015 and part of 2018, over 2/3rds of the volume is concentrated in the South and West regions, and that there is less volume in the Northeast. This is also consistent with the breakdown of the Conventional and Organic Avocados.
Based on the above plots there is a clear seasonal price spike beginning in March, peaking in September - October, then returning to normal levels in December.
This section covers the the models developed to predict pricing for each type of avocado and then further by the four condensed regions.. The reason for this breakdown is due to the following:
Each model will have a new reponse variable called ‘High’ and have each their own threshold values dictated by the respective average of the avocado type and region (i.e. Midwest Conventional, Northeast Organic, etc.). After the models have been ran, we will compare the original average prices to the predicted average prices and benchark them against the overall price averages for each type of avocado.
We will begin with a breakdown of Conventional Avocados by region.
This tree model is predicting the average price of Midwest Conventional Avocados to decrease by almost one cent.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_Midwest)
## Variables actually used in tree construction:
## [1] "X4225" "Total.Bags" "X4046"
## Number of terminal nodes: 13
## Residual mean deviance: 0.8743 = 1318 / 1508
## Misclassification error rate: 0.217 = 330 / 1521
## High.test
## Midwest.pred No Yes
## No 327 78
## Yes 49 131
## [1] 0.782906
## [1] 0.217094
This tree model is predicting the average price of Northeast Conventional Avocados to decrease by almost one cent.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_Northeast)
## Variables actually used in tree construction:
## [1] "X4046" "X4225" "X4770"
## Number of terminal nodes: 10
## Residual mean deviance: 1.071 = 2523 / 2356
## Misclassification error rate: 0.2688 = 636 / 2366
## High.test
## Northeast.pred No Yes
## No 297 59
## Yes 202 352
## [1] 0.7131868
## [1] 0.2868132
This tree model is predicting the average price of South Conventional Avocados to increase by almost one cent.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_South)
## Variables actually used in tree construction:
## [1] "X4225" "X4770" "Total.Bags" "X4046"
## Number of terminal nodes: 13
## Residual mean deviance: 0.997 = 2683 / 2691
## Misclassification error rate: 0.2618 = 708 / 2704
## High.test
## South.pred No Yes
## No 445 144
## Yes 126 325
## [1] 0.7403846
## [1] 0.2596154
This tree model is predicting the average price of Midwest Conventional Avocados to decrease by almost one cent.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_West)
## Variables actually used in tree construction:
## [1] "X4046" "Total.Bags" "PortAccess" "X4770" "X4225"
## Number of terminal nodes: 11
## Residual mean deviance: 1.001 = 2357 / 2355
## Misclassification error rate: 0.254 = 601 / 2366
## High.test
## West.pred No Yes
## No 324 90
## Yes 190 306
## [1] 0.6923077
## [1] 0.3076923
This tree model is predicting the average price of all Conventional Avocados to decrease.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_ALL)
## Variables actually used in tree construction:
## [1] "X4046" "X4770" "X4225" "Total.Bags"
## Number of terminal nodes: 8
## Residual mean deviance: 1.185 = 10610 / 8949
## Misclassification error rate: 0.3369 = 3018 / 8957
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_ALL)
## Variables actually used in tree construction:
## [1] "X4046" "X4770" "X4225" "Total.Bags"
## Number of terminal nodes: 8
## Residual mean deviance: 1.185 = 10610 / 8949
## Misclassification error rate: 0.3369 = 3018 / 8957
## High.test
## ALL.pred No Yes
## No 1461 702
## Yes 477 805
## [1] 0.6577649
## [1] 0.3422351
The predictions show that our cheapest options are still in the West and the South, but the average pricing in the south is rising and the Midwest is decreasing. Assuming the buyer of the conventional avocado is more cost-conscious, it is safe to say the buyer is going to schedule their next Spirit Airline headed to the Western region of the US in the Winter time.
Midwest: Down
| Original | Predicted |
|---|---|
$1.175345 |
$1.174855 |
Northeast: Down
| Original | Predicted |
|---|---|
$1.307363 |
$1.302088 |
South: Up
| Original | Predicted |
|---|---|
$1.105987 |
$1.108231 |
West: Down
| Original | Predicted |
|---|---|
$1.061796 |
$1.053802 |
OVERALL: Down
| Original | Predicted |
|---|---|
$1.159285 |
$1.155161 |
Now we will break down Conventional Avocado average pricing by region.
This tree model is predicting the average price of Midwest Organic Avocados to decrease.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_Midwest)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4046" "X4225" "PortAccess"
## Number of terminal nodes: 9
## Residual mean deviance: 1.107 = 1675 / 1512
## Misclassification error rate: 0.2834 = 431 / 1521
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_Midwest)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4046" "X4225" "PortAccess"
## Number of terminal nodes: 9
## Residual mean deviance: 1.107 = 1675 / 1512
## Misclassification error rate: 0.2834 = 431 / 1521
## High.test
## Midwest.pred No Yes
## No 244 127
## Yes 36 178
## [1] 0.6871795
## [1] 0.2786325
This tree model is predicting the average price of Northeast Organic Avocados to decrease.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_Northeast)
## Variables actually used in tree construction:
## [1] "X4225" "X4770" "PortAccess" "Total.Bags"
## Number of terminal nodes: 12
## Residual mean deviance: 0.9791 = 2305 / 2354
## Misclassification error rate: 0.2574 = 609 / 2366
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_Northeast)
## Variables actually used in tree construction:
## [1] "X4225" "X4770" "PortAccess" "Total.Bags"
## Number of terminal nodes: 12
## Residual mean deviance: 0.9791 = 2305 / 2354
## Misclassification error rate: 0.2574 = 609 / 2366
## High.test
## Northeast.pred No Yes
## No 284 66
## Yes 192 368
## [1] 0.7164835
## [1] 0.2835165
This tree model is predicting the average price of South Organic Avocados to decrease.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_South)
## Variables actually used in tree construction:
## [1] "X4770" "X4046" "Total.Bags" "X4225" "PortAccess"
## Number of terminal nodes: 13
## Residual mean deviance: 0.9012 = 2425 / 2691
## Misclassification error rate: 0.2104 = 569 / 2704
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_South)
## Variables actually used in tree construction:
## [1] "X4770" "X4046" "Total.Bags" "X4225" "PortAccess"
## Number of terminal nodes: 13
## Residual mean deviance: 0.9012 = 2425 / 2691
## Misclassification error rate: 0.2104 = 569 / 2704
## High.test
## South.pred No Yes
## No 444 128
## Yes 117 351
## [1] 0.7644231
## [1] 0.2355769
This tree model is predicting the average price of West Organic Avocados to decrease.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_West)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4046" "X4225"
## Number of terminal nodes: 6
## Residual mean deviance: 1.163 = 2741 / 2357
## Misclassification error rate: 0.2954 = 698 / 2363
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_West)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4046" "X4225"
## Number of terminal nodes: 6
## Residual mean deviance: 1.163 = 2741 / 2357
## Misclassification error rate: 0.2954 = 698 / 2363
## High.test
## West.pred No Yes
## No 405 164
## Yes 99 239
## [1] 0.7100331
## [1] 0.2899669
This tree model is predicting the average price of West Organic Avocados to decrease.
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_ALL)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4225"
## Number of terminal nodes: 3
## Residual mean deviance: 1.31 = 11720 / 8951
## Misclassification error rate: 0.3913 = 3504 / 8954
##
## Classification tree:
## tree(formula = High ~ X4046 + X4225 + X4770 + Total.Bags + type +
## PortAccess, data = Avocados_ALL)
## Variables actually used in tree construction:
## [1] "Total.Bags" "X4225"
## Number of terminal nodes: 3
## Residual mean deviance: 1.31 = 11720 / 8951
## Misclassification error rate: 0.3913 = 3504 / 8954
## High.test
## ALL.pred No Yes
## No 1282 787
## Yes 595 778
## [1] 0.5979681
## [1] 0.4011611
The predictions show a downward trend in Organic Avocado pricing across the United States. The Organic Avocado buyer is more than likely conscious of the Avocado’s quality, as opposed to the price. It would be wise for the consumer to take advantage of the artificial savings this model has to offer regardless of the market. If we are dealing with a cost-conscious Millenial though, they will more than likely take the next Southwest Airline flight to the South or Midwest where the decreasing average pricing was already below the overall Organic Avocado average price.
Midwest: Down
| Original | Predicted |
|---|---|
$1.562327 |
$1.557145 |
Northeast: Down
| Original | Predicted |
|---|---|
$1.740900 |
$1.735198 |
South: Down
| Original | Predicted |
|---|---|
$1.593765 |
$1.591183 |
West: Down
| Original | Predicted |
|---|---|
$1.702641 |
$1.690551 |
OVERALL: Down
| Original | Predicted |
|---|---|
$1.656036 |
$1.654611 |
Based on our exploratory analysis, it appears that both location and season play an effect on the price of Avocados. We explored the possibility of shipping via sea but found no correlation to the prices. It is more likely that proximity to one of the growing zones is a more likely indicator but that is not explored here as we do not have reliable data on where they are grown. A second predicor we discovered was the time of year. It appears that supply and demand play a large part in this as growing season for Avocados is June - October which is also when the price jumps significantly because the previous years stocks are running out, and a new supply has not yet been introduced for that year. These assumptions could be further verified by someone with more experience in time series completing the model.
Based on our price predicting results, it appears that average prices will mostly decrease by each conditional pairing of regions and avocado types. Particularly, we see our lowest-priced Conventional avocados in areas where those are already on the cheaper side like the West and South regions. The Organic avocado consumer may want to take advantage of the artificial savings that these optimistic models provide, but the cost-conscious consumer may want to take a flight to the South or Midwest regions. Regardless, the West makes its case as the best market to buy your next avocado in terms of scale, pricing and product offerings. The millenial consumer will more than likely book their next Spirit Airline flight headed West in the earlier parts of the years 2019 and 2020.