In a kaggle
competition, this data was provided by the Hass avocado board to
analyze the perception that avocado prices surged in 2017. R has devoted
a package to a more recent version of this data which combines an extensive
description of variables and three datasets: - at the country level
(hass_usa) - at the region level (hass_region)
- at the city or subregion level (hass) You can decide
which data you will work with. I would advise for those who feel
uncertain about their capabilities, to start with the more general data
(at the USA level).
getwd()# make sure you download the csv file in your working directory!
## [1] "/Users/amitaharoni/Library/CloudStorage/Dropbox/03. Main Personal/01. Knowledge, Courses & School/06. CUNY - BMCC/07. Study Abroad/Data Analysis in R/03. Projects/DataAnalysisRCourse/FinalAssignment"
# install.packages("avocado")
library(avocado)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(ggpubr)
data('hass_region')
# ?hass_region # for more information
head(hass_region)
## # A tibble: 6 × 16
## week_ending region avg_price_nonorg plu4046 plu4225 plu4770
## <dttm> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2017-01-02 00:00:00 California 0.89 2266313. 2877688. 90900.
## 2 2017-01-02 00:00:00 Great Lakes 0.88 636278. 2157250. 189357.
## 3 2017-01-02 00:00:00 Midsouth 1.12 653896 1285364. 64703.
## 4 2017-01-02 00:00:00 Northeast 1.35 174843. 2589316. 39607.
## 5 2017-01-02 00:00:00 Plains 0.83 1462454. 509660. 4781.
## 6 2017-01-02 00:00:00 South Central 0.64 3587283. 1719720. 27549.
## # ℹ 10 more variables: small_nonorg_bag <dbl>, large_nonorg_bag <dbl>,
## # xlarge_nonorg_bag <dbl>, avg_price_org <dbl>, plu94046 <dbl>,
## # plu94225 <dbl>, plu94770 <dbl>, small_org_bag <dbl>, large_org_bag <dbl>,
## # xlarge_org_bag <dbl>
summary(hass_region)
## week_ending region avg_price_nonorg
## Min. :2017-01-02 00:00:00.00 Length:1576 Min. :0.620
## 1st Qu.:2017-12-10 00:00:00.00 Class :character 1st Qu.:0.990
## Median :2018-11-18 00:00:00.00 Mode :character Median :1.130
## Mean :2018-12-01 15:57:33.80 Mean :1.147
## 3rd Qu.:2019-11-24 00:00:00.00 3rd Qu.:1.290
## Max. :2020-11-01 00:00:00.00 Max. :1.900
## plu4046 plu4225 plu4770 small_nonorg_bag
## Min. : 10779 Min. : 4016 Min. : 26.8 Min. : 12775
## 1st Qu.: 728445 1st Qu.: 627577 1st Qu.: 15463.1 1st Qu.: 904415
## Median :1482193 Median :1059240 Median : 32828.1 Median :1324807
## Mean :1588680 Mean :1294277 Mean : 93410.4 Mean :1400776
## 3rd Qu.:2310060 3rd Qu.:1618820 3rd Qu.:142694.2 3rd Qu.:1802267
## Max. :5160897 Max. :6533193 Max. :794742.9 Max. :4017035
## large_nonorg_bag xlarge_nonorg_bag avg_price_org plu94046
## Min. : 764 Min. : 0 Min. :0.650 Min. : 0
## 1st Qu.: 169019 1st Qu.: 10049 1st Qu.:1.400 1st Qu.: 2522
## Median : 443211 Median : 27462 Median :1.550 Median : 6862
## Mean : 667406 Mean : 53216 Mean :1.591 Mean : 14574
## 3rd Qu.: 922094 3rd Qu.: 66460 3rd Qu.:1.740 3rd Qu.: 23970
## Max. :3533282 Max. :730833 Max. :2.950 Max. :156528
## plu94225 plu94770 small_org_bag large_org_bag
## Min. : 0 Min. : 0.00 Min. : 6.3 Min. : 0
## 1st Qu.: 5976 1st Qu.: 0.00 1st Qu.: 49844.1 1st Qu.: 6365
## Median : 22677 Median : 39.08 Median : 94482.8 Median : 17558
## Mean : 27304 Mean : 402.70 Mean : 99030.5 Mean : 33364
## 3rd Qu.: 39145 3rd Qu.: 268.42 3rd Qu.:136281.7 3rd Qu.: 42416
## Max. :425617 Max. :15013.85 Max. :771846.5 Max. :247924
## xlarge_org_bag
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 54.76
## 3rd Qu.: 0.00
## Max. :16336.25
table(hass_region$region)
##
## California Great Lakes Midsouth Northeast Plains
## 197 197 197 197 197
## South Central Southeast West
## 197 197 197
The data contains the total weight (US pounds) of Hass avocado’s
being sold per region and year. They distinguish between avocados sold
per unit (PLU variables) and those sold in bags. In the columns
plu4046, plu4225 and plu4770 you
can find the total weight of small, large and extra large Hass avocado’s
being sold:
PLU refers to a bulk sale. On the other hand, the bags indicates a pre-packaged container consisting of a variable number of avocados of mixed PLU type. For instance, a package of six avocados may consist of 2 PLU 4046, 3 PLU 4770 and 1 PLU 4225. In other words, bagged amount sold are unable to account for individual PLU amount sold. The data also contains the average price per organic and nonorganic avocado being sold in that region and year. Multiplying the price with the volume will create total sales.
Your goal will be to predict prices in the last observed time period. You are free to select regions, time periods, and variables as you think is necessary to well predict the prices of the avocado’s. How good your prediction is determined by the difference between the predicted value and the observed value of the last time period. In the final submission, please focus on a few analyses that you found most useful and interesting to answer the questions. Do not submit a full paper with all the possible analyses you’ve tried.
In order to inspect the data over time, we create three new variables, day, month and year.
for(i in seq(1, nrow(hass_region))){
hass_region$year[i] <- unlist(strsplit(as.character(hass_region$week_ending[i]),"-"))[1]
hass_region$month[i] <- unlist(strsplit(as.character(hass_region$week_ending[i]),"-"))[2]
hass_region$day[i] <- unlist(strsplit(as.character(hass_region$week_ending[i]),"-"))[3]
}
An alternative approach to this is to use functions from the
lubridate package:
hass_region %>%
mutate(
year = year(week_ending),
month = month(week_ending),
day = day(week_ending)
) -> hass_region
I advise you to work from this notebook, so the results are well organized, and represented in such a way that others can make sense of it. For the assignment, it is important to show that you’ve learned how to code, but also how to interpret the plots, hypothesis tests and regression results. To get started, I added some examples below. Please make sure you add code and text below each question, and that the analyses combined become a report that is easy to read, and attractive to see.
When you inspect the variables, pay special attention to measurement levels (how many categories, numeric values etc), distributions (use plots!), explore conditional distributions (plots separated for groups of observations), make selections on the data so you investigate how ‘complex’ the data is.
For the assignment, select those inspections that you felt were most insightful to you, and interpret these plots/summaries. As an example below I added my first inspection of the data:
In preliminary analyses, the main goal is to get a feel for the data. How are variables construed, how do they relate with one another, what is the structure of the data? You might have noted during your inspection of the variables that this is a so-called long file with respect to all years and regions. This means that years and regions are placed below one another on separate rows.
As you will see towards the end of this assignment, this makes time analysis easy. For the first analyses, you do not have to analyze the effect of time, but you do need to control for time. If not, the effect of time will confound the effect of the other variables.
The following questions help you to get to know the data better. If you think of other questions, feel free to add those analyses below.
The boxplot above shows that the price changed over the years. Keep
in mind that there are regional differences that confound this
comparisons over the years, as the price change might have differed per
region. In the following we explore the price per region, and focus only
on the larger regions in the US that are listed in the
hass_region data. This is just to illustrate what you can
do with the data. Of course you can further explore
hass_county if you want! Look into the different regions
here: https://cran.r-project.org/web/packages/avocado/vignettes/a_intro.html
# first delete levels so that you can focus on specific part of the data
hass_region$region <- as.character(hass_region$region)
unique(hass_region$region)
## [1] "California" "Great Lakes" "Midsouth" "Northeast"
## [5] "Plains" "South Central" "Southeast" "West"
boxplot(avg_price_nonorg~region, hass_region)
boxplot(avg_price_org~region, hass_region)
# Process and visualize the hass_region data
hass_region %>%
mutate(volume_org = small_org_bag + large_org_bag + xlarge_org_bag, # Calculate total organic volume
volume_nonorg = small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag, # Calculate total non-organic volume
year = as.integer(format(as.Date(week_ending), "%Y"))) %>% # Extract year from week_ending
dplyr::select(avg_price_nonorg, avg_price_org, week_ending, region, volume_org, volume_nonorg, year) %>% # Select relevant columns
pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>% # Reshape data to long format
ggplot(aes(x = region, y = price)) + # Create ggplot object with region on x-axis and price on y-axis
geom_boxplot() + # Add boxplot layer
facet_grid(year ~ type, scales="free") + # Create separate facets for each year and type of price
labs(title = "Average Avocado Prices by Region, Year, and Type", x = "Region", y = "Price") + # Add titles and labels
theme_minimal() + # Use minimal theme
theme(
axis.text.x = element_text(size = 12, angle = 45, hjust = 1), # Increase size and rotate x-axis labels for readability
axis.text.y = element_text(size = 12), # Increase size of y-axis labels
strip.text = element_text(size = 14), # Increase size of facet labels
plot.title = element_text(size = 16, face = "bold"), # Increase size and bold the title
axis.title = element_text(size = 14) # Increase size of axis titles
)
# Process and visualize the hass_region data
hass_region %>%
filter(year %in% c(2018, 2019, 2020), region %in% c("California", "South Central", "West")) %>% # Filter for the years 2018, 2019, and 2020
mutate(volume_org = small_org_bag + large_org_bag + xlarge_org_bag, # Calculate total organic volume
volume_nonorg = small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag # Calculate total non-organic volume
) %>%
dplyr::select(avg_price_nonorg, avg_price_org, year, month, region, volume_org, volume_nonorg) %>% # Select relevant columns
pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>% # Reshape data to long format
ggplot(aes(x = as.numeric(month), y = price, color = region, group = region)) + # Create ggplot object with month on x-axis and price on y-axis
geom_smooth(method = "loess", se = FALSE) + # Add smooth line layer using LOESS smoothing without confidence interval
# geom_line() + # Add line layer
# facet_grid(~ type + year, scales = "free") +
facet_wrap(~ type + year, scales = "free") + # Create separate facets for each type of price and year
geom_vline(xintercept = 7, linetype = "dashed", color = "red", size = 1) + # Add vertical line at July (7th month)
# geom_segment(data = . %>% group_by(type, year) %>% summarise(max_price = max(price, na.rm = TRUE)),
# aes(x = 7, xend = 7, y = max_price, yend = max_price * 0.95),
# arrow = arrow(length = unit(0.2, "cm")), color = "red") + # Add arrow pointing to the peak in July
# annotate("text", x = 7, y = Inf, label = "July", vjust = 2, color = "red", size = 4) + # Annotate with text "July"
scale_x_continuous(breaks = 1:12, labels = month.abb) + # Set x-axis to display months
labs(title = "Monthly Average Avocado Prices by Region, Year, and Type", x = "Month", y = "Price") + # Add titles and labels
theme_minimal() + # Use minimal theme
theme(
axis.text.x = element_text(size = 7, angle = 45, hjust = 1), # Increase size and rotate x-axis labels for readability
axis.text.y = element_text(size = 10), # Increase size of y-axis labels
strip.text = element_text(size = 8), # Increase size of facet labels
plot.title = element_text(size = 14, face = "bold"), # Increase size and bold the title
axis.title = element_text(size = 12) # Increase size of axis titles
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
As you can see the price of avocado’s in the West is much more variable than those in South Central (where the price of avocado is the lowest). In the Northeast the price of the avocado is the highest.
Now let’s inspect the time dimension using tapply and
month.
tapply(hass_region$avg_price_nonorg , hass_region$month, mean)
## 1 2 3 4 5 6 7 8
## 1.0338889 0.9726562 1.1188889 1.1757639 1.1527941 1.1844118 1.2206250 1.2114706
## 9 10 11 12
## 1.2522917 1.2016912 1.1114423 1.0796250
A tidyverse solution instead of the tapply function:
hass_region %>%
group_by(month) %>%
summarise(avg_price_nonorg = mean(avg_price_nonorg))
## # A tibble: 12 × 2
## month avg_price_nonorg
## <dbl> <dbl>
## 1 1 1.03
## 2 2 0.973
## 3 3 1.12
## 4 4 1.18
## 5 5 1.15
## 6 6 1.18
## 7 7 1.22
## 8 8 1.21
## 9 9 1.25
## 10 10 1.20
## 11 11 1.11
## 12 12 1.08
It seems that the average price of avocado’s somewhat becomes higher in the months August till October. Indeed, https://www.wisegeek.com/when-is-avocado-season.htm states the following:
The U.S. state of California is one of the world’s largest producers of avocados. Farmers there tend to grow the “Hass” variety, which matures quickly and produces a large, attractive fruit. The Hass is generally available beginning in late March and ending in early to mid-September, though of course a lot depends on weather patterns, storms, and basic growing conditions. Droughts and unusually cold weather tend to reduce crop availability, which can alter the start and end points of avocado season while also impacting price. Export demands may also place a strain on availability. Many California farms sell their goods to grocers around the world during the peak growing season.
Apparently, on average, the price for avocado’s is higher when it is in season in California.
If you now understand the data structure, please see if you can
compare conventional and organic avocados. In the columns you can find
both bags (avocados sold in package) or PLU numbers (avocados sold per
unit). For the bags, you have two columns, one for the organic and one
for the conventional avocado. The PLU numbers indicate the type of
avocado and whether it is organic. When the number has a 9 in front, as
in plu94046 which is a small organic Hass avocado.
Since you have a column for organic and conventional, the data is wide with respect to the type. If you want to compare the two, it makes sense to reshape the data to a long format, so that organic and conventional avocados are placed under one another on a separate row.
organic_hass_check <- hass_region %>%
dplyr::select(avg_price_nonorg, avg_price_org) %>%
pivot_longer(cols=everything(),names_to = "type", values_to = "price") %>%
mutate(type=case_when(
type == "avg_price_nonorg" ~ "conventional",
TRUE~ "organic"))
# organic_hass_check
# Plot the data
ggplot(organic_hass_check, aes(x = type, y = price, fill = type)) +
geom_boxplot() +
labs(title = "Price Comparison of Avocados",
x = "Type",
y = "Average Price",
fill = "Type") +
theme_minimal()
# Note that it is also possible to modify the pivot_longer function to keep other variables such as the regions and years
library(dplyr)
library(tidyr)
library(ggplot2)
# Transform the dataset
organic_hass_check <- hass_region %>%
dplyr::select(region, avg_price_nonorg, avg_price_org) %>% # Added region to selection
pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>%
mutate(type = case_when(
type == "avg_price_nonorg" ~ "conventional",
TRUE ~ "organic"
))
# Plot the data by region
ggplot(organic_hass_check, aes(x = region, y = price, fill = type)) +
geom_boxplot() +
labs(title = "Price Comparison of Avocados by Region",
x = "Region",
y = "Average Price",
fill = "Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x-axis labels for readability
library(dplyr)
library(tidyr)
library(ggplot2)
# Transform the dataset
organic_hass_check <- hass_region %>%
dplyr::select(year, avg_price_nonorg, avg_price_org) %>% # Added year to selection
pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>%
mutate(type = case_when(
type == "avg_price_nonorg" ~ "conventional",
TRUE ~ "organic"
))
# Plot the data by year
ggplot(organic_hass_check, aes(x = factor(year), y = price, fill = type)) +
geom_boxplot() +
labs(title = "Price Comparison of Avocados by Year",
x = "Year",
y = "Average Price",
fill = "Type") +
theme_minimal()
This data set allows you to test the simple price difference between organic and conventional avocados. As a follow-up analysis, you could test differences in prices across regions and years.
Now let’s move to regression. For this, you need to reshape data so that you have a column of prices and volumes for the different types of avocados.
# It depends on what exactly you want to test, but there is an example of how you can reshape the data with the amount of avocados sold in bags:
vol_hass_check <- hass_region %>%
mutate(volume_org = small_org_bag + large_org_bag + xlarge_org_bag,
volume_nonorg = small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag
) %>%
dplyr::select(avg_price_nonorg, avg_price_org, week_ending, region, volume_org, volume_nonorg) %>%
pivot_longer(cols=-c(week_ending, region, volume_org, volume_nonorg), names_to = "type", values_to = "price") %>%
mutate(case = if_else("avg_price_org" == type, "organic", "nonorganic")) %>%
dplyr::select(-type) %>%
mutate(volume = if_else("organic" == case, volume_org, volume_nonorg)) %>%
dplyr::select(-c("volume_org", "volume_nonorg" ))
vol_hass_check
## # A tibble: 3,152 × 5
## week_ending region price case volume
## <dttm> <chr> <dbl> <chr> <dbl>
## 1 2017-01-02 00:00:00 California 0.89 nonorganic 1940375.
## 2 2017-01-02 00:00:00 California 1.46 organic 67433.
## 3 2017-01-02 00:00:00 Great Lakes 0.88 nonorganic 1242361.
## 4 2017-01-02 00:00:00 Great Lakes 1.44 organic 30249.
## 5 2017-01-02 00:00:00 Midsouth 1.12 nonorganic 875004.
## 6 2017-01-02 00:00:00 Midsouth 1.72 organic 31030.
## 7 2017-01-02 00:00:00 Northeast 1.35 nonorganic 709623.
## 8 2017-01-02 00:00:00 Northeast 2 organic 68924.
## 9 2017-01-02 00:00:00 Plains 0.83 nonorganic 405848.
## 10 2017-01-02 00:00:00 Plains 1.62 organic 25883.
## # ℹ 3,142 more rows
For the assignment, you need to further analyze how these effects depend on other variables in your data. Think of nice research questions to answer using the methods you’ve learned. Inspect your questions using plots, hypothesis tests, and regression models. Again, feel free to add any other analysis that you may have found useful to understand the data.
# Plot the linear regression model affect of price vs volume
ggplot(vol_hass_check, aes(x = volume, y = price, color = case)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Avocado Price vs Volume by Type",
x = "Volume",
y = "Average Price",
color = "Type") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
| Type | Correlation | p_value |
|---|---|---|
| Organic | -0.1264782 | 5e-07 |
| Nonorganic | -0.4766489 | 0e+00 |
| Type | Coefficient | p_value | R_squared |
|---|---|---|---|
| Organic | -4e-07 | 5e-07 | 0.0159967 |
| Nonorganic | -1e-07 | 0e+00 | 0.2271942 |
| case | mean_price | sd_price | mean_volume | sd_volume |
|---|---|---|---|---|
| nonorganic | 1.147157 | 0.2257183 | 2121398.9 | 1175647.88 |
| organic | 1.591123 | 0.2890327 | 132449.7 | 88233.38 |
• Both organic and non-organic avocado prices are negatively impacted by the volume sold, meaning that higher volumes are associated with lower prices.
• The effect is more pronounced for non-organic avocados, as indicated by the higher correlation and R-squared values in the regression analysis.
• Despite the statistical significance, the actual impact (coefficients) is very small, indicating that other factors may play a more substantial role in determining avocado prices.
After your first initial exploration of the data, you’re ready to start analyzing the effects over time. For instance, you could explore whether the effects you find differ per year, month or seasons. To do this, you’ll need a trend variable to predict increase or decrease in sales, volumes or prices across time. To create a trend variable, I first reorder the data, so that the first row contains the first time, and the last row the last time measured.
hass_region<-hass_region[order(hass_region$year, hass_region$month, hass_region$day),]
hass_region$trend <- NA
i<-1
for(date in unique(hass_region$week_ending)){
hass_region$trend[hass_region$week_ending==date]<-i
i<- i + 1
}
An alternative approach using tidyverse to create the trend variable:
hass_region %>%
group_by(region) %>%
arrange(week_ending) %>%
mutate(trend2 = row_number()) %>%
ungroup() -> hass_region
# Transform the dataset for plotting
hass_trend <- hass_region %>%
mutate(volume_org = small_org_bag + large_org_bag + xlarge_org_bag,
volume_nonorg = small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag) %>%
dplyr::select(avg_price_nonorg, avg_price_org, week_ending, region, volume_org, volume_nonorg, trend2) %>%
pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>%
mutate(case = if_else(type == "avg_price_org", "organic", "nonorganic")) %>%
dplyr::select(-type) %>%
mutate(volume = if_else(case == "organic", volume_org, volume_nonorg)) %>%
dplyr::select(-c(volume_org, volume_nonorg))
# Plot the trend for avocado prices over time
ggplot(hass_trend, aes(x = trend2, y = price, color = case)) +
geom_line() +
facet_wrap(~ region) +
labs(title = "Trend of Avocado Prices Over Time by Region",
x = "Time (Trend)",
y = "Average Price",
color = "Type") +
theme_minimal()
• Prices Peak: There seems to be a price peak at the beginning of time and the third quarter, organic and non-organic, across all regions.
• Volumes Difference: There seems to be a larger volume of non-organic avocados across all regions, which increases over time.
• Market Volatility: The avocado market shows considerable volatility in both prices and volumes, with certain regions experiencing more pronounced fluctuations.
Make sure you check the regression assumptions, specifically linearity that might be violated by nonlinear trends. Explain what you see.
It would be nice to see how well your model to explain prices can predict prices. To do this, you need to coefficients of your model, and you need to extrapolate what the price would be given these coefficients. I present an example below. Using linear regression I can estimate the effect of trend and region on the price. As I will use the coefficients to predict the last time measured, I exclude this time from the analysis. The coefficients are later used to predict this time. As you can see below, I overpredict the price of both organic and conventional avocado’s.
As an example, I show you how you can compare the observed price for one particular day with the predicted price.
# observed values:
hass_region$date <- as.character(hass_region$week_ending)
hass_region[hass_region$date=="2018-08-05" & hass_region$region=="Northeast",c("avg_price_nonorg","avg_price_org","region","trend")]
## # A tibble: 1 × 4
## avg_price_nonorg avg_price_org region trend
## <dbl> <dbl> <chr> <dbl>
## 1 1.21 1.75 Northeast 84
# Tidyverse approach to the same code:
# hass_region %>%
# mutate(date = as.character(week_ending)) %>%
# filter(date == "2018-08-05", region=="Northeast") %>%
# dplyr::select(avg_price_nonorg, avg_price_org, region, trend)
# Note that with the tidyverse approach you did not save the variable called date as a new variable in your dataset, this was only created temporarily only for the calculations here
I run different models to predict organic versus conventional avocados and show the predictions:
# predicted values:
# conventional
fit<-lm(avg_price_nonorg ~ trend + region,hass_region[hass_region$date!="2018-08-05",])
print(paste("The predicted price for region Northeast of conventional avocados is:", as.character(fit$coefficients[1] +
fit$coefficients['trend']*84 +
fit$coefficients['regionNortheast']*1)))
## [1] "The predicted price for region Northeast of conventional avocados is: 1.35839228334958"
# organic
fit<-lm(avg_price_org ~ trend + region,hass_region[hass_region$date!="2018-08-05",])
print(paste("The predicted price for region Northeast of organic avocados is:", as.character(fit$coefficients[1] +
fit$coefficients['trend']*84 +
fit$coefficients['regionNortheast']*1)))
## [1] "The predicted price for region Northeast of organic avocados is: 1.70932902159318"
To create a decision boundary between two regions, we need to reshape data:
# inspect two regions
glm_region <- glm(I(as.numeric(factor(region))-1) ~ small_nonorg_bag + avg_price_nonorg, hass_region[hass_region$region == "Northeast" | hass_region$region == "California",], family=binomial)
# create data for decision boundary
# summary(hass_region$small_nonorg_bag[hass_region$region == "Northeast" ])
plot_x <- c(1300000,1700000) # pick values based on x value in region of boundary decision
plot_y <- (-1 /coef(glm_region)[3]) * (coef(glm_region)[2] * plot_x + coef(glm_region)[1])
db.data <- data.frame(cbind(plot_x, plot_y))
colnames(db.data) <- c('x','y')
ggplot(hass_region[hass_region$region == "Northeast" | hass_region$region == "California",], aes(x=small_nonorg_bag,y=avg_price_nonorg, color=region))+
labs(x="small_org_bag", y="avg_price_org", title="Average organic avocado prices")+
geom_point(size = 4, shape = 19, alpha=0.8)+
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_line(data=db.data,aes(x=x, y=y), col = "black") # decision boundary
## Predicted conventional (non-organic) avocado price for 2021: 0.9824529
## Predicted organic avocado price for 2021: 1.436892
fit<-lm(avg_price_org~trend+ I(trend^2) + I(trend^3), hass_region)
ggplot(fit, aes(y=hass_region$avg_price_org, x=hass_region$trend,colour=hass_region$region))+
geom_point()+
geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3))+
facet_wrap(~hass_region$region) +
labs(title = "Trend Differentiation Across Regions",
x = "Trend",
y = "Average Price (Organic)",
color = "Type") +
theme_minimal()
You can use PCA or cluster analysis to classify the countries based on multiple variables. Select only one year to avoid correlated observations over time. You could start with using the variables you’ve chosen to study above, and include them in the PCA. The components that are obtained from this PCA can be used to create classes (i.e. groups of states/regions that have similar values). Interpret the components, and argue why you have chosen a specific classification.
hass_region2<-hass_region[hass_region$region %in% c("California","Northeast","Great Lakes"),]
lda.fit<-lda(region ~ avg_price_org + avg_price_nonorg , hass_region2)
mylda_plot <- cbind(hass_region2, predict(lda.fit)$x)
ggscatterhist(
mylda_plot, x = "LD2", y = "avg_price_org",
color = "region", size = 3, alpha = 0.6,
margin.params = list(fill = "region", color = "black", size = 0.2))
ggscatterhist(
mylda_plot, x = "LD1", y = "avg_price_nonorg",
color = "region", size = 3, alpha = 0.6,
margin.params = list(fill = "region", color = "black", size = 0.2))