Assignment 1

1. Load the dataset into R

In a kaggle competition, this data was provided by the Hass avocado board to analyze the perception that avocado prices surged in 2017. R has devoted a package to a more recent version of this data which combines an extensive description of variables and three datasets: - at the country level (hass_usa) - at the region level (hass_region) - at the city or subregion level (hass) You can decide which data you will work with. I would advise for those who feel uncertain about their capabilities, to start with the more general data (at the USA level).

getwd()# make sure you download the csv file in your working directory!

## [1] "/Users/amitaharoni/Library/CloudStorage/Dropbox/03. Main Personal/01. Knowledge, Courses & School/06. CUNY - BMCC/07. Study Abroad/Data Analysis in R/03. Projects/DataAnalysisRCourse/FinalAssignment"

# install.packages("avocado")
library(avocado)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(ggpubr)
data('hass_region')
# ?hass_region # for more information
head(hass_region)

## # A tibble: 6 × 16
##   week_ending         region        avg_price_nonorg  plu4046  plu4225 plu4770
##   <dttm>              <chr>                    <dbl>    <dbl>    <dbl>   <dbl>
## 1 2017-01-02 00:00:00 California                0.89 2266313. 2877688.  90900.
## 2 2017-01-02 00:00:00 Great Lakes               0.88  636278. 2157250. 189357.
## 3 2017-01-02 00:00:00 Midsouth                  1.12  653896  1285364.  64703.
## 4 2017-01-02 00:00:00 Northeast                 1.35  174843. 2589316.  39607.
## 5 2017-01-02 00:00:00 Plains                    0.83 1462454.  509660.   4781.
## 6 2017-01-02 00:00:00 South Central             0.64 3587283. 1719720.  27549.
## # ℹ 10 more variables: small_nonorg_bag <dbl>, large_nonorg_bag <dbl>,
## #   xlarge_nonorg_bag <dbl>, avg_price_org <dbl>, plu94046 <dbl>,
## #   plu94225 <dbl>, plu94770 <dbl>, small_org_bag <dbl>, large_org_bag <dbl>,
## #   xlarge_org_bag <dbl>

summary(hass_region)

##   week_ending                        region          avg_price_nonorg
##  Min.   :2017-01-02 00:00:00.00   Length:1576        Min.   :0.620   
##  1st Qu.:2017-12-10 00:00:00.00   Class :character   1st Qu.:0.990   
##  Median :2018-11-18 00:00:00.00   Mode  :character   Median :1.130   
##  Mean   :2018-12-01 15:57:33.80                      Mean   :1.147   
##  3rd Qu.:2019-11-24 00:00:00.00                      3rd Qu.:1.290   
##  Max.   :2020-11-01 00:00:00.00                      Max.   :1.900   
##     plu4046           plu4225           plu4770         small_nonorg_bag 
##  Min.   :  10779   Min.   :   4016   Min.   :    26.8   Min.   :  12775  
##  1st Qu.: 728445   1st Qu.: 627577   1st Qu.: 15463.1   1st Qu.: 904415  
##  Median :1482193   Median :1059240   Median : 32828.1   Median :1324807  
##  Mean   :1588680   Mean   :1294277   Mean   : 93410.4   Mean   :1400776  
##  3rd Qu.:2310060   3rd Qu.:1618820   3rd Qu.:142694.2   3rd Qu.:1802267  
##  Max.   :5160897   Max.   :6533193   Max.   :794742.9   Max.   :4017035  
##  large_nonorg_bag  xlarge_nonorg_bag avg_price_org      plu94046     
##  Min.   :    764   Min.   :     0    Min.   :0.650   Min.   :     0  
##  1st Qu.: 169019   1st Qu.: 10049    1st Qu.:1.400   1st Qu.:  2522  
##  Median : 443211   Median : 27462    Median :1.550   Median :  6862  
##  Mean   : 667406   Mean   : 53216    Mean   :1.591   Mean   : 14574  
##  3rd Qu.: 922094   3rd Qu.: 66460    3rd Qu.:1.740   3rd Qu.: 23970  
##  Max.   :3533282   Max.   :730833    Max.   :2.950   Max.   :156528  
##     plu94225         plu94770        small_org_bag      large_org_bag   
##  Min.   :     0   Min.   :    0.00   Min.   :     6.3   Min.   :     0  
##  1st Qu.:  5976   1st Qu.:    0.00   1st Qu.: 49844.1   1st Qu.:  6365  
##  Median : 22677   Median :   39.08   Median : 94482.8   Median : 17558  
##  Mean   : 27304   Mean   :  402.70   Mean   : 99030.5   Mean   : 33364  
##  3rd Qu.: 39145   3rd Qu.:  268.42   3rd Qu.:136281.7   3rd Qu.: 42416  
##  Max.   :425617   Max.   :15013.85   Max.   :771846.5   Max.   :247924  
##  xlarge_org_bag    
##  Min.   :    0.00  
##  1st Qu.:    0.00  
##  Median :    0.00  
##  Mean   :   54.76  
##  3rd Qu.:    0.00  
##  Max.   :16336.25

table(hass_region$region)

## 
##    California   Great Lakes      Midsouth     Northeast        Plains 
##           197           197           197           197           197 
## South Central     Southeast          West 
##           197           197           197

The data contains the total weight (US pounds) of Hass avocado’s being sold per region and year. They distinguish between avocados sold per unit (PLU variables) and those sold in bags. In the columns plu4046, plu4225 and plu4770 you can find the total weight of small, large and extra large Hass avocado’s being sold:

4046: non-organic small/medium Hass Avocados (~3-5 oz)
4225: non-organic large Hass Avocados (~8-10 oz)
4770: non-organic extra large Hass Avocados (~10-15 oz)
94046: organic small/medium Hass Avocados (~3-5 oz)
94225: organic large Hass Avocados (~8-10 oz)
94770: organic extra large Hass Avocados

PLU refers to a bulk sale. On the other hand, the bags indicates a pre-packaged container consisting of a variable number of avocados of mixed PLU type. For instance, a package of six avocados may consist of 2 PLU 4046, 3 PLU 4770 and 1 PLU 4225. In other words, bagged amount sold are unable to account for individual PLU amount sold. The data also contains the average price per organic and nonorganic avocado being sold in that region and year. Multiplying the price with the volume will create total sales.

Your goal will be to predict prices in the last observed time period. You are free to select regions, time periods, and variables as you think is necessary to well predict the prices of the avocado’s. How good your prediction is determined by the difference between the predicted value and the observed value of the last time period. In the final submission, please focus on a few analyses that you found most useful and interesting to answer the questions. Do not submit a full paper with all the possible analyses you’ve tried.

In order to inspect the data over time, we create three new variables, day, month and year.

for(i in seq(1, nrow(hass_region))){
  hass_region$year[i] <- unlist(strsplit(as.character(hass_region$week_ending[i]),"-"))[1]
  hass_region$month[i] <- unlist(strsplit(as.character(hass_region$week_ending[i]),"-"))[2]
  hass_region$day[i] <- unlist(strsplit(as.character(hass_region$week_ending[i]),"-"))[3]
}

An alternative approach to this is to use functions from the lubridate package:

hass_region %>%
  mutate(
    year = year(week_ending),
    month = month(week_ending),
    day = day(week_ending)
    ) -> hass_region

I advise you to work from this notebook, so the results are well organized, and represented in such a way that others can make sense of it. For the assignment, it is important to show that you’ve learned how to code, but also how to interpret the plots, hypothesis tests and regression results. To get started, I added some examples below. Please make sure you add code and text below each question, and that the analyses combined become a report that is easy to read, and attractive to see.

2. Inspect variables

When you inspect the variables, pay special attention to measurement levels (how many categories, numeric values etc), distributions (use plots!), explore conditional distributions (plots separated for groups of observations), make selections on the data so you investigate how ‘complex’ the data is.

For the assignment, select those inspections that you felt were most insightful to you, and interpret these plots/summaries. As an example below I added my first inspection of the data:

3. Preliminary analyses

In preliminary analyses, the main goal is to get a feel for the data. How are variables construed, how do they relate with one another, what is the structure of the data? You might have noted during your inspection of the variables that this is a so-called long file with respect to all years and regions. This means that years and regions are placed below one another on separate rows.

As you will see towards the end of this assignment, this makes time analysis easy. For the first analyses, you do not have to analyze the effect of time, but you do need to control for time. If not, the effect of time will confound the effect of the other variables.

The following questions help you to get to know the data better. If you think of other questions, feel free to add those analyses below.

a. How does price differ per region?

The boxplot above shows that the price changed over the years. Keep in mind that there are regional differences that confound this comparisons over the years, as the price change might have differed per region. In the following we explore the price per region, and focus only on the larger regions in the US that are listed in the hass_region data. This is just to illustrate what you can do with the data. Of course you can further explore hass_county if you want! Look into the different regions here: https://cran.r-project.org/web/packages/avocado/vignettes/a_intro.html

# first delete levels so that you can focus on specific part of the data
hass_region$region <- as.character(hass_region$region)
unique(hass_region$region)

## [1] "California"    "Great Lakes"   "Midsouth"      "Northeast"    
## [5] "Plains"        "South Central" "Southeast"     "West"

boxplot(avg_price_nonorg~region, hass_region)

boxplot(avg_price_org~region, hass_region)

# Process and visualize the hass_region data
hass_region %>%
  mutate(volume_org = small_org_bag + large_org_bag + xlarge_org_bag, # Calculate total organic volume
         volume_nonorg = small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag, # Calculate total non-organic volume
         year = as.integer(format(as.Date(week_ending), "%Y"))) %>% # Extract year from week_ending
  dplyr::select(avg_price_nonorg, avg_price_org, week_ending, region, volume_org, volume_nonorg, year) %>% # Select relevant columns
  pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>% # Reshape data to long format
  ggplot(aes(x = region, y = price)) + # Create ggplot object with region on x-axis and price on y-axis
  geom_boxplot() + # Add boxplot layer
  facet_grid(year ~ type, scales="free") + # Create separate facets for each year and type of price
  labs(title = "Average Avocado Prices by Region, Year, and Type", x = "Region", y = "Price") + # Add titles and labels
  theme_minimal() + # Use minimal theme
  theme(
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1), # Increase size and rotate x-axis labels for readability
    axis.text.y = element_text(size = 12), # Increase size of y-axis labels
    strip.text = element_text(size = 14), # Increase size of facet labels
    plot.title = element_text(size = 16, face = "bold"), # Increase size and bold the title
    axis.title = element_text(size = 14) # Increase size of axis titles
  )

# Process and visualize the hass_region data
hass_region %>%
  filter(year %in% c(2018, 2019, 2020), region %in% c("California", "South Central", "West")) %>% # Filter for the years 2018, 2019, and 2020
  mutate(volume_org = small_org_bag + large_org_bag + xlarge_org_bag, # Calculate total organic volume
         volume_nonorg = small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag # Calculate total non-organic volume
         ) %>%
  dplyr::select(avg_price_nonorg, avg_price_org, year, month, region, volume_org, volume_nonorg) %>% # Select relevant columns
  pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>% # Reshape data to long format
  ggplot(aes(x = as.numeric(month), y = price, color = region, group = region)) + # Create ggplot object with month on x-axis and price on y-axis
   geom_smooth(method = "loess", se = FALSE) + # Add smooth line layer using LOESS smoothing without confidence interval
  # geom_line() + # Add line layer
  # facet_grid(~ type + year, scales = "free") + 
  facet_wrap(~ type + year, scales = "free") + # Create separate facets for each type of price and year
  geom_vline(xintercept = 7, linetype = "dashed", color = "red", size = 1) + # Add vertical line at July (7th month)
  # geom_segment(data = . %>% group_by(type, year) %>% summarise(max_price = max(price, na.rm = TRUE)),
  #              aes(x = 7, xend = 7, y = max_price, yend = max_price * 0.95),
  #              arrow = arrow(length = unit(0.2, "cm")), color = "red") + # Add arrow pointing to the peak in July
  # annotate("text", x = 7, y = Inf, label = "July", vjust = 2, color = "red", size = 4) + # Annotate with text "July"
  scale_x_continuous(breaks = 1:12, labels = month.abb) + # Set x-axis to display months
  labs(title = "Monthly Average Avocado Prices by Region, Year, and Type", x = "Month", y = "Price") + # Add titles and labels
  theme_minimal() + # Use minimal theme
  theme(
    axis.text.x = element_text(size = 7, angle = 45, hjust = 1), # Increase size and rotate x-axis labels for readability
    axis.text.y = element_text(size = 10), # Increase size of y-axis labels
    strip.text = element_text(size = 8), # Increase size of facet labels
    plot.title = element_text(size = 14, face = "bold"), # Increase size and bold the title
    axis.title = element_text(size = 12) # Increase size of axis titles
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

As you can see the price of avocado’s in the West is much more variable than those in South Central (where the price of avocado is the lowest). In the Northeast the price of the avocado is the highest.

Now let’s inspect the time dimension using tapply and month.

tapply(hass_region$avg_price_nonorg , hass_region$month, mean)

##         1         2         3         4         5         6         7         8 
## 1.0338889 0.9726562 1.1188889 1.1757639 1.1527941 1.1844118 1.2206250 1.2114706 
##         9        10        11        12 
## 1.2522917 1.2016912 1.1114423 1.0796250

A tidyverse solution instead of the tapply function:

hass_region %>%
  group_by(month) %>%
  summarise(avg_price_nonorg = mean(avg_price_nonorg))

## # A tibble: 12 × 2
##    month avg_price_nonorg
##    <dbl>            <dbl>
##  1     1            1.03 
##  2     2            0.973
##  3     3            1.12 
##  4     4            1.18 
##  5     5            1.15 
##  6     6            1.18 
##  7     7            1.22 
##  8     8            1.21 
##  9     9            1.25 
## 10    10            1.20 
## 11    11            1.11 
## 12    12            1.08

It seems that the average price of avocado’s somewhat becomes higher in the months August till October. Indeed, https://www.wisegeek.com/when-is-avocado-season.htm states the following:

The U.S. state of California is one of the world’s largest producers of avocados. Farmers there tend to grow the “Hass” variety, which matures quickly and produces a large, attractive fruit. The Hass is generally available beginning in late March and ending in early to mid-September, though of course a lot depends on weather patterns, storms, and basic growing conditions. Droughts and unusually cold weather tend to reduce crop availability, which can alter the start and end points of avocado season while also impacting price. Export demands may also place a strain on availability. Many California farms sell their goods to grocers around the world during the peak growing season.

Apparently, on average, the price for avocado’s is higher when it is in season in California.

b. Are organic more expensive than conventional?

If you now understand the data structure, please see if you can compare conventional and organic avocados. In the columns you can find both bags (avocados sold in package) or PLU numbers (avocados sold per unit). For the bags, you have two columns, one for the organic and one for the conventional avocado. The PLU numbers indicate the type of avocado and whether it is organic. When the number has a 9 in front, as in plu94046 which is a small organic Hass avocado.

Since you have a column for organic and conventional, the data is wide with respect to the type. If you want to compare the two, it makes sense to reshape the data to a long format, so that organic and conventional avocados are placed under one another on a separate row.

organic_hass_check <- hass_region %>% 
  dplyr::select(avg_price_nonorg, avg_price_org) %>%
  pivot_longer(cols=everything(),names_to = "type", values_to = "price") %>% 
  mutate(type=case_when(
    type == "avg_price_nonorg" ~ "conventional", 
    TRUE~ "organic")) 

# organic_hass_check

# Plot the data
ggplot(organic_hass_check, aes(x = type, y = price, fill = type)) +
  geom_boxplot() +
  labs(title = "Price Comparison of Avocados",
       x = "Type",
       y = "Average Price",
       fill = "Type") +
  theme_minimal()

# Note that it is also possible to modify the pivot_longer function to keep other variables such as the regions and years

library(dplyr)
library(tidyr)
library(ggplot2)
# Transform the dataset
organic_hass_check <- hass_region %>%
  dplyr::select(region, avg_price_nonorg, avg_price_org) %>% # Added region to selection
  pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>%
  mutate(type = case_when(
    type == "avg_price_nonorg" ~ "conventional",
    TRUE ~ "organic"
  ))


# Plot the data by region
ggplot(organic_hass_check, aes(x = region, y = price, fill = type)) +
  geom_boxplot() +
  labs(title = "Price Comparison of Avocados by Region",
       x = "Region",
       y = "Average Price",
       fill = "Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  # Rotate x-axis labels for readability

library(dplyr)
library(tidyr)
library(ggplot2)

# Transform the dataset
organic_hass_check <- hass_region %>%
  dplyr::select(year, avg_price_nonorg, avg_price_org) %>% # Added year to selection
  pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>%
  mutate(type = case_when(
    type == "avg_price_nonorg" ~ "conventional",
    TRUE ~ "organic"
  ))

# Plot the data by year
ggplot(organic_hass_check, aes(x = factor(year), y = price, fill = type)) +
  geom_boxplot() +
  labs(title = "Price Comparison of Avocados by Year",
       x = "Year",
       y = "Average Price",
       fill = "Type") +
  theme_minimal()

This data set allows you to test the simple price difference between organic and conventional avocados. As a follow-up analysis, you could test differences in prices across regions and years.

c. Does the volume of organic versus nonorganic avocado’s produced affect the price?

Now let’s move to regression. For this, you need to reshape data so that you have a column of prices and volumes for the different types of avocados.

# It depends on what exactly you want to test, but there is an example of how you can reshape the data with the amount of avocados sold in bags:

vol_hass_check <- hass_region %>%
  mutate(volume_org = small_org_bag + large_org_bag + xlarge_org_bag,
         volume_nonorg = small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag
         ) %>%
  dplyr::select(avg_price_nonorg, avg_price_org, week_ending, region, volume_org, volume_nonorg) %>%
  pivot_longer(cols=-c(week_ending, region, volume_org, volume_nonorg), names_to = "type", values_to = "price") %>%
  mutate(case = if_else("avg_price_org" == type, "organic", "nonorganic")) %>%
  dplyr::select(-type) %>%
  mutate(volume = if_else("organic" == case, volume_org, volume_nonorg)) %>%
  dplyr::select(-c("volume_org", "volume_nonorg" ))

vol_hass_check

## # A tibble: 3,152 × 5
##    week_ending         region      price case         volume
##    <dttm>              <chr>       <dbl> <chr>         <dbl>
##  1 2017-01-02 00:00:00 California   0.89 nonorganic 1940375.
##  2 2017-01-02 00:00:00 California   1.46 organic      67433.
##  3 2017-01-02 00:00:00 Great Lakes  0.88 nonorganic 1242361.
##  4 2017-01-02 00:00:00 Great Lakes  1.44 organic      30249.
##  5 2017-01-02 00:00:00 Midsouth     1.12 nonorganic  875004.
##  6 2017-01-02 00:00:00 Midsouth     1.72 organic      31030.
##  7 2017-01-02 00:00:00 Northeast    1.35 nonorganic  709623.
##  8 2017-01-02 00:00:00 Northeast    2    organic      68924.
##  9 2017-01-02 00:00:00 Plains       0.83 nonorganic  405848.
## 10 2017-01-02 00:00:00 Plains       1.62 organic      25883.
## # ℹ 3,142 more rows

For the assignment, you need to further analyze how these effects depend on other variables in your data. Think of nice research questions to answer using the methods you’ve learned. Inspect your questions using plots, hypothesis tests, and regression models. Again, feel free to add any other analysis that you may have found useful to understand the data.

# Plot the linear regression model affect of price vs volume
ggplot(vol_hass_check, aes(x = volume, y = price, color = case)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Avocado Price vs Volume by Type",
       x = "Volume",
       y = "Average Price",
       color = "Type") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

Correlation Test Results for Avocado Prices and Volumes
Type	Correlation	p_value
Organic	-0.1264782	5e-07
Nonorganic	-0.4766489	0e+00

Regression Analysis Summary for Avocado Prices and Volumes
Type	Coefficient	p_value	R_squared
Organic	-4e-07	5e-07	0.0159967
Nonorganic	-1e-07	0e+00	0.2271942

Descriptive Statistics for Avocado Prices and Volumes
case	mean_price	sd_price	mean_volume	sd_volume
nonorganic	1.147157	0.2257183	2121398.9	1175647.88
organic	1.591123	0.2890327	132449.7	88233.38

Conclusion

• Both organic and non-organic avocado prices are negatively impacted by the volume sold, meaning that higher volumes are associated with lower prices.

• The effect is more pronounced for non-organic avocados, as indicated by the higher correlation and R-squared values in the regression analysis.

• Despite the statistical significance, the actual impact (coefficients) is very small, indicating that other factors may play a more substantial role in determining avocado prices.

4. Explore time

After your first initial exploration of the data, you’re ready to start analyzing the effects over time. For instance, you could explore whether the effects you find differ per year, month or seasons. To do this, you’ll need a trend variable to predict increase or decrease in sales, volumes or prices across time. To create a trend variable, I first reorder the data, so that the first row contains the first time, and the last row the last time measured.

hass_region<-hass_region[order(hass_region$year, hass_region$month, hass_region$day),]

hass_region$trend <- NA
i<-1
for(date in unique(hass_region$week_ending)){
  hass_region$trend[hass_region$week_ending==date]<-i
  i<- i + 1
}

An alternative approach using tidyverse to create the trend variable:

hass_region %>%
  group_by(region) %>%
  arrange(week_ending) %>%
  mutate(trend2 = row_number()) %>%
  ungroup() -> hass_region


# Transform the dataset for plotting
hass_trend <- hass_region %>%
  mutate(volume_org = small_org_bag + large_org_bag + xlarge_org_bag,
         volume_nonorg = small_nonorg_bag + large_nonorg_bag + xlarge_nonorg_bag) %>%
  dplyr::select(avg_price_nonorg, avg_price_org, week_ending, region, volume_org, volume_nonorg, trend2) %>%
  pivot_longer(cols = c(avg_price_nonorg, avg_price_org), names_to = "type", values_to = "price") %>%
  mutate(case = if_else(type == "avg_price_org", "organic", "nonorganic")) %>%
  dplyr::select(-type) %>%
  mutate(volume = if_else(case == "organic", volume_org, volume_nonorg)) %>%
  dplyr::select(-c(volume_org, volume_nonorg))

# Plot the trend for avocado prices over time
ggplot(hass_trend, aes(x = trend2, y = price, color = case)) +
  geom_line() +
  facet_wrap(~ region) +
  labs(title = "Trend of Avocado Prices Over Time by Region",
       x = "Time (Trend)",
       y = "Average Price",
       color = "Type") +
  theme_minimal()

Trends Conclusion

• Prices Peak: There seems to be a price peak at the beginning of time and the third quarter, organic and non-organic, across all regions.

• Volumes Difference: There seems to be a larger volume of non-organic avocados across all regions, which increases over time.

• Market Volatility: The avocado market shows considerable volatility in both prices and volumes, with certain regions experiencing more pronounced fluctuations.

Make sure you check the regression assumptions, specifically linearity that might be violated by nonlinear trends. Explain what you see.

5. Predict

It would be nice to see how well your model to explain prices can predict prices. To do this, you need to coefficients of your model, and you need to extrapolate what the price would be given these coefficients. I present an example below. Using linear regression I can estimate the effect of trend and region on the price. As I will use the coefficients to predict the last time measured, I exclude this time from the analysis. The coefficients are later used to predict this time. As you can see below, I overpredict the price of both organic and conventional avocado’s.

As an example, I show you how you can compare the observed price for one particular day with the predicted price.

# observed values:

hass_region$date <- as.character(hass_region$week_ending)
hass_region[hass_region$date=="2018-08-05" & hass_region$region=="Northeast",c("avg_price_nonorg","avg_price_org","region","trend")]

## # A tibble: 1 × 4
##   avg_price_nonorg avg_price_org region    trend
##              <dbl>         <dbl> <chr>     <dbl>
## 1             1.21          1.75 Northeast    84

# Tidyverse approach to the same code:
# hass_region %>%
#   mutate(date = as.character(week_ending)) %>%
#   filter(date == "2018-08-05", region=="Northeast") %>%
#   dplyr::select(avg_price_nonorg, avg_price_org, region, trend)
# Note that with the tidyverse approach you did not save the variable called date as a new variable in your dataset, this was only created temporarily only for the calculations here

I run different models to predict organic versus conventional avocados and show the predictions:

# predicted values:

# conventional
fit<-lm(avg_price_nonorg ~ trend + region,hass_region[hass_region$date!="2018-08-05",])
print(paste("The predicted price for region Northeast of conventional avocados is:", as.character(fit$coefficients[1] + 
  fit$coefficients['trend']*84 +
  fit$coefficients['regionNortheast']*1)))

## [1] "The predicted price for region Northeast of conventional avocados is: 1.35839228334958"

# organic
fit<-lm(avg_price_org ~ trend + region,hass_region[hass_region$date!="2018-08-05",])
print(paste("The predicted price for region Northeast of organic avocados is:", as.character(fit$coefficients[1] + 
  fit$coefficients['trend']*84 + 
  fit$coefficients['regionNortheast']*1)))

## [1] "The predicted price for region Northeast of organic avocados is: 1.70932902159318"

To create a decision boundary between two regions, we need to reshape data:

# inspect two regions
glm_region <- glm(I(as.numeric(factor(region))-1) ~ small_nonorg_bag + avg_price_nonorg, hass_region[hass_region$region == "Northeast" | hass_region$region == "California",], family=binomial)

# create data for decision boundary
# summary(hass_region$small_nonorg_bag[hass_region$region == "Northeast" ])
plot_x <- c(1300000,1700000) # pick values based on x value in region of boundary decision
plot_y <- (-1 /coef(glm_region)[3]) * (coef(glm_region)[2] * plot_x + coef(glm_region)[1])
db.data <- data.frame(cbind(plot_x, plot_y))
colnames(db.data) <- c('x','y')

ggplot(hass_region[hass_region$region == "Northeast" | hass_region$region == "California",], aes(x=small_nonorg_bag,y=avg_price_nonorg, color=region))+ 
  labs(x="small_org_bag", y="avg_price_org", title="Average organic avocado prices")+
  geom_point(size = 4, shape = 19, alpha=0.8)+
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))  +
  geom_line(data=db.data,aes(x=x, y=y), col = "black")  # decision boundary

## Predicted conventional (non-organic) avocado price for 2021: 0.9824529

## Predicted organic avocado price for 2021: 1.436892

b. Explore how trend differs across regions.

fit<-lm(avg_price_org~trend+ I(trend^2) + I(trend^3), hass_region)
ggplot(fit, aes(y=hass_region$avg_price_org, x=hass_region$trend,colour=hass_region$region))+
  geom_point()+
  geom_smooth(method='lm', formula=y~x+I(x^2)+I(x^3))+
  facet_wrap(~hass_region$region) +
  labs(title = "Trend Differentiation Across Regions",
     x = "Trend",
     y = "Average Price (Organic)",
     color = "Type") +
  theme_minimal()

7. Classification

You can use PCA or cluster analysis to classify the countries based on multiple variables. Select only one year to avoid correlated observations over time. You could start with using the variables you’ve chosen to study above, and include them in the PCA. The components that are obtained from this PCA can be used to create classes (i.e. groups of states/regions that have similar values). Interpret the components, and argue why you have chosen a specific classification.

hass_region2<-hass_region[hass_region$region %in% c("California","Northeast","Great Lakes"),]

lda.fit<-lda(region ~ avg_price_org + avg_price_nonorg , hass_region2)

mylda_plot <- cbind(hass_region2, predict(lda.fit)$x)

ggscatterhist(
  mylda_plot, x = "LD2", y = "avg_price_org",
  color = "region", size = 3, alpha = 0.6,
  margin.params = list(fill = "region", color = "black", size = 0.2))

ggscatterhist(
  mylda_plot, x = "LD1", y = "avg_price_nonorg",
  color = "region", size = 3, alpha = 0.6,
  margin.params = list(fill = "region", color = "black", size = 0.2))