library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readr)
library(modelr)
library(readxl)
Avocado <- read_excel("~/Desktop/Avocado.xlsx")

Data Set Linked- https://www.kaggle.com/datasets/pedroisrael/avocado-sales

Introduction

Millennials all across America are heartbroken over soaring avocado prices. Demand is strong and the fruits need a vast amount of water and labor to grow. As high as the demand for avocados is, so too is the amount of ways the fruit is used; avocado juice, avocado toast, avocado chips, guac the list goes on and on for the ways people enjoy avocados. However, Mexico, the main producer of avocados is struggling with lower crop yields making 25% lower shipments. The increased scarcity has raised prices across the country, and many are now paying an extra buck for their favorite fruit. This national tragedy has encouraged us to help America’s millennials and find where they can get the best cost per avocado ratio. We will do so by using a data set detailing avocado sales in U.S cities and comparing the average price between 2015-2018. Additionally we plan to look at the trends in these cities to find how prices have changed across months and years. Our data set contains two categorical variables outlining whether avocados are conventional or organic, and quantitative variables detailing time, quantity of avocados, weight, and price. Due to pre-existing knowledge, we know that the West Coast is the leading U.S. supplier of agricultural exports and the widest range of farm sizes. Based on this knowledge, we hypothesize that cities in the West Coast will have the best cost to quantity ratio of avocados over time, readily available to the public.

Procedure:

  1. Import data into RStudio and tidy if necessary

  2. Run tidy verse and initial visualizations comparing the average price of an avocado and the total bags being sold.

  3. Find mean values of average price and total avocado and arrange the data set to see the highest supply of avocados.

  4. Create a data set with observations that only fit the mean average criteria.

  5. Make more visualizations such as box plots and line graphs in order to make comparisons across larger groups

  6. Filter the most interesting and best cities

  7. Create a linear model in terms of time to ensure the best cities are being constant throughout the years and their trends.

  8. Create a linear model with all cities side by side with faceted graphs to compare

  9. T-testing

  10. Conclusion

For our first visualization, we wanted to see the correlations between the average price of avocados and the amount of product for each of the forty four cities.. We decided to use facet_wrap() to view each city separately and which cities have odd and distinct patterns. While the first visualization is misleading and not entirely legible, or even useful, it can be seen that Los Angeles, New York City, and Houstan had interesting curved shapes and irregular clusters that raised questions. These three cities are located in different parts of the United States and raised questions such as, could places with higher population densities sell more avocados for better prices?

ggplot(data=Avocado)+geom_point(mapping=aes(x=Avg_Price, y=Total_Bags))+facet_wrap(~Cities)

We know that throughout time prices fluctuate based on the economy and high demand, so wondering how this impacted each city we decided to make a visualization using geom_smooth() to view the cities’ average price for one avocado throughout the years. However a concern of ours was if the “Date” variable would be entirely legible when displayed on the graph, as it was input into the data set with the full month day and year and that was too specific to be seen clearly on the graph. Thankfully R selected only the year and we got the following graph:

ggplot(data=Avocado)+geom_smooth(mapping=aes(x=Avg_Price, y=Date, color=Cities), se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data=Avocado)+geom_smooth(mapping=aes(x=Avg_Price, y=Date, color=Cities), se=FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

While the date variable was nicely displayed on the axis, we ran into a larger problem: there were too many cities to view. The graph is completely illegible if you wanted to examine a single city. We decided then to restrict our investigation to cities with the best deals for avocados. We defined cities with the best deal as those that had above the average total avocados and below the average price. By using summarize(), we were able to condense the data set and find the average price of avocados and average total avocados from 2015-2018 in 44 cities across America. These two numbers will be used as a reference point and eventually the “null” in the following t-tests as we restrict our interest to the cities where lots of avocados are sold with prices lower than the average.

Avocado %>% summarize(mean(Avg_Price))
## # A tibble: 1 × 1
##   `mean(Avg_Price)`
##               <dbl>
## 1              1.41
Avocado %>% summarize(mean(Total_Avo))
## # A tibble: 1 × 1
##   `mean(Total_Avo)`
##               <dbl>
## 1           178582.

Using arrange(), and having the results be organized in descending order, we wanted to see which cities tend to have the largest supply of avocados in order to get an idea of which cities and regions in the United States have big supplies.

arrange( Avocado, desc(Total_Avo)) %>% select(Total_Avo, Cities, Avg_Price, Date)
## # A tibble: 14,869 × 4
##    Total_Avo Cities      Avg_Price Date               
##        <dbl> <chr>           <dbl> <dttm>             
##  1   3689769 Los Angeles      0.74 2015-02-01 00:00:00
##  2   3645777 Los Angeles      0.78 2015-05-03 00:00:00
##  3   3592025 Los Angeles      0.73 2018-02-04 00:00:00
##  4   3314693 Los Angeles      0.79 2015-06-14 00:00:00
##  5   3195106 Los Angeles      0.78 2015-05-10 00:00:00
##  6   3059349 Los Angeles      0.94 2015-08-09 00:00:00
##  7   3037181 Los Angeles      0.96 2015-07-05 00:00:00
##  8   2887007 Los Angeles      0.85 2015-05-31 00:00:00
##  9   2876127 Los Angeles      0.84 2015-06-07 00:00:00
## 10   2808899 Los Angeles      0.58 2016-02-07 00:00:00
## # … with 14,859 more rows

With it being the only city show in the top 10 tibble, it is obvious that Los Angeles is a hotspot for selling avocados, and compared with the average price of $1.41, it can be assumed that stores in Los Angeles have good deals on avocados as well as an abundance available to sell. The largest supply of avocados that Los Angeles had was in 2015, 2018, 2016 (according to the tibble).

In order to view only the observations that fit our criteria, we used mutate() and filter() to create a new data set named “avocados2” only presenting the observations that have an average price less than $1.41 and amounts of avocados greater than 200,000.

avocados2<- filter(Avocado, Avg_Price < 1.41, Total_Avo > 200000 ) %>% select(Avg_Price, Total_Avo, Cities, Type, Date)

arrange(avocados2, Avg_Price) %>% select(Total_Avo, Avg_Price, Cities, Date)
## # A tibble: 3,195 × 4
##    Total_Avo Avg_Price Cities              Date               
##        <dbl>     <dbl> <chr>               <dttm>             
##  1   1750185      0.46 Phoenix Tucson      2017-02-05 00:00:00
##  2   1036815      0.49 Phoenix Tucson      2015-12-27 00:00:00
##  3   1281938      0.51 Houston             2017-01-01 00:00:00
##  4   1269280      0.51 Phoenix Tucson      2015-04-19 00:00:00
##  5   1312113      0.51 Phoenix Tucson      2016-03-06 00:00:00
##  6   1335088      0.52 Phoenix Tucson      2015-06-07 00:00:00
##  7   1213541      0.52 West Tex New Mexico 2017-02-05 00:00:00
##  8   1322008      0.53 Houston             2017-01-15 00:00:00
##  9   1226305      0.53 Houston             2017-02-26 00:00:00
## 10   2768618      0.53 Los Angeles         2017-02-05 00:00:00
## # … with 3,185 more rows

Looking at this new data set we selected the first four cities that appeared and picked them as the best places where avocados can be bought cheaply with a high probability of guaranteed supply. Seeing that these cities had different values at different times we decided to look at the general trend of average price and total avocado amounts across all cities.

ggplot(data=Avocado)+geom_smooth(mapping=aes(x=Date, y=Avg_Price))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(data=Avocado)+geom_smooth(mapping=aes(x=Date, y=Total_Avo))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The general trends of these two factors shows an inverse relationship with each other, with dips in quantity leading to spikes in price. Real world events like the import bans and low crop turnout can also be traced on the graph and their impacts shown. We tried next to make a visualization of the four cities to see the individual trends in each, however Phoenix Tucson did not have enough observations to graph properly. So looking at just the other three top cities we can see that each generally follows the trend of the average price across all cities but as mentioned before Los Angles has a far higher number of total avocados than the other cities. West Tex New Mexico has a tighter spread than the other two cities with generally lower prices, and Los Angeles has the largest variability between observations, with more distance between each.

Modeling for Los Angeles, Phoenix Tucson, Houston, West Tex New Mexico

avocados2<- filter(Avocado, Avg_Price < 1.41, Total_Avo > 200000 ) %>% select(Avg_Price, Total_Avo, Cities, Type, Date)

avocados3<-filter(avocados2, Cities %in% c("Los Angeles", "Phoneix Tucson", "Houston", "West Tex New Mexico"))


ggplot(data=avocados3, mapping=aes(x=Avg_Price, y=Date, color=Total_Avo))+geom_point()+facet_wrap(~Cities)+coord_flip()

Trying to do linear modeling for this data set proved to be mostly fruitless as the data was so scattered but there was a general upwards trends when looking at variables like average price.

avo_mod1<-avocados3 %>% data_grid(Date)
avo_mod2<-lm(Avg_Price~Date, data=avocados3)
coef(avo_mod2)
##   (Intercept)          Date 
## -1.726249e+00  1.760415e-09
grid<-avo_mod1 %>% add_predictions(avo_mod2)
ggplot(avocados3, aes(Date))+geom_point(aes(y=Avg_Price))+geom_line(aes(y=pred), data=grid, color="pink", size=1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

avocados3a<-avocados3 %>% add_residuals(avo_mod2)
ggplot(avocados3a, aes(Avg_Price, resid))+geom_point()+geom_smooth()+facet_grid(~Cities)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(avocados3, aes(x=Date, y=Avg_Price))+geom_point()+geom_smooth(method="lm", color="pink")+facet_wrap(~Cities, ncol=4)
## `geom_smooth()` using formula = 'y ~ x'

T-Testing:

We ran the following tests in RStudio for Los Angeles and Houston (West Tex was once again not included because it did not have enough observations.) Since the tibbles showed a total number of avocados much higher than our reference point, 200,000, we decided to focus on price, a variable that is very important to consumers. We selected the average price and ran a t-test with a null of $1.41, our average across all cities.

LA <- filter(Avocado, Cities=="Los Angeles") %>% select(Avg_Price) %>%  t.test(mu=1.41, alternative="greater")

Houston<-filter(Avocado, Cities=="Houston") %>% select(Avg_Price) %>% 
  t.test(mu=1.41, alternative="greater")

In Los Angeles, there is a 100% chance of observing our results with the average price of 1.41. Since the p-value is greater than 0.05, we fail to reject the null hypothesis meaning there is evidence to suggest that the average price of a single avocado is above 1.41. Similarly in Houston, there is a 100% chance of observations having a greater average price value than $1.41.

Conclussion

We began with a single data set displaying 44 cities and their information on avocado supply and prices. After multiple coding attempts, visualizations, and ranking, we slowly began the process of elimination to find the best cities with large supplies of avocados and lowest prices. While Los Angeles had the largest supply in the country over the span of three years, their prices fluctuate frequently due to economic and agricultural factors. Houston, Texas was a city with very consistent prices over the three years. Since it is difficult to view this data with a linear model, the results may be mis-interpretated; however, Houston Texas appeared as a good middle ground between big supply and low cost. The overall prices continue to climb, increasing nineteen percent, after a brief import ban of avocados from Mexico in 2022. We wonder how the modeling graphs would change if the data included observations from 2018-2022, including the time of this important ban in which avocado prices must have spiked due to the increased demand. The t-tests evoked another area of interest. The average prices for both Houston and Los Angeles have a 100% chance of being higher than the null, a value we calculated by summarizing the mean of all forty four cities. To better our project, we would like to see how the t-tests would change if the null hypothesis was a value representing the average prices between the three cities of interest: Houston, Los Angeles, and West Text New Mexico.