class: center, middle, inverse, title-slide .title[ # 🥑 An Avocado Price Point Adventure Across the U.S. 🥑 ] .author[ ### Tyger Smolar ] .date[ ### 2025-04-15 ] --- <!-- you can write anything between these brackets thay wont show in your knitted report --> <!-- use hashtag signs to define new sections of subsections: ## section-1 YOUR TEXT AND CODE ### subsection-1.1 YOUR TEXT AND CODE ### subsection 1-2 YOUR TEXT AND CODE ## section-2 YOUR TEXT AND CODE ### subsection-2.1 etc --> <!-- load the libraries that you need --> <!-- Use the instructions in the project description in the course notes, evaluation page, to prepare your report. In particular you should describe Write a data report in Rmd format in which you describe: * a question or series of questions, * a dataset, * a series of steps taken to describe and explore the data, * "final" visualizations that show how the data answers your questions, and * your interpretation of the visualization to answer your questions. Required elements: * At least one visualization illustrating the distribution of a single variable (boxplot, histogram, bar chart, density plot) * At least one visualization illustrating a relationship between two quantitative variables (e.g., a scatterplot) * At least one visualization of multiple variables condensed down to two: PCA or MDS or points labeled with K-means * At least one visualization using a regression line or smooth (lm, gam, LOESS) * Any other plots relevant for your question and data (map, time plot, etc.) Some of these visualizations may not be useful for your "final" visualizations. That's okay. The idea is that the report will document your exploration of your data, show that you know how to prepare many of the visualizations studied in the course, and show that you can use data to prepare visualizations to answer a question. You can change the plan you developed in the proposal. You can revise the visualizations you showed in your oral presentation. Delete the instructions when you are done. --> ## Introduction <!-- context --> In this data report, I will be analyzing trends in costs of avocado prices in the United States. This data set is very informative containing many pieces of information for me to put into data visualizations. I chose this data set because it contained lots of parameters that I initially found challenging to understand. I wanted to take on this challenge and channel it into a project with different data visualizations so these parameters are easier to understand for others. <!-- I use html code to embed an umage for illustration of the context I chose a cover of the motorn trends magazine of 1974, as this is where the data appeared first. just replace the jpg file by a link to an image --> <div style="display: flex;flex-direction: column;align-items: center;background-color: white;padding: 10px;"> <div style="border: 1px solid white;padding: 10px;width: 60%;"><a href=""><img src="https://assets.farmjournal.com/dims4/default/7f17e59/2147483647/strip/true/crop/1200x801+0+28/resize/800x534!/quality/90/?url=https%3A%2F%2Ffj-corp-pub.s3.us-east-2.amazonaws.com%2Fs3fs-public%2F2023-11%2Favocados.png" style="display: block;width: 80%;"></a></div> <p style="color: white;margin-top: 10px;font-size: 16px;text-align: center;">Motor Trends Magazine</p></div> ## Research Question <!-- Describe your research questions here.--> The **primary research question** I aim to address with this dataset is: 1. **How have prices increased across America in terms of avocado sales and have some sizes of avocados become more expensive in certain regions?** 2. **Are regions closer to Mexico, the largest exporter of avocados, getting avocados for a better price?** ## Data <!-- Describe your dataset here Requirements: a clear and precise description of your dataset. Try to answer all the important questions: who collected the data (possibly an organism) why was the data collected where and when was the data collected how was the data collected then explain the structure of the data: make an inventory of quantitative variables (name? meaning? unit?) make an inventory/list of categorical variables (name?meaning? levels?) select only a subset (e.g. 2 or 3 cat and 4 or more quant) if you have too many columns Discuss separately variables indicative of time or position that are available in the dataset, if any. --> The avocado dataset is was collected by a user on Kaggle and contains data collected from the Hass Avocado Board. The data was collected to analyze trends and price fluctuations and covers information on avocado sales from 2015 to 2018. The data was collected via Hass Avocado Board's public records and Kaggle users came together to fill in regional, descriptive and numerical data. From this data set, the variables important to me are the region (categorical), the price (numerical), the number of small(4046)/medium(4225)/large(4770) avocados(numerical), and the year(categorical). #### 1. Overview The avocado prices dataset is a set of information on trends in avocado sales in the United States from 2015 - 2018. This dataset was taken from large public records and published to Kaggle. #### 2. Source & Collection - **Source**: The data was extracted from *Kaggle* - **Original Purpose**: Describe trends in avocado sales. - **Collection Method**: The data was taken from a large dataset from Hass Avocado Board and missing data was averaged and supplemented by Kaggle users. #### 3. Variables & Descriptions The dataset contains 18,249 observations and 14 variables, including quantitative, categorical, and date/time information. #### 4. Key Characteristics** - **4046**: Number of small Hass avocados (PLU4046) - **4225**: Number of medium Hass avocados (PLU4225) - **4770**: Number of large Hass avocados (PLU4770) #### 5. Why Was This Data Collected? - **Avocado Price's Goal**: To analyze consumer trends in avocado prices across the United States. #### 6. Limitations & Considerations The dataset is from Kaggle and made by a user therefore there may be some random variables or averaged variables inputed where missing variables were. ## Exploratory data analysis <!-- Include required visualzations and any useful exploratory visualizations you created here Would you like to refine these questions or prioritize one for analysis? 1. **Explore correlations** between variables? 2. **Run regression models** (e.g., `mpg ~ wt + hp`)? 3. **Cluster cars** based on performance? 4. **Compare manual vs. automatic transmissions**? Let me know how you'd like to proceed! Now give me a series of steps to perform an exploratory data analysis. This should not involve any modeling. Required elements: * A summary table using kableextra to show the statistics of a few quantitative variables * At least one visualization illustrating the distribution of a single variable (boxplot, histogram, bar chart, density plot) * At least one visualization illustrating a relationship between two quantitative variables (e.g., a scatterplot) * You many also add a visualization using facets (e.g. to show scatterplots across levels of a factor) ### Summary table <!-- make a selection of the columns of interest to you . here i choose 4 quantitative columns --> We create a summary table summarizing the main statistics (range, median, etc.) of the key quantitative variables (`mpg`, `wt`, `hp`, `disp`): <table class="table table-striped" style="color: black; width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Summary Statistics for Avocado Data</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> AveragePrice </th> <th style="text-align:left;"> Total Volume </th> <th style="text-align:left;"> 4046 </th> <th style="text-align:left;"> 4225 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Min. :0.440 </td> <td style="text-align:left;"> Min. : 85 </td> <td style="text-align:left;"> Min. : 0 </td> <td style="text-align:left;"> Min. : 0 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 1st Qu.:1.100 </td> <td style="text-align:left;"> 1st Qu.: 10839 </td> <td style="text-align:left;"> 1st Qu.: 854 </td> <td style="text-align:left;"> 1st Qu.: 3009 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Median :1.370 </td> <td style="text-align:left;"> Median : 107377 </td> <td style="text-align:left;"> Median : 8645 </td> <td style="text-align:left;"> Median : 29061 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Mean :1.406 </td> <td style="text-align:left;"> Mean : 850644 </td> <td style="text-align:left;"> Mean : 293008 </td> <td style="text-align:left;"> Mean : 295155 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> 3rd Qu.:1.660 </td> <td style="text-align:left;"> 3rd Qu.: 432962 </td> <td style="text-align:left;"> 3rd Qu.: 111020 </td> <td style="text-align:left;"> 3rd Qu.: 150207 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> Max. :3.250 </td> <td style="text-align:left;"> Max. :62505647 </td> <td style="text-align:left;"> Max. :22743616 </td> <td style="text-align:left;"> Max. :20470573 </td> </tr> </tbody> </table> **Interpretation**: This shows the average price is between $0.44 and $3.25 with a mean of $1.41 and a median of $1.37. In terms of large and small avoacados, we can see that the small avocados have a higher max in terms of being solder over that of the large avocados but interestingly, it has a lower mean. --- ### Average Cost of an Avocado <!-- Distribution of a Single Variable (Histogram) --> [Data-visualization-1: Histogram of avocado prices, Authour=Tyger Smolar] <!-- --> **Insight**: This graph shows us that avocados will normally fall between $1.00 in cost and $1.50 with a relatively normal distribution, but one that is ever so positively skewed meaning that the typical cost tends to be on the smaller end of the spectrum. --- ### Total Volume Sold vs Average Avocado Price <!-- Relationship Between Two Quantitative Variables (Scatterplot) --> [Data-visualization-2: scatterplot of total volume versus average price, author=Tyger Smolar] <!-- --> **Insights**: Trend seen here suggests that high times in terms of avocado sales tend to have a lower prices for their avocados. The data is not entirely clear indicating that there could be an additional effect from region or year. --- ### Influence of Region on Avocado Price (How close to Mexico?) [Data-visualization-3: scatterplot of average price versus cost by region, Author=Tyger Smolar] <!-- --> **Insights**: This graph is very intuitive and shows us that more southern regions have lower averages in cost. It also shows us the trend of organic avocados being more expensive and regions that are further from Mexico, one of the largest exporters of avocados, for example, New York, tend to have a higher cost in avocados. This graph cannot account for regional economic differences in produce sales, so that is a limitation. --- ### Comparison of mpg vs transmission types [Data-visualization-4: boxplot of average price by types of avocados, Author=Tyger Smolar] <!-- --> **Insight**: Organic avocados are more expensive than conventional ones with the maximum value, mean, median, range, and domain being higher than that of the conventional avocado. To summarize, here are the key takeaways of this EDA: 1. **Numeric Summaries**: Show average price ranges for both types. 2. **Distributions**: Organic Avocados have higher distributions than the conventional. 3. **Relationships**: The box plot shows a relationship between cost and whether an avocado is organic or not. 4. **Facets**: The box plot shows that the organic avocado is more expensive than the conventional avocado. ## Analysis The most important model for my research question is the one that shows the influence of region on avocado prices. ### Research Questions 1. **Primary Question**: How have prices increased across America in terms of avocado sales and have some sizes of avocados become more expensive in certain regions? 2. **Secondary Question**: Are regions closer to Mexico, the largest exporter of avocados, getting avocados for a lower price? --- ### 1. Dimensionality Reduction PCA for Multivariate Patterns [Data-visualization-5: biplot of PCA on avocado features, Author=Tyger Smolar] <!-- --> **Key Insights**: - **PC1 (60% variance)** - **PC2 (20% variance)** **Conclusion**: The top represents expensive avocado sales most likely in regions that do not have a high demand. --- ### 2. Regression Analysis We use regression to uantify the impact of Weight on MPG (miles per gallon). To model the relationship between `mpg` and `wt`, we fit a **linear regression with LOESS smoothing**: [Data-visualization-6: Regression plot with LOESS smooth model of mpg vs wt Author=Philippe] <!-- --> **Key Insights**: - **LOESS curve (red)**: Shows a nonlinear pattern, average price decreases as volume increases. - **Linear fit (dashed blue)**: Shows a negative linear trend. **Conclusion**: Avocado volume sold may be a more predictive feature in terms of price rather than region or even year. These factors may play an important role however. --- ### 3. Final Model Model: Avocado Price versus. Region How much of a price variation is there per region? [Data-visualization-7: Multiple linear regression of average price versus. volume, by region, Author=Tyger Smolar] ``` ## ## Call: ## lm(formula = AveragePrice ~ region_zone + Total.Volume + X4046 + ## X4225 + X4770, data = avocado) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.99881 -0.29792 -0.03579 0.24251 1.81194 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.414e+00 1.240e-02 114.005 < 2e-16 *** ## region_zoneNorth 2.363e-02 1.282e-02 1.844 0.06521 . ## region_zoneSouth -1.637e-01 1.625e-02 -10.074 < 2e-16 *** ## Total.Volume 1.632e-08 8.065e-09 2.023 0.04305 * ## X4046 -1.147e-07 1.324e-08 -8.666 < 2e-16 *** ## X4225 4.171e-08 1.345e-08 3.100 0.00194 ** ## X4770 -3.939e-07 5.854e-08 -6.729 1.76e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3898 on 18242 degrees of freedom ## Multiple R-squared: 0.06321, Adjusted R-squared: 0.0629 ## F-statistic: 205.1 on 6 and 18242 DF, p-value: < 2.2e-16 ``` <!-- --> **Key Insights + Conclusion**: - It can be seen that there is a disparity between southern and north regions with southern regions typically being lower down the axis of average price and northern regions being higher. In terms of volume, the relationship between this factor persists and strongly suggests a correlation between total volume and price. --- ### 4. Supplemental Visualization ### **Final Answers to Research Questions** 1. **Primary Question**: How does region contribute to price disparities in costs of avocados? 2. **Secondary Question**: Is total volume sold a factor in finding out the average cost of an avocado? 3. **Unexpected Finding**: Total volume of avocados sold may correlate with the price even more significantly and positively than the location/region. --- ## Discussion & Conclusion Ultimately, If you are looking to buy an avocado, it could be a good idea to buy some when they are in season and the total volume of avocados being sold in the United States is at its peak. If this rule is followed, it is statistically probable based on this dataset that your avocados will be cheaper in cost. Note: some regional and size variations also contribute to price. **Future Research Directions**: - Include data sets from regions that are more northern and do not limit the dataset to the United States. - A data set covering more years could be optimal. - Validate whether findings are consistant with other fruits and vegetables. ## References 1. **Kaggle Avocado Dataset** https://www.kaggle.com/datasets/neuromusic/avocado-prices