Response Variable: ‘rating’

Explanatory Variable: ‘loc_country’

1. Consolidate top 10 countries

top_countries <- coffee_clean %>%
  count(loc_country, sort = TRUE) %>%
  slice_head(n = 10) %>%
  pull(loc_country)

coffee_clean <- coffee_clean %>%
  mutate(
    roast_country_group = if_else(loc_country %in% top_countries,
                                  loc_country,
                                  "Other")
  )
top_countries
##  [1] "United States" "Taiwan"        "Hawai'i"       "Canada"       
##  [5] "Guatemala"     "Hong Kong"     "Japan"         "China"        
##  [9] "Malaysia"      "Australia"

2. ANOVA Hypothesis’

Null Hypothesis: All roast country groups have the same mean rating

\[ H_0 : \theta_1 = \theta_2 = \cdots = \theta_k \] \[H_1 : \text{At least one group mean differs}\]

anova_model <- aov(rating ~ roast_country_group, data = coffee_clean)
summary(anova_model)
##                       Df Sum Sq Mean Sq F value Pr(>F)    
## roast_country_group   10    259  25.881   11.09 <2e-16 ***
## Residuals           2069   4831   2.335                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Visualization

ggplot(coffee_clean, aes(x = roast_country_group, y = rating, fill = roast_country_group)) + 
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", color = "red") + 
  theme_minimal() +
  labs(
    title = "Coffee Rating by Roast Country",
    x = "Roast Country",
    y = "Rating"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4. Conclusion

Reject the null hypothesis for a multitude of reasons. For one, the output of the ANOVA model has a p value \(\leq\) 0.05, and in fact is virtually 0. F = 11.09 which means the variation between roast-country means is extremely high between countries. I filtered the countries in various ways from looking at the top 5 countries, the median rating, the countries with the highest ratings, and the F value did not lower much, and increased more often than not. This speaks to the fact that the variance is extremely high in coffee ratings between countries and there are likely many contributing factors for why that is the case. We accept the alternative hypothesis which states at least one group mean differs, which is true. From the ANOVA hypothesis, the residual mean squared is 2.335, proving that for the most part, the coffee ratings within each country do not vary too significantly. Because of this, I conclude that ‘loc_country’ is a significant contributing variable in coffee ratings statistically speaking, as this proves the importance of trained and skilled coffee roasters that have the necessary tools and supplies to maximize the value of the coffee through successful roasting techniques. The box plot visual above supports this notion that certain countries such as The United States, Taiwan, and Australia where as certain countries consistently do not receive as high of ratings. The tighter the boxes are, the more consistent the ratings are.

5. Linear Regression Model

Continuous Explanatory Variable: ‘100g_USD’ (price per 100g)

lm_model <- lm(rating ~ `100g_USD`, data = coffee_clean)
summary(lm_model)
## 
## Call:
## lm(formula = rating ~ `100g_USD`, data = coffee_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1907 -0.9303  0.0454  1.0301  4.0985 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 92.736655   0.043965  2109.3   <2e-16 ***
## `100g_USD`   0.041205   0.003195    12.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.506 on 2078 degrees of freedom
## Multiple R-squared:  0.07412,    Adjusted R-squared:  0.07367 
## F-statistic: 166.3 on 1 and 2078 DF,  p-value: < 2.2e-16

6. Interpretation of Coefficients

Regression Model Summary

\[ rating = 92.7367 + 0.0412 * \text{100g_USD} \]

Coefficients