Week8 Data Dive - Regression Modeling

Response Variable: ‘rating’

Explanatory Variable: ‘loc_country’

1. Consolidate top 10 countries

top_countries <- coffee_clean %>%
  count(loc_country, sort = TRUE) %>%
  slice_head(n = 10) %>%
  pull(loc_country)

coffee_clean <- coffee_clean %>%
  mutate(
    roast_country_group = if_else(loc_country %in% top_countries,
                                  loc_country,
                                  "Other")
  )
top_countries

##  [1] "United States" "Taiwan"        "Hawai'i"       "Canada"       
##  [5] "Guatemala"     "Hong Kong"     "Japan"         "China"        
##  [9] "Malaysia"      "Australia"

2. ANOVA Hypothesis’

Null Hypothesis: All roast country groups have the same mean rating

\[ H_0 : \theta_1 = \theta_2 = \cdots = \theta_k \] \[H_1 : \text{At least one group mean differs}\]

anova_model <- aov(rating ~ roast_country_group, data = coffee_clean)
summary(anova_model)

##                       Df Sum Sq Mean Sq F value Pr(>F)    
## roast_country_group   10    259  25.881   11.09 <2e-16 ***
## Residuals           2069   4831   2.335                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Visualization

ggplot(coffee_clean, aes(x = roast_country_group, y = rating, fill = roast_country_group)) + 
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", color = "red") + 
  theme_minimal() +
  labs(
    title = "Coffee Rating by Roast Country",
    x = "Roast Country",
    y = "Rating"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

4. Conclusion

Reject the null hypothesis for a multitude of reasons. For one, the output of the ANOVA model has a p value $\leq$ 0.05, and in fact is virtually 0. F = 11.09 which means the variation between roast-country means is extremely high between countries. I filtered the countries in various ways from looking at the top 5 countries, the median rating, the countries with the highest ratings, and the F value did not lower much, and increased more often than not. This speaks to the fact that the variance is extremely high in coffee ratings between countries and there are likely many contributing factors for why that is the case. We accept the alternative hypothesis which states at least one group mean differs, which is true. From the ANOVA hypothesis, the residual mean squared is 2.335, proving that for the most part, the coffee ratings within each country do not vary too significantly. Because of this, I conclude that ‘loc_country’ is a significant contributing variable in coffee ratings statistically speaking, as this proves the importance of trained and skilled coffee roasters that have the necessary tools and supplies to maximize the value of the coffee through successful roasting techniques. The box plot visual above supports this notion that certain countries such as The United States, Taiwan, and Australia where as certain countries consistently do not receive as high of ratings. The tighter the boxes are, the more consistent the ratings are.

5. Linear Regression Model

Continuous Explanatory Variable: ‘100g_USD’ (price per 100g)

lm_model <- lm(rating ~ `100g_USD`, data = coffee_clean)
summary(lm_model)

## 
## Call:
## lm(formula = rating ~ `100g_USD`, data = coffee_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1907 -0.9303  0.0454  1.0301  4.0985 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 92.736655   0.043965  2109.3   <2e-16 ***
## `100g_USD`   0.041205   0.003195    12.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.506 on 2078 degrees of freedom
## Multiple R-squared:  0.07412,    Adjusted R-squared:  0.07367 
## F-statistic: 166.3 on 1 and 2078 DF,  p-value: < 2.2e-16

6. Interpretation of Coefficients

Regression Model Summary

\[ rating = 92.7367 + 0.0412 * \text{100g_USD} \]

Coefficients

Intercept = 92.7367 The regression line starts at a rating of 92.7367 for coffee beans priced at 0 USD per 100g. This should not be interpreted literally, but rather serve as a starting point for the line.
Slope = 0.0412 For every 1 USD increase per 100g of coffee, the expected rating increases by the slope. The slope and intercept are estimates found in the regression model output summary. To put this into context, a $20 increase in price results in a 0.82 increase in rating. This proves that there is a real effect that price has on rating, although a $20 increase per 100g of coffee is pretty significant, and yet does not bump the rating of coffee up even a whole percentage point. This supports the idea that price does impact rating, as higher priced coffees do tend to receive higher ratings, but the overall difference is minimal. Consumers and roasters should take this into account when buying coffee that price alone does not determine the value of coffee, and there are many indicators that create high value and ratings in coffees.
R² = 0.074 ‘100g_USD’ explains 7.4% of the variation in rating, which proves that there is a correlation between price and rating. Although, it is likely strong because of a large sample size, but ‘100g_USD’ by itself is not a statistically significant indicator of rating. Ultimately, this makes sense as there are multiple variables that contribute to the overall ‘rating’ of a coffee such as the origin, type of roast, and location country of the roast as I investigated using the ANOVA method.