Binary Variable: ‘high_rating’
Goal: Model whether a coffee is highly rated based on various factors
coffee_clean <- coffee_clean %>%
mutate(
high_rating = if_else(rating >= 95, 1, 0)
)
table(coffee_clean$high_rating)
##
## 0 1
## 1742 338
# Modifying 'is_top_country' to top 5 to test statistical significance
top_countries <- coffee_clean %>%
count(loc_country, sort = TRUE) %>%
slice_head(n = 5) %>%
pull(loc_country)
coffee_clean <- coffee_clean %>%
mutate(is_top_country = if_else(loc_country %in% top_countries, 1, 0))
table(coffee_clean$is_top_country)
##
## 0 1
## 53 2027
‘100g_USD’ - price per 100g of coffee in USD (continuous)
‘roast_num’ - ordered roast level (integer)
‘is_top_country’ - 1 if roast is in top 10, 0 otherwise
logit_mod <- glm(
high_rating ~ `100g_USD` + roast_num,
data = coffee_clean,
family = binomial
)
summary(logit_mod)
##
## Call:
## glm(formula = high_rating ~ `100g_USD` + roast_num, family = binomial,
## data = coffee_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.651363 0.238283 -6.93 4.2e-12 ***
## `100g_USD` 0.056151 0.006345 8.85 < 2e-16 ***
## roast_num -0.282473 0.109479 -2.58 0.00988 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1846.2 on 2079 degrees of freedom
## Residual deviance: 1725.3 on 2077 degrees of freedom
## AIC: 1731.3
##
## Number of Fisher Scoring iterations: 4
Model Interpretation:
\[ logit(Pr(\text{high_rating=1}))=β_0+β_1(\text{100g_USD})+β_2(\text{roast_num})+β_3(\text{is_top_country}) \]
exp(coef(logit_mod)["`100g_USD`"])
## `100g_USD`
## 1.057757
\(Estimate = 0.057\)
\(p < 2e^{-16}\)
# 1. Fit the model
model <- glm(
high_rating ~ `100g_USD` + roast_num,
family = binomial,
data = coffee_clean
)
# 2. Compute log odds
coffee_clean$logit <- predict(model, type = "link")
# 3. Plot price vs logit
ggplot(coffee_clean, aes(x = `100g_USD`, y = logit)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
labs(
x = "Price per 100g (USD)",
y = "Log-Odds of High Rating",
title = "Checking Linearity of the Logit for Price"
)
## `geom_smooth()` using formula = 'y ~ x'
Interpretation: Holding roast level and country group constant, each $1 increase per 100g increases the odds of being highly rated by about 5.9%. This is statistically strong and practically meaningful: higher‑priced coffees are more likely to be elite. The assumption of linearity is confirmed through this statistical check, as the visualization shows an almost straight line demonstrating the strong positive relationship between the two variables.
exp(coef(logit_mod)["roast_num"])
## roast_num
## 0.753917
\(Estimate = -0.288\)
\(p = 0.00852\)
Interpretation: For each level darker in coffee roast, the odds of being highly rating, which I defined as rated 95/100 and above, decrease by approximately 24%, holding price and country constant. I conclude from this that light roasts receive consistently far higher ratings.
exp(coef(logit_mod)["is_top_country"])
## <NA>
## NA
\(Estimate = 0.485\)
\(p = 0.2474\)
UPDATE: ‘is_top_country’ is not statistically significant as an indicator of high rating, and has subsequently been removed from the model for best practice.
Calculate 95% Confidence Interval for ‘100g_USD’
coef_est <- coef(logit_mod)["`100g_USD`"]
se_est <- summary(logit_mod)$coefficients["`100g_USD`", "Std. Error"]
# 95% CI on log-odds scale
lower_log <- coef_est - 1.96 * se_est
upper_log <- coef_est + 1.96 * se_est
c(lower_log, upper_log)
## `100g_USD` `100g_USD`
## 0.04371461 0.06858712
# Convert to odds ratio CI
exp(c(lower_log, upper_log))
## `100g_USD` `100g_USD`
## 1.044684 1.070994
Odds Ratio Confidence Interval:
\[ [1.0455, 1.072] \]
Holding roast level and country group constant, each additional $1 per 100g of coffee is associated with an increase in the odds of being highly rated by between 4.5% and 7.2%, with 95% confidence.
Because the entire confidence interval lies above 1, the effect of price is:
positive
statistically significant
consistent across plausible values of the coefficient
This means that even after accounting for roast level and top‑country status, higher‑priced coffees reliably have higher odds of being rated 95 or above. This is statistically important because the effect of price is not only significant, but highly precise. From this, I can confidently conclude that the practical effect of this model is real. Ultimately, while the top countries are insignificant to this model, the roast level, but especially price are major indicators of coffee ratings.
saveRDS(coffee_clean, "coffee_clean.rds")