Building a Logistic Regression Model

1. Create Binary Variable

  • Binary Variable: ‘high_rating’

  • Goal: Model whether a coffee is highly rated based on various factors

coffee_clean <- coffee_clean %>%
  mutate(
    high_rating = if_else(rating >= 95, 1, 0)
  )
table(coffee_clean$high_rating)
## 
##    0    1 
## 1742  338
# Modifying 'is_top_country' to top 5 to test statistical significance
top_countries <- coffee_clean %>%
  count(loc_country, sort = TRUE) %>%
  slice_head(n = 5) %>%
  pull(loc_country)

coffee_clean <- coffee_clean %>%
  mutate(is_top_country = if_else(loc_country %in% top_countries, 1, 0))
table(coffee_clean$is_top_country)
## 
##    0    1 
##   53 2027

2. Choose Explanatory Variables

  • ‘100g_USD’ - price per 100g of coffee in USD (continuous)

  • ‘roast_num’ - ordered roast level (integer)

  • ‘is_top_country’ - 1 if roast is in top 10, 0 otherwise

3. Fit Logistic Regression Model

logit_mod <- glm(
  high_rating ~ `100g_USD` + roast_num,
  data = coffee_clean,
  family = binomial
)

summary(logit_mod)
## 
## Call:
## glm(formula = high_rating ~ `100g_USD` + roast_num, family = binomial, 
##     data = coffee_clean)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.651363   0.238283   -6.93  4.2e-12 ***
## `100g_USD`   0.056151   0.006345    8.85  < 2e-16 ***
## roast_num   -0.282473   0.109479   -2.58  0.00988 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1846.2  on 2079  degrees of freedom
## Residual deviance: 1725.3  on 2077  degrees of freedom
## AIC: 1731.3
## 
## Number of Fisher Scoring iterations: 4

4. Interpret Coefficients

Model Interpretation:

\[ logit(Pr⁡(\text{high_rating=1}))=β_0+β_1(\text{100g_USD})+β_2(\text{roast_num})+β_3(\text{is_top_country}) \]

Price (100g_USD)

exp(coef(logit_mod)["`100g_USD`"])
## `100g_USD` 
##   1.057757

\(Estimate = 0.057\)

\(p < 2e^{-16}\)

Testing for linearity

# 1. Fit the model
model <- glm(
  high_rating ~ `100g_USD` + roast_num,
  family = binomial,
  data = coffee_clean
)

# 2. Compute log odds
coffee_clean$logit <- predict(model, type = "link")

# 3. Plot price vs logit
ggplot(coffee_clean, aes(x = `100g_USD`, y = logit)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(
    x = "Price per 100g (USD)",
    y = "Log-Odds of High Rating",
    title = "Checking Linearity of the Logit for Price"
  )
## `geom_smooth()` using formula = 'y ~ x'

Interpretation: Holding roast level and country group constant, each $1 increase per 100g increases the odds of being highly rated by about 5.9%. This is statistically strong and practically meaningful: higher‑priced coffees are more likely to be elite. The assumption of linearity is confirmed through this statistical check, as the visualization shows an almost straight line demonstrating the strong positive relationship between the two variables.

Roast Level (roast_num)

exp(coef(logit_mod)["roast_num"])
## roast_num 
##  0.753917

\(Estimate = -0.288\)

\(p = 0.00852\)

Interpretation: For each level darker in coffee roast, the odds of being highly rating, which I defined as rated 95/100 and above, decrease by approximately 24%, holding price and country constant. I conclude from this that light roasts receive consistently far higher ratings.

Top Country (is_top_country)*

exp(coef(logit_mod)["is_top_country"])
## <NA> 
##   NA

\(Estimate = 0.485\)

\(p = 0.2474\)

UPDATE: ‘is_top_country’ is not statistically significant as an indicator of high rating, and has subsequently been removed from the model for best practice.

5. Confidence Interval: Price

Calculate 95% Confidence Interval for ‘100g_USD’

coef_est <- coef(logit_mod)["`100g_USD`"]
se_est   <- summary(logit_mod)$coefficients["`100g_USD`", "Std. Error"]

# 95% CI on log-odds scale
lower_log <- coef_est - 1.96 * se_est
upper_log <- coef_est + 1.96 * se_est

c(lower_log, upper_log)
## `100g_USD` `100g_USD` 
## 0.04371461 0.06858712
# Convert to odds ratio CI
exp(c(lower_log, upper_log))
## `100g_USD` `100g_USD` 
##   1.044684   1.070994

CI Insights

Odds Ratio Confidence Interval:

\[ [1.0455,  1.072] \]

Holding roast level and country group constant, each additional $1 per 100g of coffee is associated with an increase in the odds of being highly rated by between 4.5% and 7.2%, with 95% confidence.

CI Meaning

Because the entire confidence interval lies above 1, the effect of price is:

  • positive

  • statistically significant

  • consistent across plausible values of the coefficient

This means that even after accounting for roast level and top‑country status, higher‑priced coffees reliably have higher odds of being rated 95 or above. This is statistically important because the effect of price is not only significant, but highly precise. From this, I can confidently conclude that the practical effect of this model is real. Ultimately, while the top countries are insignificant to this model, the roast level, but especially price are major indicators of coffee ratings.

Saving Dataset Update

saveRDS(coffee_clean, "coffee_clean.rds")