Building a Logistic Regression Model

1. Create Binary Variable

Binary Variable: ‘high_rating’
Goal: Model whether a coffee is highly rated based on various factors

coffee_clean <- coffee_clean %>%
  mutate(
    high_rating = if_else(rating >= 95, 1, 0)
  )
table(coffee_clean$high_rating)

## 
##    0    1 
## 1742  338

2. Choose Explanatory Variables

‘100g_USD’ - price per 100g of coffee in USD (continuous)
‘roast_num’ - ordered roast level (integer)
‘is_top_country’ - 1 if roast is in top 10, 0 otherwise

3. Fit Logistic Regression Model

logit_mod <- glm(
  high_rating ~ `100g_USD` + roast_num + is_top_country,
  data = coffee_clean,
  family = binomial
)

summary(logit_mod)

## 
## Call:
## glm(formula = high_rating ~ `100g_USD` + roast_num + is_top_country, 
##     family = binomial, data = coffee_clean)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.359102   1.379100  -2.436   0.0149 *  
## `100g_USD`      0.056863   0.006346   8.961   <2e-16 ***
## roast_num      -0.279408   0.109508  -2.551   0.0107 *  
## is_top_country  1.700220   1.348236   1.261   0.2073    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1846.2  on 2079  degrees of freedom
## Residual deviance: 1723.1  on 2076  degrees of freedom
## AIC: 1731.1
## 
## Number of Fisher Scoring iterations: 5

4. Interpret Coefficients

Model Interpretation:

\[ logit(Pr⁡(\text{high_rating=1}))=β_0+β_1(\text{100g_USD})+β_2(\text{roast_num})+β_3(\text{is_top_country}) \]

Price (100g_USD)

exp(coef(logit_mod)["`100g_USD`"])

## `100g_USD` 
##   1.058511

$Estimate = 0.0569$

$p < 2e^{-16}$

Interpretation: Holding roast level and country group constant, each $1 increase per 100g increases the odds of being highly rated by about 5.9%. This is statistically strong and practically meaningful: higher‑priced coffees are more likely to be elite.

Roast Level (roast_num)

exp(coef(logit_mod)["roast_num"])

## roast_num 
## 0.7562313

$Estimate = -0.2794$

$p = 0.0107$

Interpretation: For each level darker in coffee roast, the odds of being highly rating, which I defined as rated 95/100 and above, decrease by approximately 24%, holding price and country constant. I conclude from this that light roasts receive consistently far higher ratings.

Top Country (is_top_country)

exp(coef(logit_mod)["is_top_country"])

## is_top_country 
##       5.475151

$Estimate = 1.7002$

$p = 0.2073$

Interpretation: Coffees roasted in a top‑10 country have 5.5 times the odds of being highly rated compared to others, controlling for price and roast. However, the p‑value indicates this effect is not statistically significant in this model. Ultimately, the direction is signficant, but uncertainty is also large.

5. Confidence Interval: Price

Calculate 95% Confidence Interval for ‘100g_USD’

coef_est <- coef(logit_mod)["`100g_USD`"]
se_est   <- summary(logit_mod)$coefficients["`100g_USD`", "Std. Error"]

# 95% CI on log-odds scale
lower_log <- coef_est - 1.96 * se_est
upper_log <- coef_est + 1.96 * se_est

c(lower_log, upper_log)

## `100g_USD` `100g_USD` 
## 0.04442523 0.06930109

# Convert to odds ratio CI
exp(c(lower_log, upper_log))

## `100g_USD` `100g_USD` 
##   1.045427   1.071759

CI Insights

Odds Ratio Confidence Interval:

\[ [1.0454, 1.0718] \]

Holding roast level and country group constant, each additional $1 per 100g of coffee is associated with an increase in the odds of being highly rated by between 4.5% and 7.2%, with 95% confidence.

CI Meaning

Because the entire confidence interval lies above 1, the effect of price is:

positive
statistically significant
consistent across plausible values of the coefficient

This means that even after accounting for roast level and top‑country status, higher‑priced coffees reliably have higher odds of being rated 95 or above. This is statistically important because the effect of price is not only significant, but highly precise. From this, I can confidently conclude that the practical effect of this model is real.

Week10 Data Dive - GLMs

Woods

2026-03-30