1. Introduction

The goal is to build a linear or generalized linear model using any response or explanatory variables from the ‘coffee_clean’ dataset. Last week, I worked on building a logistic regression model, and this week I will shift back to using a linear regression framework. In recent weeks, I engineered new variables such as ‘roast_num’ and ‘is_top_country’. I will leverage these variables in this weeks analysis through investigating how these variables in addition to price affect my chosen response variable, which will be coffee ‘rating’.

2. Model Specification

‘100g_USD’ — continuous measure of price
‘roast_num’ — ordered roast level (1 = Light → 4 = Dark)
‘is_top_country’ — binary indicator for top 10 roasting countries

lm <- lm(rating ~ `100g_USD` + roast_num + is_top_country, data = coffee_clean)
summary(lm)

## 
## Call:
## lm(formula = rating ~ `100g_USD` + roast_num + is_top_country, 
##     data = coffee_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9760 -0.9476  0.0117  0.9894  4.1463 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    93.709771   0.440944 212.521   <2e-16 ***
## `100g_USD`      0.037870   0.003117  12.151   <2e-16 ***
## roast_num      -0.614648   0.053790 -11.427   <2e-16 ***
## is_top_country  0.307470   0.423381   0.726    0.468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.461 on 2076 degrees of freedom
## Multiple R-squared:  0.1293, Adjusted R-squared:  0.128 
## F-statistic: 102.7 on 3 and 2076 DF,  p-value: < 2.2e-16

Model Fit

Adjusted R² = 0.128
Residual SE = 1.461
F‑statistic p < 2.2e‑16

3. Correlation Coefficients

Price

Estimate = 0.0379

Each additional $1 per 100g increases the expected rating by 0.0379 points.

This effect is small but extremely statistically significant (p < 2e‑16). It suggests that price captures some underlying quality signal, but the magnitude is modest — even a $20 increase only raises rating by about 0.76 points.

Roast Level

Estimate = -0.6146

Each level of roast darker decreases the expected rating point by about 0.615 points.

This is a large, significant effect as there are only 5 total roast levels in this dataset. From doing the math, this creates a major 2.46 difference in rating point on average between light roasts and dark roasts. This indicates a strong consumer preference that leans towards lighter roasts consistently, but at the end of the day, the type of roast that I would recommend is mainly dependent on each unique consumer preference.

Top Country

Estimate = 0.3074

Coffees roasted in top 10 countries score about 0.31 points higher on average.

This effect is not statistically significant as p = 0.468. Given the minimal movement in coffee ratings based on whether the coffee was roasted in a top 10 coffee roasting country or not, ‘is_top_country’ is far less important than variables such as price and roast level. Virtually, with price and roast level in the mix, the effect of the top country on the rating shrinks significantly.

4. Model Diagnostics

Plots

par(mfrow = c(2,2))
plot(lm)

par(mfrow = c(1,1))

Evaluation

Residuals vs Fitted

Slight curvature at low and high fitted values
No major heteroskedasticity
Linearity mostly satisfied, but mild nonlinearity present

Q-Q Residuals

Points follow the line closely
Slight deviation in the upper tail
Residuals approximately normal

Scale-Location

Red line nearly flat
Spread consistent across fitted values
Constant variance assumption met

Residuals vs Leverage

No points near Cook’s distance contours
No influential observations

5. Issues Identified

Low R² (0.128): The model explains ~13% of the variation in rating. This is expected for sensory data but indicates many unobserved factors.
Mild nonlinearity: Slight curvature in the residuals vs fitted plot suggests that price or roast effects may not be perfectly linear.
Country effect not significant: Suggests that country differences may be mediated by roast style or price.

6. Insights

Price has a minimal, yet positive effect on rating
Darker roasts consistently receive lower ratings
Lighter roasts consistently receive higher ratings
With the inclusion of price and roast level, roast country effects diminish significantly
The model is statistically strong, yet only some parts explain rating variability

Overall, a major conclusion from this analysis is that coffee rating is dependent on numerous factors. Price and roast level are two more significant factors, although there are many more that impact the overall quality of coffee. Other variables such as flavor notes, origin, and roaster are also major factors that can influence total rating.

WK11DataDive

Woods

2026-04-01