The goal is to build a linear or generalized linear model using any response or explanatory variables from the ‘coffee_clean’ dataset. Last week, I worked on building a logistic regression model, and this week I will shift back to using a linear regression framework. In recent weeks, I engineered new variables such as ‘roast_num’ and ‘is_top_country’. I will leverage these variables in this weeks analysis through investigating how these variables in addition to price affect my chosen response variable, which will be coffee ‘rating’.
‘100g_USD’ — continuous measure of price
‘roast_num’ — ordered roast level (1 = Light → 4 = Dark)
‘is_top_country’ — binary indicator for top 10 roasting countries
lm <- lm(rating ~ `100g_USD` + roast_num + is_top_country, data = coffee_clean)
summary(lm)
##
## Call:
## lm(formula = rating ~ `100g_USD` + roast_num + is_top_country,
## data = coffee_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9760 -0.9476 0.0117 0.9894 4.1463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.709771 0.440944 212.521 <2e-16 ***
## `100g_USD` 0.037870 0.003117 12.151 <2e-16 ***
## roast_num -0.614648 0.053790 -11.427 <2e-16 ***
## is_top_country 0.307470 0.423381 0.726 0.468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.461 on 2076 degrees of freedom
## Multiple R-squared: 0.1293, Adjusted R-squared: 0.128
## F-statistic: 102.7 on 3 and 2076 DF, p-value: < 2.2e-16
Adjusted R² = 0.128
Residual SE = 1.461
F‑statistic p < 2.2e‑16
Estimate = 0.0379
Each additional $1 per 100g increases the expected rating by 0.0379 points.
This effect is small but extremely statistically significant (p < 2e‑16). It suggests that price captures some underlying quality signal, but the magnitude is modest — even a $20 increase only raises rating by about 0.76 points.
Estimate = -0.6146
Each level of roast darker decreases the expected rating point by about 0.615 points.
This is a large, significant effect as there are only 5 total roast levels in this dataset. From doing the math, this creates a major 2.46 difference in rating point on average between light roasts and dark roasts. This indicates a strong consumer preference that leans towards lighter roasts consistently, but at the end of the day, the type of roast that I would recommend is mainly dependent on each unique consumer preference.
Estimate = 0.3074
Coffees roasted in top 10 countries score about 0.31 points higher on average.
This effect is not statistically significant as p = 0.468. Given the minimal movement in coffee ratings based on whether the coffee was roasted in a top 10 coffee roasting country or not, ‘is_top_country’ is far less important than variables such as price and roast level. Virtually, with price and roast level in the mix, the effect of the top country on the rating shrinks significantly.
par(mfrow = c(2,2))
plot(lm)
par(mfrow = c(1,1))
Slight curvature at low and high fitted values
No major heteroskedasticity
Linearity mostly satisfied, but mild nonlinearity present
Points follow the line closely
Slight deviation in the upper tail
Residuals approximately normal
Red line nearly flat
Spread consistent across fitted values
Constant variance assumption met
No points near Cook’s distance contours
No influential observations
Low R² (0.128): The model explains ~13% of the variation in rating. This is expected for sensory data but indicates many unobserved factors.
Mild nonlinearity: Slight curvature in the residuals vs fitted plot suggests that price or roast effects may not be perfectly linear.
Country effect not significant: Suggests that country differences may be mediated by roast style or price.
Price has a minimal, yet positive effect on rating
Darker roasts consistently receive lower ratings
Lighter roasts consistently receive higher ratings
With the inclusion of price and roast level, roast country effects diminish significantly
The model is statistically strong, yet only some parts explain rating variability
Overall, a major conclusion from this analysis is that coffee rating is dependent on numerous factors. Price and roast level are two more significant factors, although there are many more that impact the overall quality of coffee. Other variables such as flavor notes, origin, and roaster are also major factors that can influence total rating.