1. Introduction

The goal is to build a linear or generalized linear model using any response or explanatory variables from the ‘coffee_clean’ dataset. Last week, I worked on building a logistic regression model, and this week I will shift back to using a linear regression framework. In recent weeks, I engineered new variables such as ‘roast_num’ and ‘is_top_country’. I will leverage these variables in this weeks analysis through investigating how these variables in addition to price affect my chosen response variable, which will be coffee ‘rating’.

2. Model Specification

lm <- lm(rating ~ `100g_USD` + roast_num + is_top_country, data = coffee_clean)
summary(lm)
## 
## Call:
## lm(formula = rating ~ `100g_USD` + roast_num + is_top_country, 
##     data = coffee_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9760 -0.9476  0.0117  0.9894  4.1463 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    93.709771   0.440944 212.521   <2e-16 ***
## `100g_USD`      0.037870   0.003117  12.151   <2e-16 ***
## roast_num      -0.614648   0.053790 -11.427   <2e-16 ***
## is_top_country  0.307470   0.423381   0.726    0.468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.461 on 2076 degrees of freedom
## Multiple R-squared:  0.1293, Adjusted R-squared:  0.128 
## F-statistic: 102.7 on 3 and 2076 DF,  p-value: < 2.2e-16

Model Fit

  • Adjusted R² = 0.128

  • Residual SE = 1.461

  • F‑statistic p < 2.2e‑16

3. Correlation Coefficients

Price

Estimate = 0.0379

Each additional $1 per 100g increases the expected rating by 0.0379 points.

This effect is small but extremely statistically significant (p < 2e‑16). It suggests that price captures some underlying quality signal, but the magnitude is modest — even a $20 increase only raises rating by about 0.76 points.

Roast Level

Estimate = -0.6146

Each level of roast darker decreases the expected rating point by about 0.615 points.

This is a large, significant effect as there are only 5 total roast levels in this dataset. From doing the math, this creates a major 2.46 difference in rating point on average between light roasts and dark roasts. This indicates a strong consumer preference that leans towards lighter roasts consistently, but at the end of the day, the type of roast that I would recommend is mainly dependent on each unique consumer preference.

Top Country

Estimate = 0.3074

Coffees roasted in top 10 countries score about 0.31 points higher on average.

This effect is not statistically significant as p = 0.468. Given the minimal movement in coffee ratings based on whether the coffee was roasted in a top 10 coffee roasting country or not, ‘is_top_country’ is far less important than variables such as price and roast level. Virtually, with price and roast level in the mix, the effect of the top country on the rating shrinks significantly.

4. Model Diagnostics

Plots

par(mfrow = c(2,2))
plot(lm)

par(mfrow = c(1,1))

Evaluation

Residuals vs Fitted

  • Slight curvature at low and high fitted values

  • No major heteroskedasticity

  • Linearity mostly satisfied, but mild nonlinearity present

Q-Q Residuals

  • Points follow the line closely

  • Slight deviation in the upper tail

  • Residuals approximately normal

Scale-Location

  • Red line nearly flat

  • Spread consistent across fitted values

  • Constant variance assumption met

Residuals vs Leverage

  • No points near Cook’s distance contours

  • No influential observations

5. Issues Identified

6. Insights

Overall, a major conclusion from this analysis is that coffee rating is dependent on numerous factors. Price and roast level are two more significant factors, although there are many more that impact the overall quality of coffee. Other variables such as flavor notes, origin, and roaster are also major factors that can influence total rating.