Week 9 Data Dive - Regression Diagnostics

1. Introduction

Goal: Build off of last weeks linear regression model analyzing price and rating

       lm(formula = rating ~ `100g_USD`, data = coffee_clean)

Regression model summary

\[ rating = 92.7367 + 0.0412 * \text{100g_USD} \]

Main Objective: Construct a regression model with 2-4 terms and evaluate its performance by analyzing diagnostic plots to assess if the assumptions are met and if there are any issues evident in the model.

2. Variables

‘100g_USD’ (continuous)
‘roast_num’ (ordered integer)
‘is_top_country’ (binary indicator for top 10 roasting countries)

Create binary variable

top_countries <- coffee_clean %>%
  count(loc_country, sort = TRUE) %>%
  slice_head(n = 10) %>%
  pull(loc_country)

coffee_clean <- coffee_clean %>%
  mutate(is_top_country = if_else(loc_country %in% top_countries, 1, 0))

Create ordered numeric variable

coffee_clean <- coffee_clean %>%
  mutate(
    roast = factor(
      roast,
      levels = c("Light", "Medium-Light", "Medium", "Medium-Dark", "Dark"),
      ordered = TRUE
    ),
    roast_num = as.numeric(roast)
  )

3. Fit Regression Models

Model 1: Baseline

 m1 <- lm(rating ~ `100g_USD`, data=coffee_clean)

Model 2: Add roast level

m2 <- lm(rating ~ `100g_USD` + roast_num, data=coffee_clean)

Model 3: Add binary country indicator

m3 <- lm(rating ~ `100g_USD` + roast_num + is_top_country, data = coffee_clean)
summary(m3)

## 
## Call:
## lm(formula = rating ~ `100g_USD` + roast_num + is_top_country, 
##     data = coffee_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9760 -0.9476  0.0117  0.9894  4.1463 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    93.709771   0.440944 212.521   <2e-16 ***
## `100g_USD`      0.037870   0.003117  12.151   <2e-16 ***
## roast_num      -0.614648   0.053790 -11.427   <2e-16 ***
## is_top_country  0.307470   0.423381   0.726    0.468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.461 on 2076 degrees of freedom
## Multiple R-squared:  0.1293, Adjusted R-squared:  0.128 
## F-statistic: 102.7 on 3 and 2076 DF,  p-value: < 2.2e-16

4. Multicollinearity check

library(car)

## Warning: package 'car' was built under R version 4.5.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(m3)

##     `100g_USD`      roast_num is_top_country 
##       1.010918       1.009769       1.001877

Interpretation: All variance inflation factor (VIF) values discovered in the multicollinearity check for model 3, the final model, are below 2. With 100g_USD, roast_num, and is_top_country all well beneath 2 VIF, I conclude that there are no multicollinearity concerns, as each variable provides unique data to the model.

5. Final Model

\[ rating = \beta_0 + \beta_1(\text{100g_USD}) + \beta_2(\text{roast_num}) + \beta_3(\text{is_top_country}) \]

Analysis & Findings

‘100g_USD’ still has a positive correlation to rating, although effect size remains small
‘roast_num’ reveals insight into customer preferences as higher roast levels correspond with different ratings. Specifically, this refers to lighter roasts receiving consistently high ratings.
‘is_top_country’ tells us that coffees roasted in the top roasting countries worldwide produce significantly higher ratings on average.

6. Diagnostic Plots

par(mfrow = c(2,2))
plot(m3)

par(mfrow = c(1,1))
hist(residuals(m3), breaks = 30, main = "Histogram of Residuals")

Plot Takeaways

Residual vs Fitted: Because the red line is only slightly curved, I have a high confidence in my linearity assumption. The slight curve at the end is likely to include fitted values but the results are encouraging, supporting my initial assumption. In addition to the linearity assumption being satisfied, this diagnostic plot shows strong variance.

Q-Q Residuals: This type of plot analyzes whether residuals follow a normal distribution. It is good to see that the points very closely follow the line for the most part, especially in the middle of the plot, which I can conclude from this that the residuals are mostly normal. While there is minor deviation from the line at the start and end of the plot, my confidence in normality is high.

Scale-Location: This plot shows that the red line is nearly horizontal meaning that the spread of square root standardized residuals is for the most part fitted evenly across fitted values. Because of this, there is no strong evidence of heteroskedasticity, as variance is constant for the most part, meaning the assumption is met.

Residuals vs Leverage: This checks for major observations that could influence the results of the model. This plot shows a major clustering of the points at a low leverage. Given the red line is fairly flat, there is no indication of a major pattern. From all of this, I conclude that there are no major influential outliers enough to be a threat to the stability and accuracy of the model. This is another assumption met as I am very confident in the model stability.

Histogram of Residuals: This plot shows a roughly symmetric distribution across the residuals, with no extreme outliers. The normality assumption is supported by the mostly symmetric residuals. Ultimately, I have high confidence that the assumptions are met and my final model is highly stable and accurate.