Linear Regression (Part 2)

In this lab, we cover the following topics:

Regression Characteristics
Regression Diagnostics
Simpson’s Paradox (demo)

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.4.2

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

# remove scientific notation
options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())

(We’ll need to continue to use the ames_basic dataset we built from last week.)

ames <- make_ames()

ames_basic <- ames |>
  rename_with(tolower) |>
  filter(bldg_type == "OneFam",
         house_style == "One_Story",
         year_built >= 2000) |>
  mutate(great_qual = ifelse(overall_qual %in%
           c("Very_Excellent", "Excellent", "Very_Good"),
           1, 0))

Regression Characteristics

Coefficients

Consider that the coefficient for a model can be very close to zero, i.e., the relationship between \(X\) and \(Y\) would be essentially zero. This can be a hypothesis test, and we can determine if there is enough evidence against the claim that there is no relationship between a variable and the response. In R, each coefficient is tested against a T-test with the null hypothesis that the coefficient value is zero.

\[ \text{H}_0: \beta_i=0 \]

Let’s return to the model which used the interaction term, and use the broom tidy function to view the model coefficients.

model <- lm(sale_price ~ year_remod_add + great_qual 
            + year_remod_add:great_qual, ames_basic)

tidy(model, conf.int = TRUE)

## # A tibble: 4 × 7
##   term                   estimate std.error statistic p.value conf.low conf.high
##   <chr>                     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)             -8.30e5  5327026.    -0.156   0.876  -1.13e7  9650105.
## 2 year_remod_add           5.13e2     2656.     0.193   0.847  -4.71e3     5739.
## 3 great_qual              -1.01e7  7545297.    -1.34    0.182  -2.49e7  4755240.
## 4 year_remod_add:great_…   5.09e3     3762.     1.35    0.177  -2.31e3    12488.

Notice how the top two p-values are very high. This indicates that there is not sufficient evidence that the year remodeled and interaction term have a significant effect on price. In other words, these low coefficients are not surprising given the null hypothesis (that they are in fact zero).

We can also visualize these tests as confidence intervals instead.

tidy(model, conf.int = TRUE) |>
  ggplot(mapping = aes(x = estimate, y = term)) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
  geom_errorbarh(mapping = aes(xmin = conf.low, 
                               xmax = conf.high, 
                               height = 0.5)) +
  labs(title = "Model Coefficient C.I.")

It’s true that 0 is contained in the great_qual confidence interval, but intuitively, the house quality really should effect the price, so likely we need to consider other variables in a new model.

R-Squared

A metric that is often used for linear regression is \(R^2\), or the coefficient of determination. The idea behind this metric is to measure the “variance explained” in the response variable by the model. The response variable has an intrinsic variance from its mean \(\overline{y}\) given by \(\sum_{i=1}^{n} (y_i - \overline{y})^2/n\) whose numerator is called the SST (sum of the squares total), and we then think of the ratio \(SSE/SST\) as the % of variance that remains once we’ve modeled the data. Since we want % variance explained we take the difference between 1 and this ratio:

\[ R^2 = 1 - SSE/SST = 1 - \frac{\sum _{i=1}^n \left(y_i - \hat{y}_i\right)^2}{\sum_{i=1}^{n} (y_i - \overline{y})^2} \]

(So, for all inferential models, \(0\leq R^2 \leq 1\).)

If the model is a perfect fit, \(SSE = 0\) so \(R^2 = 1\). On the other hand, if we have a constant model that predicted the mean for all observations, then \(SSE=SST\) and \(R^2 = 0\) (further, if \(SST = 0\), then \(R^2\) is undefined). So, \(R^2\) measures how good our regression fit is relative to a constant model that predicts the mean.

There is no notion of an \(R^2\) value that is “too low”. It should only be used to compare one model to another.

(Some problems are just harder than others!)

Adjusted R-Squared

The Adjusted R-squared value is meant to account for the natural increase in \(R^2\) as more explanatory variables are introduced. Explicitly, if there are \(n\) rows of data and \(p\) parameters or coefficients in the model, then:

\[ R_{\text{adj}}^2 = 1 - \frac{SSE/\text{df}_e}{SST/\text{df}_t} = 1 - \frac{\left[\sum _{i=1}^n \left(y_i - \hat{y}_i\right)^2\right]/(n-p)}{\left[\sum_{i=1}^{n} (y_i - \overline{y})^2\right]/(n-1)} \]

The degrees of freedom for the denominator (\(\text{df}_t\)) comes from the mean \(\bar{y}\) in the calculation, and the degrees of freedom for the numerator (\(\text{df}_e\)) comes from the fact that \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \cdots + \hat{\beta}_px_p\), which contains \(p\) parameters.

Consider this as an adjusted version of \(R^2\) written in terms of sample variance:

\[ R^2 = 1 - \frac{\text{var(Model Errors)}}{\text{var(Response)}} = 1 - \frac{\left[\sum _{i=1}^n \left(y_i - \hat{y}_i\right)^2\right]/n}{\left[\sum_{i=1}^{n} (y_i - \overline{y})^2\right]/n} \]

Naturally, as we increase the number of explanatory variables in a model, the model will necessarily improve, at least slightly (simply by introducing more information). So, the numerator of the second term will decrease, while the denominator stays exactly the same. The correction in the \(R_\text{adj}\) calculation attempts to balance the sum of squared errors with the number of parameters in the model.

# four (nested) models with increasing variables
model1 <- lm(sale_price ~ year_remod_add + great_qual, ames_basic) |>
            summary()

model2 <- lm(sale_price ~ year_remod_add + great_qual 
            + lot_area, ames_basic) |>
            summary()

model3 <- lm(sale_price ~ year_remod_add + great_qual 
            + lot_area + great_qual:lot_area, ames_basic) |>
            summary()

model4 <- lm(sale_price ~ year_remod_add + great_qual 
            + lot_area + great_qual:lot_area + lot_frontage, ames_basic) |>
            summary()

df_plot <- data.frame(p=c(2, 3, 4, 5),
                      r_squared = c(model1$r.squared, 
                                    model2$r.squared, 
                                    model3$r.squared,
                                    model4$r.squared),
                      adj_r_squared = c(model1$adj.r.squared, 
                                        model2$adj.r.squared, 
                                        model3$adj.r.squared,
                                        model4$adj.r.squared)) |>
          mutate(r_perc_increase = 100 * round((r_squared - 
                                     model1$r.squared)
                                     /model1$r.squared, 3),
                 r_adj_perc_increase = 100 * round((adj_r_squared - 
                                     model1$adj.r.squared)
                                     /model1$adj.r.squared, 3),
                 diff = r_perc_increase - r_adj_perc_increase)

ggplot(data=df_plot) +
  geom_line(mapping = aes(x=p, y = r_perc_increase, 
                          color = 'R Squared Improvement')) +
  geom_line(mapping = aes(x=p, y = r_adj_perc_increase, 
                          color = 'Adjusted R Squared Improvement')) +
  geom_text_repel(mapping = aes(x=p,  
                                y = r_adj_perc_increase, 
                                label = paste("diff = ", round(diff, 1))),
                  color = "darkred", 
                  nudge_x = 0.3) +
  labs(color='', 
       x='Number of Variables (p)',
       y='% Increase in R-Squared',
       title = "Improving on a Model with 2 Coefficients",
       subtitle = "(Initial Model: R=0.4 and R_adj=0.399)") +
  scale_color_brewer(palette = 'Dark2') +
  theme(legend.position="bottom")

Both of these values are going to increase as we introduce more variables, but the Adjusted R-squared will not increase as fast as the R-squared values will. Consider the implications of this with several more variables in the model.

Influence and Leverage

“Outlier” is a term used to describe one point among many in a column of data. “Leverage” and “influence” approach the same idea, but from the perspective of linear regression:

Points of high leverage have extreme \(X\) values. We can think of them as being far away from the fulcrum (the mean of \(X\)) of a lever (the regression line). In the case of simple linear regression, these points are horizontally distant from the mean.
If a point invokes a strong influence on the slope of the regression line, we call it an influential point. In other words, removing a single influential point may strongly affect the slope of a regression line.

Cook’s Distance, or Cook’s D, measures the amount to which a model’s fitted values would change if a point were removed. In other words, it is a great measure of influence for each observation.

\[ \text{Cook's D} = \displaystyle D_{i} = {\frac {\sum _{j=1}^{n}\left(\hat {y}_j-\hat{y}_{j(i)}\right)^{2}}{k\cdot \text{MSE}}} \]

This is a calculation for each observation \(i\), and \(\hat{y}_{j(i)}\) is the fitted value for observation \(j\) when observation \(i\) is removed from the model. \(k\) is the number of parameters in the model (i.e., the number of \(\beta_*\) terms).

model_3 <- lm(y3 ~ x3, anscombe)
model_4 <- lm(y4 ~ x4, anscombe)

anscombe <- anscombe |>
  mutate(influential_3 = y3 > 11,
         influential_4 = x4 > 11,
         cooks_3 = cooks.distance(model_3),
         cooks_4 = cooks.distance(model_4))

mean(anscombe$cooks_3)

## [1] 0.1672688

Different kinds of “outliers” yield different leverage values.

anscombe |>
  ggplot(mapping = aes(x = x3, y = y3)) +
  geom_point(size = 2, color = 'darkblue') +
  geom_text_repel(data = filter(anscombe, influential_3),
                  mapping = aes(label = paste("Cook's D",
                                              round(cooks_3, 3))),
                  color = "darkred") +
  geom_smooth(data = filter(anscombe, influential_3 == FALSE),
              method = 'lm', se = FALSE, color = 'lightblue') +
  geom_smooth(method = 'lm', se = FALSE, color = 'darkred')

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

As a rule of thumb: an observation with Cook’s distance larger than three times the mean Cook’s distance might be an outlier or influential point. In this case, the point has a Cook’s D of \(1.393 > 3 \cdot 0.167\).

The Hat Matrix

The explicit measure for leverage comes from “the hat matrix”. First, note that linear algebra defines a linear model as a projection of data \(X, y\) onto a line \(\hat{y}\):

\[ \hat{y} = X(X^\top X)^{-1}X^\top y \]

where the first column of \(X\) contains all ones (our constant intercept). In this way, we can define the square matrix \(H=X(X^\top X)^{-1}X^\top\), so

\[ \hat{y} = Hy \]

If the \(i\)th row of \(H\) corresponds to the \(i\)th row of \(y\), our \(i\)th prediction is defined as:

\[ \hat{y}_i = h_{i1}y_1 + h_{i2}y_2 + \cdots + h_{ii}y_i + \cdots + h_{in}y_n \]

In other words, \(h_{ii}\) represents the amount of leverage that the observed value \(y_i\) has on the predicted value \(\hat{y}_i\). If \(h_{ii}\) is high, then the predicted value for the \(i\)th row depends heavily on the value of \(y_i\) — so, it has an effect on the “lever” of the line itself. And, vice versa when \(h_{ii}\) is low.

anscombe <- anscombe |>
  mutate(hat_3 = hatvalues(model_3))

anscombe |>
  ggplot(mapping = aes(x = x3, y = y3)) +
  geom_point(size = 2, color = 'darkblue') +
  geom_smooth(data = filter(anscombe, influential_3 == FALSE),
              method = 'lm', se = FALSE, color = 'lightblue') +
  geom_smooth(method = 'lm', se = FALSE, color = 'darkred') +
  geom_text_repel(data = anscombe[c(3, 4, 6, 8),],
                  mapping = aes(label = paste("Hat Value",
                                              round(hat_3, 3))),
                  color = "darkred",
                  nudge_y = -0.7)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Typically, end points are going to have higher hat values, and middle points (closer to the mean) will have lower ones. This makes sense when we think about the idea that the further you are from the fulcrum of a lever, the more effect you’ll have on it.

Model Summary

In R, the summary function provides a summary of whatever object you pass into it. For instance, we’ve already seen the output for an ANOVA (aov) object. It will also give us an overview of a linear model:

# let's use this model for the remainder of the notebook
model <- lm(sale_price ~ year_remod_add + lot_area, ames_basic)

summary(model)

## 
## Call:
## lm(formula = sale_price ~ year_remod_add + lot_area, data = ames_basic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -421288  -42869  -11152   33313  353571 
## 
## Coefficients:
##                     Estimate    Std. Error t value Pr(>|t|)    
## (Intercept)    -14580203.909   4154884.442  -3.509 0.000513 ***
## year_remod_add      7338.075      2071.626   3.542 0.000455 ***
## lot_area              11.279         1.166   9.672  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80860 on 324 degrees of freedom
## Multiple R-squared:  0.252,  Adjusted R-squared:  0.2473 
## F-statistic: 54.56 on 2 and 324 DF,  p-value: < 2.2e-16

The F-Test is as we’ve seen before. But in this case, the null hypothesis is that all the coefficient values are 0. So, e.g., here it is very unlikely that all the coefficients are 0, so one of these explanatory variables likely has a linear relationship with the response.
The p-value Pr(>|t|) assumes that there is no linear relationship between \(Y\) and the corresponding variable \(X\), i.e., that the coefficient is equal to zero.

Diagnostics and Evaluation

Recall the 5 assumptions of linear regression:

VARIABLE \(x\) IS LINEARLY CORRELATED WITH RESPONSE \(y\).
ERRORS HAVE CONSTANT VARIANCE ACROSS ALL PREDICTIONS
OBSERVATIONS ARE INDEPENDENT AND UNCORRELATED
INDEPENDENT VARIABLES CANNOT BE LINEARLY CORRELATED
ERRORS ARE NORMALLY DISTRIBUTED OVER THE PREDICTION LINE*

* This is equivalent to saying “\(y\) is normally distributed given any \(x\) value.” This is not the same as \(y\) being normally distributed as a whole, but normally distributed \(y\) variables usually satisfy this assumption.

Diagnosing our model is akin to ensuring that these assumptions are met, and evaluation is determining how well our model is fit.

Diagnostic Plots

We can evaluate all our assumptions by looking at diagnostic plots. In this class, I recommend using lindia as the platform for regression diagnostic plots. Specifically, I advise using the following plots:

Residuals vs. \(\hat{y}\) values
Residuals vs. \(x\) values
Residual Histogram
QQ-Plot
Cook’s D by Observation*

Our goal here is to identify issues in our model using diagnostic plots, not necessarily to fix them. We will talk about how to solve issues from diagnostics in other lessons.

Note: Other plots like the Scale-Location Plot, or Residual vs. Leverage plots (even in other R packages) tend to use more advanced calculations such as standardized residuals, and these complicate the interpretation of the plots. Practically, the most valuable conclusions you might draw from these plots can be gathered from the (more readable) plots listed above. For this reason, I do not cover them here.

Residuals vs. Fitted Values

The first plot I tend to look at is the residual values vs. the fitted values:

gg_resfitted(model) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

For this particular plot, we can already see that one of our assumptions is being violated, in that the variance in errors increases for higher sales prices. This is indicated by the “fanning” effect from left to right.

Notice, there is a curve that follows the “path” of these residuals. This is a great way to find out if there is a pattern to the residuals other than just fanning. For example, if this red curve is upward trending, then higher fitted values will tend to over estimate, and lower fitted values will tend to under estimate.

Residuals vs. X Values

plots <- gg_resX(model, plot.all = FALSE)

# for each variable of interest ...
plots$year_remod_add +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

plots$lot_area +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

These plots provide a deeper look into the 5th assumption, above. What we want are normally distributed errors with a constant variance across values of \(X\) variables. And, here is a violation of that assumption for explanatory variables: lot_area indicates higher residuals for larger values.

This plot is also a great way to test the linearity assumption between X and y. If the relationship between \(X\) and \(y\) were perfectly linear, then the blue line would line up with the red dotted line. We can also investigate linearity by plotting the variable with the response:

ggplot(data = ames_basic) +
  geom_point(mapping = aes(x=lot_area, y=sale_price))

Residual Histogram

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

When we look at the histogram of residuals, we want to see a distribution that is roughly normal. We see that there is a slight concern, as the tails are not equally distributed. There is a “shorter” tail on the left, and a longer one on the right. This is not severe, but it indicates that some aspect of the phenomenon is missing from the model (i.e., we may need more/different explanatory variables).

QQ-Plots

Be very careful when using QQ-Plots to diagnose a regression model. These plots are great at detecting even the slightest differences between an empirical distribution and a theoretical distribution. Though, if you’re not careful, you may find deviations indicated in a QQ-plot that are not severe enough to significantly affect your regression model. For example:

# the normal QQ plot
gg_qqplot(model)

To build this plot, we:

Sort all the residual values from the smallest value to the largest value (negatives included). Of course, these residuals should be centered (roughly) at 0.
Each point also defines a quantile. That is, after sorting, the \(k\)th residual value indicates that \(\displaystyle \frac{k}{n} \times 100\) % of the data lies beneath that value.
On the y-axis, we plot the residual values (or, in other words, the observed quantiles).
If we take these quantiles (i.e., the percentages), and map them to the corresponding areas under a standard normal distribution, we will get theoretical quantiles. These values are mapped on the x-axis.
Data that is perfectly distributed normally will look like a line here. If the tails are symmetric, we’ll have opposing “curves” on the left and the right of this plot. The closer the points are to the line, the more normal the residuals are distributed.
Balance your conclusions from this plot with the histogram. In practice, residuals just need to be normal “enough”. Our residuals are normal enough, but the QQ-Plot might (mis)lead you into thinking otherwise.

For more on QQ-plots, you can take a look at this article or this YouTube video.

Cook’s D by Observation

# take a look at the docs for `threshold`
gg_cooksd(model, threshold = 'matlab')

Here, we can see a few rows of data have a high influence on the model. Each number represents the row in the data. Rows 9, 134, 177, and 254 may be troublesome, so we can investigate them by plotting various variables against the response variable.

For example:

ggplot(data = slice(ames_basic, 
                    c(134, 177, 254))) +
  geom_point(data = ames_basic,
             mapping = aes(x = lot_area, y = sale_price)) +
  geom_point(mapping = aes(x = lot_area, y = sale_price),
             color = 'darkred') +
  geom_text_repel(mapping = aes(x = lot_area, 
                                y = sale_price,
                                label = neighborhood),
             color = 'darkred') +
  labs(title="Investigating High Influence Points",
       subtitle="Label = Neighborhood Name")

Clearly, some of these rows have a stronger influence when we consider just the lot_area, but not all of them. We may want to find where “highly influential rows” like these intersect across the different explanatory variables of the model. I.e., which rows are very highly influential among most (or all) explanatory variables.

Note: as usual, one should never remove these data points from the model without good reason. At the very least, you could “tag” these as outliers, and present multiple models.

EXERCISE

Consider adding different combinations of variables to see how they affect the diagnostic plots … What improves them, or what makes them worse?

print("your code here")

## [1] "your code here"

A Note on Error Metrics

Error metrics are far more useful when doing predictive statistics. When we do inferential statistics, the notion of “expected error” is not applicable. That is, here we are not looking to predict on new values, rather, we’re trying to understand relationships between variables. However, you can calculate them here in R:

# sum of squared errors
sse <- sum(model$residuals ^ 2)

# mean squared error
mse <- mean(model$residuals ^ 2)

# root mean squared error
rmse <- sqrt(mse)

# mean absolute error
mae <- mean(abs(model$residuals))

# mean absolute percent error
mape <- mean((abs(model$residuals) + 1) / 
               (predict(model) + 1))

c('sse' = sse, 'mse' = mse, 'rmse' = rmse, 'mae' = mae, 'mape' = mape)

##          sse          mse         rmse          mae         mape 
## 2.118348e+12 6.478130e+09 8.048683e+04 5.738466e+04 2.081019e-01

Simpson’s Paradox

Consider the following dataset.

# Create data
x <- rnorm(50, 0, 1)
y_a <- -1/3*x + rnorm(50, 1, 0.5)
y_b <- -1/3*x + rnorm(50, 2, 0.5)
y_c <- -1/3*x + rnorm(50, 3, 0.5)

simpsons <- data.frame(x = c(x + 1, x + 3, x + 5),
                       y = c(y_a, y_b, y_c),
                       sub_pop = c(rep('a', 50),
                                   rep('b', 50),
                                   rep('c', 50)))

ggplot(data=simpsons) +
  geom_point(mapping = aes(x=x,y=y, color=sub_pop)) +
  scale_color_brewer(palette = 'Dark2')

ggplot(data=simpsons) +
  geom_point(mapping = aes(x=x,y=y))

Notice hour our understanding of “as x increases, y _____” changes based on whether or not we include the confounding variable sub_pop in our model. In short, be careful of your interpretations, keeping in mind there may be a confounding variable at play. Usually, this can be avoided when we think about what the variables are in context.