In this lab, we cover the following topics:
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
# remove scientific notation
options(scipen = 6)
# default theme, unless otherwise noted
theme_set(theme_minimal())
(We’ll need to continue to use the ames_basic
dataset we built from last week.)
ames <- make_ames()
ames_basic <- ames |>
rename_with(tolower) |>
filter(bldg_type == "OneFam",
house_style == "One_Story",
year_built >= 2000) |>
mutate(great_qual = ifelse(overall_qual %in%
c("Very_Excellent", "Excellent", "Very_Good"),
1, 0))
Consider that the coefficient for a model can be very close to zero, i.e., the relationship between \(X\) and \(Y\) would be essentially zero. This can be a hypothesis test, and we can determine if there is enough evidence against the claim that there is no relationship between a variable and the response. In R, each coefficient is tested against a T-test with the null hypothesis that the coefficient value is zero.
\[ \text{H}_0: \beta_i=0 \]
Let’s return to the model which used the interaction term, and use
the broom tidy
function to view the model
coefficients.
model <- lm(sale_price ~ year_remod_add + great_qual
+ year_remod_add:great_qual, ames_basic)
tidy(model, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -8.30e5 5327026. -0.156 0.876 -1.13e7 9650105.
## 2 year_remod_add 5.13e2 2656. 0.193 0.847 -4.71e3 5739.
## 3 great_qual -1.01e7 7545297. -1.34 0.182 -2.49e7 4755240.
## 4 year_remod_add:great_… 5.09e3 3762. 1.35 0.177 -2.31e3 12488.
Notice how the top two p-values are very high. This indicates that there is not sufficient evidence that the year remodeled and interaction term have a significant effect on price. In other words, these low coefficients are not surprising given the null hypothesis (that they are in fact zero).
We can also visualize these tests as confidence intervals instead.
tidy(model, conf.int = TRUE) |>
ggplot(mapping = aes(x = estimate, y = term)) +
geom_point() +
geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
geom_errorbarh(mapping = aes(xmin = conf.low,
xmax = conf.high,
height = 0.5)) +
labs(title = "Model Coefficient C.I.")
It’s true that 0 is contained in the great_qual
confidence interval, but intuitively, the house quality really should
effect the price, so likely we need to consider other variables
in a new model.
A metric that is often used for linear regression is \(R^2\), or the coefficient of determination. The idea behind this metric is to measure the “variance explained” in the response variable by the model. The response variable has an intrinsic variance from its mean \(\overline{y}\) given by \(\sum_{i=1}^{n} (y_i - \overline{y})^2/n\) whose numerator is called the SST (sum of the squares total), and we then think of the ratio \(SSE/SST\) as the % of variance that remains once we’ve modeled the data. Since we want % variance explained we take the difference between 1 and this ratio:
\[ R^2 = 1 - SSE/SST = 1 - \frac{\sum _{i=1}^n \left(y_i - \hat{y}_i\right)^2}{\sum_{i=1}^{n} (y_i - \overline{y})^2} \]
(So, for all inferential models, \(0\leq R^2 \leq 1\).)
If the model is a perfect fit, \(SSE = 0\) so \(R^2 = 1\). On the other hand, if we have a constant model that predicted the mean for all observations, then \(SSE=SST\) and \(R^2 = 0\) (further, if \(SST = 0\), then \(R^2\) is undefined). So, \(R^2\) measures how good our regression fit is relative to a constant model that predicts the mean.
There is no notion of an \(R^2\) value that is “too low”. It should only be used to compare one model to another.
(Some problems are just harder than others!)
The Adjusted R-squared value is meant to account for the natural increase in \(R^2\) as more explanatory variables are introduced. Explicitly, if there are \(n\) rows of data and \(p\) parameters or coefficients in the model, then:
\[ R_{\text{adj}}^2 = 1 - \frac{SSE/\text{df}_e}{SST/\text{df}_t} = 1 - \frac{\left[\sum _{i=1}^n \left(y_i - \hat{y}_i\right)^2\right]/(n-p)}{\left[\sum_{i=1}^{n} (y_i - \overline{y})^2\right]/(n-1)} \]
The degrees of freedom for the denominator (\(\text{df}_t\)) comes from the mean \(\bar{y}\) in the calculation, and the degrees of freedom for the numerator (\(\text{df}_e\)) comes from the fact that \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \cdots + \hat{\beta}_px_p\), which contains \(p\) parameters.
Consider this as an adjusted version of \(R^2\) written in terms of sample variance:
\[ R^2 = 1 - \frac{\text{var(Model Errors)}}{\text{var(Response)}} = 1 - \frac{\left[\sum _{i=1}^n \left(y_i - \hat{y}_i\right)^2\right]/n}{\left[\sum_{i=1}^{n} (y_i - \overline{y})^2\right]/n} \]
Naturally, as we increase the number of explanatory variables in a model, the model will necessarily improve, at least slightly (simply by introducing more information). So, the numerator of the second term will decrease, while the denominator stays exactly the same. The correction in the \(R_\text{adj}\) calculation attempts to balance the sum of squared errors with the number of parameters in the model.
# four (nested) models with increasing variables
model1 <- lm(sale_price ~ year_remod_add + great_qual, ames_basic) |>
summary()
model2 <- lm(sale_price ~ year_remod_add + great_qual
+ lot_area, ames_basic) |>
summary()
model3 <- lm(sale_price ~ year_remod_add + great_qual
+ lot_area + great_qual:lot_area, ames_basic) |>
summary()
model4 <- lm(sale_price ~ year_remod_add + great_qual
+ lot_area + great_qual:lot_area + lot_frontage, ames_basic) |>
summary()
df_plot <- data.frame(p=c(2, 3, 4, 5),
r_squared = c(model1$r.squared,
model2$r.squared,
model3$r.squared,
model4$r.squared),
adj_r_squared = c(model1$adj.r.squared,
model2$adj.r.squared,
model3$adj.r.squared,
model4$adj.r.squared)) |>
mutate(r_perc_increase = 100 * round((r_squared -
model1$r.squared)
/model1$r.squared, 3),
r_adj_perc_increase = 100 * round((adj_r_squared -
model1$adj.r.squared)
/model1$adj.r.squared, 3),
diff = r_perc_increase - r_adj_perc_increase)
ggplot(data=df_plot) +
geom_line(mapping = aes(x=p, y = r_perc_increase,
color = 'R Squared Improvement')) +
geom_line(mapping = aes(x=p, y = r_adj_perc_increase,
color = 'Adjusted R Squared Improvement')) +
geom_text_repel(mapping = aes(x=p,
y = r_adj_perc_increase,
label = paste("diff = ", round(diff, 1))),
color = "darkred",
nudge_x = 0.3) +
labs(color='',
x='Number of Variables (p)',
y='% Increase in R-Squared',
title = "Improving on a Model with 2 Coefficients",
subtitle = "(Initial Model: R=0.4 and R_adj=0.399)") +
scale_color_brewer(palette = 'Dark2') +
theme(legend.position="bottom")
Both of these values are going to increase as we introduce more variables, but the Adjusted R-squared will not increase as fast as the R-squared values will. Consider the implications of this with several more variables in the model.
“Outlier” is a term used to describe one point among many in a column of data. “Leverage” and “influence” approach the same idea, but from the perspective of linear regression:
Cook’s Distance, or Cook’s D, measures the amount to which a model’s fitted values would change if a point were removed. In other words, it is a great measure of influence for each observation.
\[ \text{Cook's D} = \displaystyle D_{i} = {\frac {\sum _{j=1}^{n}\left(\hat {y}_j-\hat{y}_{j(i)}\right)^{2}}{k\cdot \text{MSE}}} \]
This is a calculation for each observation \(i\), and \(\hat{y}_{j(i)}\) is the fitted value for observation \(j\) when observation \(i\) is removed from the model. \(k\) is the number of parameters in the model (i.e., the number of \(\beta_*\) terms).
model_3 <- lm(y3 ~ x3, anscombe)
model_4 <- lm(y4 ~ x4, anscombe)
anscombe <- anscombe |>
mutate(influential_3 = y3 > 11,
influential_4 = x4 > 11,
cooks_3 = cooks.distance(model_3),
cooks_4 = cooks.distance(model_4))
mean(anscombe$cooks_3)
## [1] 0.1672688
Different kinds of “outliers” yield different leverage values.
anscombe |>
ggplot(mapping = aes(x = x3, y = y3)) +
geom_point(size = 2, color = 'darkblue') +
geom_text_repel(data = filter(anscombe, influential_3),
mapping = aes(label = paste("Cook's D",
round(cooks_3, 3))),
color = "darkred") +
geom_smooth(data = filter(anscombe, influential_3 == FALSE),
method = 'lm', se = FALSE, color = 'lightblue') +
geom_smooth(method = 'lm', se = FALSE, color = 'darkred')
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
As a rule of thumb: an observation with Cook’s distance larger than three times the mean Cook’s distance might be an outlier or influential point. In this case, the point has a Cook’s D of \(1.393 > 3 \cdot 0.167\).
The explicit measure for leverage comes from “the hat matrix”. First, note that linear algebra defines a linear model as a projection of data \(X, y\) onto a line \(\hat{y}\):
\[ \hat{y} = X(X^\top X)^{-1}X^\top y \]
where the first column of \(X\) contains all ones (our constant intercept). In this way, we can define the square matrix \(H=X(X^\top X)^{-1}X^\top\), so
\[ \hat{y} = Hy \]
If the \(i\)th row of \(H\) corresponds to the \(i\)th row of \(y\), our \(i\)th prediction is defined as:
\[ \hat{y}_i = h_{i1}y_1 + h_{i2}y_2 + \cdots + h_{ii}y_i + \cdots + h_{in}y_n \]
In other words, \(h_{ii}\) represents the amount of leverage that the observed value \(y_i\) has on the predicted value \(\hat{y}_i\). If \(h_{ii}\) is high, then the predicted value for the \(i\)th row depends heavily on the value of \(y_i\) — so, it has an effect on the “lever” of the line itself. And, vice versa when \(h_{ii}\) is low.
anscombe <- anscombe |>
mutate(hat_3 = hatvalues(model_3))
anscombe |>
ggplot(mapping = aes(x = x3, y = y3)) +
geom_point(size = 2, color = 'darkblue') +
geom_smooth(data = filter(anscombe, influential_3 == FALSE),
method = 'lm', se = FALSE, color = 'lightblue') +
geom_smooth(method = 'lm', se = FALSE, color = 'darkred') +
geom_text_repel(data = anscombe[c(3, 4, 6, 8),],
mapping = aes(label = paste("Hat Value",
round(hat_3, 3))),
color = "darkred",
nudge_y = -0.7)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Typically, end points are going to have higher hat values, and middle points (closer to the mean) will have lower ones. This makes sense when we think about the idea that the further you are from the fulcrum of a lever, the more effect you’ll have on it.
In R, the summary
function provides a summary of
whatever object you pass into it. For instance, we’ve already seen the
output for an ANOVA (aov
) object. It will also give us an
overview of a linear model:
# let's use this model for the remainder of the notebook
model <- lm(sale_price ~ year_remod_add + lot_area, ames_basic)
summary(model)
##
## Call:
## lm(formula = sale_price ~ year_remod_add + lot_area, data = ames_basic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -421288 -42869 -11152 33313 353571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14580203.909 4154884.442 -3.509 0.000513 ***
## year_remod_add 7338.075 2071.626 3.542 0.000455 ***
## lot_area 11.279 1.166 9.672 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80860 on 324 degrees of freedom
## Multiple R-squared: 0.252, Adjusted R-squared: 0.2473
## F-statistic: 54.56 on 2 and 324 DF, p-value: < 2.2e-16
Pr(>|t|)
assumes that there is no linear
relationship between \(Y\) and the
corresponding variable \(X\), i.e.,
that the coefficient is equal to zero.Recall the 5 assumptions of linear regression:
* This is equivalent to saying “\(y\) is normally distributed given any \(x\) value.” This is not the same as \(y\) being normally distributed as a whole, but normally distributed \(y\) variables usually satisfy this assumption.
Diagnosing our model is akin to ensuring that these assumptions are met, and evaluation is determining how well our model is fit.
We can evaluate all our assumptions by looking at diagnostic plots. In this class, I recommend using lindia as the platform for regression diagnostic plots. Specifically, I advise using the following plots:
Our goal here is to identify issues in our model using diagnostic plots, not necessarily to fix them. We will talk about how to solve issues from diagnostics in other lessons.
Note: Other plots like the Scale-Location Plot, or Residual vs. Leverage plots (even in other R packages) tend to use more advanced calculations such as standardized residuals, and these complicate the interpretation of the plots. Practically, the most valuable conclusions you might draw from these plots can be gathered from the (more readable) plots listed above. For this reason, I do not cover them here.
The first plot I tend to look at is the residual values vs. the fitted values:
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
For this particular plot, we can already see that one of our assumptions is being violated, in that the variance in errors increases for higher sales prices. This is indicated by the “fanning” effect from left to right.
Notice, there is a curve that follows the “path” of these residuals. This is a great way to find out if there is a pattern to the residuals other than just fanning. For example, if this red curve is upward trending, then higher fitted values will tend to over estimate, and lower fitted values will tend to under estimate.
plots <- gg_resX(model, plot.all = FALSE)
# for each variable of interest ...
plots$year_remod_add +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
plots$lot_area +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
These plots provide a deeper look into the 5th assumption, above.
What we want are normally distributed errors with a constant variance
across values of \(X\) variables. And,
here is a violation of that assumption for explanatory variables:
lot_area
indicates higher residuals for larger values.
This plot is also a great way to test the linearity assumption between X and y. If the relationship between \(X\) and \(y\) were perfectly linear, then the blue line would line up with the red dotted line. We can also investigate linearity by plotting the variable with the response:
ggplot(data = ames_basic) +
geom_point(mapping = aes(x=lot_area, y=sale_price))
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
When we look at the histogram of residuals, we want to see a distribution that is roughly normal. We see that there is a slight concern, as the tails are not equally distributed. There is a “shorter” tail on the left, and a longer one on the right. This is not severe, but it indicates that some aspect of the phenomenon is missing from the model (i.e., we may need more/different explanatory variables).
Be very careful when using QQ-Plots to diagnose a regression model. These plots are great at detecting even the slightest differences between an empirical distribution and a theoretical distribution. Though, if you’re not careful, you may find deviations indicated in a QQ-plot that are not severe enough to significantly affect your regression model. For example:
# the normal QQ plot
gg_qqplot(model)
To build this plot, we:
For more on QQ-plots, you can take a look at this article or this YouTube video.
# take a look at the docs for `threshold`
gg_cooksd(model, threshold = 'matlab')
Here, we can see a few rows of data have a high influence on the model. Each number represents the row in the data. Rows 9, 134, 177, and 254 may be troublesome, so we can investigate them by plotting various variables against the response variable.
For example:
ggplot(data = slice(ames_basic,
c(134, 177, 254))) +
geom_point(data = ames_basic,
mapping = aes(x = lot_area, y = sale_price)) +
geom_point(mapping = aes(x = lot_area, y = sale_price),
color = 'darkred') +
geom_text_repel(mapping = aes(x = lot_area,
y = sale_price,
label = neighborhood),
color = 'darkred') +
labs(title="Investigating High Influence Points",
subtitle="Label = Neighborhood Name")
Clearly, some of these rows have a stronger influence when we
consider just the lot_area
, but not all of them. We may
want to find where “highly influential rows” like these intersect across
the different explanatory variables of the model. I.e., which rows are
very highly influential among most (or all) explanatory variables.
Note: as usual, one should never remove these data points from the model without good reason. At the very least, you could “tag” these as outliers, and present multiple models.
Consider adding different combinations of variables to see how they affect the diagnostic plots … What improves them, or what makes them worse?
print("your code here")
## [1] "your code here"
Error metrics are far more useful when doing predictive statistics. When we do inferential statistics, the notion of “expected error” is not applicable. That is, here we are not looking to predict on new values, rather, we’re trying to understand relationships between variables. However, you can calculate them here in R:
# sum of squared errors
sse <- sum(model$residuals ^ 2)
# mean squared error
mse <- mean(model$residuals ^ 2)
# root mean squared error
rmse <- sqrt(mse)
# mean absolute error
mae <- mean(abs(model$residuals))
# mean absolute percent error
mape <- mean((abs(model$residuals) + 1) /
(predict(model) + 1))
c('sse' = sse, 'mse' = mse, 'rmse' = rmse, 'mae' = mae, 'mape' = mape)
## sse mse rmse mae mape
## 2.118348e+12 6.478130e+09 8.048683e+04 5.738466e+04 2.081019e-01
Consider the following dataset.
# Create data
x <- rnorm(50, 0, 1)
y_a <- -1/3*x + rnorm(50, 1, 0.5)
y_b <- -1/3*x + rnorm(50, 2, 0.5)
y_c <- -1/3*x + rnorm(50, 3, 0.5)
simpsons <- data.frame(x = c(x + 1, x + 3, x + 5),
y = c(y_a, y_b, y_c),
sub_pop = c(rep('a', 50),
rep('b', 50),
rep('c', 50)))
ggplot(data=simpsons) +
geom_point(mapping = aes(x=x,y=y, color=sub_pop)) +
scale_color_brewer(palette = 'Dark2')
ggplot(data=simpsons) +
geom_point(mapping = aes(x=x,y=y))
Notice hour our understanding of “as x increases, y _____” changes
based on whether or not we include the confounding variable
sub_pop
in our model. In short, be careful of your
interpretations, keeping in mind there may be a confounding variable at
play. Usually, this can be avoided when we think about what the
variables are in context.