This week, we focus on checking assumptions for regression models. Many of these techniques are also useful for models other than the linear ones we have learned up to this point. We will use a dataset that has a lot of issues as a model to show what violation of assumptions “looks” like.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.0.6 ✓ dplyr 1.0.4
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.1
── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:dplyr’:
src, summarize
The following objects are masked from ‘package:base’:
format.pval, units
library(psych)
Attaching package: ‘psych’
The following object is masked from ‘package:Hmisc’:
describe
The following objects are masked from ‘package:ggplot2’:
%+%, alpha
library(dplyr)
library(ggplot2)
psych::describedescribe(skew1)
Let’s start with a regular MLR. We will use this model as our demo and check the assumptions for it:
linearmodel1 <- lm(y ~ x1 + x2 + x3, data = skew1)
summary(linearmodel1)
Call:
lm(formula = y ~ x1 + x2 + x3, data = skew1)
Residuals:
Min 1Q Median 3Q Max
-20.2324 -3.2209 0.6368 4.1809 13.4686
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.0601 7.8212 -0.647 0.518
x1 0.7425 0.1614 4.601 6.74e-06 ***
x2 -0.7812 1.7776 -0.439 0.661
x3 1.2894 1.5421 0.836 0.404
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.085 on 246 degrees of freedom
Multiple R-squared: 0.0811, Adjusted R-squared: 0.0699
F-statistic: 7.238 on 3 and 246 DF, p-value: 0.0001131
Use the augment function from the broom package to get predicted values, residuals, Cook’s Distances, etc. We also use mutate to re-generate an ID number for our new “augmented” dataset.
library(broom)
diagnostics <- augment(linearmodel1) %>%
mutate(.,
ID = row_number())
glimpse(diagnostics)
Rows: 250
Columns: 11
$ y <dbl> 3.52, 5.17, -3.00, 2.08, 10.45, 8.22, 0.99, 5.14,…
$ x1 <dbl> 0.43, 2.46, -0.09, 5.99, 2.82, -0.05, 3.22, 3.85,…
$ x2 <dbl> 11.012794, 10.521316, 10.148856, 9.138934, 9.6354…
$ x3 <dbl> 11.612148, 11.654059, 10.942583, 10.113733, 10.41…
$ .fitted <dbl> 1.629127, 3.574360, 1.054567, 5.289135, 2.939575,…
$ .resid <dbl> 1.89087332, 1.59563982, -4.05456701, -3.20913506,…
$ .hat <dbl> 0.034030296, 0.010170061, 0.010185926, 0.02359367…
$ .sigma <dbl> 6.096286, 6.096664, 6.091963, 6.093994, 6.078457,…
$ .cooksd <dbl> 8.803705e-04, 1.784325e-04, 1.153942e-03, 1.72072…
$ .std.resid <dbl> 0.316163414, 0.263563643, -0.669728223, -0.533707…
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
ggplot(data = diagnostics, mapping = aes(x = .std.resid)) +
geom_histogram(binwidth = .25) +
labs(title = "Histogram of Residuals for Model 1",
x = "Residual Value") +
geom_vline(xintercept = c(-2.5, 2.5), linetype = "dotted", color = "red")
plot(linearmodel1,2)
shapiro.test(diagnostics$.resid)
Shapiro-Wilk normality test
data: diagnostics$.resid
W = 0.96122, p-value = 2.838e-06
An RVF plot and related statistical tests are useful for checking that there is no heteroskedasticity and that the relationship appears linear.
ggplot(data = diagnostics, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth(method = "loess") +
labs(title = "RVF Plot for Model 1",
x = "Predicted Value",
y = "Residual Value")
base R with plotplot(linearmodel1, 1)
Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.
library(lmtest)
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
bptest(linearmodel1)
studentized Breusch-Pagan test
data: linearmodel1
BP = 1.3423, df = 3, p-value = 0.7191
Look at the p-value to determine whether or not you should reject the null. If the p-value is less than the level of significance, then you reject the null hypothesis (which is that there is no heteroskedasticity).
resettest(diagnostics, power=2, type="regressor")
Unknown or uninitialised column: `model`.Unknown or uninitialised column: `x`.Unknown or uninitialised column: `terms`.Unknown or uninitialised column: `terms`.
RESET test
data: diagnostics
RESET = 38.284, df1 = 10, df2 = 236, p-value < 2.2e-16
ggplot(data = diagnostics, mapping = aes(x = .fitted, y = .cooksd, label = ID)) +
geom_point() + geom_text(nudge_x = .25) +
labs(title = "Cook's Distance Plot for Model",
x = "Predicted Value",
y = "Cook's Distance") +
geom_hline(yintercept = 4/250, linetype = "dotted", color = "red")
Two rules for interpreting Cook’s Distance: 1) any observation with a Cook’s Distance greater than 1 should be excluded from the model, and 2) remove any observations that are above the threshold of 4/n, where n is the sample size (in this case, 4/250).
See how these potential outliers might influence your results, if at all. The fancy word for this: sensitivity analysis. Are my results sensitive to the presence of outliers? Hopefully not, but sometimes yes!
summary(linearmodel1)
Call:
lm(formula = y ~ x1 + x2 + x3, data = skew1)
Residuals:
Min 1Q Median 3Q Max
-20.2324 -3.2209 0.6368 4.1809 13.4686
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.0601 7.8212 -0.647 0.518
x1 0.7425 0.1614 4.601 6.74e-06 ***
x2 -0.7812 1.7776 -0.439 0.661
x3 1.2894 1.5421 0.836 0.404
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.085 on 246 degrees of freedom
Multiple R-squared: 0.0811, Adjusted R-squared: 0.0699
F-statistic: 7.238 on 3 and 246 DF, p-value: 0.0001131
trimmed <- diagnostics %>%
filter(., .cooksd < .10)
high.cooks.removed <- lm(y ~ x1 + x2 + x3, data = trimmed)
summary(high.cooks.removed)
Call:
lm(formula = y ~ x1 + x2 + x3, data = trimmed)
Residuals:
Min 1Q Median 3Q Max
-19.2380 -3.5036 0.5714 3.9238 13.4437
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.94881 7.45948 0.261 0.794
x1 0.78352 0.15885 4.932 1.51e-06 ***
x2 -0.10683 1.68127 -0.064 0.949
x3 0.05053 1.47075 0.034 0.973
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.727 on 243 degrees of freedom
Multiple R-squared: 0.09228, Adjusted R-squared: 0.08107
F-statistic: 8.234 on 3 and 243 DF, p-value: 3.065e-05
prod.trimmed <- diagnostics %>%
filter(., .cooksd < .016)
all.removed <- lm(y ~ x1 + x2 + x3, data = prod.trimmed)
summary(all.removed)
Call:
lm(formula = y ~ x1 + x2 + x3, data = prod.trimmed)
Residuals:
Min 1Q Median 3Q Max
-11.3804 -3.5478 0.0409 3.2187 12.5848
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.1021 6.2131 -0.660 0.510
x1 0.8508 0.1485 5.731 3.15e-08 ***
x2 -0.1485 1.4194 -0.105 0.917
x3 0.7013 1.2585 0.557 0.578
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.646 on 228 degrees of freedom
Multiple R-squared: 0.1262, Adjusted R-squared: 0.1147
F-statistic: 10.98 on 3 and 228 DF, p-value: 9.224e-07
modelsummaryCode for summary table comparing all three models to compare how the model improved.
library(modelsummary)
Attaching package: ‘modelsummary’
The following object is masked from ‘package:psych’:
SD
The following object is masked from ‘package:Hmisc’:
Mean
models <-list(linearmodel1, high.cooks.removed,all.removed)
modelsummary(models,output = "markdown")
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| (Intercept) | -5.060 | 1.949 | -4.102 |
| (7.821) | (7.459) | (6.213) | |
| x1 | 0.742 | 0.784 | 0.851 |
| (0.161) | (0.159) | (0.148) | |
| x2 | -0.781 | -0.107 | -0.149 |
| (1.778) | (1.681) | (1.419) | |
| x3 | 1.289 | 0.051 | 0.701 |
| (1.542) | (1.471) | (1.258) | |
| Num.Obs. | 250 | 247 | 232 |
| R2 | 0.081 | 0.092 | 0.126 |
| R2 Adj. | 0.070 | 0.081 | 0.115 |
| AIC | 1618.4 | 1569.0 | 1377.1 |
| BIC | 1636.0 | 1586.6 | 1394.3 |
| Log.Lik. | -804.180 | -779.516 | -683.537 |
| F | 7.238 | 8.234 | 10.979 |
Last, but not least - checking for multicollinearity using the Variance Inflation Factor (VIF).
car::vif(linearmodel1)
x1 x2 x3
1.014198 5.358846 5.359217
simpleboot to address possible assumption violationssummary(bootstrap)
BOOTSTRAP OF LINEAR MODEL (method = rows)
Original Model Fit
------------------
Call:
lm(formula = y ~ x1 + x2 + x3, data = skew1)
Coefficients:
(Intercept) x1 x2 x3
-5.0601 0.7425 -0.7812 1.2894
Bootstrap SD's:
(Intercept) x1 x2 x3
7.5243693 0.2491075 1.6874591 1.6212080
perc.lm and examine results…perc.lm(bootstrap, p = c(0.025, 0.50, 0.975))
(Intercept) x1 x2 x3
2.5% -19.635413 0.2525468 -4.0837777 -1.689145
50% -4.728313 0.7565141 -0.7065138 1.198821
97.5% 8.918464 1.2451756 2.3524789 4.307554