This week, we focus on checking assumptions for regression models. Many of these techniques are also useful for models other than the linear ones we have learned up to this point. We will use a dataset that has a lot of issues as a model to show what violation of assumptions “looks” like.

Load in your data

Load in packages

library(tidyverse)

Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.0.6     ✓ dplyr   1.0.4
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.1
── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

library(Hmisc)

Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:dplyr’:

    src, summarize

The following objects are masked from ‘package:base’:

    format.pval, units

library(psych)


Attaching package: ‘psych’

The following object is masked from ‘package:Hmisc’:

    describe

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha

library(dplyr)
library(ggplot2)

Basic descriptive statistics using `psych::describe`

describe(skew1)

Step -1: Find a “final” model that you want to test

Let’s start with a regular MLR. We will use this model as our demo and check the assumptions for it:

linearmodel1 <- lm(y ~ x1 + x2 + x3, data = skew1)
summary(linearmodel1)


Call:
lm(formula = y ~ x1 + x2 + x3, data = skew1)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.2324  -3.2209   0.6368   4.1809  13.4686 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -5.0601     7.8212  -0.647    0.518    
x1            0.7425     0.1614   4.601 6.74e-06 ***
x2           -0.7812     1.7776  -0.439    0.661    
x3            1.2894     1.5421   0.836    0.404    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.085 on 246 degrees of freedom
Multiple R-squared:  0.0811,    Adjusted R-squared:  0.0699 
F-statistic: 7.238 on 3 and 246 DF,  p-value: 0.0001131

Step 0: Generate some useful diagnostic values

Use the augment function from the broom package to get predicted values, residuals, Cook’s Distances, etc. We also use mutate to re-generate an ID number for our new “augmented” dataset.

library(broom)
diagnostics <- augment(linearmodel1) %>%
  mutate(.,
         ID = row_number())
glimpse(diagnostics)

Rows: 250
Columns: 11
$ y          <dbl> 3.52, 5.17, -3.00, 2.08, 10.45, 8.22, 0.99, 5.14,…
$ x1         <dbl> 0.43, 2.46, -0.09, 5.99, 2.82, -0.05, 3.22, 3.85,…
$ x2         <dbl> 11.012794, 10.521316, 10.148856, 9.138934, 9.6354…
$ x3         <dbl> 11.612148, 11.654059, 10.942583, 10.113733, 10.41…
$ .fitted    <dbl> 1.629127, 3.574360, 1.054567, 5.289135, 2.939575,…
$ .resid     <dbl> 1.89087332, 1.59563982, -4.05456701, -3.20913506,…
$ .hat       <dbl> 0.034030296, 0.010170061, 0.010185926, 0.02359367…
$ .sigma     <dbl> 6.096286, 6.096664, 6.091963, 6.093994, 6.078457,…
$ .cooksd    <dbl> 8.803705e-04, 1.784325e-04, 1.153942e-03, 1.72072…
$ .std.resid <dbl> 0.316163414, 0.263563643, -0.669728223, -0.533707…
$ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…

Step 1: Assess normality of residuals visually, with a histogram

ggplot(data = diagnostics, mapping = aes(x = .std.resid)) +
  geom_histogram(binwidth = .25) + 
  labs(title = "Histogram of Residuals for Model 1",
                      x = "Residual Value") +
  geom_vline(xintercept = c(-2.5, 2.5), linetype = "dotted", color = "red")

Q-Q Plot of Residuals - check for normality

plot(linearmodel1,2)

Statistically, with the Shapiro-Wilk Test

shapiro.test(diagnostics$.resid)


    Shapiro-Wilk normality test

data:  diagnostics$.resid
W = 0.96122, p-value = 2.838e-06

Step 2: Use Residuals vs. Fitted (RVF) plot to assess homoskedasticity and linearity assumptions

An RVF plot and related statistical tests are useful for checking that there is no heteroskedasticity and that the relationship appears linear.

Residuals vs. Fitted Plot

ggplot(data = diagnostics, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() + 
  geom_smooth(method = "loess") +
  labs(title = "RVF Plot for Model 1",
                      x = "Predicted Value",
                      y = "Residual Value")

Same thing, just in `base` `R` with `plot`

plot(linearmodel1, 1)

Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.

Breusch-Pagan test: statistical test for heteroskedasticity

library(lmtest)

Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

bptest(linearmodel1)


    studentized Breusch-Pagan test

data:  linearmodel1
BP = 1.3423, df = 3, p-value = 0.7191

Look at the p-value to determine whether or not you should reject the null. If the p-value is less than the level of significance, then you reject the null hypothesis (which is that there is no heteroskedasticity).

Ramset’s RESET: statistical test for linearity/ functional form

resettest(diagnostics, power=2, type="regressor")

Unknown or uninitialised column: `model`.Unknown or uninitialised column: `x`.Unknown or uninitialised column: `terms`.Unknown or uninitialised column: `terms`.


    RESET test

data:  diagnostics
RESET = 38.284, df1 = 10, df2 = 236, p-value < 2.2e-16

Step 3: Use Cook’s Distance to check for regression outliers

ggplot(data = diagnostics, mapping = aes(x = .fitted, y = .cooksd, label = ID)) +
  geom_point() + geom_text(nudge_x = .25) +
  labs(title = "Cook's Distance Plot for Model",
                      x = "Predicted Value",
                      y = "Cook's Distance") + 
  geom_hline(yintercept = 4/250, linetype = "dotted", color = "red")

Two rules for interpreting Cook’s Distance: 1) any observation with a Cook’s Distance greater than 1 should be excluded from the model, and 2) remove any observations that are above the threshold of 4/n, where n is the sample size (in this case, 4/250).

See how these potential outliers might influence your results, if at all. The fancy word for this: sensitivity analysis. Are my results sensitive to the presence of outliers? Hopefully not, but sometimes yes!

Original model

summary(linearmodel1)


Call:
lm(formula = y ~ x1 + x2 + x3, data = skew1)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.2324  -3.2209   0.6368   4.1809  13.4686 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -5.0601     7.8212  -0.647    0.518    
x1            0.7425     0.1614   4.601 6.74e-06 ***
x2           -0.7812     1.7776  -0.439    0.661    
x3            1.2894     1.5421   0.836    0.404    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.085 on 246 degrees of freedom
Multiple R-squared:  0.0811,    Adjusted R-squared:  0.0699 
F-statistic: 7.238 on 3 and 246 DF,  p-value: 0.0001131

Same model, with the largest Cook’s values removed

trimmed <- diagnostics %>%
  filter(., .cooksd < .10)
high.cooks.removed <- lm(y ~ x1 + x2 + x3, data = trimmed)
summary(high.cooks.removed)


Call:
lm(formula = y ~ x1 + x2 + x3, data = trimmed)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.2380  -3.5036   0.5714   3.9238  13.4437 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.94881    7.45948   0.261    0.794    
x1           0.78352    0.15885   4.932 1.51e-06 ***
x2          -0.10683    1.68127  -0.064    0.949    
x3           0.05053    1.47075   0.034    0.973    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.727 on 243 degrees of freedom
Multiple R-squared:  0.09228,   Adjusted R-squared:  0.08107 
F-statistic: 8.234 on 3 and 243 DF,  p-value: 3.065e-05

Same model, with all values > 4/n removed

prod.trimmed <- diagnostics %>%
  filter(., .cooksd < .016)
all.removed <- lm(y ~ x1 + x2 + x3, data = prod.trimmed)
summary(all.removed)


Call:
lm(formula = y ~ x1 + x2 + x3, data = prod.trimmed)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.3804  -3.5478   0.0409   3.2187  12.5848 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -4.1021     6.2131  -0.660    0.510    
x1            0.8508     0.1485   5.731 3.15e-08 ***
x2           -0.1485     1.4194  -0.105    0.917    
x3            0.7013     1.2585   0.557    0.578    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.646 on 228 degrees of freedom
Multiple R-squared:  0.1262,    Adjusted R-squared:  0.1147 
F-statistic: 10.98 on 3 and 228 DF,  p-value: 9.224e-07

Create a comparison table using `modelsummary`

Code for summary table comparing all three models to compare how the model improved.

library(modelsummary)


Attaching package: ‘modelsummary’

The following object is masked from ‘package:psych’:

    SD

The following object is masked from ‘package:Hmisc’:

    Mean

models <-list(linearmodel1, high.cooks.removed,all.removed)
modelsummary(models,output = "markdown")

	Model 1	Model 2	Model 3
(Intercept)	-5.060	1.949	-4.102
	(7.821)	(7.459)	(6.213)
x1	0.742	0.784	0.851
	(0.161)	(0.159)	(0.148)
x2	-0.781	-0.107	-0.149
	(1.778)	(1.681)	(1.419)
x3	1.289	0.051	0.701
	(1.542)	(1.471)	(1.258)
Num.Obs.	250	247	232
R2	0.081	0.092	0.126
R2 Adj.	0.070	0.081	0.115
AIC	1618.4	1569.0	1377.1
BIC	1636.0	1586.6	1394.3
Log.Lik.	-804.180	-779.516	-683.537
F	7.238	8.234	10.979

Step 4: Checking for no multicollinearity

Last, but not least - checking for multicollinearity using the Variance Inflation Factor (VIF).

car::vif(linearmodel1)

      x1       x2       x3 
1.014198 5.358846 5.359217

Step 5: Using bootstrapping with `simpleboot` to address possible assumption violations

First, run bootstrap to obtain bootstrap SDs

summary(bootstrap)

BOOTSTRAP OF LINEAR MODEL  (method = rows)

Original Model Fit
------------------
Call:
lm(formula = y ~ x1 + x2 + x3, data = skew1)

Coefficients:
(Intercept)           x1           x2           x3  
    -5.0601       0.7425      -0.7812       1.2894  

Bootstrap SD's:
(Intercept)           x1           x2           x3  
  7.5243693    0.2491075    1.6874591    1.6212080

Second, create a 95% confidence interval using `perc.lm` and examine results…

perc.lm(bootstrap, p = c(0.025, 0.50, 0.975))

      (Intercept)        x1         x2        x3
2.5%   -19.635413 0.2525468 -4.0837777 -1.689145
50%     -4.728313 0.7565141 -0.7065138  1.198821
97.5%    8.918464 1.2451756  2.3524789  4.307554

LS0tCnRpdGxlOiAiTXVsdGl2YXJpYXRlIFN0YXRpc3RpY3M6IE1vZHVsZSA1LSBBc3Nlc3NpbmcgUmVncmVzc2lvbiBBc3N1bXB0aW9ucyIKYXV0aG9yOiAiRHIuIEJyb2RhIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpUaGlzIHdlZWssIHdlIGZvY3VzIG9uIGNoZWNraW5nIGFzc3VtcHRpb25zIGZvciByZWdyZXNzaW9uIG1vZGVscy4gTWFueSBvZiB0aGVzZSB0ZWNobmlxdWVzIGFyZSBhbHNvIHVzZWZ1bCBmb3IgbW9kZWxzIG90aGVyIHRoYW4gdGhlIGxpbmVhciBvbmVzIHdlIGhhdmUgbGVhcm5lZCB1cCB0byB0aGlzIHBvaW50LiBXZSB3aWxsIHVzZSBhIGRhdGFzZXQgdGhhdCBoYXMgYSBsb3Qgb2YgaXNzdWVzIGFzIGEgbW9kZWwgdG8gc2hvdyB3aGF0IHZpb2xhdGlvbiBvZiBhc3N1bXB0aW9ucyAibG9va3MiIGxpa2UuCgojIExvYWQgaW4geW91ciBkYXRhCmBgYHtyfQpsaWJyYXJ5KGhhdmVuKQpza2V3MSA8LSByZWFkX2R0YSgiV2VlayA1X3NrZXcxLmR0YSIpClZpZXcoc2tldzEpCmBgYAoKIyBMb2FkIGluIHBhY2thZ2VzCmBgYHtyfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShwc3ljaCkKYGBgCgojIEJhc2ljIGRlc2NyaXB0aXZlIHN0YXRpc3RpY3MgdXNpbmcgYHBzeWNoOjpkZXNjcmliZWAKYGBge3J9CmRlc2NyaWJlKHNrZXcxKQpgYGAKCiMgU3RlcCAtMTogRmluZCBhICJmaW5hbCIgbW9kZWwgdGhhdCB5b3Ugd2FudCB0byB0ZXN0CkxldCdzIHN0YXJ0IHdpdGggYSByZWd1bGFyIE1MUi4gV2Ugd2lsbCB1c2UgdGhpcyBtb2RlbCBhcyBvdXIgZGVtbyBhbmQgY2hlY2sgdGhlIGFzc3VtcHRpb25zIGZvciBpdDoKYGBge3J9CmxpbmVhcm1vZGVsMSA8LSBsbSh5IH4geDEgKyB4MiArIHgzLCBkYXRhID0gc2tldzEpCnN1bW1hcnkobGluZWFybW9kZWwxKQpgYGAKCiMgU3RlcCAwOiBHZW5lcmF0ZSBzb21lIHVzZWZ1bCBkaWFnbm9zdGljIHZhbHVlcwpVc2UgdGhlIGBhdWdtZW50YCBmdW5jdGlvbiBmcm9tIHRoZSBgYnJvb21gIHBhY2thZ2UgdG8gZ2V0IHByZWRpY3RlZCB2YWx1ZXMsIHJlc2lkdWFscywgQ29va+KAmXMgRGlzdGFuY2VzLCBldGMuIFdlIGFsc28gdXNlIGBtdXRhdGVgIHRvIHJlLWdlbmVyYXRlIGFuIElEIG51bWJlciBmb3Igb3VyIG5ldyAiYXVnbWVudGVkIiBkYXRhc2V0LgpgYGB7cn0KbGlicmFyeShicm9vbSkKZGlhZ25vc3RpY3MgPC0gYXVnbWVudChsaW5lYXJtb2RlbDEpICU+JQogIG11dGF0ZSguLAogICAgICAgICBJRCA9IHJvd19udW1iZXIoKSkKZ2xpbXBzZShkaWFnbm9zdGljcykKYGBgCgojIFN0ZXAgMTogQXNzZXNzIG5vcm1hbGl0eSBvZiByZXNpZHVhbHMgdmlzdWFsbHksIHdpdGggYSBoaXN0b2dyYW0KYGBge3J9CmdncGxvdChkYXRhID0gZGlhZ25vc3RpY3MsIG1hcHBpbmcgPSBhZXMoeCA9IC5zdGQucmVzaWQpKSArCiAgZ2VvbV9oaXN0b2dyYW0oYmlud2lkdGggPSAuMjUpICsgCiAgbGFicyh0aXRsZSA9ICJIaXN0b2dyYW0gb2YgUmVzaWR1YWxzIGZvciBNb2RlbCAxIiwKICAgICAgICAgICAgICAgICAgICAgIHggPSAiUmVzaWR1YWwgVmFsdWUiKSArCiAgZ2VvbV92bGluZSh4aW50ZXJjZXB0ID0gYygtMi41LCAyLjUpLCBsaW5ldHlwZSA9ICJkb3R0ZWQiLCBjb2xvciA9ICJyZWQiKQpgYGAKCiMjIFEtUSBQbG90IG9mIFJlc2lkdWFscyAtIGNoZWNrIGZvciBub3JtYWxpdHkKYGBge3J9CnBsb3QobGluZWFybW9kZWwxLDIpCmBgYAoKIyMgU3RhdGlzdGljYWxseSwgd2l0aCB0aGUgU2hhcGlyby1XaWxrIFRlc3QKYGBge3J9CnNoYXBpcm8udGVzdChkaWFnbm9zdGljcyQucmVzaWQpCmBgYAojIFN0ZXAgMjogVXNlIFJlc2lkdWFscyB2cy4gRml0dGVkIChSVkYpIHBsb3QgdG8gYXNzZXNzIGhvbW9za2VkYXN0aWNpdHkgYW5kIGxpbmVhcml0eSBhc3N1bXB0aW9ucwoKQW4gUlZGIHBsb3QgYW5kIHJlbGF0ZWQgc3RhdGlzdGljYWwgdGVzdHMgYXJlIHVzZWZ1bCBmb3IgY2hlY2tpbmcgdGhhdCB0aGVyZSBpcyBubyBoZXRlcm9za2VkYXN0aWNpdHkgYW5kIHRoYXQgdGhlIHJlbGF0aW9uc2hpcCBhcHBlYXJzIGxpbmVhci4KCiMjIFJlc2lkdWFscyB2cy4gRml0dGVkIFBsb3QKYGBge3J9CmdncGxvdChkYXRhID0gZGlhZ25vc3RpY3MsIG1hcHBpbmcgPSBhZXMoeCA9IC5maXR0ZWQsIHkgPSAucmVzaWQpKSArCiAgZ2VvbV9wb2ludCgpICsgCiAgZ2VvbV9zbW9vdGgobWV0aG9kID0gImxvZXNzIikgKwogIGxhYnModGl0bGUgPSAiUlZGIFBsb3QgZm9yIE1vZGVsIDEiLAogICAgICAgICAgICAgICAgICAgICAgeCA9ICJQcmVkaWN0ZWQgVmFsdWUiLAogICAgICAgICAgICAgICAgICAgICAgeSA9ICJSZXNpZHVhbCBWYWx1ZSIpCmBgYAojIyBTYW1lIHRoaW5nLCBqdXN0IGluIGBiYXNlYCBgUmAgd2l0aCBgcGxvdGAKYGBge3J9CnBsb3QobGluZWFybW9kZWwxLCAxKQpgYGAKSWRlYWxseSwgdGhlIHJlc2lkdWFsIHBsb3Qgd2lsbCBzaG93IG5vIGZpdHRlZCBwYXR0ZXJuLiBUaGF0IGlzLCB0aGUgcmVkIGxpbmUgc2hvdWxkIGJlIGFwcHJveGltYXRlbHkgaG9yaXpvbnRhbCBhdCB6ZXJvLiBUaGUgcHJlc2VuY2Ugb2YgYSBwYXR0ZXJuIG1heSBpbmRpY2F0ZSBhIHByb2JsZW0gd2l0aCBzb21lIGFzcGVjdCBvZiB0aGUgbGluZWFyIG1vZGVsLgoKIyMgQnJldXNjaC1QYWdhbiB0ZXN0OiBzdGF0aXN0aWNhbCB0ZXN0IGZvciBoZXRlcm9za2VkYXN0aWNpdHkKYGBge3J9CmxpYnJhcnkobG10ZXN0KQpicHRlc3QobGluZWFybW9kZWwxKQpgYGAKTG9vayBhdCB0aGUgKnAqLXZhbHVlIHRvIGRldGVybWluZSB3aGV0aGVyIG9yIG5vdCB5b3Ugc2hvdWxkIHJlamVjdCB0aGUgbnVsbC4gSWYgdGhlICpwKi12YWx1ZSBpcyBsZXNzIHRoYW4gdGhlIGxldmVsIG9mIHNpZ25pZmljYW5jZSwgdGhlbiB5b3UgcmVqZWN0IHRoZSBudWxsIGh5cG90aGVzaXMgKHdoaWNoIGlzIHRoYXQgdGhlcmUgaXMgbm8gaGV0ZXJvc2tlZGFzdGljaXR5KS4KCiMjIFJhbXNldCdzIFJFU0VUOiBzdGF0aXN0aWNhbCB0ZXN0IGZvciBsaW5lYXJpdHkvIGZ1bmN0aW9uYWwgZm9ybQpgYGB7cn0KcmVzZXR0ZXN0KGRpYWdub3N0aWNzLCBwb3dlciA9IDIsIHR5cGUgPSAicmVncmVzc29yIikKYGBgCiMgU3RlcCAzOiBVc2UgQ29va+KAmXMgRGlzdGFuY2UgdG8gY2hlY2sgZm9yIHJlZ3Jlc3Npb24gb3V0bGllcnMKYGBge3J9CmdncGxvdChkYXRhID0gZGlhZ25vc3RpY3MsIG1hcHBpbmcgPSBhZXMoeCA9IC5maXR0ZWQsIHkgPSAuY29va3NkLCBsYWJlbCA9IElEKSkgKwogIGdlb21fcG9pbnQoKSArIGdlb21fdGV4dChudWRnZV94ID0gLjI1KSArCiAgbGFicyh0aXRsZSA9ICJDb29rJ3MgRGlzdGFuY2UgUGxvdCBmb3IgTW9kZWwiLAogICAgICAgICAgICAgICAgICAgICAgeCA9ICJQcmVkaWN0ZWQgVmFsdWUiLAogICAgICAgICAgICAgICAgICAgICAgeSA9ICJDb29rJ3MgRGlzdGFuY2UiKSArIAogIGdlb21faGxpbmUoeWludGVyY2VwdCA9IDQvMjUwLCBsaW5ldHlwZSA9ICJkb3R0ZWQiLCBjb2xvciA9ICJyZWQiKQpgYGAKVHdvIHJ1bGVzIGZvciBpbnRlcnByZXRpbmcgQ29vaydzIERpc3RhbmNlOiAxKSBhbnkgb2JzZXJ2YXRpb24gd2l0aCBhIENvb2sncyBEaXN0YW5jZSBncmVhdGVyIHRoYW4gMSBzaG91bGQgYmUgZXhjbHVkZWQgZnJvbSB0aGUgbW9kZWwsIGFuZCAyKSByZW1vdmUgYW55IG9ic2VydmF0aW9ucyB0aGF0IGFyZSBhYm92ZSB0aGUgdGhyZXNob2xkIG9mIDQvbiwgd2hlcmUgbiBpcyB0aGUgc2FtcGxlIHNpemUgKGluIHRoaXMgY2FzZSwgNC8yNTApLgoKU2VlIGhvdyB0aGVzZSBwb3RlbnRpYWwgb3V0bGllcnMgbWlnaHQgaW5mbHVlbmNlIHlvdXIgcmVzdWx0cywgaWYgYXQgYWxsLiBUaGUgZmFuY3kgd29yZCBmb3IgdGhpczogc2Vuc2l0aXZpdHkgYW5hbHlzaXMuIEFyZSBteSByZXN1bHRzIHNlbnNpdGl2ZSB0byB0aGUgcHJlc2VuY2Ugb2Ygb3V0bGllcnM/IEhvcGVmdWxseSBub3QsIGJ1dCBzb21ldGltZXMgeWVzIQoKIyMgT3JpZ2luYWwgbW9kZWwKYGBge3J9CnN1bW1hcnkobGluZWFybW9kZWwxKQpgYGAKIyMgU2FtZSBtb2RlbCwgd2l0aCB0aGUgbGFyZ2VzdCBDb29rJ3MgdmFsdWVzIHJlbW92ZWQKYGBge3J9CnRyaW1tZWQgPC0gZGlhZ25vc3RpY3MgJT4lCiAgZmlsdGVyKC4sIC5jb29rc2QgPCAuMTApCmhpZ2guY29va3MucmVtb3ZlZCA8LSBsbSh5IH4geDEgKyB4MiArIHgzLCBkYXRhID0gdHJpbW1lZCkKc3VtbWFyeShoaWdoLmNvb2tzLnJlbW92ZWQpCmBgYAojIyBTYW1lIG1vZGVsLCB3aXRoIGFsbCB2YWx1ZXMgPiA0L24gcmVtb3ZlZApgYGB7cn0KcHJvZC50cmltbWVkIDwtIGRpYWdub3N0aWNzICU+JQogIGZpbHRlciguLCAuY29va3NkIDwgLjAxNikKYWxsLnJlbW92ZWQgPC0gbG0oeSB+IHgxICsgeDIgKyB4MywgZGF0YSA9IHByb2QudHJpbW1lZCkKc3VtbWFyeShhbGwucmVtb3ZlZCkKYGBgCgojIyBDcmVhdGUgYSBjb21wYXJpc29uIHRhYmxlIHVzaW5nIGBtb2RlbHN1bW1hcnlgCkNvZGUgZm9yIHN1bW1hcnkgdGFibGUgY29tcGFyaW5nIGFsbCB0aHJlZSBtb2RlbHMgdG8gY29tcGFyZSBob3cgdGhlIG1vZGVsIGltcHJvdmVkLgpgYGB7cn0KbGlicmFyeShtb2RlbHN1bW1hcnkpCm1vZGVscyA8LWxpc3QobGluZWFybW9kZWwxLCBoaWdoLmNvb2tzLnJlbW92ZWQsYWxsLnJlbW92ZWQpCm1vZGVsc3VtbWFyeShtb2RlbHMsb3V0cHV0ID0gIm1hcmtkb3duIikKYGBgCgojIFN0ZXAgNDogQ2hlY2tpbmcgZm9yIG5vIG11bHRpY29sbGluZWFyaXR5Ckxhc3QsIGJ1dCBub3QgbGVhc3QgLSBjaGVja2luZyBmb3IgbXVsdGljb2xsaW5lYXJpdHkgdXNpbmcgdGhlIFZhcmlhbmNlIEluZmxhdGlvbiBGYWN0b3IgKFZJRikuCmBgYHtyfQpjYXI6OnZpZihsaW5lYXJtb2RlbDEpCmBgYAoKIyBTdGVwIDU6IFVzaW5nIGJvb3RzdHJhcHBpbmcgd2l0aCBgc2ltcGxlYm9vdGAgdG8gYWRkcmVzcyBwb3NzaWJsZSBhc3N1bXB0aW9uIHZpb2xhdGlvbnMKIyMgRmlyc3QsIHJ1biBib290c3RyYXAgdG8gb2J0YWluIGJvb3RzdHJhcCBTRHMKYGBge3J9CmxpYnJhcnkoc2ltcGxlYm9vdCkKYm9vdHN0cmFwIDwtIGxtLmJvb3QobGluZWFybW9kZWwxLCBSPTEwMDApCnN1bW1hcnkoYm9vdHN0cmFwKQpgYGAKIyMgU2Vjb25kLCBjcmVhdGUgYSA5NSUgY29uZmlkZW5jZSBpbnRlcnZhbCB1c2luZyBgcGVyYy5sbWAgYW5kIGV4YW1pbmUgcmVzdWx0cy4uLgpgYGB7cn0KcGVyYy5sbShib290c3RyYXAsIHAgPSBjKDAuMDI1LCAwLjUwLCAwLjk3NSkpCmBgYAoK

Multivariate Statistics: Module 5- Assessing Regression Assumptions

Dr. Broda

Load in your data

Load in packages

Basic descriptive statistics using `psych::describe`

Step -1: Find a “final” model that you want to test

Step 0: Generate some useful diagnostic values

Step 1: Assess normality of residuals visually, with a histogram

Q-Q Plot of Residuals - check for normality

Statistically, with the Shapiro-Wilk Test

Step 2: Use Residuals vs. Fitted (RVF) plot to assess homoskedasticity and linearity assumptions

Residuals vs. Fitted Plot

Same thing, just in `base` `R` with `plot`

Breusch-Pagan test: statistical test for heteroskedasticity

Ramset’s RESET: statistical test for linearity/ functional form

Step 3: Use Cook’s Distance to check for regression outliers

Original model

Same model, with the largest Cook’s values removed

Same model, with all values > 4/n removed

Create a comparison table using `modelsummary`

Step 4: Checking for no multicollinearity

Step 5: Using bootstrapping with `simpleboot` to address possible assumption violations

First, run bootstrap to obtain bootstrap SDs

Second, create a 95% confidence interval using `perc.lm` and examine results…

Multivariate Statistics: Module 5- Assessing Regression Assumptions

Dr. Broda

Load in your data

Load in packages

Basic descriptive statistics using psych::describe

Step -1: Find a “final” model that you want to test

Step 0: Generate some useful diagnostic values

Step 1: Assess normality of residuals visually, with a histogram

Q-Q Plot of Residuals - check for normality

Statistically, with the Shapiro-Wilk Test

Step 2: Use Residuals vs. Fitted (RVF) plot to assess homoskedasticity and linearity assumptions

Residuals vs. Fitted Plot

Same thing, just in base R with plot

Breusch-Pagan test: statistical test for heteroskedasticity

Ramset’s RESET: statistical test for linearity/ functional form

Step 3: Use Cook’s Distance to check for regression outliers

Original model

Same model, with the largest Cook’s values removed

Same model, with all values > 4/n removed

Create a comparison table using modelsummary

Step 4: Checking for no multicollinearity

Step 5: Using bootstrapping with simpleboot to address possible assumption violations

First, run bootstrap to obtain bootstrap SDs

Second, create a 95% confidence interval using perc.lm and examine results…

Basic descriptive statistics using `psych::describe`

Same thing, just in `base` `R` with `plot`

Create a comparison table using `modelsummary`

Step 5: Using bootstrapping with `simpleboot` to address possible assumption violations

Second, create a 95% confidence interval using `perc.lm` and examine results…