There are many potential issues that we can run into when dealing with real, messy data. Regression diagnostics refers to the process of diagnosing potential issues with the data and deciding how to handle them.

We’ll cover the following regression diagnostics:

Outliers
Missing Data
Violatons of Correctly Specified Form
Violations of Normality Assumption
Violations of Homogeneity of Variances Assumption
Violations of Independence Assumption
Multicollinearity

For today’s lab, we’re going to be working with the variables happiness and extraversion to demonstrate each of the potential issues you can run into when fitting a linear regression model.

Specifically, let’s say we’re interested in predicting happiness (outcome variable) from extraversion (predictor variable).

Outliers

When performing regression diagnostics, outliers should be dealt with first because they can be the cause of some of the other issues that we run into, like non-normality or heteroskedatsicity.

Import data

data_out <- import("outliers.csv")
head(data_out)

##   id extraversion happiness
## 1  1           66        68
## 2  2           69        78
## 3  3           56        70
## 4  4           91        88
## 5  5           97        90
## 6  6           47        63

Two types of outliers:

Univariate outliers: observations with an unusual value on a single predictor, either a predictor or the outcome variable
Multivariate outliers: observations with an unusual set of values compared to where the model lies

Univariate Outliers

Visualizing Univariate Outliers

To visualize univariate outliers, we can use a boxplot for each of our variables.

ggplot(data_out) +
  aes(y = happiness) +
  geom_boxplot() +
  theme_minimal()

ggplot(data_out) +
  aes(y = extraversion) +
  geom_boxplot() +
  theme_minimal()

Question: Do you see any observations that are potential univariate outliers?

Univariate Outliers on Predictor Variables

To quantitatively assess which observations may be potential outliers on a predictor variable, we can calculate each observation’s leverage (also called hat.)

Leverage values are a measure of how much strongly a particular observation on the predictor variable contributes to the estimation of the model’s slope, where \(X_i\) values further from \(\bar{X}\) contribute more to estimating the model’s overall slope. We don’t want an outlying value on a predictor to get to have too much influence on the estimate of the overall model’s slope.

To get leverage values, we’ll first fit the model. Then, we’ll get regression diagnostic statistics for the model by passing it to the augment() function from the broom package. The .hat column contains each observation’s leverage.

# First, fit the model
model_out <- lm(happiness ~ extraversion, data = data_out)

# Obtain a table of regression diagnostics
diagnostics <- augment(model_out)
diagnostics

## # A tibble: 165 × 8
##    happiness extraversion .fitted .resid    .hat .sigma    .cooksd .std.resid
##        <int>        <int>   <dbl>  <dbl>   <dbl>  <dbl>      <dbl>      <dbl>
##  1        68           66    72.1 -4.13  0.00785   11.2 0.000543      -0.370 
##  2        78           69    73.3  4.66  0.00900   11.2 0.000795       0.418 
##  3        70           56    68.1  1.90  0.00607   11.2 0.0000884      0.170 
##  4        88           91    82.2  5.80  0.0261    11.2 0.00369        0.525 
##  5        90           97    84.6  5.39  0.0334    11.2 0.00413        0.489 
##  6        63           47    64.5 -1.48  0.00715   11.2 0.0000631     -0.132 
##  7        60           28    56.8  3.18  0.0178    11.2 0.000743       0.286 
##  8        65           63    70.9 -5.92  0.00699   11.2 0.000991      -0.531 
##  9        75           72    74.5  0.454 0.0104    11.2 0.00000878     0.0408
## 10        83           72    74.5  8.45  0.0104    11.2 0.00304        0.759 
## # ℹ 155 more rows

Let’s add the leverage column (.hat) to the original data frame so we can see which participants each leverage value belongs to

# Add leverage column to original data
data_out <- cbind.data.frame(data_out, diagnostics$.hat)

# Organize the data by descending leverage values
data_out %>% 
  arrange(desc(diagnostics$.hat)) %>%
  head(n = 20)

##     id extraversion happiness diagnostics$.hat
## 1   88           13        45       0.03423379
## 2    5           97        90       0.03336919
## 3   23           97        85       0.03336919
## 4  160           15        35       0.03163427
## 5  162           15        40       0.03163427
## 6  134           16        85       0.03038169
## 7   54           94        78       0.02957885
## 8  139           17        37       0.02916056
## 9  165           17        91       0.02916056
## 10 106           18        32       0.02797088
## 11 163           18        45       0.02797088
## 12   4           91        88       0.02607157
## 13  20           91        75       0.02607157
## 14  25           91        83       0.02607157
## 15 115           20        46       0.02568587
## 16 149           20        87       0.02568587
## 17  72           22        60       0.02352666
## 18  92           22        43       0.02352666
## 19 152           22        44       0.02352666
## 20 161           22        95       0.02352666

Rules of thumb, especially for diagnostic tests, can vary depending on the source and are not necessarily universally agreed upon. However, they can be a good place to start when interpreting regression diagnostic tests for the first time.

Rule of thumb for interpreting leverage values:

One rule of thumb is that leverage values greater than \({3p/n}\) are worth investigating further as a potential outlier.
- p is the number of parameter estimates
- n is the sample size

p <- 2 
n <- nrow(data_out)

(3*p)/n

## [1] 0.03636364

This standard suggests we should investigate any observations with a leverage greater than 0.03636. None of our observations have a leverage that surpasses this value.

But, if we needed to, we could inspect observations with leverage values above this threshold by filtering them from the data set.

data_out %>%
  filter(diagnostics$.hat > 0.03636)

## [1] id               extraversion     happiness        diagnostics$.hat
## <0 rows> (or 0-length row.names)

This allows us to inspect how this person scored on all of the variables in the data set to gain insight into how to interpret their outlying value on our predictor variable.

Univariate Outliers on Outcome Variable

To quantitatively assess which observations may be outliers on the outcome variable, we can calculate each observation’s studentized residual by passing our model to the rstudent() function.

Studentized residuals compare the distance between actual scores on Y, \(Y_i\), from the value on Y predicted by a model that excludes that observation, \(\hat{Y}_i\), in standardized units.

# Obtain studentized residuals
student_resids <- rstudent(model_out)

# Combine with original data
data_out <- cbind.data.frame(data_out, student_resids)

# Organize the data by descending studentized residual values
data_out %>% 
  arrange(desc(abs(student_resids))) %>%
  head(n = 20)

##     id extraversion happiness diagnostics$.hat student_resids
## 1  161           22        95      0.023526661       3.819637
## 2  165           17        91      0.029160560       3.628564
## 3  149           20        87      0.025685870       3.101559
## 4  134           16        85      0.030381691       3.071177
## 5   43           53        42      0.006145777      -2.258516
## 6  159           66        50      0.007851814      -2.002909
## 7  150           28        78      0.017803844       1.924586
## 8  106           18        32      0.027970880      -1.899319
## 9   17           59        90      0.006272722       1.868277
## 10 117           69        53      0.009000323      -1.838297
## 11 140           81        98      0.016424890       1.798357
## 12  11           69        93      0.009000323       1.776115
## 13  44           32        39      0.014617639      -1.760220
## 14  74           38        80      0.010781842       1.730394
## 15  71           41        43      0.009288523      -1.720887
## 16  87           63        90      0.006986358       1.720660
## 17  22           75        58      0.012146501      -1.603405
## 18 155           50        48      0.006506884      -1.592439
## 19 113           27        39      0.018679021      -1.578172
## 20 160           15        35      0.031634272      -1.511761

Rule of thumb for interpreting studentized residuals:

One rule of thumb is that studentized residuals greater than an absolute value of 3 are worth investigating further as potential outliers.

Let’s identify the observations in our data set with standardized residuals greater than 3 to inspect them further:

data_out %>%
  filter(abs(student_resids) > 3)

##    id extraversion happiness diagnostics$.hat student_resids
## 1 134           16        85       0.03038169       3.071177
## 2 149           20        87       0.02568587       3.101559
## 3 161           22        95       0.02352666       3.819637
## 4 165           17        91       0.02916056       3.628564

Multivariate Outliers

Multivariate outliers are outliers that have an unusual combination of scores on the set of variables included in the model.

One way of assessing whether an observation is a multivariate outlier is by assessing whether or not it undue influence on the fit of the model.

Visualizing Multivariate Outliers

One straightforward way to visually inspect the data for multivariate outliers is to use bivariate scatterplots for each combination of outcome and predictor.

ggplot(data_out, aes(x = extraversion, y = happiness)) +
  geom_point()

Question: Do you see any observations that are potential multivariate outliers?

Yes! There are a cluster of four individuals whose scores on extraversion and happiness do not fit the general pattern found in the data. These four people are very low on extraversion yet very high on happiness.

Measuring Multivariate Outliers Based on Model Influence

Cook’s Distance (aka, Cook’s D) summarizes how much the regression model would change if you removed a particular case. It examines how different all of the fitted values would be with versus without a particular case.

# Obtain Cook's D values
cooks_d <- cooks.distance(model_out)

# Combine with original data
data_out <- cbind.data.frame(data_out, cooks_d)

# Organize the data by descending Cook's D values
data_out %>% 
  arrange(desc(cooks_d)) %>%
  head(n = 20)

##     id extraversion happiness diagnostics$.hat student_resids    cooks_d
## 1  165           17        91      0.029160560       3.628564 0.18400290
## 2  161           22        95      0.023526661       3.819637 0.16223203
## 3  134           16        85      0.030381691       3.071177 0.14050319
## 4  149           20        87      0.025685870       3.101559 0.12043312
## 5  106           18        32      0.027970880      -1.899319 0.05108585
## 6  160           15        35      0.031634272      -1.511761 0.03703766
## 7  150           28        78      0.017803844       1.924586 0.03302285
## 8  139           17        37      0.029160560      -1.399721 0.02925184
## 9  140           81        98      0.016424890       1.798357 0.02663819
## 10 113           27        39      0.018679021      -1.578172 0.02348920
## 11  44           32        39      0.014617639      -1.760220 0.02268929
## 12 162           15        40      0.031634272      -1.052279 0.01807441
## 13  74           38        80      0.010781842       1.730394 0.01612054
## 14  22           75        58      0.012146501      -1.603405 0.01565488
## 15 159           66        50      0.007851814      -2.002909 0.01558601
## 16  43           53        42      0.006145777      -2.258516 0.01538436
## 17 117           69        53      0.009000323      -1.838297 0.01512490
## 18  50           84        93      0.018988664       1.230476 0.01460727
## 19  11           69        93      0.009000323       1.776115 0.01413818
## 20  71           41        43      0.009288523      -1.720887 0.01371764

Rule of thumb for interpreting Cook’s D:

One rule of thumb is that Cook’s D values greater than 0.5 are worth investigating further, and Cook’s D values above 1.0 are absolutely having a large influence on the overall fit of the model.

Let’s investigate whether any of the observations have Cook’s D values above this threshold.

data_out %>%
  filter(cooks_d > 0.5)

## [1] id               extraversion     happiness        diagnostics$.hat
## [5] student_resids   cooks_d         
## <0 rows> (or 0-length row.names)

A second rule of thumb is to look for a leap in Cook’s D values amongst the values with the largest Cook’s D values.

Based on our diagnostics, amongst the observations with the highest Cook’s D values, there appears to be a leap in Cook’s D values from 0.05 to 0.12.

data_out %>%
  filter(cooks_d > 0.05)

##    id extraversion happiness diagnostics$.hat student_resids    cooks_d
## 1 106           18        32       0.02797088      -1.899319 0.05108585
## 2 134           16        85       0.03038169       3.071177 0.14050319
## 3 149           20        87       0.02568587       3.101559 0.12043312
## 4 161           22        95       0.02352666       3.819637 0.16223203
## 5 165           17        91       0.02916056       3.628564 0.18400290

Keep, Remove, or Recode Outliers

Once potential outliers have been identified, we have three choices of how to deal with them:

Keep the outlier
Remove the outlier
Recode the outlier

With data errors (e.g., the incorrect value was input), it’s very easy to decide what to do with the outlier. We either fix the value if we can infer with great confidence what the value should be corrected to or we remove it.

With values that appear to reflect a real phenomenon (not an error), it’s much harder to decide what to do with that outlier.

Let’s examine which observations are outliers based on all three of our diagnostic tests:

# Outliers based on Predictor Variable (Leverage)
data_out %>%
  filter(diagnostics$.hat > 0.03636)

## [1] id               extraversion     happiness        diagnostics$.hat
## [5] student_resids   cooks_d         
## <0 rows> (or 0-length row.names)

# Outliers based on Distance from Model (Studentized Residuals)
data_out %>%
  filter(abs(student_resids) > 3)

##    id extraversion happiness diagnostics$.hat student_resids   cooks_d
## 1 134           16        85       0.03038169       3.071177 0.1405032
## 2 149           20        87       0.02568587       3.101559 0.1204331
## 3 161           22        95       0.02352666       3.819637 0.1622320
## 4 165           17        91       0.02916056       3.628564 0.1840029

# Outliers based on Undue Model Influence (Cook's D)
data_out %>%
  filter(cooks_d > 0.05)

##    id extraversion happiness diagnostics$.hat student_resids    cooks_d
## 1 106           18        32       0.02797088      -1.899319 0.05108585
## 2 134           16        85       0.03038169       3.071177 0.14050319
## 3 149           20        87       0.02568587       3.101559 0.12043312
## 4 161           22        95       0.02352666       3.819637 0.16223203
## 5 165           17        91       0.02916056       3.628564 0.18400290

Question: Which observations are outliers based on all three of our diagnostic tests?

Participants 134, 149, 161, and 165. These are the four people we identified earlier with low extraversion and extremely high happiness, which doesn’t fit the general pattern found among people in general (as extraversion increases, happiness tends to increase).

Question: What do you think we should do with these outliers? What is your rationale for that decision?

There’s a lot of discussion about what to do with outliers. Should we leave them in the analysis because they reflect a real phenomenon that we want our model to capture? Should we remove them from the analysis because they’re having a strong influence on the model’s overall fit and don’t fit the pattern demonstrated by the majority of participants’ scores on the variables?

Whatever you decide, be able to provide rationale for why you made that decision.

Missing Data

Import data

data_miss <- import("miss.csv")
head(data_miss)

##   id extraversion happiness
## 1  1           66        68
## 2  2           69        78
## 3  3           56        70
## 4  4           91        NA
## 5  5           NA        90
## 6  6           47        63

Assessing Missing Data

Missing values are indicated in R by an NA in the data frame.

We can visualize the rows with missing data on each variable using the vis_miss() function from the naniar package. It also provides a measure of the percentage of missing data on each variable.

data_miss %>%
  vis_miss()

It looks like extraversion is missing 8% of responses and happiness is missing 5%.

Listwise Deletion

Listwise deletion excludes all rows from the analysis that have NAs for any of the variables included in the model.

Listwise deletion is not recommended unless the percentage of missing data is very small because it 1) reduces power because it uses a smaller sample size to fit the model, and 2) can lead to biased parameter estimates.

The lm function can be told how to handle missing values using the na.action argument. By default, the na.action argument is most likely set to na.omit unless you have changed that default value. Let’s check.

getOption("na.action")

## [1] "na.omit"

na.omit means the model will use listwise deletion. If you don’t choose another method of handling missing data, then you’re going with the default of listwise deletion.

model_listwise <- lm(happiness ~ extraversion, data = data_miss)
summary(model_listwise)

## 
## Call:
## lm(formula = happiness ~ extraversion, data = data_miss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.707  -5.706   0.095   5.893  33.293 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.50296    3.23831  14.051  < 2e-16 ***
## extraversion  0.40007    0.05297   7.552 7.85e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.872 on 125 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.3133, Adjusted R-squared:  0.3078 
## F-statistic: 57.03 on 1 and 125 DF,  p-value: 7.848e-12

Multiple Imputation

Social scientists have increasingly encouraged the community to use either multiple imputation or full information maximum likelihood (FIML) to handle missing data as opposed to the traditional approach of listwise deletion. Studies have found that these two methods for handling missing data 1) maintain better power, and 2) produce less biased parameter estimates.

We’re going to cover multiple imputation in this lab. You’ll learn how to use FIML, which is another great way of handling missing data, in 613 when you cover multilevel modeling.

Multiple imputation can be performed using the functions from the mice package. It occurs in a few steps:

1. Create several imputed data sets

Missing values will be predicted based on the associations observed for other participants in the data set plus random noise to reflect the uncertainty in predicting these values. Set a seed when imputing the data sets if you would like the results to be replicable using the seed argument.
We also have to choose the number of data sets to impute using the m argument. A commonly chosen value is 5, but there are arguments for using larger numbers of imputations as the proportion of missing data increases.

imp <- mice(data_miss, m = 5, seed = 382, print = FALSE)

2. Fit the model with each of the imputed data sets

model <- with(imp, lm(happiness ~ extraversion, data = data_miss))

3. Pool the results across all imputed models.

Finally, we can average the results across all of the imputed models to get a final set parameter estimates and a test of their significance.

pooled_model <- pool(model)

# Summary() Output
summary(pooled_model)

##           term   estimate std.error statistic       df      p.value
## 1  (Intercept) 45.5029606 3.2383114 14.051447 123.0345 2.409915e-27
## 2 extraversion  0.4000701 0.0529747  7.552098 123.0345 8.372387e-12

The parameter estimates are simply averaged across each of the estimated models.

The calculation of the pooled degrees of freedom gets… hairy. See Grund, Lüdtke, and Robitzsch (2016). and Li et al. (1991).

Violations of Correctly Specified Form

One of the assumptions underlying linear regression is that the form of the relationship between the predictor(s) and outcome is correctly specified. Most often, we assume the relationship is linear. If the actual form of the relationship is non-linear, then the linear model estimates could be quite inaccurate.

If a linear model is used to fit the relationship between the predictor(s) and outcome variable but the relationship actually follows a non-linear pattern, this will show up as a systematic pattern in the residuals plot.

Import data

data_nonlin <- import("nonlinear.csv")
head(data_nonlin)

##   id extraversion happiness
## 1  1           66  72.49974
## 2  2           69  88.45730
## 3  3           56  37.50700
## 4  4           91 238.90074
## 5  5           97 311.96884
## 6  6           47  18.14539

Fit model

model_nonlin <- lm(happiness ~ extraversion, data = data_nonlin)

Examine residuals plot

plot(model_nonlin, 1)

If the linear model is a good fit to the pattern of the relationship, there will be no discernible pattern in the residuals plot. If there is some systematic pattern, this is indication that potentially there is a non-linear trend not being captured by the linear model. Our model certainly seems to show a systematic pattern in its residuals!

We should have started out with a visual inspection of the relationship between the predictor and outcome variable to see if it fit the assumption of linearity before fitting the model. Let’s look at it now:

ggplot(data_nonlin, aes(x = extraversion, y = happiness)) +
  geom_point()

Potential Solution: If your residuals plot suggests the relationship is non-linear, then consider fitting a non-linear model that more accurately captures the form of the relationship between the predictor(s) and outcome. We don’t get to discuss fitting non-linear models much in this course, so I would recommend looking into this more if you are interested. You’ll get to discuss it a little more with logistic regression in 613, though. The other option would be to transform the variable(s) so that the relationship between them fits the pattern of a straight line.

Violations of Normality Assumption

Another assumption underlying linear regression is that the model’s residuals are normally distributed.

To diagnose this problem, you can either:

Examine the distribution of the residuals, and/or
Examine a Q-Q plot

Import data

data_nonnorm <- import("nonnormal.csv")
head(data_nonnorm)

##   id extraversion happiness
## 1  1           66  68.44909
## 2  2           69  78.10721
## 3  3           56  79.77801
## 4  4           91  88.05083
## 5  5           97  90.61513
## 6  6           47  63.17076

Fit the model

model_nonnorm <- lm(happiness ~ extraversion, data = data_nonnorm)

Plotting a distribution of the residuals

# storing residuals
regr_diags <- augment(model_nonnorm)
regr_diags

## # A tibble: 147 × 8
##    happiness extraversion .fitted   .resid    .hat .sigma     .cooksd .std.resid
##        <dbl>        <int>   <dbl>    <dbl>   <dbl>  <dbl>       <dbl>      <dbl>
##  1      68.4           66    77.1  -8.61   0.00788   17.0     1.03e-3   -0.510  
##  2      78.1           69    78.1   0.0522 0.00907   17.0     4.37e-8    0.00309
##  3      79.8           56    73.7   6.05   0.00708   17.0     4.58e-4    0.358  
##  4      88.1           91    85.4   2.67   0.0313    17.0     4.12e-4    0.160  
##  5      90.6           97    87.4   3.23   0.0414    17.0     8.19e-4    0.195  
##  6      63.2           47    70.7  -7.56   0.0105    17.0     1.07e-3   -0.448  
##  7      62.0           28    64.4  -2.42   0.0309    17.0     3.35e-4   -0.145  
##  8      65.1           63    76.1 -11.0    0.00712   17.0     1.51e-3   -0.650  
##  9      75.6           72    79.1  -3.47   0.0107    17.0     2.28e-4   -0.205  
## 10      85.3           72    79.1   6.20   0.0107    17.0     7.32e-4    0.368  
## # ℹ 137 more rows

# plotting histogram of residuals
ggplot(data = regr_diags, aes(x = .resid)) + 
  geom_density(fill = "purple") + 
  stat_function(linetype = 2, 
                fun      = dnorm, 
                args     = list(mean = mean(regr_diags$.resid), 
                                sd   =   sd(regr_diags$.resid))) +
  theme_minimal()

Examining a QQ-Plot

ggplot(model_nonnorm) +
  geom_abline(color = "turquoise4") + 
  stat_qq(aes(sample = .stdresid), color = "darkorchid4", alpha = .50) +
  theme_minimal()

Question: Do you see evidence of non-normality in the residuals based on the distribution and QQ plot?

Yes, there appears to be one or a couple of individuals with extremely large residuals that are making the residuals positively-skewed.

Potential Solution: Linear regression is robust to some degree of violation of the normality assumption. If you see evidence of non-normality, make sure you have dealt with (or are aware of) any potential outliers that could be driving the non-normality in the residuals. If the non-normality is not due to outliers, then you can apply a transformation to Y and/or X to try to produce a model with normally distributed residuals.

Let’s see if we can gain more insight into the cause of the non-normality by looking at a scatterplot of the relationship between extraversion and happiness in this data set.

ggplot(data_nonnorm, aes(x = extraversion, y = happiness)) +
  geom_point()

Question: Do you notice anything in the scatterplot that could be driving the non-normality in the residuals?

Yes - we missed a potential multivariate outlier!

Violations of Homogeneity of Variances Assumption

Another assumption underlying linear regression is called homoscedasticity, which is the assumption that the variance of an outcome is approximately the same across all values of the predictor(s). Heteroskedasticity is the violation of this assumption. The accuracy of the standard errors and confidence intervals depend on this assumption being met.

The residuals plot can also be used to check for heteroskedasticity.

Import data

data_heteroskedastic <- import("heteroskedastic.csv")
head(data_heteroskedastic)

##   id extraversion happiness
## 1  1           66     43.19
## 2  2           69    120.93
## 3  3           56     55.53
## 4  4           91    112.42
## 5  5           97     19.37
## 6  6           47     58.42

Fit the model

model_heteroskedastic <- lm(happiness ~ extraversion, data = data_heteroskedastic)

Examine the residuals plot

plot(model_heteroskedastic, 1)

Question: Do you see any evidence of heteroskedasticity?

Yes, the fan-shaped residuals illustrate heteroskedasticity. Residuals are more spread out at higher values of extraversion than they are at lower values of extraversion.

Potential Solution: Sometimes, dealing with non-normality can also treat heteroskedasticity. If the heteroskedasticity remains, though, then one option for treating it is to fit your model using weighted least squares.

Weighted Least Squares

Weighted least squares assigns a weight to each observation to be used when finding the best-fitting model. A common weight that’s used when the residuals are heteroskedastic is \(1/X^2\).

model_weighted <- lm(happiness ~ extraversion, data = data_heteroskedastic, weights = 1/extraversion^2)

summary(model_weighted)

## 
## Call:
## lm(formula = happiness ~ extraversion, data = data_heteroskedastic, 
##     weights = 1/extraversion^2)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53798 -0.36680 -0.06792  0.35110  1.20461 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   35.3092     5.5239   6.392  2.1e-09 ***
## extraversion   0.6171     0.1120   5.509  1.6e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5256 on 145 degrees of freedom
## Multiple R-squared:  0.1731, Adjusted R-squared:  0.1674 
## F-statistic: 30.35 on 1 and 145 DF,  p-value: 1.601e-07

Violations of Independence Assumption

Another assumption underlying linear regression is that the errors are independent of each other. Non-independence can occur if people’s scores on the outcome variable are related to each other for some reason (e.g., the same person responds multiple times or you measured responses from clusters of individuals in the same setting).

The design of the study if the most obvious way to know whether to expect non-independence among one’s residuals. For instance, if the researcher intentionally collected multiple scores from the same participants, they should not anticipate participants’ errors to be independent.

To test for non-independence when the research design does not inform you, you can look at a plot of the residuals with a variable that the errors should not correspond to (e.g., ID number, time). This is an imperfect way of deducing issues of non-independence, though, and the best way of ensuring independence occurs during data collection.

Let’s get some more insight, though, by plotting residuals against ID numbers.

Import data

data_nonind <- import("nonindependent.csv")
head(data_nonind)

##   ID extraversion happiness V4       V5
## 1  1      low_ext         5 NA 3.520833
## 2  2      low_ext         6 NA       NA
## 3  3      low_ext         2 NA       NA
## 4  4      low_ext         1 NA       NA
## 5  5      low_ext         2 NA       NA
## 6  6      low_ext         2 NA       NA

Fit the model

model_nonind <- lm(happiness ~ extraversion, data = data_nonind)

Plot the residuals by ID

# storing residuals
regr_diags <- augment(model_nonind)
regr_diags

## # A tibble: 48 × 8
##    happiness extraversion .fitted .resid   .hat .sigma .cooksd .std.resid
##        <int> <chr>          <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
##  1         5 low_ext         3.06   1.94 0.0625   2.08 0.0206       0.962
##  2         6 low_ext         3.06   2.94 0.0625   2.05 0.0472       1.46 
##  3         2 low_ext         3.06  -1.06 0.0625   2.10 0.00618     -0.527
##  4         1 low_ext         3.06  -2.06 0.0625   2.08 0.0233      -1.02 
##  5         2 low_ext         3.06  -1.06 0.0625   2.10 0.00618     -0.527
##  6         2 low_ext         3.06  -1.06 0.0625   2.10 0.00618     -0.527
##  7         5 low_ext         3.06   1.94 0.0625   2.08 0.0206       0.962
##  8         6 low_ext         3.06   2.94 0.0625   2.05 0.0472       1.46 
##  9         1 low_ext         3.06  -2.06 0.0625   2.08 0.0233      -1.02 
## 10         1 low_ext         3.06  -2.06 0.0625   2.08 0.0233      -1.02 
## # ℹ 38 more rows

# Add ID column
regr_diags$ID <- data_nonind$ID

# Plot residuals by ID
ggplot(data = regr_diags, aes(x = ID, y = .resid)) + 
  geom_point() +  
  geom_smooth(se = F) +
  geom_hline(yintercept = 0)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

It looks like adjacent residuals tend to be similar to each other, which could indicate that participants came and participated in pairs or that the same individual completed the study back-to-back.

Potential Solution: Relationships among participants’ residuals can be driven by 1) repeated measures from the same (or related) participants, and 2) clustering among participants. The first source of non-independence can be dealt with by using a repeated-measures or longitudinal analysis. The second source of non-independence can be dealt with by using multilevel modeling. You’ll discuss both of these analyses more in 613.

Multicollinearity

Multicollinearity occurs when one or more predictors in our model are highly correlated. This causes an issue because 1) it becomes more difficult to measure the unique relationship between each predictor variable and the outcome, and 2) high multicollinearity increases standard errors and confidence intervals, which makes it more difficult to detect the significance of predictors.

For this example, we’re using a data set that has two predictors of happiness, extraversion and social_engagement.

Import data

data_multicoll <- import("multicollinear.csv")
head(data_multicoll)

##   id happiness extraversion social_engagement
## 1  1        68           66                72
## 2  2        78           69                85
## 3  3        70           56                63
## 4  4        88           91               104
## 5  5        90           97               100
## 6  6        63           47                56

First, you can simply look at a correlation matrix of your predictors.

data_multicoll %>%
  cor()

##                            id   happiness extraversion social_engagement
## id                 1.00000000 -0.07198186   -0.0501857       -0.06811339
## happiness         -0.07198186  1.00000000    0.5591066        0.53413605
## extraversion      -0.05018570  0.55910662    1.0000000        0.96193732
## social_engagement -0.06811339  0.53413605    0.9619373        1.00000000

The variables extraversion and social_engagement appear to be VERY highly correlated (r = 0.96)!

We can’t rely only on correlation matrices to identify collinearity, because it is possible for multicollinearity to exist among three or more variables (even if each pair of variables is not highly correlated).

Thus, we also want to calculate numerical measures of multicollinearity, including tolerance and VIFs.

Fit the model

model_multicoll <- lm(happiness ~ extraversion + social_engagement, data = data_multicoll)

Examining tolerance & VIFs

ols_vif_tol(model_multicoll)

##           Variables  Tolerance      VIF
## 1      extraversion 0.07467659 13.39108
## 2 social_engagement 0.07467659 13.39108

Rule of thumb for interpreting multicollinearity: Either a low tolerance (below 0.20 is one rule of thumb) or a high VIF (above 5 or 10) is an indication of a problem with multicollinearity.

Potential solution: One potential solution for handling multicollinearity is to combine predictor variables that appear to be measuring very similar conceptual constructs. The other potential solution is to drop one of the variables from the analysis that is of less theoretical importance.

Let’s combine extraversion and social_engagement into a single variable that we’ll call sociality and use this new variable to predict happiness.

Since the predictor variables could have been measured in originally different units, we’ll first standardize them (i.e., convert them into z-scores) before combining participants’ scores in each variable. There are different ways of combining scores on different variables. Here, we will simply average them.

data_multicoll <- data_multicoll %>%
  mutate(ext_std = scale(extraversion, center = TRUE, scale = TRUE),
         soc_eng_std = scale(social_engagement, center = TRUE, scale = TRUE),
         sociality = ((ext_std + soc_eng_std)/2))

model_sociality <- lm(happiness ~ sociality, data = data_multicoll)

Look at how the summary() output for the two models differs.

To compare the two models, let’s fit the first model with standardized scores on each predictor so their scales are comparable to that of the predictor in the second model.

model_multicoll <- lm(happiness ~ ext_std + soc_eng_std, data = data_multicoll)
summary(model_multicoll)

## 
## Call:
## lm(formula = happiness ~ ext_std + soc_eng_std, data = data_multicoll)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.229  -5.918   0.354   5.718  32.578 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.7143     0.8128  85.769   <2e-16 ***
## ext_std       7.1619     2.9846   2.400   0.0177 *  
## soc_eng_std  -0.5833     2.9846  -0.195   0.8453    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.855 on 144 degrees of freedom
## Multiple R-squared:  0.3128, Adjusted R-squared:  0.3032 
## F-statistic: 32.77 on 2 and 144 DF,  p-value: 1.865e-12

summary(model_sociality)

## 
## Call:
## lm(formula = happiness ~ sociality, data = data_multicoll)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.090  -6.255   0.906   5.818  34.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.7143     0.8148  85.558  < 2e-16 ***
## sociality     6.5787     0.8255   7.969 4.31e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.879 on 145 degrees of freedom
## Multiple R-squared:  0.3046, Adjusted R-squared:  0.2998 
## F-statistic: 63.51 on 1 and 145 DF,  p-value: 4.311e-13

Question: What do you notice about the difference between the two model outputs?

The standard errors are much larger in the model containing both, highly correlated predictors.

Lab 8 - Regression Diagnostics

Outliers

Univariate Outliers

Visualizing Univariate Outliers

Univariate Outliers on Predictor Variables

Univariate Outliers on Outcome Variable

Multivariate Outliers

Visualizing Multivariate Outliers

Measuring Multivariate Outliers Based on Model Influence

Keep, Remove, or Recode Outliers

Missing Data

Assessing Missing Data

Listwise Deletion

Multiple Imputation

1. Create several imputed data sets

2. Fit the model with each of the imputed data sets

3. Pool the results across all imputed models.

Violations of Correctly Specified Form

Violations of Normality Assumption

Violations of Homogeneity of Variances Assumption

Weighted Least Squares

Violations of Independence Assumption

Multicollinearity