There are many potential issues that we can run into when dealing with real, messy data. Regression diagnostics refers to the process of diagnosing potential issues with the data and deciding how to handle them.
We’ll cover the following regression diagnostics:
For today’s lab, we’re going to be working with the variables
happiness and extraversion to demonstrate each
of the potential issues you can run into when fitting a linear
regression model.
Specifically, let’s say we’re interested in predicting
happiness (outcome variable) from extraversion
(predictor variable).
When performing regression diagnostics, outliers should be dealt with first because they can be the cause of some of the other issues that we run into, like non-normality or heteroskedatsicity.
Import data
data_out <- import("outliers.csv")
head(data_out)
## id extraversion happiness
## 1 1 66 68
## 2 2 69 78
## 3 3 56 70
## 4 4 91 88
## 5 5 97 90
## 6 6 47 63
Two types of outliers:
Univariate outliers: observations with an unusual value on a single predictor, either a predictor or the outcome variable
Multivariate outliers: observations with an unusual set of values compared to where the model lies
To visualize univariate outliers, we can use a boxplot for each of our variables.
ggplot(data_out) +
aes(y = happiness) +
geom_boxplot() +
theme_minimal()
ggplot(data_out) +
aes(y = extraversion) +
geom_boxplot() +
theme_minimal()
Question: Do you see any observations that are potential univariate outliers?
To quantitatively assess which observations may be potential outliers on a predictor variable, we can calculate each observation’s leverage (also called hat.)
Leverage values are a measure of how much strongly a particular observation on the predictor variable contributes to the estimation of the model’s slope, where \(X_i\) values further from \(\bar{X}\) contribute more to estimating the model’s overall slope. We don’t want an outlying value on a predictor to get to have too much influence on the estimate of the overall model’s slope.
To get leverage values, we’ll first fit the model. Then, we’ll get
regression diagnostic statistics for the model by passing it to the
augment() function from the broom package. The
.hat column contains each observation’s
leverage.
# First, fit the model
model_out <- lm(happiness ~ extraversion, data = data_out)
# Obtain a table of regression diagnostics
diagnostics <- augment(model_out)
diagnostics
## # A tibble: 165 × 8
## happiness extraversion .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 68 66 72.1 -4.13 0.00785 11.2 0.000543 -0.370
## 2 78 69 73.3 4.66 0.00900 11.2 0.000795 0.418
## 3 70 56 68.1 1.90 0.00607 11.2 0.0000884 0.170
## 4 88 91 82.2 5.80 0.0261 11.2 0.00369 0.525
## 5 90 97 84.6 5.39 0.0334 11.2 0.00413 0.489
## 6 63 47 64.5 -1.48 0.00715 11.2 0.0000631 -0.132
## 7 60 28 56.8 3.18 0.0178 11.2 0.000743 0.286
## 8 65 63 70.9 -5.92 0.00699 11.2 0.000991 -0.531
## 9 75 72 74.5 0.454 0.0104 11.2 0.00000878 0.0408
## 10 83 72 74.5 8.45 0.0104 11.2 0.00304 0.759
## # ℹ 155 more rows
Let’s add the leverage column (.hat) to the original data frame so we can see which participants each leverage value belongs to
# Add leverage column to original data
data_out <- cbind.data.frame(data_out, diagnostics$.hat)
# Organize the data by descending leverage values
data_out %>%
arrange(desc(diagnostics$.hat)) %>%
head(n = 20)
## id extraversion happiness diagnostics$.hat
## 1 88 13 45 0.03423379
## 2 5 97 90 0.03336919
## 3 23 97 85 0.03336919
## 4 160 15 35 0.03163427
## 5 162 15 40 0.03163427
## 6 134 16 85 0.03038169
## 7 54 94 78 0.02957885
## 8 139 17 37 0.02916056
## 9 165 17 91 0.02916056
## 10 106 18 32 0.02797088
## 11 163 18 45 0.02797088
## 12 4 91 88 0.02607157
## 13 20 91 75 0.02607157
## 14 25 91 83 0.02607157
## 15 115 20 46 0.02568587
## 16 149 20 87 0.02568587
## 17 72 22 60 0.02352666
## 18 92 22 43 0.02352666
## 19 152 22 44 0.02352666
## 20 161 22 95 0.02352666
Rules of thumb, especially for diagnostic tests, can vary depending on the source and are not necessarily universally agreed upon. However, they can be a good place to start when interpreting regression diagnostic tests for the first time.
Rule of thumb for interpreting leverage values:
p <- 2
n <- nrow(data_out)
(3*p)/n
## [1] 0.03636364
This standard suggests we should investigate any observations with a leverage greater than 0.03636. None of our observations have a leverage that surpasses this value.
But, if we needed to, we could inspect observations with leverage values above this threshold by filtering them from the data set.
data_out %>%
filter(diagnostics$.hat > 0.03636)
## [1] id extraversion happiness diagnostics$.hat
## <0 rows> (or 0-length row.names)
This allows us to inspect how this person scored on all of the variables in the data set to gain insight into how to interpret their outlying value on our predictor variable.
To quantitatively assess which observations may be outliers on the
outcome variable, we can calculate each observation’s
studentized residual by passing our model to the
rstudent() function.
Studentized residuals compare the distance between actual scores on Y, \(Y_i\), from the value on Y predicted by a model that excludes that observation, \(\hat{Y}_i\), in standardized units.
# Obtain studentized residuals
student_resids <- rstudent(model_out)
# Combine with original data
data_out <- cbind.data.frame(data_out, student_resids)
# Organize the data by descending studentized residual values
data_out %>%
arrange(desc(abs(student_resids))) %>%
head(n = 20)
## id extraversion happiness diagnostics$.hat student_resids
## 1 161 22 95 0.023526661 3.819637
## 2 165 17 91 0.029160560 3.628564
## 3 149 20 87 0.025685870 3.101559
## 4 134 16 85 0.030381691 3.071177
## 5 43 53 42 0.006145777 -2.258516
## 6 159 66 50 0.007851814 -2.002909
## 7 150 28 78 0.017803844 1.924586
## 8 106 18 32 0.027970880 -1.899319
## 9 17 59 90 0.006272722 1.868277
## 10 117 69 53 0.009000323 -1.838297
## 11 140 81 98 0.016424890 1.798357
## 12 11 69 93 0.009000323 1.776115
## 13 44 32 39 0.014617639 -1.760220
## 14 74 38 80 0.010781842 1.730394
## 15 71 41 43 0.009288523 -1.720887
## 16 87 63 90 0.006986358 1.720660
## 17 22 75 58 0.012146501 -1.603405
## 18 155 50 48 0.006506884 -1.592439
## 19 113 27 39 0.018679021 -1.578172
## 20 160 15 35 0.031634272 -1.511761
Rule of thumb for interpreting studentized residuals:
Let’s identify the observations in our data set with standardized residuals greater than 3 to inspect them further:
data_out %>%
filter(abs(student_resids) > 3)
## id extraversion happiness diagnostics$.hat student_resids
## 1 134 16 85 0.03038169 3.071177
## 2 149 20 87 0.02568587 3.101559
## 3 161 22 95 0.02352666 3.819637
## 4 165 17 91 0.02916056 3.628564
Multivariate outliers are outliers that have an unusual combination of scores on the set of variables included in the model.
One way of assessing whether an observation is a multivariate outlier is by assessing whether or not it undue influence on the fit of the model.
One straightforward way to visually inspect the data for multivariate outliers is to use bivariate scatterplots for each combination of outcome and predictor.
ggplot(data_out, aes(x = extraversion, y = happiness)) +
geom_point()
Question: Do you see any observations that are potential multivariate outliers?
Yes! There are a cluster of four individuals whose scores on extraversion and happiness do not fit the general pattern found in the data. These four people are very low on extraversion yet very high on happiness.
Cook’s Distance (aka, Cook’s D) summarizes how much the regression model would change if you removed a particular case. It examines how different all of the fitted values would be with versus without a particular case.
# Obtain Cook's D values
cooks_d <- cooks.distance(model_out)
# Combine with original data
data_out <- cbind.data.frame(data_out, cooks_d)
# Organize the data by descending Cook's D values
data_out %>%
arrange(desc(cooks_d)) %>%
head(n = 20)
## id extraversion happiness diagnostics$.hat student_resids cooks_d
## 1 165 17 91 0.029160560 3.628564 0.18400290
## 2 161 22 95 0.023526661 3.819637 0.16223203
## 3 134 16 85 0.030381691 3.071177 0.14050319
## 4 149 20 87 0.025685870 3.101559 0.12043312
## 5 106 18 32 0.027970880 -1.899319 0.05108585
## 6 160 15 35 0.031634272 -1.511761 0.03703766
## 7 150 28 78 0.017803844 1.924586 0.03302285
## 8 139 17 37 0.029160560 -1.399721 0.02925184
## 9 140 81 98 0.016424890 1.798357 0.02663819
## 10 113 27 39 0.018679021 -1.578172 0.02348920
## 11 44 32 39 0.014617639 -1.760220 0.02268929
## 12 162 15 40 0.031634272 -1.052279 0.01807441
## 13 74 38 80 0.010781842 1.730394 0.01612054
## 14 22 75 58 0.012146501 -1.603405 0.01565488
## 15 159 66 50 0.007851814 -2.002909 0.01558601
## 16 43 53 42 0.006145777 -2.258516 0.01538436
## 17 117 69 53 0.009000323 -1.838297 0.01512490
## 18 50 84 93 0.018988664 1.230476 0.01460727
## 19 11 69 93 0.009000323 1.776115 0.01413818
## 20 71 41 43 0.009288523 -1.720887 0.01371764
Rule of thumb for interpreting Cook’s D:
Let’s investigate whether any of the observations have Cook’s D values above this threshold.
data_out %>%
filter(cooks_d > 0.5)
## [1] id extraversion happiness diagnostics$.hat
## [5] student_resids cooks_d
## <0 rows> (or 0-length row.names)
Based on our diagnostics, amongst the observations with the highest Cook’s D values, there appears to be a leap in Cook’s D values from 0.05 to 0.12.
data_out %>%
filter(cooks_d > 0.05)
## id extraversion happiness diagnostics$.hat student_resids cooks_d
## 1 106 18 32 0.02797088 -1.899319 0.05108585
## 2 134 16 85 0.03038169 3.071177 0.14050319
## 3 149 20 87 0.02568587 3.101559 0.12043312
## 4 161 22 95 0.02352666 3.819637 0.16223203
## 5 165 17 91 0.02916056 3.628564 0.18400290
Once potential outliers have been identified, we have three choices of how to deal with them:
With data errors (e.g., the incorrect value was input), it’s very easy to decide what to do with the outlier. We either fix the value if we can infer with great confidence what the value should be corrected to or we remove it.
With values that appear to reflect a real phenomenon (not an error), it’s much harder to decide what to do with that outlier.
Let’s examine which observations are outliers based on all three of our diagnostic tests:
# Outliers based on Predictor Variable (Leverage)
data_out %>%
filter(diagnostics$.hat > 0.03636)
## [1] id extraversion happiness diagnostics$.hat
## [5] student_resids cooks_d
## <0 rows> (or 0-length row.names)
# Outliers based on Distance from Model (Studentized Residuals)
data_out %>%
filter(abs(student_resids) > 3)
## id extraversion happiness diagnostics$.hat student_resids cooks_d
## 1 134 16 85 0.03038169 3.071177 0.1405032
## 2 149 20 87 0.02568587 3.101559 0.1204331
## 3 161 22 95 0.02352666 3.819637 0.1622320
## 4 165 17 91 0.02916056 3.628564 0.1840029
# Outliers based on Undue Model Influence (Cook's D)
data_out %>%
filter(cooks_d > 0.05)
## id extraversion happiness diagnostics$.hat student_resids cooks_d
## 1 106 18 32 0.02797088 -1.899319 0.05108585
## 2 134 16 85 0.03038169 3.071177 0.14050319
## 3 149 20 87 0.02568587 3.101559 0.12043312
## 4 161 22 95 0.02352666 3.819637 0.16223203
## 5 165 17 91 0.02916056 3.628564 0.18400290
Question: Which observations are outliers based on all three of our diagnostic tests?
Participants 134, 149, 161, and 165. These are the four people we identified earlier with low extraversion and extremely high happiness, which doesn’t fit the general pattern found among people in general (as extraversion increases, happiness tends to increase).
Question: What do you think we should do with these outliers? What is your rationale for that decision?
There’s a lot of discussion about what to do with outliers. Should we leave them in the analysis because they reflect a real phenomenon that we want our model to capture? Should we remove them from the analysis because they’re having a strong influence on the model’s overall fit and don’t fit the pattern demonstrated by the majority of participants’ scores on the variables?
Whatever you decide, be able to provide rationale for why you made that decision.
Import data
data_miss <- import("miss.csv")
head(data_miss)
## id extraversion happiness
## 1 1 66 68
## 2 2 69 78
## 3 3 56 70
## 4 4 91 NA
## 5 5 NA 90
## 6 6 47 63
Missing values are indicated in R by an NA in the data
frame.
We can visualize the rows with missing data on each variable using
the vis_miss() function from the naniar
package. It also provides a measure of the percentage of missing data on
each variable.
data_miss %>%
vis_miss()
It looks like extraversion is missing 8% of responses and happiness is missing 5%.
Listwise deletion excludes all rows from the analysis that have NAs for any of the variables included in the model.
Listwise deletion is not recommended unless the percentage of missing data is very small because it 1) reduces power because it uses a smaller sample size to fit the model, and 2) can lead to biased parameter estimates.
The lm function can be told how to handle missing values
using the na.action argument. By default, the
na.action argument is most likely set to
na.omit unless you have changed that default value. Let’s
check.
getOption("na.action")
## [1] "na.omit"
na.omit means the model will use listwise deletion. If
you don’t choose another method of handling missing data, then you’re
going with the default of listwise deletion.
model_listwise <- lm(happiness ~ extraversion, data = data_miss)
summary(model_listwise)
##
## Call:
## lm(formula = happiness ~ extraversion, data = data_miss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.707 -5.706 0.095 5.893 33.293
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.50296 3.23831 14.051 < 2e-16 ***
## extraversion 0.40007 0.05297 7.552 7.85e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.872 on 125 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.3133, Adjusted R-squared: 0.3078
## F-statistic: 57.03 on 1 and 125 DF, p-value: 7.848e-12
Social scientists have increasingly encouraged the community to use either multiple imputation or full information maximum likelihood (FIML) to handle missing data as opposed to the traditional approach of listwise deletion. Studies have found that these two methods for handling missing data 1) maintain better power, and 2) produce less biased parameter estimates.
We’re going to cover multiple imputation in this lab. You’ll learn how to use FIML, which is another great way of handling missing data, in 613 when you cover multilevel modeling.
Multiple imputation can be performed using the functions from the
mice package. It occurs in a few steps:
Missing values will be predicted based on the associations
observed for other participants in the data set plus random noise to
reflect the uncertainty in predicting these values. Set a seed
when imputing the data sets if you would like the results to be
replicable using the seed argument.
We also have to choose the number of data sets to impute using
the m argument. A commonly chosen value is 5, but there are
arguments for using larger numbers of imputations as the proportion of
missing data increases.
imp <- mice(data_miss, m = 5, seed = 382, print = FALSE)
model <- with(imp, lm(happiness ~ extraversion, data = data_miss))
Finally, we can average the results across all of the imputed models to get a final set parameter estimates and a test of their significance.
pooled_model <- pool(model)
# Summary() Output
summary(pooled_model)
## term estimate std.error statistic df p.value
## 1 (Intercept) 45.5029606 3.2383114 14.051447 123.0345 2.409915e-27
## 2 extraversion 0.4000701 0.0529747 7.552098 123.0345 8.372387e-12
The parameter estimates are simply averaged across each of the estimated models.
The calculation of the pooled degrees of freedom gets… hairy. See Grund, Lüdtke, and Robitzsch (2016). and Li et al. (1991).
One of the assumptions underlying linear regression is that the form of the relationship between the predictor(s) and outcome is correctly specified. Most often, we assume the relationship is linear. If the actual form of the relationship is non-linear, then the linear model estimates could be quite inaccurate.
If a linear model is used to fit the relationship between the predictor(s) and outcome variable but the relationship actually follows a non-linear pattern, this will show up as a systematic pattern in the residuals plot.
Import data
data_nonlin <- import("nonlinear.csv")
head(data_nonlin)
## id extraversion happiness
## 1 1 66 72.49974
## 2 2 69 88.45730
## 3 3 56 37.50700
## 4 4 91 238.90074
## 5 5 97 311.96884
## 6 6 47 18.14539
Fit model
model_nonlin <- lm(happiness ~ extraversion, data = data_nonlin)
Examine residuals plot
plot(model_nonlin, 1)
If the linear model is a good fit to the pattern of the relationship, there will be no discernible pattern in the residuals plot. If there is some systematic pattern, this is indication that potentially there is a non-linear trend not being captured by the linear model. Our model certainly seems to show a systematic pattern in its residuals!
We should have started out with a visual inspection of the relationship between the predictor and outcome variable to see if it fit the assumption of linearity before fitting the model. Let’s look at it now:
ggplot(data_nonlin, aes(x = extraversion, y = happiness)) +
geom_point()
Potential Solution: If your residuals plot suggests the relationship is non-linear, then consider fitting a non-linear model that more accurately captures the form of the relationship between the predictor(s) and outcome. We don’t get to discuss fitting non-linear models much in this course, so I would recommend looking into this more if you are interested. You’ll get to discuss it a little more with logistic regression in 613, though. The other option would be to transform the variable(s) so that the relationship between them fits the pattern of a straight line.
Another assumption underlying linear regression is that the model’s residuals are normally distributed.
To diagnose this problem, you can either:
Import data
data_nonnorm <- import("nonnormal.csv")
head(data_nonnorm)
## id extraversion happiness
## 1 1 66 68.44909
## 2 2 69 78.10721
## 3 3 56 79.77801
## 4 4 91 88.05083
## 5 5 97 90.61513
## 6 6 47 63.17076
Fit the model
model_nonnorm <- lm(happiness ~ extraversion, data = data_nonnorm)
Plotting a distribution of the residuals
# storing residuals
regr_diags <- augment(model_nonnorm)
regr_diags
## # A tibble: 147 × 8
## happiness extraversion .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 68.4 66 77.1 -8.61 0.00788 17.0 1.03e-3 -0.510
## 2 78.1 69 78.1 0.0522 0.00907 17.0 4.37e-8 0.00309
## 3 79.8 56 73.7 6.05 0.00708 17.0 4.58e-4 0.358
## 4 88.1 91 85.4 2.67 0.0313 17.0 4.12e-4 0.160
## 5 90.6 97 87.4 3.23 0.0414 17.0 8.19e-4 0.195
## 6 63.2 47 70.7 -7.56 0.0105 17.0 1.07e-3 -0.448
## 7 62.0 28 64.4 -2.42 0.0309 17.0 3.35e-4 -0.145
## 8 65.1 63 76.1 -11.0 0.00712 17.0 1.51e-3 -0.650
## 9 75.6 72 79.1 -3.47 0.0107 17.0 2.28e-4 -0.205
## 10 85.3 72 79.1 6.20 0.0107 17.0 7.32e-4 0.368
## # ℹ 137 more rows
# plotting histogram of residuals
ggplot(data = regr_diags, aes(x = .resid)) +
geom_density(fill = "purple") +
stat_function(linetype = 2,
fun = dnorm,
args = list(mean = mean(regr_diags$.resid),
sd = sd(regr_diags$.resid))) +
theme_minimal()
Examining a QQ-Plot
ggplot(model_nonnorm) +
geom_abline(color = "turquoise4") +
stat_qq(aes(sample = .stdresid), color = "darkorchid4", alpha = .50) +
theme_minimal()
Question: Do you see evidence of non-normality in the residuals based on the distribution and QQ plot?
Yes, there appears to be one or a couple of individuals with extremely large residuals that are making the residuals positively-skewed.
Potential Solution: Linear regression is robust to some degree of violation of the normality assumption. If you see evidence of non-normality, make sure you have dealt with (or are aware of) any potential outliers that could be driving the non-normality in the residuals. If the non-normality is not due to outliers, then you can apply a transformation to Y and/or X to try to produce a model with normally distributed residuals.
Let’s see if we can gain more insight into the cause of the
non-normality by looking at a scatterplot of the relationship between
extraversion and happiness in this data
set.
ggplot(data_nonnorm, aes(x = extraversion, y = happiness)) +
geom_point()
Question: Do you notice anything in the scatterplot that could be driving the non-normality in the residuals?
Yes - we missed a potential multivariate outlier!
Another assumption underlying linear regression is called homoscedasticity, which is the assumption that the variance of an outcome is approximately the same across all values of the predictor(s). Heteroskedasticity is the violation of this assumption. The accuracy of the standard errors and confidence intervals depend on this assumption being met.
The residuals plot can also be used to check for heteroskedasticity.
Import data
data_heteroskedastic <- import("heteroskedastic.csv")
head(data_heteroskedastic)
## id extraversion happiness
## 1 1 66 43.19
## 2 2 69 120.93
## 3 3 56 55.53
## 4 4 91 112.42
## 5 5 97 19.37
## 6 6 47 58.42
Fit the model
model_heteroskedastic <- lm(happiness ~ extraversion, data = data_heteroskedastic)
Examine the residuals plot
plot(model_heteroskedastic, 1)
Question: Do you see any evidence of heteroskedasticity?
Yes, the fan-shaped residuals illustrate heteroskedasticity. Residuals are more spread out at higher values of extraversion than they are at lower values of extraversion.
Potential Solution: Sometimes, dealing with non-normality can also treat heteroskedasticity. If the heteroskedasticity remains, though, then one option for treating it is to fit your model using weighted least squares.
Weighted least squares assigns a weight to each observation to be used when finding the best-fitting model. A common weight that’s used when the residuals are heteroskedastic is \(1/X^2\).
model_weighted <- lm(happiness ~ extraversion, data = data_heteroskedastic, weights = 1/extraversion^2)
summary(model_weighted)
##
## Call:
## lm(formula = happiness ~ extraversion, data = data_heteroskedastic,
## weights = 1/extraversion^2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.53798 -0.36680 -0.06792 0.35110 1.20461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.3092 5.5239 6.392 2.1e-09 ***
## extraversion 0.6171 0.1120 5.509 1.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5256 on 145 degrees of freedom
## Multiple R-squared: 0.1731, Adjusted R-squared: 0.1674
## F-statistic: 30.35 on 1 and 145 DF, p-value: 1.601e-07
Another assumption underlying linear regression is that the errors are independent of each other. Non-independence can occur if people’s scores on the outcome variable are related to each other for some reason (e.g., the same person responds multiple times or you measured responses from clusters of individuals in the same setting).
The design of the study if the most obvious way to know whether to expect non-independence among one’s residuals. For instance, if the researcher intentionally collected multiple scores from the same participants, they should not anticipate participants’ errors to be independent.
To test for non-independence when the research design does not inform you, you can look at a plot of the residuals with a variable that the errors should not correspond to (e.g., ID number, time). This is an imperfect way of deducing issues of non-independence, though, and the best way of ensuring independence occurs during data collection.
Let’s get some more insight, though, by plotting residuals against ID numbers.
Import data
data_nonind <- import("nonindependent.csv")
head(data_nonind)
## ID extraversion happiness V4 V5
## 1 1 low_ext 5 NA 3.520833
## 2 2 low_ext 6 NA NA
## 3 3 low_ext 2 NA NA
## 4 4 low_ext 1 NA NA
## 5 5 low_ext 2 NA NA
## 6 6 low_ext 2 NA NA
Fit the model
model_nonind <- lm(happiness ~ extraversion, data = data_nonind)
Plot the residuals by ID
# storing residuals
regr_diags <- augment(model_nonind)
regr_diags
## # A tibble: 48 × 8
## happiness extraversion .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 low_ext 3.06 1.94 0.0625 2.08 0.0206 0.962
## 2 6 low_ext 3.06 2.94 0.0625 2.05 0.0472 1.46
## 3 2 low_ext 3.06 -1.06 0.0625 2.10 0.00618 -0.527
## 4 1 low_ext 3.06 -2.06 0.0625 2.08 0.0233 -1.02
## 5 2 low_ext 3.06 -1.06 0.0625 2.10 0.00618 -0.527
## 6 2 low_ext 3.06 -1.06 0.0625 2.10 0.00618 -0.527
## 7 5 low_ext 3.06 1.94 0.0625 2.08 0.0206 0.962
## 8 6 low_ext 3.06 2.94 0.0625 2.05 0.0472 1.46
## 9 1 low_ext 3.06 -2.06 0.0625 2.08 0.0233 -1.02
## 10 1 low_ext 3.06 -2.06 0.0625 2.08 0.0233 -1.02
## # ℹ 38 more rows
# Add ID column
regr_diags$ID <- data_nonind$ID
# Plot residuals by ID
ggplot(data = regr_diags, aes(x = ID, y = .resid)) +
geom_point() +
geom_smooth(se = F) +
geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
It looks like adjacent residuals tend to be similar to each other, which could indicate that participants came and participated in pairs or that the same individual completed the study back-to-back.
Potential Solution: Relationships among participants’ residuals can be driven by 1) repeated measures from the same (or related) participants, and 2) clustering among participants. The first source of non-independence can be dealt with by using a repeated-measures or longitudinal analysis. The second source of non-independence can be dealt with by using multilevel modeling. You’ll discuss both of these analyses more in 613.
Multicollinearity occurs when one or more predictors in our model are highly correlated. This causes an issue because 1) it becomes more difficult to measure the unique relationship between each predictor variable and the outcome, and 2) high multicollinearity increases standard errors and confidence intervals, which makes it more difficult to detect the significance of predictors.
For this example, we’re using a data set that has two predictors of
happiness, extraversion and
social_engagement.
Import data
data_multicoll <- import("multicollinear.csv")
head(data_multicoll)
## id happiness extraversion social_engagement
## 1 1 68 66 72
## 2 2 78 69 85
## 3 3 70 56 63
## 4 4 88 91 104
## 5 5 90 97 100
## 6 6 63 47 56
First, you can simply look at a correlation matrix of your predictors.
data_multicoll %>%
cor()
## id happiness extraversion social_engagement
## id 1.00000000 -0.07198186 -0.0501857 -0.06811339
## happiness -0.07198186 1.00000000 0.5591066 0.53413605
## extraversion -0.05018570 0.55910662 1.0000000 0.96193732
## social_engagement -0.06811339 0.53413605 0.9619373 1.00000000
The variables extraversion and
social_engagement appear to be VERY highly correlated
(r = 0.96)!
We can’t rely only on correlation matrices to identify collinearity, because it is possible for multicollinearity to exist among three or more variables (even if each pair of variables is not highly correlated).
Thus, we also want to calculate numerical measures of multicollinearity, including tolerance and VIFs.
Fit the model
model_multicoll <- lm(happiness ~ extraversion + social_engagement, data = data_multicoll)
Examining tolerance & VIFs
ols_vif_tol(model_multicoll)
## Variables Tolerance VIF
## 1 extraversion 0.07467659 13.39108
## 2 social_engagement 0.07467659 13.39108
Rule of thumb for interpreting multicollinearity: Either a low tolerance (below 0.20 is one rule of thumb) or a high VIF (above 5 or 10) is an indication of a problem with multicollinearity.
Potential solution: One potential solution for handling multicollinearity is to combine predictor variables that appear to be measuring very similar conceptual constructs. The other potential solution is to drop one of the variables from the analysis that is of less theoretical importance.
Let’s combine extraversion and
social_engagement into a single variable that we’ll call
sociality and use this new variable to predict
happiness.
Since the predictor variables could have been measured in originally different units, we’ll first standardize them (i.e., convert them into z-scores) before combining participants’ scores in each variable. There are different ways of combining scores on different variables. Here, we will simply average them.
data_multicoll <- data_multicoll %>%
mutate(ext_std = scale(extraversion, center = TRUE, scale = TRUE),
soc_eng_std = scale(social_engagement, center = TRUE, scale = TRUE),
sociality = ((ext_std + soc_eng_std)/2))
model_sociality <- lm(happiness ~ sociality, data = data_multicoll)
Look at how the summary() output for the two models differs.
To compare the two models, let’s fit the first model with standardized scores on each predictor so their scales are comparable to that of the predictor in the second model.
model_multicoll <- lm(happiness ~ ext_std + soc_eng_std, data = data_multicoll)
summary(model_multicoll)
##
## Call:
## lm(formula = happiness ~ ext_std + soc_eng_std, data = data_multicoll)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.229 -5.918 0.354 5.718 32.578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.7143 0.8128 85.769 <2e-16 ***
## ext_std 7.1619 2.9846 2.400 0.0177 *
## soc_eng_std -0.5833 2.9846 -0.195 0.8453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.855 on 144 degrees of freedom
## Multiple R-squared: 0.3128, Adjusted R-squared: 0.3032
## F-statistic: 32.77 on 2 and 144 DF, p-value: 1.865e-12
summary(model_sociality)
##
## Call:
## lm(formula = happiness ~ sociality, data = data_multicoll)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.090 -6.255 0.906 5.818 34.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.7143 0.8148 85.558 < 2e-16 ***
## sociality 6.5787 0.8255 7.969 4.31e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.879 on 145 degrees of freedom
## Multiple R-squared: 0.3046, Adjusted R-squared: 0.2998
## F-statistic: 63.51 on 1 and 145 DF, p-value: 4.311e-13
Question: What do you notice about the difference between the two model outputs?
The standard errors are much larger in the model containing both, highly correlated predictors.