There are several assumptions used in multiple linear regression. We can create diagnostic plots to check if our data meets those assumptions. In this case, Here is the model we will be analyzing:
Let’s compare this to the plot of Feature_1 on Agreeableness
library(ggplot2)ggplot(df_small, aes(Agreeableness, Feature_1)) +geom_point() +geom_smooth(method ="lm", se =FALSE)
How do these two graphs correspond to each other?
QQ-plots
Using the same model as we used previously, we can generate the same residual plots and more by directly using the plot() function on the model.
# Residuals vs fitted modelplot(model, which =1)
(we use which = 1 to specify that we want the residual plot. We will see the other plots in a bit)
For the residuals, we can see that they are generally linear and the spread is fairly consistent. This shows this data adequately meet the assumptions of linearity and homoscedasticity (constant variance) of residuals.
QQ-plot of normalcy
# Q-Q plot of normalcyplot(model, which =2)
The Q-Q plot checks for normalcy of residuals indicated by the diagonal dashed line. Since our data diverges from that line, it could indicate skew. This isn’t a huge problem for large data sets but may affect the validity of hypothesis tests.
There are other diagnostic plots provided in the plot() function, but these two are the most relevant. You can view all the plots using the code below:
Call:
lm(formula = Feature_1 ~ Openness + Conscientiousness + Extraversion +
Agreeableness + Neuroticism, data = df_full)
Residuals:
Min 1Q Median 3Q Max
-0.52692 -0.23912 -0.00349 0.24649 0.52444
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.52111 0.02547 20.463 <2e-16 ***
Openness -0.02980 0.02231 -1.336 0.1818
Conscientiousness -0.02057 0.02198 -0.936 0.3493
Extraversion -0.02339 0.02189 -1.069 0.2853
Agreeableness 0.06010 0.02198 2.735 0.0063 **
Neuroticism -0.01933 0.02192 -0.882 0.3780
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2868 on 1994 degrees of freedom
Multiple R-squared: 0.006178, Adjusted R-squared: 0.003686
F-statistic: 2.479 on 5 and 1994 DF, p-value: 0.03009
During last week’s lab section, we looked at various versions of the multiple linear regression model. Along with the full model with all five predictors, let’s consider two additional models: one with openness removed as a predictor, and one with agreeableness removed as a predictor.