lab 5

Diagnostic Plots

Model

There are several assumptions used in multiple linear regression. We can create diagnostic plots to check if our data meets those assumptions. In this case, Here is the model we will be analyzing:

df_full <- read.csv("handwriting.csv")
df <- df_full[1:100, ]
model <- lm(Feature_1 ~ Openness + Conscientiousness + Extraversion + Agreeableness + Neuroticism, df)
summary(model)


Call:
lm(formula = Feature_1 ~ Openness + Conscientiousness + Extraversion + 
    Agreeableness + Neuroticism, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50603 -0.21734 -0.05915  0.25007  0.51523 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)   
(Intercept)        0.349732   0.121135   2.887  0.00482 **
Openness          -0.018672   0.104424  -0.179  0.85847   
Conscientiousness  0.065906   0.089107   0.740  0.46137   
Extraversion      -0.001501   0.102142  -0.015  0.98831   
Agreeableness      0.097962   0.096796   1.012  0.31412   
Neuroticism        0.137614   0.100713   1.366  0.17507   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2777 on 94 degrees of freedom
Multiple R-squared:  0.03594,   Adjusted R-squared:  -0.01534 
F-statistic: 0.7008 on 5 and 94 DF,  p-value: 0.6242

resid()

One easy way to find the residuals of any linear model is by using the resid() function.

resid(model)

           1            2            3            4            5            6 
-0.302025656  0.320678142 -0.065808574  0.342422852  0.442289049  0.205916404 
           7            8            9           10           11           12 
-0.241512998  0.461053861 -0.170823414  0.252262065 -0.084112423 -0.288535344 
          13           14           15           16           17           18 
 0.395723099 -0.315899812 -0.329471395 -0.340871271  0.475122642  0.346971451 
          19           20           21           22           23           24 
 0.059448853  0.325558021  0.174582592 -0.244916263  0.210717713  0.515233856 
          25           26           27           28           29           30 
 0.501616604  0.034264209 -0.088495781 -0.305680631 -0.304870792  0.130661500 
          31           32           33           34           35           36 
-0.156243687  0.329641717 -0.506027434  0.344060860  0.378699182  0.317799045 
          37           38           39           40           41           42 
-0.110339714 -0.030039792 -0.153273203 -0.193352238  0.233341408 -0.439221992 
          43           44           45           46           47           48 
-0.081764796 -0.335268425  0.376609160 -0.052493267  0.382846576  0.085716389 
          49           50           51           52           53           54 
 0.357770387 -0.370412870 -0.199040943 -0.348784697  0.249335832 -0.196583878 
          55           56           57           58           59           60 
-0.144820979 -0.441057447 -0.275725947  0.090460312 -0.146173420  0.331902441 
          61           62           63           64           65           66 
-0.320624065 -0.049404993 -0.115442039 -0.154745954 -0.313845120  0.448573142 
          67           68           69           70           71           72 
-0.066331113 -0.221351397  0.041259578 -0.225738296 -0.208640906 -0.143812873 
          73           74           75           76           77           78 
-0.154446249 -0.169553514 -0.259676120 -0.216007518  0.002217333 -0.001589025 
          79           80           81           82           83           84 
 0.328001399 -0.227199364 -0.303652167 -0.043547191  0.387648649  0.302271259 
          85           86           87           88           89           90 
 0.137315267  0.003673868  0.102328566  0.308020600  0.458338353 -0.021998006 
          91           92           93           94           95           96 
-0.200562883 -0.067653127  0.131525859  0.188461190  0.220627314 -0.322695773 
          97           98           99          100 
-0.038663529 -0.182075826 -0.231723366 -0.208339106

resid() vs fitted line

However, looking at a long list of residuals is not very useful, so we can compare it to the fitted line instead.

res <- resid(model)
fit <- fitted(model)

(try seeing what the output of fitted(model) is for yourself)

We can plot the comparison using the plot() function in R.

plot(fit, res)

(As a side note, you can also add a horizontal line using abline(0, 0)

plot(res, fit)
abline(0, 0)

What does this mean?

Let’s look a smaller model to interpret this plot

df_small <- df[1:3, ]
model_small <- lm(Feature_1 ~ Agreeableness, df_small)
res <- resid(model_small)
fit <- fitted(model_small)
plot(fit, res)

Let’s compare this to the plot of Feature_1 on Agreeableness

library(ggplot2)
ggplot(df_small, aes(Agreeableness, Feature_1)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

How do these two graphs correspond to each other?

QQ-plots

Using the same model as we used previously, we can generate the same residual plots and more by directly using the plot() function on the model.

# Residuals vs fitted model
plot(model, which = 1)

(we use which = 1 to specify that we want the residual plot. We will see the other plots in a bit)

For the residuals, we can see that they are generally linear and the spread is fairly consistent. This shows this data adequately meet the assumptions of linearity and homoscedasticity (constant variance) of residuals.

QQ-plot of normalcy

# Q-Q plot of normalcy
plot(model, which = 2)

The Q-Q plot checks for normalcy of residuals indicated by the diagonal dashed line. Since our data diverges from that line, it could indicate skew. This isn’t a huge problem for large data sets but may affect the validity of hypothesis tests.

There are other diagnostic plots provided in the plot() function, but these two are the most relevant. You can view all the plots using the code below:

par(mfrow=c(2,2))
plot(model)

General Linear F-Test

What is the best model?

model <- lm(Feature_1 ~ Openness + Conscientiousness + Extraversion + Agreeableness + Neuroticism, df_full)
summary(model)


Call:
lm(formula = Feature_1 ~ Openness + Conscientiousness + Extraversion + 
    Agreeableness + Neuroticism, data = df_full)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.52692 -0.23912 -0.00349  0.24649  0.52444 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.52111    0.02547  20.463   <2e-16 ***
Openness          -0.02980    0.02231  -1.336   0.1818    
Conscientiousness -0.02057    0.02198  -0.936   0.3493    
Extraversion      -0.02339    0.02189  -1.069   0.2853    
Agreeableness      0.06010    0.02198   2.735   0.0063 ** 
Neuroticism       -0.01933    0.02192  -0.882   0.3780    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2868 on 1994 degrees of freedom
Multiple R-squared:  0.006178,  Adjusted R-squared:  0.003686 
F-statistic: 2.479 on 5 and 1994 DF,  p-value: 0.03009

During last week’s lab section, we looked at various versions of the multiple linear regression model. Along with the full model with all five predictors, let’s consider two additional models: one with openness removed as a predictor, and one with agreeableness removed as a predictor.

model_no_openness <- lm(Feature_1 ~ Conscientiousness + Extraversion + Agreeableness + Neuroticism, df_full)

model_no_agreeableness <- lm(Feature_1 ~ Openness + Conscientiousness + Extraversion + Neuroticism, df_full)

We can then conduct a general linear f-test by calling anova() on the two models.

anova(model_no_openness, model)

Analysis of Variance Table

Model 1: Feature_1 ~ Conscientiousness + Extraversion + Agreeableness + 
    Neuroticism
Model 2: Feature_1 ~ Openness + Conscientiousness + Extraversion + Agreeableness + 
    Neuroticism
  Res.Df    RSS Df Sum of Sq     F Pr(>F)
1   1995 164.13                          
2   1994 163.99  1   0.14671 1.784 0.1818

anova(model_no_agreeableness, model)

Analysis of Variance Table

Model 1: Feature_1 ~ Openness + Conscientiousness + Extraversion + Neuroticism
Model 2: Feature_1 ~ Openness + Conscientiousness + Extraversion + Agreeableness + 
    Neuroticism
  Res.Df    RSS Df Sum of Sq     F   Pr(>F)   
1   1995 164.60                               
2   1994 163.99  1   0.61507 7.479 0.006298 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

`anova()` output

Let’s take a closer look at the anova() output:

Res.Df: residual degrees of freedom. More predictors means fewer degrees of freedom.
RSS: residual sum of squares. More predictors means lower or equal residual sum of squares.
Df: The number of predictors that is being removed from the full to reduced model.
Sum of Sq: the result of subtracting the two RSS’s
F: The general linear f-test conducted on the two models
Pr(>F): The resulting p-value from the general linear f-test

Removing More than One Predictor

So far you’ve only seen a difference of one predictor from the full to reduced model. Now let’s see what happens if we remove more than one.

model_agreeableness_only <- lm(Feature_1 ~ Agreeableness, df_full)
anova(model_agreeableness_only, model)

Analysis of Variance Table

Model 1: Feature_1 ~ Agreeableness
Model 2: Feature_1 ~ Openness + Conscientiousness + Extraversion + Agreeableness + 
    Neuroticism
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1   1998 164.37                           
2   1994 163.99  4   0.38067 1.1572 0.3279

How many predictors are being removed?
Should the full or reduced model be kept?