Diagnosis of regression models

Once a regression model is developed, such as lm or lme4::lmer, people need to carry out diagnosis. Some packages provide plotting functions to show the diagnosis (ex lme4::plot.merMod or performance::check_model).

A new package diagnoser will provide various diagnosis plottings using ggplot2. Here, I will show how to use diagnoser package.

Installation

Currently, diagnoser is provided only from GitHub.

devtools::install_github("i-kiwamu/diagnoser")

lm model

There are various diagnoses of multiple regression with numerical and factorial variables. A great package car provides a lot of diagnosis plotting functions. Let’s follow the examples in An R Companion to Applied Regression using diagnoser package.

Data & model

We use the same dataset as Chapter 8 of An R Companion to Applied Regression.

library(car)
## Loading required package: carData
Prestige$type <- factor(Prestige$type, levels = c("bc", "wc", "prof"))
lm_prestige <- lm(prestige ~ education + income + type, Prestige)
brief(lm_prestige)
##            (Intercept) education   income typewc typeprof
## Estimate        -0.623     3.673 0.001013  -2.74     6.04
## Std. Error       5.228     0.641 0.000221   2.51     3.87
## 
##  Residual SD = 7.09 on 93 df, R-squared = 0.835

Now diagnosis object is prepared.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some()   masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(diagnoser)
diag_lm_prestige <- diagnose(lm_prestige)

Measured vs fitted plot

plot(diag_lm_prestige, type = "mf")

Measured-fitted plot is used for goodness-of-fit.

Residuals vs fitted plot

plot(diag_lm_prestige, type = "rf")

Residuals-fitted plot is used for checking the normality. Here, the y-axis is the studentized residuals. Thus, possible outliers can be found if the studentized residuals > 2 or < -2. In this case, we can find 4 possible outliers.

Also, if the plot has a pattern, people need to consider the transformation, usage of weight, usage of hierarchical models, etc.

QQ plot

plot(diag_lm_prestige, type = "qq")

QQ plot is used for checking the normality as well. In this case, the hypothesis of normality is valid.

Scale-Location plot

plot(diag_lm_prestige, type = "sl")

Scale-Location plot is also used for finding outliers or checking the normality.

Cook’s distance

plot(diag_lm_prestige, type = "cook")

Cook’s distance is used for checking the influential data. The data is possibly influential on the model if Cook’s distance is higher than 1.

Marginal-model plot

plot(diag_lm_prestige, type = "mm", aes(x = income))

Marginal-model plot is used for the marginal goodness-of-fit. The blue line Data is the smoothed line between prestige and income. On the other hand, the red line Model is the smoothed line between the fitted prestige and income. If two lines are not similar, we can doubt that the model does not fit the data well.

Added-variable plot

plot(diag_lm_prestige, type = "av", aes(x = income))

Added-variable plot is used for visualizing the linear effect of an explanatory variable (income in this case) on the response (prestige in this case). The slope of the line is same as the partial regression coefficient of income.

Also, we can say that the income data are not explained by the other explanatory variables if the value of x-axis is far from the center. In this case, there are two data far from the center of x-axis. These data are candidates of influential data so that we need to consider recheck the original data, data transformation, model improvement, etc. Data removal is the final option.

Note that the added-variable plot is misleading if the data show nonlinearity.

Component+Residual plot

plot(diag_lm_prestige, type = "cr", aes(x = income))

The black line shows the partial regression coefficients, and the blue line shows the loess splines between component+residual and income.

Component+Residual plot is used for considering the data transformation. If the linear model is true, the black and blue lines are almost same. If not, we need to consider the data transformation.

lme of lmer model

Mixed-effects models have additional hypotheses from lm model:

Let’s follow the examples in Mixed-Effects Models in S and S-PLUS.

Data & model

We use the same dataset as Chapter 4 of Mixed-Effects Models in S and S-PLUS. Here, we use nlme::lme function but it is almost the same as lme4::lmer.

library(nlme)
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
options(contrasts = c("contr.sum", "contr.poly"))  # The book uses contr.helmert
Orthodont$age_11 <- Orthodont$age - 11
lme_orth <- lme(distance ~ Sex * age_11, Orthodont, random = ~ age_11 | Subject)

Diagnosis object is prepared.

diag_lme_orth <- diagnose(lme_orth)

Measured vs fitted plot

plot(diag_lme_orth, type = "mf")

Residuals vs fitted plot

fig_rf <- plot(diag_lme_orth, type = "rf")

There are some possible outliers. Let’s divide the plot into Sex.

fig_rf + facet_wrap(~ Sex)

As you can see, the variance of the studentized residuals are different in Sex: Male shows higher variance than Female. Let’s check the variability of the residuals in each Sex.

fig_r <- plot(diag_lme_orth, type = "rf", aes(Subject))

fig_r + facet_wrap(~ Sex, scales = "free_x")

The variance of the studentized residuals are different between Sex. These results show we need to consider heteroscedacity in fitting model by setting weight argument of lme.

QQ plot for errors

plot(diag_lme_orth, type = "qq")

In most cases, we can almost say the data follow a normal distribution. However, there are 3 outliers.

QQ plot for random intercept

plot(diag_lme_orth, type = "qq", aes(sample = .ranef.intercept), level = "Subject")

Random intercept almost follows a normal distribution.

QQ plot for random slope

plot(diag_lme_orth, type = "qq", aes(sample = .ranef.age_11), level = "Subject")

Random slope almost follows a normal distribution.