Once a regression model is developed, such as lm or
lme4::lmer, people need to carry out diagnosis. Some
packages provide plotting functions to show the diagnosis (ex
lme4::plot.merMod or
performance::check_model).
A new package diagnoser will provide various diagnosis
plottings using ggplot2. Here, I will show how to use
diagnoser package.
Currently, diagnoser is provided only from GitHub.
devtools::install_github("i-kiwamu/diagnoser")
lm modelThere are various diagnoses of multiple regression with numerical and
factorial variables. A great package car provides a lot of
diagnosis plotting functions. Let’s follow the examples in An R Companion to
Applied Regression using diagnoser package.
We use the same dataset as Chapter 8 of An R Companion to Applied Regression.
library(car)
## Loading required package: carData
Prestige$type <- factor(Prestige$type, levels = c("bc", "wc", "prof"))
lm_prestige <- lm(prestige ~ education + income + type, Prestige)
brief(lm_prestige)
## (Intercept) education income typewc typeprof
## Estimate -0.623 3.673 0.001013 -2.74 6.04
## Std. Error 5.228 0.641 0.000221 2.51 3.87
##
## Residual SD = 7.09 on 93 df, R-squared = 0.835
Now diagnosis object is prepared.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(diagnoser)
diag_lm_prestige <- diagnose(lm_prestige)
plot(diag_lm_prestige, type = "mf")
Measured-fitted plot is used for goodness-of-fit.
plot(diag_lm_prestige, type = "rf")
Residuals-fitted plot is used for checking the normality. Here, the y-axis is the studentized residuals. Thus, possible outliers can be found if the studentized residuals > 2 or < -2. In this case, we can find 4 possible outliers.
Also, if the plot has a pattern, people need to consider the transformation, usage of weight, usage of hierarchical models, etc.
plot(diag_lm_prestige, type = "qq")
QQ plot is used for checking the normality as well. In this case, the hypothesis of normality is valid.
plot(diag_lm_prestige, type = "sl")
Scale-Location plot is also used for finding outliers or checking the normality.
plot(diag_lm_prestige, type = "cook")
Cook’s distance is used for checking the influential data. The data is possibly influential on the model if Cook’s distance is higher than 1.
plot(diag_lm_prestige, type = "mm", aes(x = income))
Marginal-model plot is used for the marginal goodness-of-fit. The
blue line Data is the smoothed line between
prestige and income. On the other hand, the
red line Model is the smoothed line between the
fitted prestige and income.
If two lines are not similar, we can doubt that the model does not fit
the data well.
plot(diag_lm_prestige, type = "av", aes(x = income))
Added-variable plot is used for visualizing the linear effect of an
explanatory variable (income in this case) on the response
(prestige in this case). The slope of the line is same as
the partial regression coefficient of income.
Also, we can say that the income data are not explained
by the other explanatory variables if the value of x-axis is far from
the center. In this case, there are two data far from the center of
x-axis. These data are candidates of influential data so that we need to
consider recheck the original data, data transformation, model
improvement, etc. Data removal is the final option.
Note that the added-variable plot is misleading if the data show nonlinearity.
plot(diag_lm_prestige, type = "cr", aes(x = income))
The black line shows the partial regression coefficients, and the
blue line shows the loess splines between component+residual and
income.
Component+Residual plot is used for considering the data transformation. If the linear model is true, the black and blue lines are almost same. If not, we need to consider the data transformation.
lme of lmer modelMixed-effects models have additional hypotheses from lm
model:
Let’s follow the examples in Mixed-Effects Models in S and S-PLUS.
We use the same dataset as Chapter 4 of Mixed-Effects
Models in S and S-PLUS. Here, we use nlme::lme function
but it is almost the same as lme4::lmer.
library(nlme)
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
options(contrasts = c("contr.sum", "contr.poly")) # The book uses contr.helmert
Orthodont$age_11 <- Orthodont$age - 11
lme_orth <- lme(distance ~ Sex * age_11, Orthodont, random = ~ age_11 | Subject)
Diagnosis object is prepared.
diag_lme_orth <- diagnose(lme_orth)
plot(diag_lme_orth, type = "mf")
fig_rf <- plot(diag_lme_orth, type = "rf")
There are some possible outliers. Let’s divide the plot into Sex.
fig_rf + facet_wrap(~ Sex)
As you can see, the variance of the studentized residuals are
different in Sex: Male shows higher variance than
Female. Let’s check the variability of the residuals in
each Sex.
fig_r <- plot(diag_lme_orth, type = "rf", aes(Subject))
fig_r + facet_wrap(~ Sex, scales = "free_x")
The variance of the studentized residuals are different between Sex.
These results show we need to consider heteroscedacity in fitting model
by setting weight argument of lme.
plot(diag_lme_orth, type = "qq")
In most cases, we can almost say the data follow a normal distribution. However, there are 3 outliers.
plot(diag_lme_orth, type = "qq", aes(sample = .ranef.intercept), level = "Subject")
Random intercept almost follows a normal distribution.
plot(diag_lme_orth, type = "qq", aes(sample = .ranef.age_11), level = "Subject")
Random slope almost follows a normal distribution.