Source of the codes and discussions:
Check
your outliers! An introduction to identifying statistical outliers in R
with easystats
library(performance)
# Create some artificial outliers and an ID column
data <- rbind(mtcars[1:4], 12, 55)
data <- cbind(car = row.names(data), data)
Check outliers using zscore robust method:
outliers <- check_outliers(data, method = "zscore_robust", ID = "car")
outliers
1 outlier detected: case 34.
- Based on the following method and threshold: zscore_robust (3.291).
- For variables: mpg, cyl, disp, hp.
-----------------------------------------------------------------------------
The following observations were considered outliers for two or more
variables by at least one of the selected methods:
Row car n_Zscore_robust
1 34 34 2
-----------------------------------------------------------------------------
Outliers per variable (zscore_robust):
$mpg
Row car Distance_Zscore_robust
34 34 34 6.271888
$cyl
Row car Distance_Zscore_robust
34 34 34 16.52502
which(outliers) # get row index of outlier
[1] 34
data_clean <- data[-which(outliers), ] # remove outlier from data
Plot outliers:
library(see)
plot(outliers) +
ggplot2::theme(axis.text.x = ggplot2::element_text(
angle = 45, size = 8
))
Using a robust version of the Mahalanobis distance method - the minimum covariance determinant method:
outliers <- check_outliers(data, method = "mcd")
outliers
2 outliers detected: cases 33, 34.
- Based on the following method and threshold: mcd (20).
- For variables: mpg, cyl, disp, hp.
plot(outliers) +
ggplot2::theme(axis.text.x = ggplot2::element_text(
angle = 45, size = 8
))
model <- lm(mpg ~ disp * hp, data = data)
outliers <- check_outliers(model, method = "cook")
outliers
1 outlier detected: case 33.
- Based on the following method and threshold: cook (0.806).
- For variable: (Whole model).
plot(outliers)
outliers <- check_outliers(model, method = c("zscore_robust", "mcd", "cook"))
outliers
2 outliers detected: cases 33, 34.
- Based on the following methods and thresholds: zscore_robust (3.291),
mcd (0.806), cook (20).
- For variable: (Whole model).
Note: Outliers were classified as such by at least half of the selected methods.
-----------------------------------------------------------------------------
The following observations were considered outliers for two or more
variables by at least one of the selected methods:
Row n_Zscore_robust n_MCD n_Cook
1 34 1 (Multivariate) 0
2 31 0 (Multivariate) 0
3 33 0 (Multivariate) (Multivariate)
which(outliers)
[1] 33 34
attributes(outliers)$outlier_var$zscore_robust
$mpg
Row Distance_Zscore_robust
34 34 6.271888
Removing outliers that do not belong to the distribution of interest can in this case be a valid strategy, and ideally one would report results with and without outliers to see the extent of their impact on results. This approach, however, can reduce statistical power. Therefore, some propose a recoding approach, namely, winsorization: bringing outliers back within acceptable limits (e.g., three MADs, Tukey & McLaughlin, 1963). However, if possible, it is recommended to collect enough data so that even after removing outliers, there is still sufcient statistical power without having to resort to winsorization (Leys et al., 2019).
data[33:34, 2:3] # See outliers rows
mpg cyl
33 12 12
34 55 55
# Winsorizing using the MAD
library(datawizard)
winsorized_data <- winsorize(data, method = "zscore", robust = TRUE, threshold = 3)
# Outlier values > +/- MAD have been winsorized
winsorized_data[33:34, 2:3]
mpg cyl
33 12.00000 12.0000
34 36.32403 14.8956