Consider the artificial data in the object df. There are two obvious outliers.
library(tidyverse)
library(broom)
library(palmerpenguins)
library(MASS) # needed for rlm
df <- tibble(x = 1:20, y = c(1:18, 49:50))
ggplot(df, aes(x, y)) +
geom_point(size = 2) + theme_bw(base_size = 12)
Which observations are the outliers?
The outliers are the two last observations in the dataset.
A simple linear model where parameters are estimated using least squares estimate are not robust to outliers. Below we fit two models:
fit_lm, andfit_rlm. To explain briefly, a robust linear model down-weights the contribution of the observations that are outlying observations to estimate parameters.fit_lm <- lm(y ~ x, data = df)
fit_rlm <- rlm(y ~ x, data = df)
What is the class of fit_rlm?
class(fit_rlm)
## [1] "rlm" "lm"
The class of
fit_rlmisrlmandlm
Study the object structure of both fit_lm and fit_rlm. How are the structures different between fit_lm and fit_rlm?
We can use
str()function to examine the structure of these objects.
Open here for the structure of
fit_lm
str(fit_lm)
## List of 12
## $ coefficients : Named num [1:2] -5.53 1.81
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
## $ residuals : Named num [1:20] 4.71 3.9 3.09 2.28 1.47 ...
## ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
## $ effects : Named num [1:20] -60.374 46.728 1.69 0.972 0.253 ...
## ..- attr(*, "names")= chr [1:20] "(Intercept)" "x" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:20] -3.7143 -1.9023 -0.0902 1.7218 3.5338 ...
## ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:20, 1:2] -4.472 0.224 0.224 0.224 0.224 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:20] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "x"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.22 1.26
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 18
## $ xlevels : Named list()
## $ call : language lm(formula = y ~ x, data = df)
## $ terms :Classes 'terms', 'formula' language y ~ x
## .. ..- attr(*, "variables")= language list(y, x)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. ..$ : chr "x"
## .. ..- attr(*, "term.labels")= chr "x"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(y, x)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## $ model :'data.frame': 20 obs. of 2 variables:
## ..$ y: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ x: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ x
## .. .. ..- attr(*, "variables")= language list(y, x)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. .. ..$ : chr "x"
## .. .. ..- attr(*, "term.labels")= chr "x"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(y, x)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## - attr(*, "class")= chr "lm"
Open here for the structure of
fit_rlm
str(fit_rlm)
## List of 21
## $ coefficients : Named num [1:2] -0.00064 1.00009
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
## $ residuals : Named num [1:20] 0.000546 0.000452 0.000358 0.000263 0.000169 ...
## ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
## $ wresid : Named num [1:20] 0.000546 0.000452 0.000358 0.000263 0.000169 ...
## ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
## $ effects : Named num [1:20] -3.99e+01 2.18e+01 1.92e-04 1.11e-04 2.91e-05 ...
## ..- attr(*, "names")= chr [1:20] "(Intercept)" "x" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:20] 0.999 2 3 4 5 ...
## ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:20, 1:2] -4.227 0.237 0.237 0.237 0.237 ...
## .. ..- attr(*, "assign")= int [1:2] 0 1
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:20] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "x"
## ..$ qraux: num [1:2] 1.24 1.27
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : logi NA
## $ w : num [1:20] 1 1 1 1 1 1 1 1 1 1 ...
## $ s : num 0.00165
## $ psi :function (u, k = 1.345, deriv = 0)
## $ k2 : num 1.34
## $ weights : num [1:20] 1 1 1 1 1 1 1 1 1 1 ...
## $ conv : num [1:11] 0.342 0.1907 0.0879 0.0382 0.0163 ...
## $ converged : logi TRUE
## $ x : num [1:20, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:20] "1" "2" "3" "4" ...
## .. ..$ : chr [1:2] "(Intercept)" "x"
## ..- attr(*, "assign")= int [1:2] 0 1
## $ call : language rlm(formula = y ~ x, data = df)
## $ terms :Classes 'terms', 'formula' language y ~ x
## .. ..- attr(*, "variables")= language list(y, x)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. ..$ : chr "x"
## .. ..- attr(*, "term.labels")= chr "x"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(y, x)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## $ xlevels : Named list()
## $ model :'data.frame': 20 obs. of 2 variables:
## ..$ y: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ x: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ x
## .. .. ..- attr(*, "variables")= language list(y, x)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "y" "x"
## .. .. .. .. ..$ : chr "x"
## .. .. ..- attr(*, "term.labels")= chr "x"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(y, x)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
## - attr(*, "class")= chr [1:2] "rlm" "lm"
We can see that
fit_lmandfit_rlmhave a different structure.fit_lmhas 12 components/lists, whilstfit_rlmhas 21 components/lists, including weights denoted byw.
Below the augment method from the broom package extracts some model-related values but it does not return the weights from the iterated re-weighted least squares.
# modify augment method to the desired output
dfa <- augment(fit_rlm)
Modify the augment method for the appropriate class so that the code below gets the desired output.
Open here for the desired output for
dfa
dfa
# A tibble: 20 x 7
y x .fitted .resid .hat .sigma w
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.999 0.000546 0.206 10.3 1
2 2 2 2.00 0.000452 0.172 10.3 1
3 3 3 3.00 0.000358 0.143 10.3 1
4 4 4 4.00 0.000263 0.118 10.3 1
5 5 5 5.00 0.000169 0.0974 10.3 1
6 6 6 6.00 0.0000749 0.0808 10.3 1
7 7 7 7.00 -0.0000193 0.0685 10.3 1
8 8 8 8.00 -0.000114 0.0603 10.3 1
9 9 9 9.00 -0.000208 0.0564 10.3 1
10 10 10 10.0 -0.000302 0.0566 10.3 1
11 11 11 11.0 -0.000396 0.0611 10.3 1
12 12 12 12.0 -0.000490 0.0698 10.3 1
13 13 13 13.0 -0.000585 0.0827 10.3 1
14 14 14 14.0 -0.000679 0.0998 10.3 1
15 15 15 15.0 -0.000773 0.121 10.3 1
16 16 16 16.0 -0.000867 0.147 10.3 1
17 17 17 17.0 -0.000961 0.172 10.3 0.977
18 18 18 18.0 -0.00106 0.187 10.3 0.890
19 49 19 19.0 30.0 0.0000184 7.28 0.0000742
20 50 20 20.0 30.0 0.0000216 7.28 0.0000742
The modification of
augment()method to produce column calledwis done below:
# modify augment method to the desired output
dfa_modif <- augment(fit_rlm) %>%
mutate(w = fit_rlm$w)
DT::datatable(dfa_modif)
Once the above is done, you can the code below to get the plot seen at the end of this question.
ggplot(dfa_modif, aes(x, y, color = w)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "#C8008F") +
geom_smooth(method = "rlm", se = FALSE, color = "#027EB6") +
annotate("text", x = 20.5, y = c(21, 31),
label = c("rlm", "lm")) +
theme_bw(base_size = 12) +
labs(title = "Robust vs. non-robust linear models") +
scale_color_continuous(type = "viridis")
What do you think the w is doing?
According to UCLA Statistical Consulting the weight or
win robust linear model gives the value or weigh the observations differently based on how well behaved these observations are. Based on the plot above, the outlier observations have a smaller weight than the normal observation. Thus this weight will keep the regression line on the right track (not pushed upward).