install.packages(c("broom", "DT", "kableExtra", "palmerpenguins"))
Working with robust linear models
Consider the artificial data in the object df. There are two obvious outliers. Which observations are the outliers? >> Outliers are observations with value of 49 and 50.
library(emo)
library(tidyverse)
library(broom)
library(palmerpenguins)
library(MASS) # needed for rlm
df <- tibble(x = 1:20, y = c(1:18, 49:50))
ggplot(df, aes(x, y)) +
geom_point(size = 2) + theme_bw(base_size = 12)
A simple linear model where parameters are estimated using least squares estimate are not robust to outliers. Below we fit two models: (1) a linear model with least square estimates contained in fit_lm and (2) a robust linear model contained in fit_rlm. To explain briefly, a robust linear model down-weights the contribution of the observations that are outlying observations to estimate parameters.
fit_lm <- lm(y ~ x, data = df)
fit_rlm <- rlm(y ~ x, data = df)
class(fit_rlm)
## [1] "rlm" "lm"
What is the class of fit_rlm? Study the object structure of both fit_lm and fit_rlm. How are the structures different between fit_lm and fit_rlm? >> Class of fit_rlm is “rlm” and “lm”. In terms of structures, fit_rlm has variable of w which down-weighting for each observation (hence it can be used to take care of outliers), where fit_lm does not have w.
Below the augment method from the broom package extracts some model-related values but it does not return the weights from the iterated re-weighted least squares. Modify the augment method for the appropriate class so that the code below gets the desired output.
# modify augment method to the desired output
dfa <- augment(fit_rlm) %>%
mutate(w = fit_rlm$w)
dfa
## # A tibble: 20 x 7
## y x .fitted .resid .hat .sigma w
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0.999 0.000546 0.206 10.3 1
## 2 2 2 2.00 0.000452 0.172 10.3 1
## 3 3 3 3.00 0.000358 0.143 10.3 1
## 4 4 4 4.00 0.000263 0.118 10.3 1
## 5 5 5 5.00 0.000169 0.0974 10.3 1
## 6 6 6 6.00 0.0000749 0.0808 10.3 1
## 7 7 7 7.00 -0.0000193 0.0685 10.3 1
## 8 8 8 8.00 -0.000114 0.0603 10.3 1
## 9 9 9 9.00 -0.000208 0.0564 10.3 1
## 10 10 10 10.0 -0.000302 0.0566 10.3 1
## 11 11 11 11.0 -0.000396 0.0611 10.3 1
## 12 12 12 12.0 -0.000490 0.0698 10.3 1
## 13 13 13 13.0 -0.000585 0.0827 10.3 1
## 14 14 14 14.0 -0.000679 0.0998 10.3 1
## 15 15 15 15.0 -0.000773 0.121 10.3 1
## 16 16 16 16.0 -0.000867 0.147 10.3 1
## 17 17 17 17.0 -0.000961 0.172 10.3 0.977
## 18 18 18 18.0 -0.00106 0.187 10.3 0.890
## 19 49 19 19.0 30.0 0.0000184 7.28 0.0000742
## 20 50 20 20.0 30.0 0.0000216 7.28 0.0000742
Once the above is done, you can the code below to get the plot seen at the end of this question. What do you think the w is doing? >> w gives weight for all observations based on their normality, hence outliers will be given small weight. Fit_rlm model captures all observations better and disregards outliers.
ggplot(dfa, aes(x, y, color = w)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "#C8008F") +
geom_smooth(method = "rlm", se = FALSE, color = "#027EB6") +
annotate("text", x = 20.5, y = c(21, 31),
label = c("rlm", "lm")) +
theme_bw(base_size = 12) +
labs(title = "Robust vs. non-robust linear models") +
scale_color_continuous(type = "viridis")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
knitr::include_graphics("images/tutorial03-rlm-1.png")