Objectives

Preparation

  1. Install R-packages
install.packages(c("broom", "DT", "kableExtra", "palmerpenguins"))

🌐️ Exercise 3A

Working with robust linear models

Consider the artificial data in the object df. There are two obvious outliers. Which observations are the outliers? >> Outliers are observations with value of 49 and 50.

library(emo)
library(tidyverse)
library(broom)
library(palmerpenguins)
library(MASS) # needed for rlm

df <- tibble(x = 1:20, y = c(1:18, 49:50))
ggplot(df, aes(x, y)) + 
  geom_point(size = 2) + theme_bw(base_size = 12)

A simple linear model where parameters are estimated using least squares estimate are not robust to outliers. Below we fit two models: (1) a linear model with least square estimates contained in fit_lm and (2) a robust linear model contained in fit_rlm. To explain briefly, a robust linear model down-weights the contribution of the observations that are outlying observations to estimate parameters.

fit_lm <- lm(y ~ x, data = df) 
fit_rlm <- rlm(y ~ x, data = df) 

class(fit_rlm)
## [1] "rlm" "lm"

What is the class of fit_rlm? Study the object structure of both fit_lm and fit_rlm. How are the structures different between fit_lm and fit_rlm? >> Class of fit_rlm is “rlm” and “lm”. In terms of structures, fit_rlm has variable of w which down-weighting for each observation (hence it can be used to take care of outliers), where fit_lm does not have w.

Below the augment method from the broom package extracts some model-related values but it does not return the weights from the iterated re-weighted least squares. Modify the augment method for the appropriate class so that the code below gets the desired output.

# modify augment method to the desired output
dfa <- augment(fit_rlm) %>% 
  mutate(w = fit_rlm$w)

dfa
## # A tibble: 20 x 7
##        y     x .fitted     .resid      .hat .sigma         w
##    <int> <int>   <dbl>      <dbl>     <dbl>  <dbl>     <dbl>
##  1     1     1   0.999  0.000546  0.206      10.3  1        
##  2     2     2   2.00   0.000452  0.172      10.3  1        
##  3     3     3   3.00   0.000358  0.143      10.3  1        
##  4     4     4   4.00   0.000263  0.118      10.3  1        
##  5     5     5   5.00   0.000169  0.0974     10.3  1        
##  6     6     6   6.00   0.0000749 0.0808     10.3  1        
##  7     7     7   7.00  -0.0000193 0.0685     10.3  1        
##  8     8     8   8.00  -0.000114  0.0603     10.3  1        
##  9     9     9   9.00  -0.000208  0.0564     10.3  1        
## 10    10    10  10.0   -0.000302  0.0566     10.3  1        
## 11    11    11  11.0   -0.000396  0.0611     10.3  1        
## 12    12    12  12.0   -0.000490  0.0698     10.3  1        
## 13    13    13  13.0   -0.000585  0.0827     10.3  1        
## 14    14    14  14.0   -0.000679  0.0998     10.3  1        
## 15    15    15  15.0   -0.000773  0.121      10.3  1        
## 16    16    16  16.0   -0.000867  0.147      10.3  1        
## 17    17    17  17.0   -0.000961  0.172      10.3  0.977    
## 18    18    18  18.0   -0.00106   0.187      10.3  0.890    
## 19    49    19  19.0   30.0       0.0000184   7.28 0.0000742
## 20    50    20  20.0   30.0       0.0000216   7.28 0.0000742

Once the above is done, you can the code below to get the plot seen at the end of this question. What do you think the w is doing? >> w gives weight for all observations based on their normality, hence outliers will be given small weight. Fit_rlm model captures all observations better and disregards outliers.

ggplot(dfa, aes(x, y, color = w)) + 
  geom_point(size = 2) + 
  geom_smooth(method = "lm", se = FALSE, color = "#C8008F") +
  geom_smooth(method = "rlm", se = FALSE, color = "#027EB6") + 
  annotate("text", x = 20.5, y = c(21, 31), 
           label = c("rlm", "lm")) + 
  theme_bw(base_size = 12) + 
  labs(title = "Robust vs. non-robust linear models") + 
  scale_color_continuous(type = "viridis")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

knitr::include_graphics("images/tutorial03-rlm-1.png")