ETC5523: Communicating with Data

🌐️ Exercise 3A

🔧 Question 1

Consider the artificial data in the object df. There are two obvious outliers.

library(tidyverse)
library(broom)
library(palmerpenguins)
library(MASS) # needed for rlm
df <- tibble(x = 1:20, y = c(1:18, 49:50))
ggplot(df, aes(x, y)) + 
  geom_point(size = 2) + theme_bw(base_size = 12)

Which observations are the outliers?

The outliers are the two last observations in the dataset.

🔧 Question 2

A simple linear model where parameters are estimated using least squares estimate are not robust to outliers. Below we fit two models:

A linear model with least square estimates contained in fit_lm, and
A robust linear model contained in fit_rlm. To explain briefly, a robust linear model down-weights the contribution of the observations that are outlying observations to estimate parameters.

fit_lm <- lm(y ~ x, data = df) 
fit_rlm <- rlm(y ~ x, data = df)

What is the class of fit_rlm?

class(fit_rlm)

## [1] "rlm" "lm"

The class of fit_rlm is rlm and lm

🔧 Question 3

Study the object structure of both fit_lm and fit_rlm. How are the structures different between fit_lm and fit_rlm?

We can use str() function to examine the structure of these objects.

Open here for the structure of fit_lm

str(fit_lm)

## List of 12
##  $ coefficients : Named num [1:2] -5.53 1.81
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
##  $ residuals    : Named num [1:20] 4.71 3.9 3.09 2.28 1.47 ...
##   ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:20] -60.374 46.728 1.69 0.972 0.253 ...
##   ..- attr(*, "names")= chr [1:20] "(Intercept)" "x" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:20] -3.7143 -1.9023 -0.0902 1.7218 3.5338 ...
##   ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:20, 1:2] -4.472 0.224 0.224 0.224 0.224 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:20] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.22 1.26
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 18
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = y ~ x, data = df)
##  $ terms        :Classes 'terms', 'formula'  language y ~ x
##   .. ..- attr(*, "variables")= language list(y, x)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. ..$ : chr "x"
##   .. ..- attr(*, "term.labels")= chr "x"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(y, x)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  $ model        :'data.frame':   20 obs. of  2 variables:
##   ..$ y: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language y ~ x
##   .. .. ..- attr(*, "variables")= language list(y, x)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. .. ..$ : chr "x"
##   .. .. ..- attr(*, "term.labels")= chr "x"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(y, x)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  - attr(*, "class")= chr "lm"

Open here for the structure of fit_rlm

str(fit_rlm)

## List of 21
##  $ coefficients : Named num [1:2] -0.00064 1.00009
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
##  $ residuals    : Named num [1:20] 0.000546 0.000452 0.000358 0.000263 0.000169 ...
##   ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
##  $ wresid       : Named num [1:20] 0.000546 0.000452 0.000358 0.000263 0.000169 ...
##   ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:20] -3.99e+01 2.18e+01 1.92e-04 1.11e-04 2.91e-05 ...
##   ..- attr(*, "names")= chr [1:20] "(Intercept)" "x" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:20] 0.999 2 3 4 5 ...
##   ..- attr(*, "names")= chr [1:20] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:20, 1:2] -4.227 0.237 0.237 0.237 0.237 ...
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:20] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "x"
##   ..$ qraux: num [1:2] 1.24 1.27
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : logi NA
##  $ w            : num [1:20] 1 1 1 1 1 1 1 1 1 1 ...
##  $ s            : num 0.00165
##  $ psi          :function (u, k = 1.345, deriv = 0)  
##  $ k2           : num 1.34
##  $ weights      : num [1:20] 1 1 1 1 1 1 1 1 1 1 ...
##  $ conv         : num [1:11] 0.342 0.1907 0.0879 0.0382 0.0163 ...
##  $ converged    : logi TRUE
##  $ x            : num [1:20, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:20] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##   ..- attr(*, "assign")= int [1:2] 0 1
##  $ call         : language rlm(formula = y ~ x, data = df)
##  $ terms        :Classes 'terms', 'formula'  language y ~ x
##   .. ..- attr(*, "variables")= language list(y, x)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. ..$ : chr "x"
##   .. ..- attr(*, "term.labels")= chr "x"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(y, x)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  $ xlevels      : Named list()
##  $ model        :'data.frame':   20 obs. of  2 variables:
##   ..$ y: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ x: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language y ~ x
##   .. .. ..- attr(*, "variables")= language list(y, x)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. .. ..$ : chr "x"
##   .. .. ..- attr(*, "term.labels")= chr "x"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(y, x)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  - attr(*, "class")= chr [1:2] "rlm" "lm"

We can see that fit_lm and fit_rlm have a different structure. fit_lm has 12 components/lists, whilst fit_rlm has 21 components/lists, including weights denoted by w.

🔧 Question 4

Below the augment method from the broom package extracts some model-related values but it does not return the weights from the iterated re-weighted least squares.

# modify augment method to the desired output
dfa <- augment(fit_rlm)

Modify the augment method for the appropriate class so that the code below gets the desired output.

Open here for the desired output for dfa

dfa
# A tibble: 20 x 7
       y     x .fitted     .resid      .hat .sigma         w
   <int> <int>   <dbl>      <dbl>     <dbl>  <dbl>     <dbl>
 1     1     1   0.999  0.000546  0.206      10.3  1        
 2     2     2   2.00   0.000452  0.172      10.3  1        
 3     3     3   3.00   0.000358  0.143      10.3  1        
 4     4     4   4.00   0.000263  0.118      10.3  1        
 5     5     5   5.00   0.000169  0.0974     10.3  1        
 6     6     6   6.00   0.0000749 0.0808     10.3  1        
 7     7     7   7.00  -0.0000193 0.0685     10.3  1        
 8     8     8   8.00  -0.000114  0.0603     10.3  1        
 9     9     9   9.00  -0.000208  0.0564     10.3  1        
10    10    10  10.0   -0.000302  0.0566     10.3  1        
11    11    11  11.0   -0.000396  0.0611     10.3  1        
12    12    12  12.0   -0.000490  0.0698     10.3  1        
13    13    13  13.0   -0.000585  0.0827     10.3  1        
14    14    14  14.0   -0.000679  0.0998     10.3  1        
15    15    15  15.0   -0.000773  0.121      10.3  1        
16    16    16  16.0   -0.000867  0.147      10.3  1        
17    17    17  17.0   -0.000961  0.172      10.3  0.977    
18    18    18  18.0   -0.00106   0.187      10.3  0.890    
19    49    19  19.0   30.0       0.0000184   7.28 0.0000742
20    50    20  20.0   30.0       0.0000216   7.28 0.0000742

The modification of augment() method to produce column called w is done below:

# modify augment method to the desired output
dfa_modif <- augment(fit_rlm) %>%
  mutate(w = fit_rlm$w)

DT::datatable(dfa_modif)

🔧 Question 5

Once the above is done, you can the code below to get the plot seen at the end of this question.

ggplot(dfa_modif, aes(x, y, color = w)) + 
  geom_point(size = 2) + 
  geom_smooth(method = "lm", se = FALSE, color = "#C8008F") +
  geom_smooth(method = "rlm", se = FALSE, color = "#027EB6") + 
  annotate("text", x = 20.5, y = c(21, 31), 
           label = c("rlm", "lm")) + 
  theme_bw(base_size = 12) + 
  labs(title = "Robust vs. non-robust linear models") + 
  scale_color_continuous(type = "viridis")

What do you think the w is doing?

According to UCLA Statistical Consulting the weight or w in robust linear model gives the value or weigh the observations differently based on how well behaved these observations are. Based on the plot above, the outlier observations have a smaller weight than the normal observation. Thus this weight will keep the regression line on the right track (not pushed upward).