suppressPackageStartupMessages( require(oetteR) )
suppressPackageStartupMessages( require(tidyverse) )
suppressPackageStartupMessages( require(caret) )
suppressPackageStartupMessages( require(corrplot) )
suppressPackageStartupMessages( require(recipes) )

1 Introduction

We will try out several regression models supported by caret and presented in Applied Predictive Modelling

In this POC we will introduce some benchmark datasets and look into robust linear regression models.

2 Data

The mlbench package includes several simulated datasets that can be used as benchmarks

set.seed(1)

df = tibble( data = list(mlbench::mlbench.friedman1( 1000 )
                         , mlbench::mlbench.friedman2( 1000 )
                         , mlbench::mlbench.friedman3( 1000 ) 
                        )
  ) %>%
  mutate( x = map(data, 'x')
          , y = map(data, 'y')
          , x = map( x, as_tibble )
          , y = map( y, function(z) tibble(resp = z) )
          , data = map2( y ,x, bind_cols) 
          ) %>%
  mutate( data_name = c('Friedman 1', 'Friedman 2', 'Friedman 3') ) %>%
  select( data_name, data )

3 Adding outliers

The datasets are designed not to carry any outliers or colinear variables.

We would like to test the outlier sensitivity of the various models and therefore add some outlying observations to the dataset. We will duplicate some of the observations of higher values of the response variable and will downscale two of the best predictor variables of dataset 1.

To increase the impact of the dataset we reduce the total number of samples to 100

set.seed(1)

data = df$data[[1]] 

mods = data %>%
  filter( resp > max(resp) * 0.9 ) %>%
  mutate( V4 = V4 * 0.1
          , V5 = V5 * 0.1
          , resp = resp * 1.5 )
  

data_mod = data %>%
  sample_n(300) %>%
  bind_rows(mods)

df = df %>%
  bind_rows( tibble( data = list(data_mod), data_name = 'Friedman 1 - outlier') )

We will also add a classical example for outlier data the hill racing dataset. Where the variables dist and climb are used to predict time.

data = MASS::hills %>%
  as_tibble()

df = df %>%
  bind_rows( tibble ( data_name = 'Hill Racing', data = list(data) ) )

4 Adding Colinearity

For adding colinearity we will add a tenth variable that is inversly correlated to variable 4.

data = df$data[[1]] %>%
  mutate( V11 = 1-V4 )

df = df %>%
  bind_rows( tibble( data = list(data), data_name = 'Friedman 1 - colinear') )

Add response variable name to modelling dataframe

df = df %>%
  mutate( response = ifelse( startsWith(data_name, 'Fried'), 'resp', 'time') )

5 X-Y Plots

df_plot = df %>%
  mutate( p = map2(data, response,  function(data, resp) gather( data, key = 'key', value = 'value', - resp ) ) 
          , p  = map2(p, response, function(p, resp)  oetteR::f_plot_pretty_points( p, 'value', as.character(resp),'key', scales = 'free_x') )
          , p  = map(p, function(x) x + geom_smooth() + geom_rug() + ggpubr::stat_cor() )
  )
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
df_plot$p
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

6 Histograms

df_plot = df %>%
  mutate( hist = map(data, gather, key = 'key', value = 'value')
          , hist = map( hist, f_clean_data_no_changes )
          , hist = map( hist, function(x) f_plot_hist('value', x , add = 'none')  )
          , hist = map( hist, function(x) x = x + facet_wrap(~key, scales = 'free') )
  ) 


df_plot$hist
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

7 Correlations

df_plot = df_plot %>%
  mutate( cor = map( data , cor)
          , corrplot = map(cor, corrplot, method = 'number') )

8 Summary Friedman Datasets

There is no colinearity between any of the variables.

8.1 Dataset 1

All variables are normally distributed and scaled.
- V4, V5 correlate with the response in a linear fashion
- V1, V2 correlate almost linearly but in the higher ranges adopt a none-linear trend
- V3 has a clear nonelinear correlation that cannot be described or exploited via linear fits

8.2 Dataset2

Response variable beta or gamma distributed with a skewness to the right.
The variables are not on the same scale.
- V2, V3 correleate with the response in a linear fashion

8.3 Dataset3

Response variable is skewed to the left The variables are not on the same scale.
- V1 linear correlation to response
- V2 semi-linear correlation
- V3 none-linear correlation

9 Transformations

9.1 Scale and center prediction variables

df = df %>%
  mutate( formula = map( response, paste, '~ .' )
          , rec = map2(data, formula, recipe)
          , rec = map(rec, step_scale, all_predictors() )
          , rec = map(rec, step_center, all_predictors() )
          , rec = map2(rec, data, prep)
          , data  = map2(rec, data, bake) ) %>%
  select( - formula, - rec )

10 Feature Selection for linear regression using lasso

df_lasso = df %>%
  mutate(  formula = map( data, names )
          , formula = map2( formula, response,  function(x,y) x[ x != y] )
          , formula = map( formula, paste, collapse = '+')
          , formula = map2( formula, response, function(x,y) paste( y , '~', x) )
          , formula = map( formula, as.formula )
          ) %>%
  mutate( lasso = map2( data, formula, f_train_lasso, k = 10, p = NULL)
          , lasso_formula = map_chr(lasso, 'formula_str_lambda_1se') 
          )
## [1] "Progress: 1 / 1 ; 100 % ; ETA: 0 min"
## [1] "Progress: 1 / 1 ; 100 % ; ETA: 0 min"
## [1] "Progress: 1 / 1 ; 100 % ; ETA: 0 min"
## [1] "Progress: 1 / 1 ; 100 % ; ETA: 0 min"
## [1] "Progress: 1 / 1 ; 100 % ; ETA: 0 min"
## [1] "Progress: 1 / 1 ; 100 % ; ETA: 0 min"
select( df_lasso, data_name, lasso_formula )
## # A tibble: 6 x 2
##   data_name             lasso_formula           
##   <chr>                 <chr>                   
## 1 Friedman 1            resp ~ V1 + V2 + V4 + V5
## 2 Friedman 2            resp ~ V2 + V3          
## 3 Friedman 3            resp ~ V1 + V2 + V3     
## 4 Friedman 1 - outlier  resp ~ V1 + V2 + V4     
## 5 Hill Racing           time ~ dist + climb     
## 6 Friedman 1 - colinear resp ~ V1 + V2 + V4 + V5

We find that the Lasso perfectly selects the variables that we had confirmed in our previouse analysis as good linear predictors. The lasso also does not mind the colinear variable V11 that we added and does not select it. There are however some none-linear qualities for some of the variables and V3 in dataset1 which has not quality as a linear predictor but could be exploited by a model that is able to capture none-linear relationships.

We also find that it selects less variables in the reduced Friedman 1 dataset with the added outliers.

11 Rsample

We are going to add a rsample object to our modelling dataframe. We will chose 10 x 10 cross validation pairs. Each time the data is split into cross validation pairs the results will depend on the specific splits and vary from time to time we repeat it. By doing it 10 times we are much more likely to obtain repeatable results and me can be more certain about the error predictions we are calculating.

df = df %>%
  mutate( cv = map( data, rsample::vfold_cv, v = 10, repeats = 10)
          , cv = map( cv, rsample::rsample2caret) )

12 Wrapper caret

car = function( formula, rsample, method, data, grid){
  
  if( is.na(grid) ) grid = NULL
  
  car = caret::train( formula
                      , data = data
                      , method = method
                      , tuneGrid = grid
                      , trControl = caret::trainControl(index = rsample$index
                                                        , indexOut = rsample$indexOut
                                                        , method = 'cv'
                                                        , verboseIter = F
                                                        , savePredictions = T
                                                        )
                        )
  return( as.tibble(car$pred) )
}

13 linear regression

df_linear = df %>%
  left_join( select( df_lasso, data_name, lasso_formula ) , by = 'data_name') %>%
  rename( formula = lasso_formula ) %>%
  mutate( method = 'glm'
          , formula = map( formula, as.formula )
          , grid = NA
          , preds = pmap( list(formula, cv, method, data, grid), car ) 
          ) 

13.1 Example Predictions returned by caret

This is an example of the predictions returned by caret::train. We can use map and the dplyr functions to directly calculate MSE inside the modelling dataframe. Note that we have a row index and a Resample index which will theoretically allow us to trace the prediction of a single observation or the mean performance of a certain cross validation sample throughout different models.

preds = df_linear$preds[[1]] %>%
  as_tibble()

preds 
## # A tibble: 10,000 x 5
##     pred   obs rowIndex parameter Resample       
##    <dbl> <dbl>    <int> <fct>     <chr>          
##  1 14.4   14.5        1 none      Repeat01.Fold01
##  2 19.5   19.7        9 none      Repeat01.Fold01
##  3 20.6   20.7       14 none      Repeat01.Fold01
##  4  8.49  10.2       22 none      Repeat01.Fold01
##  5 10.3   11.5       28 none      Repeat01.Fold01
##  6 25.3   26.4       39 none      Repeat01.Fold01
##  7 15.4   15.6       49 none      Repeat01.Fold01
##  8 13.1   13.9       53 none      Repeat01.Fold01
##  9 12.3   11.5       66 none      Repeat01.Fold01
## 10 11.6   11.2       71 none      Repeat01.Fold01
## # ... with 9,990 more rows
df_linear_perf = df_linear %>%
  mutate( cv_sample_err = map(preds, mutate
                              , se = (pred-obs)^2 
                              , ae = abs(pred-obs) )
          , cv_sample_err = map(cv_sample_err, group_by, Resample )
          , cv_sample_err = map(cv_sample_err, summarize
                                , mse = mean(se)
                                , mae  = mean(ae)
                                , n = length(se) )
          , means = map(cv_sample_err, ungroup )
          , means = map(means, summarise
                        , mean_mse  = mean(mse)
                        , mean_mae  = mean(mae)
                        , sem_mean_mse = sd(mse) / sqrt( length(mse) )
                        , sem_mean_mae = sd(mae) / sqrt( length(mae) )
                        , n = length(mae) ) ) %>%
  unnest(means)

13.2 Example Modelling dataframe with predictions and performance

inside the modelling dataframe we keep the preds columns containing all the predictions. We have the cv_sample_mse where we keep the summarized MSE for each of the 10 x 10 cross validation samples. These will allow us to calculate error bars for the total mean_mse value.

df_linear
## # A tibble: 6 x 8
##   data_name             data    response cv    formula method grid  preds 
##   <chr>                 <list>  <chr>    <lis> <list>  <chr>  <lgl> <list>
## 1 Friedman 1            <tibbl~ resp     <lis~ <S3: f~ glm    NA    <tibb~
## 2 Friedman 2            <tibbl~ resp     <lis~ <S3: f~ glm    NA    <tibb~
## 3 Friedman 3            <tibbl~ resp     <lis~ <S3: f~ glm    NA    <tibb~
## 4 Friedman 1 - outlier  <tibbl~ resp     <lis~ <S3: f~ glm    NA    <tibb~
## 5 Hill Racing           <tibbl~ time     <lis~ <S3: f~ glm    NA    <tibb~
## 6 Friedman 1 - colinear <tibbl~ resp     <lis~ <S3: f~ glm    NA    <tibb~

13.3 Performance Summary

select( df_linear_perf, data_name
        , mean_mse
        , sem_mean_mse
        , mean_mae
        , sem_mean_mae
        , n
        ) %>%
  knitr::kable( align = 'lc')
data_name mean_mse sem_mean_mse mean_mae sem_mean_mae n
Friedman 1 7.040845e+00 0.1018119 2.0686265 0.0145599 100
Friedman 2 3.462868e+04 493.0083767 145.9202266 1.1644356 100
Friedman 3 4.999430e-02 0.0010096 0.1638645 0.0013057 100
Friedman 1 - outlier 1.794176e+01 1.3928311 2.7039636 0.0529205 100
Hill Racing 2.862335e+02 46.1327034 9.4115625 0.7219416 100
Friedman 1 - colinear 7.039092e+00 0.1074555 2.0691472 0.0154816 100

as expected the linear regression performs worse for the dataset for which we had added the outliers

13.4 Distributions predicted vs. observed

df_linear %>%
  unnest( preds, .drop = F ) %>%
  select( data_name, pred, obs ) %>%
  gather( key = 'key', value = 'value', -data_name ) %>%
  ggplot( aes( x = value, fill = key ) ) +
    geom_histogram( position = 'identity'
                    , alpha = 0.5) +
    facet_wrap(~data_name, scales = 'free' )

13.5 Residual Plots

df_res = df_linear %>%
  unnest( preds, .drop = F ) %>%
  select( data_name, pred, obs ) %>%
  mutate( resid = pred-obs ) %>%
  group_by( data_name ) %>%
  nest() %>%
  mutate( p = pmap( list( df = data, title = data_name),  oetteR::f_plot_pretty_points, col_x = 'obs', col_y = 'resid' )
          , p = map(p, function(p) p = p + geom_hline( yintercept = 0, color = 'black', size = 1) ))
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
## [1] "Number of excluded observations: 0"
df_res$p
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

We can see that the distribution of the predictions for the Friedman 2 and 3 variables do not match the distributions of the actual observations. In these cases We would have to chose a different modelling method Or fit the model using a transformed version of the response. However this has the disadvantage that we will not be able to easily transform the prediction back to the original scale without a retransfromation bias.

Further we find that there is a trend to overestimate small observations and to underestimate large observations for the Friedman 1 dataset.

14 robust linear regression

Regular regression models seek to minimize SSE (squared standard error). This overpenalizes large errors over small errors which makes linear regression quite sensitive to outliers. The huber method counts smaller errors as squared errors and larger errors as absolute errors making regressions more robust. Huber method

Property Sensitivity
Colinearity sensitive
Outlier less sensitive
Not normally distributed predictors sensitive
Not normally distributed response sensitive
None-linear correlation between predictors and response insensitive, does not detect
Irrelevant variables sensitive
Unscaled variables insensitive, but necessary for calculation of variable importance

Tuning Parameter: Huber Threshold (1.345 almost always best option ?) range 0.01 * 2^x Variable Importance: scaled absolute coefficent values

The MASS package offers a MASS::rlm which supports also a few other improved loss funtions in addition to the huber method. If we pass grid = NULL to caret it will automatically try them all. They are described in this presentation.However I am not sure if here we will really tune the individual huber threshold. It seems that Huber offered 1.345 as an almost optimal threshold value. It seems to me that caret keeps it at that value and only tunes for the type of loss function.

df_rlm = df %>%
  left_join( select( df_lasso, data_name, lasso_formula ) , by = 'data_name') %>%
  rename( formula = lasso_formula ) %>%
  mutate( method = 'rlm'
          , formula = map( formula, as.formula )
          , grid = NA
          , preds = pmap( list(formula, cv, method, data, grid), car ) 
          ) 

14.1 Example predictions with tuning parameters

df_rlm$preds[[1]]
## # A tibble: 60,000 x 6
##     pred   obs rowIndex intercept psi       Resample       
##    <dbl> <dbl>    <int> <lgl>     <fct>     <chr>          
##  1 14.5   14.5        1 TRUE      psi.huber Repeat01.Fold01
##  2 19.8   19.7        9 TRUE      psi.huber Repeat01.Fold01
##  3 20.9   20.7       14 TRUE      psi.huber Repeat01.Fold01
##  4  8.67  10.2       22 TRUE      psi.huber Repeat01.Fold01
##  5 10.3   11.5       28 TRUE      psi.huber Repeat01.Fold01
##  6 25.7   26.4       39 TRUE      psi.huber Repeat01.Fold01
##  7 15.6   15.6       49 TRUE      psi.huber Repeat01.Fold01
##  8 13.1   13.9       53 TRUE      psi.huber Repeat01.Fold01
##  9 12.6   11.5       66 TRUE      psi.huber Repeat01.Fold01
## 10 11.4   11.2       71 TRUE      psi.huber Repeat01.Fold01
## # ... with 59,990 more rows

14.2 Find best set of tuning parameter

We will select the best tuning parameter in respect to absolute error (AE). AE does not overtly penalize large errors over small errors as does MSE. We reckon that this is a more adequate measure to check whether the rlm method lead to any improvement in the modelling of the outlier ridden datasets.

We first select the best tuning parameters ignoring the different cv sets

df_tune = df_rlm %>%
  mutate( tune_err = map(preds, mutate
                         , se = (pred-obs)^2 
                         , ae = abs(pred - obs) )
          , tune_err = map(tune_err, group_by, intercept, psi )
          , tune_err = map(tune_err, summarize
                           , mse = mean(se)
                           , mae = mean(ae) )
  ) %>%
  unnest(tune_err) %>%
  group_by(data_name) %>%
  mutate( rank = rank(mae) ) %>%
  filter( rank == 1 ) %>%
  select( data_name, intercept, psi ) %>%
  mutate( psi = as.character(psi) )

df_tune %>%
  knitr::kable()
data_name intercept psi
Friedman 1 TRUE psi.bisquare
Friedman 2 TRUE psi.hampel
Friedman 3 TRUE psi.bisquare
Friedman 1 - outlier TRUE psi.bisquare
Hill Racing TRUE psi.bisquare
Friedman 1 - colinear TRUE psi.bisquare

14.3 Performance

Then we filter for the best tuning parameters and group by cross validation set.

df_rlm_perf = df_rlm %>%
  left_join( df_tune ) %>%
  mutate( cv_sample_err = map(preds, mutate
                              , se = (pred-obs)^2
                              , ae = abs(pred-obs) )
          # filter best tuning parameters
          , cv_sample_err = pmap( list(cv_sample_err, intercept, psi) 
                                  , function(x,y,z) filter( x , intercept == y, psi == z ) )
          , cv_sample_err = map(cv_sample_err, group_by, Resample )
          , cv_sample_err = map(cv_sample_err, summarize
                                , mse = mean(se)
                                , mae = mean(ae) )
          , mean_err = map(cv_sample_err, ungroup )
          , mean_err = map(mean_err, summarise
                           , mean_mse = mean(mse)
                           , mean_mae = mean(mae)
                           , sem_mean_mse = sd(mse) / sqrt( length(mse) )
                           , sem_mean_mae = sd(mae) / sqrt( length(mae) )
                           , n = length(mae)
                           ) ) %>%
  unnest(mean_err)


df_rlm_perf %>%
  bind_rows(df_linear_perf) %>%
  select( data_name, method
          , mean_mse
          , sem_mean_mse
          , mean_mae
          , sem_mean_mae
          , n ) %>%
  arrange( data_name, method) %>%
  knitr::kable( align = 'lcc')
data_name method mean_mse sem_mean_mse mean_mae sem_mean_mae n
Friedman 1 glm 7.040845e+00 0.1018119 2.0686265 0.0145599 100
Friedman 1 rlm 7.140213e+00 0.1129233 2.0442659 0.0149400 100
Friedman 1 - colinear glm 7.039092e+00 0.1074555 2.0691472 0.0154816 100
Friedman 1 - colinear rlm 7.142437e+00 0.1187323 2.0452381 0.0164512 100
Friedman 1 - outlier glm 1.794176e+01 1.3928311 2.7039636 0.0529205 100
Friedman 1 - outlier rlm 1.793077e+01 1.4344045 2.6680967 0.0546601 100
Friedman 2 glm 3.462868e+04 493.0083767 145.9202266 1.1644356 100
Friedman 2 rlm 3.463121e+04 492.0832659 145.9155159 1.1649865 100
Friedman 3 glm 4.999430e-02 0.0010096 0.1638645 0.0013057 100
Friedman 3 rlm 5.840560e-02 0.0016142 0.1550368 0.0018368 100
Hill Racing glm 2.862335e+02 46.1327034 9.4115625 0.7219416 100
Hill Racing rlm 2.576500e+02 44.3555424 7.9273861 0.7263739 100

We are getting a minimal improvement in the datasets with the outliers using rlm. If we apply the rule of thump that a model is prefereable to another only if its mean error + SEM is smaller than the mean error of the other model, we find that the rlm method gives us a valid improvement over glm only when considering the MAE for the Hill Racing dataset. 0.102 + 0.009 < 0.123.

14.4 rlm without caret

I would like to check whether the findings above hold up when tuning the huber threshold manually. We do not have that option in caret, so we do it manually

wr_rlm = function( formula, split, huber ){
  
  train = as.data.frame(split, data = 'analysis')
  test = as.data.frame(split, data = 'assessment')
  
  m = MASS::rlm( formula, train , psi = MASS::psi.huber, k = huber )
  
  preds = predict(m, newdata = test)
  
  resp = f_manip_get_response_variable_from_formula(formula)
  
  obs = test[[resp]]
  
  return( tibble(obs = obs, preds = preds) )
}

x = 1:15

grid = expand.grid( list( huber = 0.01 * 2^x) )

df_wr_rlm = df %>%
  left_join( select(df_lasso, data_name, lasso_formula ) ) %>%
  mutate( cv = map(data, rsample::vfold_cv, v = 10, repeats = 10)
          , huber = grid
          , formula = map(lasso_formula, as.formula ) ) %>%
  unnest( huber , .drop = FALSE) %>%
  select( - data ) %>%
  unnest( cv , .drop = FALSE) %>%
  mutate( preds = pmap( list(formula, splits, huber)
                        , wr_rlm) )

#  calculate errors
df_wr_lm_perf = df_wr_rlm %>%
  mutate( preds = map(preds, mutate
                      , se = (preds-obs)^2
                      , ae = abs(preds-obs) )
          , preds = map(preds, summarise
                        , mse = mean(se)
                        , mae = mean(ae) ) ) %>%
  unnest( preds, .drop = FALSE )
  
# select best huber parameter by MAE
df_wr_lm_tune = df_wr_lm_perf %>%
  group_by( data_name, huber ) %>%
  summarise( mean_mse       = mean(mse)
             , sem_mean_mse = sd(mse) / sqrt( length(mse) )
             , mean_mae     = mean(mae)
             , sem_mean_mae = sd(mae) / sqrt( length(mae) )
             , n = length(mae)
             ) %>%
  group_by( data_name ) %>%
  mutate( rank = rank(mean_mae, ties.method = 'first') ) %>%
  filter( rank == 1)

df_wr_lm_tune %>%
  select( data_name, huber) %>%
  knitr::kable( align = 'lc')
data_name huber
Friedman 1 0.32
Friedman 1 - colinear 0.64
Friedman 1 - outlier 0.16
Friedman 2 5.12
Friedman 3 0.64
Hill Racing 0.64

We can see that 1.345 does not always come up as the best threshold value.

Filter by best tuning parameters

df_wr_lm_perf_filt = df_wr_lm_tune %>%
  select( data_name, huber) %>%
  left_join( df_wr_lm_perf ) %>%
  group_by( data_name ) %>%
  summarise( mean_mse       = mean(mse)
             , sem_mean_mse = sd(mse) / sqrt( length(mse) )
             , mean_mae     = mean(mae)
             , sem_mean_mae = sd(mae) / sqrt( length(mae) )
             , n = length(mae)
             ) 


final = df_wr_lm_perf_filt %>%
  mutate( method = 'wr_rlm') %>%
  bind_rows( df_rlm_perf ) %>%
  bind_rows(df_linear_perf ) %>%
  select( data_name, method, mean_mse, sem_mean_mse, mean_mae, sem_mean_mae, n  )%>%
  arrange( data_name, method )


f_datatable_minimal(final)

Even with manually tuning the huber threshold we could not increase the modelling performance for the dataset with the artificial outliers that we added.

14.5 Plot Performance of datasets with outliers

final %>%
  filter( data_name == 'Hill Racing' | data_name == 'Friedman 1 - outlier' ) %>%
  ggplot() +
    geom_pointrange( aes(x = method
                         , y = mean_mae
                         , ymin = mean_mae - sem_mean_mae
                         , ymax = mean_mae + sem_mean_mae
                         , color = method)
                     ) +
  facet_wrap( ~ data_name, scales = 'free_y' )

final %>%
  filter( data_name == 'Hill Racing' | data_name == 'Friedman 1 - outlier' ) %>%
  ggplot() +
    geom_pointrange( aes(x = method
                         , y = mean_mse
                         , ymin = mean_mse - sem_mean_mse
                         , ymax = mean_mse + sem_mean_mse
                         , color = method)
                     ) +
  facet_wrap( ~ data_name, scales = 'free_y' )

14.6 Summary

The rlm method does not really hold up in a scenario with thorough cross validation. When generating the training and test sets the outlieres have a reasonable chance to end up in both and thus effect model training and also the prediction error. In a case when no outlier has been ignored by the model or when it has not been present in the training set but an outlier comes up in the test set it will produce a large error. However if a outlier comes up in the training data and then a similar outlier will come up during testing the outlier will produce a smaller error. These effects cancel each other out when generating a large number of cross validation sets which explains the indiffference of the MSE (which penalizes large errors over small errors) to the robust linear regression methods. When using MAE (which does not overtly penalize larger errors over small errors) we can see improvements using the robust linear regression methods. The robust linear regression methods build a better descriptive model in the presence of outliers of the regular data. If the model is tested on a validation set containing outliers it only outperforms a regeular regression model if we do not square the errors but use absolute error to measure predictive quality.

In practice the we would use robust linear regression to build descriptive models for data with outliers. For predictive models it only makes sense if the outliers are completely random and unrelated such as noise from sensor data. If I have a reoccuring set of measurements such as a weird group of customers with irrational behaviour it would be better to filter them out and build a sperate model for this group of customers.