Robust Regression - Blog Post

Robust Regression Methods

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

library(car)

## Warning: package 'car' was built under R version 3.6.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 3.6.3

library(olsrr)

## Warning: package 'olsrr' was built under R version 3.6.3

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

library(quantreg)

## Warning: package 'quantreg' was built under R version 3.6.3

## Loading required package: SparseM

## Warning: package 'SparseM' was built under R version 3.6.2

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

library(robustbase)

## Warning: package 'robustbase' was built under R version 3.6.3

library(MASS)

## Warning: package 'MASS' was built under R version 3.6.3

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:olsrr':
## 
##     cement

Robust Regression Methods

Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations, and it can also be used for the purpose of detecting influential observations.

In this analysis we will use both in order to see how them behave. So we will calculate: * LQS regression through LMS, LWS and LTS (MASS library). * RLM regression through minimal squares bi-weighted (MASS library). * RQ regression (quantreg library).

As an example, we will use the Duncan database. This database contains the relationship between education level and income.

head(Duncan)

If we try to make a traditional linear regression we could see an abnormal pattern of the data, with some outliers

ggplot(Duncan, aes(x = education, y = income)) + 
  geom_point() + stat_smooth(method = "lm", col = "red") + theme_minimal() +
  ggtitle("Income vs. education")

## `geom_smooth()` using formula 'y ~ x'

Outliers

Let’s try to make a Cook’s barplot (olsrr library) to see the values which are too influential.

fitLS <- lm(income ~ education, data = Duncan)
ols_plot_cooksd_bar(fitLS)

From this graph we can clearly see that there are three outliers which are messing up the linear regression method. Because of this fact we have to try to solve this with a robust version of the regression method. To remove them is not option since they are not input errors, but real data which represents people who has an income unusually high considering their education level.

Robust method creation

We can create any robust model with the MASS and quantreg libraries.

Huber loss

fitH <- rlm(income ~ education, data = Duncan, k2 = 1.345)

LMS

fitLMS <- lqs(income ~ education, data = Duncan, method = "lms")

LTS

fitLTS <- lqs(income ~ education, data = Duncan, method = "lts")

LAD

fitLAD <- rq(income ~ education, data = Duncan, tau = 0.5)

S-estimator

fitS <- lqs(income ~ education, data = Duncan, method = "S")

MM-estimator

fitMM <- rlm(income ~ education, data = Duncan, method = "MM")

Plot

plot(Duncan$education, Duncan$income,
      xlab = "education", ylab = "income", type = "p", 
      pch = 20, cex = .8)
abline(fitLS, col = 1) 
abline(fitH, col = 2) 
abline(fitLAD, col = 3) 
abline(fitLTS, col = 4) 
abline(fitLMS, col = 5) 
abline(fitS, col = 6) 
abline(fitMM, col = 7) 
legend(0, 80, c("LS", "Huber", "LAD","LTS","LMS",
                  "S-estimador","MM-estimador" ),
           lty = rep(1, 7), bty = "n",
           col = c(1, 2, 3, 4, 5, 6, 7))

Precision of the methods

Now we are going to test the methods we have just created.

training_rows <- sample(1:nrow(Duncan), 0.8 * nrow(Duncan))  
training_data <- Duncan[training_rows, ] 
test_data <- Duncan[-training_rows, ] 
    
lm_Predicted <- predict(fitLS, test_data)
rob_Predicted <- predict(fitLTS, test_data)
    
lm_actuals_pred <- cbind(lm_Predicted, test_data$income)
rob_actuals_pred <- cbind(rob_Predicted, test_data$income)
    
mean(apply(lm_actuals_pred, 1, min)/
        apply(lm_actuals_pred, 1, max))

## [1] 0.6436558

mean(apply(rob_actuals_pred, 1, min)/ 
        apply(rob_actuals_pred, 1, max))

## [1] 0.6489016

As we can see, all the robust methods are better than the simple linear regression.

Best robust regression method

It is classicaly choosen by its \(R^2\) value.

summary(fitLS)$r.squared

## [1] 0.5249182

summary(fitLS)$coefficients[,4]

##  (Intercept)    education 
## 4.754253e-02 1.839946e-08

fitLTS <- ltsReg(Duncan$education, Duncan$income)
summary(fitLTS)$r.squared

## [1] 0.7578587

summary(fitLTS)$coefficients[,4]

##        Intercept Duncan$education 
##     2.533966e-01     6.863292e-14

The robust method improves by a 23% (\(R^2\) = 0.75), which is definitely a significant improvement.