Robust regression methods

Although the linear regression methods are a very common tool among data analyst, it is really hard to work with data which really presents a linear relationship. Because of that, we have to take robust versions of the classical linear regression method into consideration.

The main objective of the robust statistics are to change the weight of each value depending on how its behaviour changes as well. This is specially useful when we have outliers and extreme values and we don’t want to eliminate them (which, by the way, could be a bad idea in most cases).

We have two different kinds of stimators: * Those which use the the method of least squares as the traditional linear regression but they change the stimator. * Those which use other different residual function.

In this analysis we will use both in order to see how them behave. So we will calculate: * LQS regression through LMS, LWS and LTS (MASS library). * RLM regression through minimal squares bi-weighted (MASS library). * RQ regression (quantreg library).

As an example, we will use the Duncan database. This database contains the relationship between education level and income.

##            type income education prestige
## accountant prof     62        86       82
## pilot      prof     72        76       83
## architect  prof     75        92       90

	type	income	education	prestige
accountant	prof	62	86	82
pilot	prof	72	76	83
architect	prof	75	92	90
author	prof	55	90	76
chemist	prof	64	86	90
minister	prof	21	84	87
professor	prof	64	93	93
dentist	prof	80	100	90
reporter	wc	67	87	52
engineer	prof	72	86	88

If we try to make a traditional linear regression we could see an abnormal pattern of the data, with some outliers:

ggplot(Duncan, aes(x = education, y = income)) + 
  geom_point() + stat_smooth(method = "lm", col = "red") + theme_minimal() +
  ggtitle("Income vs. education")

1. Outliers

Let’s try to make a Cook’s barplot (olsrr library) to see the values which are too influential.

fitLS <- lm(income ~ education, data = Duncan)
ols_plot_cooksd_bar(fitLS)

## Warning: package 'bindrcpp' was built under R version 3.4.4

We can see in the plot three outliers which are messing up the linear regression method. It is clear that we have to try to solve this with a robust version of the regression method. To remove them is not option since they are not input errors, but real data which represents people who has an income unusually high considering their education level.

2. Robust method creation

We can create any robust model with the MASS and quantreg libraries.

2.1. Huber loss

fitH <- rlm(income ~ education, data = Duncan, k2 = 1.345)

2.2. LMS

fitLMS <- lqs(income ~ education, data = Duncan, method = "lms")

2.3. LTS

fitLTS <- lqs(income ~ education, data = Duncan, method = "lts")

2.4. LAD

fitLAD <- rq(income ~ education, data = Duncan, tau = 0.5)

2.5. S-estimator

fitS <- lqs(income ~ education, data = Duncan, method = "S")

2.6. MM-estimator

fitMM <- rlm(income ~ education, data = Duncan, method = "MM")

3. Plot

plot(Duncan$education, Duncan$income,
      xlab = "education", ylab = "income", type = "p", 
      pch = 20, cex = .8)
abline(fitLS, col = 1) 
abline(fitH, col = 2) 
abline(fitLAD, col = 3) 
abline(fitLTS, col = 4) 
abline(fitLMS, col = 5) 
abline(fitS, col = 6) 
abline(fitMM, col = 7) 
legend(0, 80, c("LS", "Huber", "LAD","LTS","LMS",
                  "S-estimador","MM-estimador" ),
           lty = rep(1, 7), bty = "n",
           col = c(1, 2, 3, 4, 5, 6, 7))

Interestingly enough, all methods are robust except for the Huber loss and the simple linear regression.

4. Precision of the methods

Now we are going to test the methods we have just created.

training_rows <- sample(1:nrow(Duncan), 0.8 * nrow(Duncan))  
training_data <- Duncan[training_rows, ] 
test_data <- Duncan[-training_rows, ] 
    
lm_Predicted <- predict(fitLS, test_data)
rob_Predicted <- predict(fitLTS, test_data)
    
lm_actuals_pred <- cbind(lm_Predicted, test_data$income)
rob_actuals_pred <- cbind(rob_Predicted, test_data$income)
    
mean(apply(lm_actuals_pred, 1, min)/
        apply(lm_actuals_pred, 1, max))

## [1] 0.7131016

mean(apply(rob_actuals_pred, 1, min)/ 
        apply(rob_actuals_pred, 1, max))

## [1] 0.7558502

As we can see, all the robust methods are better than the simple linear regression.

5. Election of the best robust regression method

It is classicaly choosen by its R^² value.

summary(fitLS)$r.squared

## [1] 0.5249182

summary(fitLS)$coefficients[,4]

##  (Intercept)    education 
## 4.754253e-02 1.839946e-08

fitLTS <- ltsReg(Duncan$education, Duncan$income)
summary(fitLTS)$r.squared

## [1] 0.7578587

summary(fitLTS)$coefficients[,4]

##        Intercept Duncan$education 
##     2.533966e-01     6.863292e-14

The robust method improves by a 23% (R^² = 0.75), which is definitely a significant improvement.