Although the linear regression methods are a very common tool among data analyst, it is really hard to work with data which really presents a linear relationship. Because of that, we have to take robust versions of the classical linear regression method into consideration.

The main objective of the robust statistics are to change the weight of each value depending on how its behaviour changes as well. This is specially useful when we have outliers and extreme values and we don’t want to eliminate them (which, by the way, could be a bad idea in most cases).

We have two different kinds of stimators: * Those which use the the method of least squares as the traditional linear regression but they change the stimator. * Those which use other different residual function.

In this analysis we will use both in order to see how them behave. So we will calculate: * LQS regression through LMS, LWS and LTS (MASS library). * RLM regression through minimal squares bi-weighted (MASS library). * RQ regression (quantreg library).

As an example, we will use the Duncan database. This database contains the relationship between education level and income.

##            type income education prestige
## accountant prof     62        86       82
## pilot      prof     72        76       83
## architect  prof     75        92       90
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90
minister prof 21 84 87
professor prof 64 93 93
dentist prof 80 100 90
reporter wc 67 87 52
engineer prof 72 86 88

If we try to make a traditional linear regression we could see an abnormal pattern of the data, with some outliers:

ggplot(Duncan, aes(x = education, y = income)) + 
  geom_point() + stat_smooth(method = "lm", col = "red") + theme_minimal() +
  ggtitle("Income vs. education")

1. Outliers

Let’s try to make a Cook’s barplot (olsrr library) to see the values which are too influential.

fitLS <- lm(income ~ education, data = Duncan)
ols_plot_cooksd_bar(fitLS)
## Warning: package 'bindrcpp' was built under R version 3.4.4

We can see in the plot three outliers which are messing up the linear regression method. It is clear that we have to try to solve this with a robust version of the regression method. To remove them is not option since they are not input errors, but real data which represents people who has an income unusually high considering their education level.

2. Robust method creation

We can create any robust model with the MASS and quantreg libraries.

2.1. Huber loss

fitH <- rlm(income ~ education, data = Duncan, k2 = 1.345) 

2.2. LMS

fitLMS <- lqs(income ~ education, data = Duncan, method = "lms")

2.3. LTS

fitLTS <- lqs(income ~ education, data = Duncan, method = "lts") 

2.4. LAD

fitLAD <- rq(income ~ education, data = Duncan, tau = 0.5)

2.5. S-estimator

fitS <- lqs(income ~ education, data = Duncan, method = "S")

2.6. MM-estimator

fitMM <- rlm(income ~ education, data = Duncan, method = "MM")

3. Plot

plot(Duncan$education, Duncan$income,
      xlab = "education", ylab = "income", type = "p", 
      pch = 20, cex = .8)
abline(fitLS, col = 1) 
abline(fitH, col = 2) 
abline(fitLAD, col = 3) 
abline(fitLTS, col = 4) 
abline(fitLMS, col = 5) 
abline(fitS, col = 6) 
abline(fitMM, col = 7) 
legend(0, 80, c("LS", "Huber", "LAD","LTS","LMS",
                  "S-estimador","MM-estimador" ),
           lty = rep(1, 7), bty = "n",
           col = c(1, 2, 3, 4, 5, 6, 7))

Interestingly enough, all methods are robust except for the Huber loss and the simple linear regression.

4. Precision of the methods

Now we are going to test the methods we have just created.

training_rows <- sample(1:nrow(Duncan), 0.8 * nrow(Duncan))  
training_data <- Duncan[training_rows, ] 
test_data <- Duncan[-training_rows, ] 
    
lm_Predicted <- predict(fitLS, test_data)
rob_Predicted <- predict(fitLTS, test_data)
    
lm_actuals_pred <- cbind(lm_Predicted, test_data$income)
rob_actuals_pred <- cbind(rob_Predicted, test_data$income)
    
mean(apply(lm_actuals_pred, 1, min)/
        apply(lm_actuals_pred, 1, max))  
## [1] 0.7131016
mean(apply(rob_actuals_pred, 1, min)/ 
        apply(rob_actuals_pred, 1, max))
## [1] 0.7558502

As we can see, all the robust methods are better than the simple linear regression.

5. Election of the best robust regression method

It is classicaly choosen by its R2 value.

summary(fitLS)$r.squared
## [1] 0.5249182
summary(fitLS)$coefficients[,4]
##  (Intercept)    education 
## 4.754253e-02 1.839946e-08
fitLTS <- ltsReg(Duncan$education, Duncan$income)
summary(fitLTS)$r.squared
## [1] 0.7578587
summary(fitLTS)$coefficients[,4]
##        Intercept Duncan$education 
##     2.533966e-01     6.863292e-14

The robust method improves by a 23% (R2 = 0.75), which is definitely a significant improvement.