Although the linear regression methods are a very common tool among data analyst, it is really hard to work with data which really presents a linear relationship. Because of that, we have to take robust versions of the classical linear regression method into consideration.
The main objective of the robust statistics are to change the weight of each value depending on how its behaviour changes as well. This is specially useful when we have outliers and extreme values and we don’t want to eliminate them (which, by the way, could be a bad idea in most cases).
We have two different kinds of stimators: * Those which use the the method of least squares as the traditional linear regression but they change the stimator. * Those which use other different residual function.
In this analysis we will use both in order to see how them behave. So we will calculate: * LQS regression through LMS, LWS and LTS (MASS library). * RLM regression through minimal squares bi-weighted (MASS library). * RQ regression (quantreg library).
As an example, we will use the Duncan database. This database contains the relationship between education level and income.
## type income education prestige
## accountant prof 62 86 82
## pilot prof 72 76 83
## architect prof 75 92 90
type | income | education | prestige | |
---|---|---|---|---|
accountant | prof | 62 | 86 | 82 |
pilot | prof | 72 | 76 | 83 |
architect | prof | 75 | 92 | 90 |
author | prof | 55 | 90 | 76 |
chemist | prof | 64 | 86 | 90 |
minister | prof | 21 | 84 | 87 |
professor | prof | 64 | 93 | 93 |
dentist | prof | 80 | 100 | 90 |
reporter | wc | 67 | 87 | 52 |
engineer | prof | 72 | 86 | 88 |
If we try to make a traditional linear regression we could see an abnormal pattern of the data, with some outliers:
ggplot(Duncan, aes(x = education, y = income)) +
geom_point() + stat_smooth(method = "lm", col = "red") + theme_minimal() +
ggtitle("Income vs. education")
Let’s try to make a Cook’s barplot (olsrr library) to see the values which are too influential.
fitLS <- lm(income ~ education, data = Duncan)
ols_plot_cooksd_bar(fitLS)
## Warning: package 'bindrcpp' was built under R version 3.4.4
We can see in the plot three outliers which are messing up the linear regression method. It is clear that we have to try to solve this with a robust version of the regression method. To remove them is not option since they are not input errors, but real data which represents people who has an income unusually high considering their education level.
We can create any robust model with the MASS and quantreg libraries.
fitH <- rlm(income ~ education, data = Duncan, k2 = 1.345)
fitLMS <- lqs(income ~ education, data = Duncan, method = "lms")
fitLTS <- lqs(income ~ education, data = Duncan, method = "lts")
fitLAD <- rq(income ~ education, data = Duncan, tau = 0.5)
fitS <- lqs(income ~ education, data = Duncan, method = "S")
fitMM <- rlm(income ~ education, data = Duncan, method = "MM")
plot(Duncan$education, Duncan$income,
xlab = "education", ylab = "income", type = "p",
pch = 20, cex = .8)
abline(fitLS, col = 1)
abline(fitH, col = 2)
abline(fitLAD, col = 3)
abline(fitLTS, col = 4)
abline(fitLMS, col = 5)
abline(fitS, col = 6)
abline(fitMM, col = 7)
legend(0, 80, c("LS", "Huber", "LAD","LTS","LMS",
"S-estimador","MM-estimador" ),
lty = rep(1, 7), bty = "n",
col = c(1, 2, 3, 4, 5, 6, 7))
Interestingly enough, all methods are robust except for the Huber loss and the simple linear regression.
Now we are going to test the methods we have just created.
training_rows <- sample(1:nrow(Duncan), 0.8 * nrow(Duncan))
training_data <- Duncan[training_rows, ]
test_data <- Duncan[-training_rows, ]
lm_Predicted <- predict(fitLS, test_data)
rob_Predicted <- predict(fitLTS, test_data)
lm_actuals_pred <- cbind(lm_Predicted, test_data$income)
rob_actuals_pred <- cbind(rob_Predicted, test_data$income)
mean(apply(lm_actuals_pred, 1, min)/
apply(lm_actuals_pred, 1, max))
## [1] 0.7131016
mean(apply(rob_actuals_pred, 1, min)/
apply(rob_actuals_pred, 1, max))
## [1] 0.7558502
As we can see, all the robust methods are better than the simple linear regression.
It is classicaly choosen by its R2 value.
summary(fitLS)$r.squared
## [1] 0.5249182
summary(fitLS)$coefficients[,4]
## (Intercept) education
## 4.754253e-02 1.839946e-08
fitLTS <- ltsReg(Duncan$education, Duncan$income)
summary(fitLTS)$r.squared
## [1] 0.7578587
summary(fitLTS)$coefficients[,4]
## Intercept Duncan$education
## 2.533966e-01 6.863292e-14
The robust method improves by a 23% (R2 = 0.75), which is definitely a significant improvement.