## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'car' was built under R version 3.6.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.6.3
## Warning: package 'olsrr' was built under R version 3.6.3
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
## Warning: package 'quantreg' was built under R version 3.6.3
## Loading required package: SparseM
## Warning: package 'SparseM' was built under R version 3.6.2
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
## Warning: package 'robustbase' was built under R version 3.6.3
## Warning: package 'MASS' was built under R version 3.6.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
##
## cement
Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations, and it can also be used for the purpose of detecting influential observations.
In this analysis we will use both in order to see how them behave. So we will calculate: * LQS regression through LMS, LWS and LTS (MASS library). * RLM regression through minimal squares bi-weighted (MASS library). * RQ regression (quantreg library).
As an example, we will use the Duncan database. This database contains the relationship between education level and income.
If we try to make a traditional linear regression we could see an abnormal pattern of the data, with some outliers
ggplot(Duncan, aes(x = education, y = income)) +
geom_point() + stat_smooth(method = "lm", col = "red") + theme_minimal() +
ggtitle("Income vs. education")
## `geom_smooth()` using formula 'y ~ x'
Let’s try to make a Cook’s barplot (olsrr library) to see the values which are too influential.
From this graph we can clearly see that there are three outliers which are messing up the linear regression method. Because of this fact we have to try to solve this with a robust version of the regression method. To remove them is not option since they are not input errors, but real data which represents people who has an income unusually high considering their education level.
We can create any robust model with the MASS and quantreg libraries.
plot(Duncan$education, Duncan$income,
xlab = "education", ylab = "income", type = "p",
pch = 20, cex = .8)
abline(fitLS, col = 1)
abline(fitH, col = 2)
abline(fitLAD, col = 3)
abline(fitLTS, col = 4)
abline(fitLMS, col = 5)
abline(fitS, col = 6)
abline(fitMM, col = 7)
legend(0, 80, c("LS", "Huber", "LAD","LTS","LMS",
"S-estimador","MM-estimador" ),
lty = rep(1, 7), bty = "n",
col = c(1, 2, 3, 4, 5, 6, 7))
Now we are going to test the methods we have just created.
training_rows <- sample(1:nrow(Duncan), 0.8 * nrow(Duncan))
training_data <- Duncan[training_rows, ]
test_data <- Duncan[-training_rows, ]
lm_Predicted <- predict(fitLS, test_data)
rob_Predicted <- predict(fitLTS, test_data)
lm_actuals_pred <- cbind(lm_Predicted, test_data$income)
rob_actuals_pred <- cbind(rob_Predicted, test_data$income)
mean(apply(lm_actuals_pred, 1, min)/
apply(lm_actuals_pred, 1, max))
## [1] 0.6436558
## [1] 0.6489016
As we can see, all the robust methods are better than the simple linear regression.
It is classicaly choosen by its \(R^2\) value.
## [1] 0.5249182
## (Intercept) education
## 4.754253e-02 1.839946e-08
## [1] 0.7578587
## Intercept Duncan$education
## 2.533966e-01 6.863292e-14
The robust method improves by a 23% (\(R^2\) = 0.75), which is definitely a significant improvement.