Consider the dataset provided (rrdata.RData), where one is interested in predicting crimes per 100,000 people (crime), given the percent of population living under poverty line (poverty), and percent of population that are single parents (single).
## load the data
load(file="rrdata.RData")
library("MASS")
rr.mlr <- lm(crime ~ poverty + single, data = rrdata)
rr.huber <- rlm(crime ~ poverty + single, data = rrdata)
summary(rr.mlr)
##
## Call:
## lm(formula = crime ~ poverty + single, data = rrdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -811.14 -114.27 -22.44 121.86 689.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1368.189 187.205 -7.308 2.48e-09 ***
## poverty 6.787 8.989 0.755 0.454
## single 166.373 19.423 8.566 3.12e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 243.6 on 48 degrees of freedom
## Multiple R-squared: 0.7072, Adjusted R-squared: 0.695
## F-statistic: 57.96 on 2 and 48 DF, p-value: 1.578e-13
summary(rr.huber)
##
## Call: rlm(formula = crime ~ poverty + single, data = rrdata)
## Residuals:
## Min 1Q Median 3Q Max
## -846.09 -125.80 -16.49 119.15 679.94
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -1423.0373 167.5899 -8.4912
## poverty 8.8677 8.0467 1.1020
## single 168.9858 17.3878 9.7186
##
## Residual standard error: 181.8 on 48 degrees of freedom
Residual Standard Error for MLR is 243.6 while the same for Huber is 181.8.
Plotting graphs to determine outliers:
par(mfrow = c(1,1))
plot(rr.mlr$fitted.values ~ rrdata$crime, main = "Fitted vs Actual")
plot(rr.mlr$residuals,main = "Residual Plot")
outlier.res = boxplot(rr.mlr$residuals, plot=FALSE)$out
Identifying the outliers in residuals:
#identify(rep(1, length(rr.mlr$residuals)), rr.mlr$residuals, labels = seq_along(rr.mlr$residuals))
rrdata$residuals <- rr.mlr$residuals
tail(rrdata$residuals,5)
## [1] -145.4986 -183.6078 -138.3937 -232.9081 434.1663
Summary statistics & Residual Standard Error of the models after removing the outliers:
#Sort rrdata by residuals
rrdata1 <- rrdata[order(abs(rrdata$residuals)),]
#Remove last 5 rows from the dataset
rrdataNoOutlier <- head(rrdata1, -5)
rr.mlr.wo.outlier <- lm(crime ~ poverty + single, data = rrdataNoOutlier)
rr.huber.wo.outlier <- rlm(crime ~ poverty + single, data = rrdataNoOutlier)
summary(rr.mlr.wo.outlier)
##
## Call:
## lm(formula = crime ~ poverty + single, data = rrdataNoOutlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -355.35 -114.16 -30.70 96.06 321.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1168.379 192.998 -6.054 3.05e-07 ***
## poverty 8.500 6.521 1.303 0.199
## single 147.054 18.526 7.938 5.81e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 166.7 on 43 degrees of freedom
## Multiple R-squared: 0.6592, Adjusted R-squared: 0.6433
## F-statistic: 41.58 on 2 and 43 DF, p-value: 8.901e-11
summary(rr.huber.wo.outlier)
##
## Call: rlm(formula = crime ~ poverty + single, data = rrdataNoOutlier)
## Residuals:
## Min 1Q Median 3Q Max
## -348.79 -106.77 -23.25 101.04 329.26
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -1155.2532 211.2772 -5.4679
## poverty 8.8081 7.1391 1.2338
## single 144.8869 20.2806 7.1441
##
## Residual standard error: 158 on 43 degrees of freedom
Residual Standard Error for MLR is 167.7 while the same for Huber is 158.
Since robust regression takes care of outliers while modeling on training dataset, we notice that the RMSE for MLR drops significantly on removing the outliers, but does not drop as much for robust regression with huber loss.
Our dataset essentially had 2-3 outliers keeping the trend in picture, but since we removed 5 points with highest residuals, we see a drop in RSE for robust regression with huber loss as well.