Consider the dataset provided (rrdata.RData), where one is interested in predicting crimes per 100,000 people (crime), given the percent of population living under poverty line (poverty), and percent of population that are single parents (single).

## load the data
load(file="rrdata.RData")

You are trying to predict “crime” based on only two independent variables, “poverty” and “single”.

(a) Try to predict “crime” using (i) an MLR (using lm function), and then (ii) a robust regression using Huber loss on rrdata. For (ii), use the rlm function, available in MASS package, for robust regression variant that uses Huber loss as follows. Report “Residual Standard Error” for both MLR and robust regression with Huber loss.

library("MASS")
rr.mlr <- lm(crime ~ poverty + single, data = rrdata)
rr.huber <- rlm(crime ~ poverty + single, data = rrdata)

summary(rr.mlr)
## 
## Call:
## lm(formula = crime ~ poverty + single, data = rrdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -811.14 -114.27  -22.44  121.86  689.82 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1368.189    187.205  -7.308 2.48e-09 ***
## poverty         6.787      8.989   0.755    0.454    
## single        166.373     19.423   8.566 3.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 243.6 on 48 degrees of freedom
## Multiple R-squared:  0.7072, Adjusted R-squared:  0.695 
## F-statistic: 57.96 on 2 and 48 DF,  p-value: 1.578e-13
summary(rr.huber)
## 
## Call: rlm(formula = crime ~ poverty + single, data = rrdata)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -846.09 -125.80  -16.49  119.15  679.94 
## 
## Coefficients:
##             Value      Std. Error t value   
## (Intercept) -1423.0373   167.5899    -8.4912
## poverty         8.8677     8.0467     1.1020
## single        168.9858    17.3878     9.7186
## 
## Residual standard error: 181.8 on 48 degrees of freedom

Residual Standard Error for MLR is 243.6 while the same for Huber is 181.8.

(b) Try to identify at most five outliers in this dataset so that if you remove these 5 points the error decreases the most. For example you can look at a plot of actual versus fitted values, or a plot of the residuals to help you determine these 5 points. Remove the outliers and call the resulting dataset rrdataNoOutlier. Fit MLR and robust regression on rrdataNoOutlier, and report residual standard errors.

Plotting graphs to determine outliers:

par(mfrow = c(1,1))
plot(rr.mlr$fitted.values ~ rrdata$crime, main = "Fitted vs Actual")

plot(rr.mlr$residuals,main = "Residual Plot")

outlier.res = boxplot(rr.mlr$residuals, plot=FALSE)$out

Identifying the outliers in residuals:

#identify(rep(1, length(rr.mlr$residuals)), rr.mlr$residuals, labels = seq_along(rr.mlr$residuals))
rrdata$residuals <- rr.mlr$residuals
tail(rrdata$residuals,5)
## [1] -145.4986 -183.6078 -138.3937 -232.9081  434.1663

Summary statistics & Residual Standard Error of the models after removing the outliers:

#Sort rrdata by residuals
rrdata1 <- rrdata[order(abs(rrdata$residuals)),]

#Remove last 5 rows from the dataset
rrdataNoOutlier <- head(rrdata1, -5)

rr.mlr.wo.outlier <- lm(crime ~ poverty + single, data = rrdataNoOutlier)
rr.huber.wo.outlier <- rlm(crime ~ poverty + single, data = rrdataNoOutlier)

summary(rr.mlr.wo.outlier)
## 
## Call:
## lm(formula = crime ~ poverty + single, data = rrdataNoOutlier)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -355.35 -114.16  -30.70   96.06  321.65 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1168.379    192.998  -6.054 3.05e-07 ***
## poverty         8.500      6.521   1.303    0.199    
## single        147.054     18.526   7.938 5.81e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 166.7 on 43 degrees of freedom
## Multiple R-squared:  0.6592, Adjusted R-squared:  0.6433 
## F-statistic: 41.58 on 2 and 43 DF,  p-value: 8.901e-11
summary(rr.huber.wo.outlier)
## 
## Call: rlm(formula = crime ~ poverty + single, data = rrdataNoOutlier)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -348.79 -106.77  -23.25  101.04  329.26 
## 
## Coefficients:
##             Value      Std. Error t value   
## (Intercept) -1155.2532   211.2772    -5.4679
## poverty         8.8081     7.1391     1.2338
## single        144.8869    20.2806     7.1441
## 
## Residual standard error: 158 on 43 degrees of freedom

Residual Standard Error for MLR is 167.7 while the same for Huber is 158.

(c) Compare and comment on the model fits (plot actual vs. predicted values) obtained in (a) and (b). How do outliers affect the relative performance of MLR and robust regression with Huber loss?

Since robust regression takes care of outliers while modeling on training dataset, we notice that the RMSE for MLR drops significantly on removing the outliers, but does not drop as much for robust regression with huber loss.

Our dataset essentially had 2-3 outliers keeping the trend in picture, but since we removed 5 points with highest residuals, we see a drop in RSE for robust regression with huber loss as well.