Robust Regression

Robust regression deals with influential observations and outliers.

What do we do when we find outliers in linear regression model? If we drop outliers, it changes the data distribution. We often drop outliers when they are very extreme.

Robust regression does not drop outliers, but it reduces the effect of outliers.

Whenever Linear Regression (least square regression) model fundamental assumptions are violated, Robust or Resistant regression method is preferred to build the model.

library(MASS)
View(phones)
phones <- data.frame(phones)

View(phones)
colnames(phones)
## [1] "year"  "calls"
#Plot to check relationship
plot(phones$year, phones$calls)

#Boxplot to check outliers
boxplot(phones$calls, horizontal = T)

#Linear regression model
reg <- lm(calls  ~ year, data = phones)
reg
## 
## Call:
## lm(formula = calls ~ year, data = phones)
## 
## Coefficients:
## (Intercept)         year  
##    -260.059        5.041
summary(reg)
## 
## Call:
## lm(formula = calls ~ year, data = phones)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78.97 -33.52 -12.04  23.38 124.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -260.059    102.607  -2.535   0.0189 * 
## year           5.041      1.658   3.041   0.0060 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.22 on 22 degrees of freedom
## Multiple R-squared:  0.2959, Adjusted R-squared:  0.2639 
## F-statistic: 9.247 on 1 and 22 DF,  p-value: 0.005998
plot(reg)

The above plots show that linear regression model assumptions are violated. we can also see quite a few influential observations and outliers as well. So in this case we build Robust Regression to overcome this issue.

#Prediction
#should split data to use train and test, but used phones
p1 <- predict(reg, phones)
#evaluation of residuals
library(DMwR)
## Warning: package 'DMwR' was built under R version 3.5.3
## Loading required package: lattice
## Loading required package: grid
regr.eval(p1, phones$calls)
##          mae          mse         rmse         mape 
##   42.4531014 2897.6474925   53.8298012    0.8776112

Robust Regression

#rlm stands for robust linear method
robust.reg <- rlm(calls ~ year, data = phones)
## Warning in rlm.default(x, y, weights, method = method, wt.method =
## wt.method, : 'rlm' failed to converge in 20 steps
robust.reg
## Call:
## rlm(formula = calls ~ year, data = phones)
## Ran 20 iterations without convergence
## 
## Coefficients:
## (Intercept)        year 
## -105.992444    2.105068 
## 
## Degrees of freedom: 24 total; 22 residual
## Scale estimate: 9.85

In the above output, failed to converge in 20 iterations means the effect of outliers or influential values still there in this model. By default, the Robust regression runs 20 itations. But we can also increase the number of iterations to reduce the effect of outliers or influential values by using a command called “maxit=?”.

summary(robust.reg)
## 
## Call: rlm(formula = calls ~ year, data = phones)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.467  -6.358  -1.784  26.421 172.743 
## 
## Coefficients:
##             Value     Std. Error t value  
## (Intercept) -105.9924   28.8789    -3.6702
## year           2.1051    0.4666     4.5112
## 
## Residual standard error: 9.855 on 22 degrees of freedom
robust.reg.50 <- rlm(calls ~ year, data=phones, maxit=50)
plot(robust.reg)

#prediction of robust model
pred.robust <- predict(robust.reg, phones)

#evaluation of errors
regr.eval(pred.robust, phones$calls)
##         mae         mse        rmse        mape 
##   35.618901 4014.249971   63.358109    1.474489

Resistant Regression

Resistant regression always rejects outliers and it is resistant to outliers. It is very useful in terms of detecting outliers.

These are two types: 1. Least median squares (LMS) 2. Least trimmed squares (LTS)

LMS minimizes the median of squared residuals. Replaces the sum in least square model method with median.

LTS removes the outliers and builds the model.

#LTS method
resis.lts.reg <- lqs(calls ~ year, data = phones)
resis.lts.reg
## Call:
## lqs.formula(formula = calls ~ year, data = phones)
## 
## Coefficients:
## (Intercept)         year  
##     -56.162        1.159  
## 
## Scale estimates 1.249 1.131
summary(resis.lts.reg)
##               Length Class      Mode     
## crit           1     -none-     numeric  
## sing           1     -none-     character
## coefficients   2     -none-     numeric  
## bestone        2     -none-     numeric  
## fitted.values 24     -none-     numeric  
## residuals     24     -none-     numeric  
## scale          2     -none-     numeric  
## terms          3     terms      call     
## call           3     -none-     call     
## xlevels        0     -none-     list     
## model          2     data.frame list
#prediction
res.lts.pred <- predict(resis.lts.reg, phones)

#evaluation of errors
regr.eval(res.lts.pred, phones$calls)
##         mae         mse        rmse        mape 
##   35.306410 4837.056918   69.548953    1.759383