Robust Regression

Introduction

In trying to better understand how to handle real-world data sets, I dug into robust regression to better understand the scenario when errors are not normally distributed. Robust regression “is designed to estimate the mean relationship between the predictors and response” according to the LMR textbook. Robust regression is a good approach for use with multiple outliers, not a couple outliers.

Implementation

Using the data from https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-07-07, I wanted to understand the advantages to using robust regression and compare the performance of robust regression against a naive linear regression model. The below implementation attempts to predict a coffee rating based on the body, balance, and aftertaste of the coffee.

library(MASS)
coffee_data <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')

# complete cases
coffee_data <- coffee_data[complete.cases(coffee_data), ]

summary(coffee_data)
##  total_cup_points    species                             owner   
##  Min.   :63.08    Arabica:130   juan luis alvarado romero   :45  
##  1st Qu.:81.73    Robusta:  2   ipanema coffees             :23  
##  Median :82.75                  lin, che-hao krude 林哲豪   :13  
##  Mean   :82.34                  bismarck castro             :11  
##  3rd Qu.:83.58                  consejo salvadoreño del café: 7  
##  Max.   :86.58                  ceca, s.a.                  : 4  
##                                 (Other)                     :29  
##    country_of_origin              farm_name 
##  Guatemala  :45      fazenda capoeirnha:13  
##  Brazil     :26      capoeirinha       :10  
##  Honduras   :17      la esmeralda      : 7  
##  Taiwan     :13      la esperanza      : 6  
##  Costa Rica : 8      las cuchillas     : 5  
##  El Salvador: 7      las merceditas    : 5  
##  (Other)    :16      (Other)           :86  
##                                                          lot_number 
##  020/17                                                       :  6  
##  019/17                                                       :  5  
##  1                                                            :  4  
##  2016 Tainan Coffee Cupping Event Micro Lot 臺南市咖啡評鑑批次:  3  
##  1-198                                                        :  2  
##  103                                                          :  2  
##  (Other)                                                      :110  
##                                mill            ico_number
##  beneficio ixchel                :32   Taiwan       :13  
##  dry mill                        :30   002/1660/0105: 7  
##  cigrah s.a de c.v.              : 9   002/1660/0080: 6  
##  beneficio montañas del diamante : 4   002/1660/0079: 5  
##  beneficio atlantic condega      : 3   002/1660/0106: 2  
##  beneficio exportacafe agua santa: 3   09-030-273   : 2  
##  (Other)                         :51   (Other)      :97  
##                          company      altitude             region  
##  unex guatemala, s.a.        :32   1700   :11   south of minas:23  
##  ipanema coffees             :23   890    :11   san marcos    :11  
##  taiwan coffee laboratory    :13   4000   :10   comayagua     :10  
##  cigrah s.a de c.v           :11   1500   : 9   oriente       :10  
##  ceca,s.a.                   : 4   1400   : 7   tarrazu       : 8  
##  consejo salvadoreño del café: 4   1250   : 5   santa rosa    : 7  
##  (Other)                     :45   (Other):79   (Other)       :63  
##                  producer  number_of_bags    bag_weight
##  Ipanema Agricola    :12   Min.   :  1.0   69 kg  :77  
##  Ipanema Agricola S.A:11   1st Qu.: 50.0   60 kg  :29  
##  JESUS RAMIREZ       : 7   Median :250.0   50 kg  : 5  
##  ANGEL DE LEON       : 5   Mean   :190.2   10 kg  : 3  
##  JORGE LEAL          : 5   3rd Qu.:290.0   20 kg  : 3  
##  Reinerio Zepeda     : 5   Max.   :550.0   59 kg  : 3  
##  (Other)             :87                   (Other):12  
##                                   in_country_partner      harvest_year
##  Asociacion Nacional Del Café              :45       2016       :56   
##  Brazil Specialty Coffee Association       :26       2017       :40   
##  Instituto Hondureño del Café              :22       2015       :22   
##  Specialty Coffee Association              :17       2017 / 2018:12   
##  Specialty Coffee Association of Costa Rica: 8       2016 / 2017: 2   
##  Salvadoran Coffee Council                 : 7       08/09 crop : 0   
##  (Other)                                   : 7       (Other)    : 0   
##              grading_date                         owner_1             variety  
##  October 20th, 2017:11    Juan Luis Alvarado Romero   :45   Bourbon       :62  
##  August 16th, 2016 : 9    Ipanema Coffees             :23   Caturra       :33  
##  June 1st, 2017    : 8    Lin, Che-Hao Krude 林哲豪   :13   Catuai        :10  
##  August 22nd, 2017 : 7    Bismarck Castro             :11   Typica        : 8  
##  May 18th, 2016    : 6    Consejo Salvadoreño del Café: 7   Yellow Bourbon: 5  
##  August 23rd, 2017 : 5    CECA, S.A.                  : 4   Pacas         : 4  
##  (Other)           :86    (Other)                     :29   (Other)       :10  
##                  processing_method     aroma           flavor     
##  Natural / Dry            :34      Min.   :7.080   Min.   :6.580  
##  Other                    : 5      1st Qu.:7.500   1st Qu.:7.500  
##  Pulped natural / honey   : 5      Median :7.580   Median :7.580  
##  Semi-washed / Semi-pulped: 0      Mean   :7.568   Mean   :7.555  
##  Washed / Wet             :88      3rd Qu.:7.670   3rd Qu.:7.690  
##                                    Max.   :8.080   Max.   :8.000  
##                                                                   
##    aftertaste       acidity           body          balance     
##  Min.   :6.330   Min.   :6.250   Min.   :6.420   Min.   :6.080  
##  1st Qu.:7.250   1st Qu.:7.500   1st Qu.:7.420   1st Qu.:7.420  
##  Median :7.420   Median :7.580   Median :7.500   Median :7.500  
##  Mean   :7.404   Mean   :7.581   Mean   :7.537   Mean   :7.505  
##  3rd Qu.:7.580   3rd Qu.:7.750   3rd Qu.:7.670   3rd Qu.:7.670  
##  Max.   :8.000   Max.   :8.250   Max.   :8.000   Max.   :8.170  
##                                                                 
##    uniformity       clean_cup        sweetness      cupper_points  
##  Min.   : 6.000   Min.   : 6.000   Min.   : 6.000   Min.   :5.170  
##  1st Qu.:10.000   1st Qu.:10.000   1st Qu.:10.000   1st Qu.:7.330  
##  Median :10.000   Median :10.000   Median :10.000   Median :7.500  
##  Mean   : 9.863   Mean   : 9.924   Mean   : 9.896   Mean   :7.498  
##  3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.:7.690  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :8.500  
##                                                                    
##     moisture       category_one_defects    quakers                color    
##  Min.   :0.00000   Min.   : 0.0000      Min.   :0.0000   Blue-Green  : 10  
##  1st Qu.:0.10000   1st Qu.: 0.0000      1st Qu.:0.0000   Bluish-Green: 15  
##  Median :0.10000   Median : 0.0000      Median :0.0000   Green       :107  
##  Mean   :0.09038   Mean   : 0.3561      Mean   :0.7197   None        :  0  
##  3rd Qu.:0.11000   3rd Qu.: 0.0000      3rd Qu.:1.0000                     
##  Max.   :0.13000   Max.   :31.0000      Max.   :8.0000                     
##                                                                            
##  category_two_defects              expiration
##  Min.   : 0.000       October 20th, 2018:11  
##  1st Qu.: 1.000       August 16th, 2017 : 9  
##  Median : 2.000       June 1st, 2018    : 8  
##  Mean   : 2.795       August 22nd, 2018 : 7  
##  3rd Qu.: 4.000       May 18th, 2017    : 6  
##  Max.   :12.000       August 23rd, 2018 : 5  
##                       (Other)           :86  
##                                   certification_body
##  Asociacion Nacional Del Café              :45      
##  Brazil Specialty Coffee Association       :26      
##  Instituto Hondureño del Café              :22      
##  Specialty Coffee Association              :17      
##  Specialty Coffee Association of Costa Rica: 8      
##  Salvadoran Coffee Council                 : 7      
##  (Other)                                   : 7      
##                               certification_address
##  b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53:45       
##  3297cfa4c538e3dd03f72cc4082c54f7999e1f9d:26       
##  b4660a57e9f8cc613ae5b8f02bfce8634c763ab4:22       
##  36d0d00a3724338ba7937c52a378d085f2172daa:15       
##  8e0b118f3cf3121ab27c5387deacdb7d4d2a60b1: 8       
##  3d4987e3b91399dbb3938b5bdf53893b6ef45be1: 7       
##  (Other)                                 : 9       
##                               certification_contact unit_of_measurement
##  724f04ad10ed31dbb9d260f0dfd221ba48be8a95:45        ft: 17             
##  8900f0bf1d0b2bafe6807a73562c7677d57eb980:26        m :115             
##  7f521ca403540f81ec99daec7da19c2788393880:22                           
##  0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660:15                           
##  5eb2b7129d9714c43825e44dc3bca9423de209e9: 8                           
##  27b21e368fb8291cbea02c60623fe6c98f84524d: 7                           
##  (Other)                                 : 9                           
##  altitude_low_meters altitude_high_meters altitude_mean_meters
##  Min.   :   157.9    Min.   :   157.9     Min.   :   157.9    
##  1st Qu.:   941.5    1st Qu.:   941.5     1st Qu.:   941.5    
##  Median :  1250.0    Median :  1250.0     Median :  1250.0    
##  Mean   :  4153.7    Mean   :  4153.7     Mean   :  4153.7    
##  3rd Qu.:  1573.8    3rd Qu.:  1573.8     3rd Qu.:  1573.8    
##  Max.   :190164.0    Max.   :190164.0     Max.   :190164.0    
## 

Output summary of the dataset to understand independent variables and possibility of missing data.

rlmod <- rlm(total_cup_points ~ body + balance + aftertaste, data=coffee_data)

summary(rlmod)
## 
## Call: rlm(formula = total_cup_points ~ body + balance + aftertaste, 
##     data = coffee_data)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -11.61620  -0.24950   0.07583   0.26297   1.10181 
## 
## Coefficients:
##             Value   Std. Error t value
## (Intercept) 34.4199  1.3326    25.8291
## body         1.5604  0.2865     5.4455
## balance      2.6729  0.2881     9.2764
## aftertaste   2.2129  0.2648     8.3566
## 
## Residual standard error: 0.3979 on 128 degrees of freedom

Output the robust regression linear model.

lmod <- lm(total_cup_points ~ body + balance + aftertaste, data=coffee_data)

summary(lmod)
## 
## Call:
## lm(formula = total_cup_points ~ body + balance + aftertaste, 
##     data = coffee_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9253 -0.2255  0.3555  0.6374  1.6192 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.4004     3.9588   5.658 9.53e-08 ***
## body          1.0740     0.8513   1.262  0.20937    
## balance       4.7842     0.8560   5.589 1.31e-07 ***
## aftertaste    2.1519     0.7867   2.735  0.00711 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.309 on 128 degrees of freedom
## Multiple R-squared:  0.6733, Adjusted R-squared:  0.6657 
## F-statistic: 87.95 on 3 and 128 DF,  p-value: < 2.2e-16

Output the naive linear regression model.

# From: http://r-statistics.co/Robust-Regression-With-R.html
# Errors from lm() model
DMwR::regr.eval(coffee_data$total_cup_points, lmod$fitted.values)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
##         mae         mse        rmse        mape 
## 0.791620490 1.660919716 1.288766742 0.009932308

Output the evaluation of the fitted values for the naive linear regression model.

# Errors from rlm() model
DMwR::regr.eval(coffee_data$total_cup_points, rlmod$fitted.values)
##         mae         mse        rmse        mape 
## 0.609320722 1.940124354 1.392883468 0.007801582

Output the evaluation of the fitted values for the robust regression model.

As expected, the robust regression model outperforms the simple linear regression model in MAE, mean absolute error.

# From: https://stats.idre.ucla.edu/r/dae/robust-regression/
hweights <- data.frame(points = coffee_data$total_cup_points, resid = rlmod$resid, weight = rlmod$w)
hweights2 <- hweights[order(rlmod$w), ]
hweights2[1:15, ]
##      points      resid     weight
## 1309  63.08 -11.616203 0.04606813
## 1236  78.00  -4.389796 0.12190522
## 1193  79.08  -4.228610 0.12655212
## 1254  77.33  -3.833515 0.13959405
## 963   81.25  -3.075324 0.17400909
## 1185  79.17  -3.005965 0.17802770
## 837   81.83  -2.650429 0.20190915
## 1323  81.58  -2.343134 0.22838061
## 1320  82.50  -2.294385 0.23324204
## 1145  79.75  -1.983669 0.26978077
## 1158  79.67  -1.963544 0.27254765
## 535   82.92  -1.936620 0.27632506
## 1015  80.92  -1.735061 0.30844406
## 317   83.67  -1.588809 0.33682775
## 260   83.92   1.101814 0.48567135

Note above output shows that as the absolute residual goes down, the weight goes up.

Conclusion

Reading through further definitions of robust regression outside of the textbook, I realized the potential values in the statistical method recognizing that heteroscedasticity is likely in many data sets. The further I journey through the MSDS program, the more I do realize the assumptions needed for ordinary least squares evaluation may not be true for a given data set. Another appealing aspect of robust regression is the use with income or wealth and trying to understand inequality better. As the year 2020 has shown, inequality of income and access to resources can play a meaningful role in one’s life. The application of robust regression can be applied to income and expenditures as those with greater income have a likely greater variance in expenditure size.