In trying to better understand how to handle real-world data sets, I dug into robust regression to better understand the scenario when errors are not normally distributed. Robust regression “is designed to estimate the mean relationship between the predictors and response” according to the LMR textbook. Robust regression is a good approach for use with multiple outliers, not a couple outliers.
Using the data from https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-07-07, I wanted to understand the advantages to using robust regression and compare the performance of robust regression against a naive linear regression model. The below implementation attempts to predict a coffee rating based on the body, balance, and aftertaste of the coffee.
library(MASS)
coffee_data <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')
# complete cases
coffee_data <- coffee_data[complete.cases(coffee_data), ]
summary(coffee_data)
## total_cup_points species owner
## Min. :63.08 Arabica:130 juan luis alvarado romero :45
## 1st Qu.:81.73 Robusta: 2 ipanema coffees :23
## Median :82.75 lin, che-hao krude 林哲豪 :13
## Mean :82.34 bismarck castro :11
## 3rd Qu.:83.58 consejo salvadoreño del café: 7
## Max. :86.58 ceca, s.a. : 4
## (Other) :29
## country_of_origin farm_name
## Guatemala :45 fazenda capoeirnha:13
## Brazil :26 capoeirinha :10
## Honduras :17 la esmeralda : 7
## Taiwan :13 la esperanza : 6
## Costa Rica : 8 las cuchillas : 5
## El Salvador: 7 las merceditas : 5
## (Other) :16 (Other) :86
## lot_number
## 020/17 : 6
## 019/17 : 5
## 1 : 4
## 2016 Tainan Coffee Cupping Event Micro Lot 臺南市咖啡評鑑批次: 3
## 1-198 : 2
## 103 : 2
## (Other) :110
## mill ico_number
## beneficio ixchel :32 Taiwan :13
## dry mill :30 002/1660/0105: 7
## cigrah s.a de c.v. : 9 002/1660/0080: 6
## beneficio montañas del diamante : 4 002/1660/0079: 5
## beneficio atlantic condega : 3 002/1660/0106: 2
## beneficio exportacafe agua santa: 3 09-030-273 : 2
## (Other) :51 (Other) :97
## company altitude region
## unex guatemala, s.a. :32 1700 :11 south of minas:23
## ipanema coffees :23 890 :11 san marcos :11
## taiwan coffee laboratory :13 4000 :10 comayagua :10
## cigrah s.a de c.v :11 1500 : 9 oriente :10
## ceca,s.a. : 4 1400 : 7 tarrazu : 8
## consejo salvadoreño del café: 4 1250 : 5 santa rosa : 7
## (Other) :45 (Other):79 (Other) :63
## producer number_of_bags bag_weight
## Ipanema Agricola :12 Min. : 1.0 69 kg :77
## Ipanema Agricola S.A:11 1st Qu.: 50.0 60 kg :29
## JESUS RAMIREZ : 7 Median :250.0 50 kg : 5
## ANGEL DE LEON : 5 Mean :190.2 10 kg : 3
## JORGE LEAL : 5 3rd Qu.:290.0 20 kg : 3
## Reinerio Zepeda : 5 Max. :550.0 59 kg : 3
## (Other) :87 (Other):12
## in_country_partner harvest_year
## Asociacion Nacional Del Café :45 2016 :56
## Brazil Specialty Coffee Association :26 2017 :40
## Instituto Hondureño del Café :22 2015 :22
## Specialty Coffee Association :17 2017 / 2018:12
## Specialty Coffee Association of Costa Rica: 8 2016 / 2017: 2
## Salvadoran Coffee Council : 7 08/09 crop : 0
## (Other) : 7 (Other) : 0
## grading_date owner_1 variety
## October 20th, 2017:11 Juan Luis Alvarado Romero :45 Bourbon :62
## August 16th, 2016 : 9 Ipanema Coffees :23 Caturra :33
## June 1st, 2017 : 8 Lin, Che-Hao Krude 林哲豪 :13 Catuai :10
## August 22nd, 2017 : 7 Bismarck Castro :11 Typica : 8
## May 18th, 2016 : 6 Consejo Salvadoreño del Café: 7 Yellow Bourbon: 5
## August 23rd, 2017 : 5 CECA, S.A. : 4 Pacas : 4
## (Other) :86 (Other) :29 (Other) :10
## processing_method aroma flavor
## Natural / Dry :34 Min. :7.080 Min. :6.580
## Other : 5 1st Qu.:7.500 1st Qu.:7.500
## Pulped natural / honey : 5 Median :7.580 Median :7.580
## Semi-washed / Semi-pulped: 0 Mean :7.568 Mean :7.555
## Washed / Wet :88 3rd Qu.:7.670 3rd Qu.:7.690
## Max. :8.080 Max. :8.000
##
## aftertaste acidity body balance
## Min. :6.330 Min. :6.250 Min. :6.420 Min. :6.080
## 1st Qu.:7.250 1st Qu.:7.500 1st Qu.:7.420 1st Qu.:7.420
## Median :7.420 Median :7.580 Median :7.500 Median :7.500
## Mean :7.404 Mean :7.581 Mean :7.537 Mean :7.505
## 3rd Qu.:7.580 3rd Qu.:7.750 3rd Qu.:7.670 3rd Qu.:7.670
## Max. :8.000 Max. :8.250 Max. :8.000 Max. :8.170
##
## uniformity clean_cup sweetness cupper_points
## Min. : 6.000 Min. : 6.000 Min. : 6.000 Min. :5.170
## 1st Qu.:10.000 1st Qu.:10.000 1st Qu.:10.000 1st Qu.:7.330
## Median :10.000 Median :10.000 Median :10.000 Median :7.500
## Mean : 9.863 Mean : 9.924 Mean : 9.896 Mean :7.498
## 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:7.690
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :8.500
##
## moisture category_one_defects quakers color
## Min. :0.00000 Min. : 0.0000 Min. :0.0000 Blue-Green : 10
## 1st Qu.:0.10000 1st Qu.: 0.0000 1st Qu.:0.0000 Bluish-Green: 15
## Median :0.10000 Median : 0.0000 Median :0.0000 Green :107
## Mean :0.09038 Mean : 0.3561 Mean :0.7197 None : 0
## 3rd Qu.:0.11000 3rd Qu.: 0.0000 3rd Qu.:1.0000
## Max. :0.13000 Max. :31.0000 Max. :8.0000
##
## category_two_defects expiration
## Min. : 0.000 October 20th, 2018:11
## 1st Qu.: 1.000 August 16th, 2017 : 9
## Median : 2.000 June 1st, 2018 : 8
## Mean : 2.795 August 22nd, 2018 : 7
## 3rd Qu.: 4.000 May 18th, 2017 : 6
## Max. :12.000 August 23rd, 2018 : 5
## (Other) :86
## certification_body
## Asociacion Nacional Del Café :45
## Brazil Specialty Coffee Association :26
## Instituto Hondureño del Café :22
## Specialty Coffee Association :17
## Specialty Coffee Association of Costa Rica: 8
## Salvadoran Coffee Council : 7
## (Other) : 7
## certification_address
## b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53:45
## 3297cfa4c538e3dd03f72cc4082c54f7999e1f9d:26
## b4660a57e9f8cc613ae5b8f02bfce8634c763ab4:22
## 36d0d00a3724338ba7937c52a378d085f2172daa:15
## 8e0b118f3cf3121ab27c5387deacdb7d4d2a60b1: 8
## 3d4987e3b91399dbb3938b5bdf53893b6ef45be1: 7
## (Other) : 9
## certification_contact unit_of_measurement
## 724f04ad10ed31dbb9d260f0dfd221ba48be8a95:45 ft: 17
## 8900f0bf1d0b2bafe6807a73562c7677d57eb980:26 m :115
## 7f521ca403540f81ec99daec7da19c2788393880:22
## 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660:15
## 5eb2b7129d9714c43825e44dc3bca9423de209e9: 8
## 27b21e368fb8291cbea02c60623fe6c98f84524d: 7
## (Other) : 9
## altitude_low_meters altitude_high_meters altitude_mean_meters
## Min. : 157.9 Min. : 157.9 Min. : 157.9
## 1st Qu.: 941.5 1st Qu.: 941.5 1st Qu.: 941.5
## Median : 1250.0 Median : 1250.0 Median : 1250.0
## Mean : 4153.7 Mean : 4153.7 Mean : 4153.7
## 3rd Qu.: 1573.8 3rd Qu.: 1573.8 3rd Qu.: 1573.8
## Max. :190164.0 Max. :190164.0 Max. :190164.0
##
Output summary of the dataset to understand independent variables and possibility of missing data.
rlmod <- rlm(total_cup_points ~ body + balance + aftertaste, data=coffee_data)
summary(rlmod)
##
## Call: rlm(formula = total_cup_points ~ body + balance + aftertaste,
## data = coffee_data)
## Residuals:
## Min 1Q Median 3Q Max
## -11.61620 -0.24950 0.07583 0.26297 1.10181
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 34.4199 1.3326 25.8291
## body 1.5604 0.2865 5.4455
## balance 2.6729 0.2881 9.2764
## aftertaste 2.2129 0.2648 8.3566
##
## Residual standard error: 0.3979 on 128 degrees of freedom
Output the robust regression linear model.
lmod <- lm(total_cup_points ~ body + balance + aftertaste, data=coffee_data)
summary(lmod)
##
## Call:
## lm(formula = total_cup_points ~ body + balance + aftertaste,
## data = coffee_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9253 -0.2255 0.3555 0.6374 1.6192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.4004 3.9588 5.658 9.53e-08 ***
## body 1.0740 0.8513 1.262 0.20937
## balance 4.7842 0.8560 5.589 1.31e-07 ***
## aftertaste 2.1519 0.7867 2.735 0.00711 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.309 on 128 degrees of freedom
## Multiple R-squared: 0.6733, Adjusted R-squared: 0.6657
## F-statistic: 87.95 on 3 and 128 DF, p-value: < 2.2e-16
Output the naive linear regression model.
# From: http://r-statistics.co/Robust-Regression-With-R.html
# Errors from lm() model
DMwR::regr.eval(coffee_data$total_cup_points, lmod$fitted.values)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## mae mse rmse mape
## 0.791620490 1.660919716 1.288766742 0.009932308
Output the evaluation of the fitted values for the naive linear regression model.
# Errors from rlm() model
DMwR::regr.eval(coffee_data$total_cup_points, rlmod$fitted.values)
## mae mse rmse mape
## 0.609320722 1.940124354 1.392883468 0.007801582
Output the evaluation of the fitted values for the robust regression model.
As expected, the robust regression model outperforms the simple linear regression model in MAE, mean absolute error.
# From: https://stats.idre.ucla.edu/r/dae/robust-regression/
hweights <- data.frame(points = coffee_data$total_cup_points, resid = rlmod$resid, weight = rlmod$w)
hweights2 <- hweights[order(rlmod$w), ]
hweights2[1:15, ]
## points resid weight
## 1309 63.08 -11.616203 0.04606813
## 1236 78.00 -4.389796 0.12190522
## 1193 79.08 -4.228610 0.12655212
## 1254 77.33 -3.833515 0.13959405
## 963 81.25 -3.075324 0.17400909
## 1185 79.17 -3.005965 0.17802770
## 837 81.83 -2.650429 0.20190915
## 1323 81.58 -2.343134 0.22838061
## 1320 82.50 -2.294385 0.23324204
## 1145 79.75 -1.983669 0.26978077
## 1158 79.67 -1.963544 0.27254765
## 535 82.92 -1.936620 0.27632506
## 1015 80.92 -1.735061 0.30844406
## 317 83.67 -1.588809 0.33682775
## 260 83.92 1.101814 0.48567135
Note above output shows that as the absolute residual goes down, the weight goes up.
Reading through further definitions of robust regression outside of the textbook, I realized the potential values in the statistical method recognizing that heteroscedasticity is likely in many data sets. The further I journey through the MSDS program, the more I do realize the assumptions needed for ordinary least squares evaluation may not be true for a given data set. Another appealing aspect of robust regression is the use with income or wealth and trying to understand inequality better. As the year 2020 has shown, inequality of income and access to resources can play a meaningful role in one’s life. The application of robust regression can be applied to income and expenditures as those with greater income have a likely greater variance in expenditure size.