Introduction

When dealing with linear regression models, it is tempting to convert numeric fields to factors. However, caution must be taken when doing this. One of the problems that can arrise when attempting this type of conversion is that it can cloud the detection of outliers in the dataset. This is not surprising. When we think of the numeric values of (1,2,3,36), it is not hard to detect the outliner. However, if we have a dataset like this: (“red”,“blue”,“green”,“yellow”) outlier detection becomes much more difficult if not impossible. The following example demonstrates how this can be manifested in a simple dataset in R.

The following cell constructs a dataset based on the builtin mtcars dataset in R.

rm(list=ls())
library(dplyr)
library(tidyverse)
mynewcars <- mtcars
mynewcars$ID <- seq.int(nrow(mynewcars))
mynewcars$model <- row.names(mynewcars)
row.names(mynewcars) <- mynewcars$ID
mynewcars <- mynewcars %>% select(model, cyl, wt, mpg)
mynewcars <- as_tibble(mynewcars)

The following code introduces an outliers to this dataset. And, what is more of an outlier The Homer?!

mynewcars <- rbind(mynewcars, c(model="The Homer",cyl=36, wt=7.8, mpg=9.3))

mynewcars$cyl <- as.numeric(mynewcars$cyl)
mynewcars$wt <- as.numeric(mynewcars$wt)
mynewcars$mpg <- as.numeric(mynewcars$mpg)
mynewcars

Next we will construct a model with the predictors in our dataset as numeric (dbl) values. We are attempting to predict the fuel efficiency of the cars based on their number of cylinders and weight.

mdl <- lm(mpg~cyl+wt, data=mynewcars)
summary(mdl)

Call:
lm(formula = mpg ~ cyl + wt, data = mynewcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2918 -2.3340 -0.3548  1.7413  7.5235 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  35.2816     1.8984  18.585  < 2e-16 ***
cyl           0.3590     0.1804   1.991   0.0557 .  
wt           -5.3823     0.7902  -6.811 1.48e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.373 on 30 degrees of freedom
Multiple R-squared:  0.7246,    Adjusted R-squared:  0.7062 
F-statistic: 39.47 on 2 and 30 DF,  p-value: 3.977e-09

Based on the above we have a resonably good model with an adjusted R-squared value of 0.7062339.

If we look at the outliers for this data, what we get is no surprise.

cd1 <- cooks.distance(mdl)
outliers <- which(cd1>1)
plot(cd1, main="Numeric Cyc Column")

Based on the “rule of thumb” for Cooks distance, Homer’s car appears to be an outlier.

Now, let’s do the same thing, but convert the number of cylinders from a numeric to a factor…

mynewcars_f <- mynewcars
mynewcars_f$cyl <- as.factor(mynewcars_f$cyl)
mdl2 <- lm(mpg~cyl+wt, data=mynewcars_f)
summary(mdl2)

Call:
lm(formula = mpg ~ cyl + wt, data = mynewcars_f)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5890 -1.1557 -0.4879  1.3370  5.7915 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
cyl6         -4.2556     1.3861  -3.070 0.004718 ** 
cyl8         -6.0709     1.6523  -3.674 0.000999 ***
cyl36         0.3130     4.9411   0.063 0.949942    
wt           -3.2056     0.7539  -4.252 0.000213 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.557 on 28 degrees of freedom
Multiple R-squared:  0.8522,    Adjusted R-squared:  0.8311 
F-statistic: 40.38 on 4 and 28 DF,  p-value: 3.056e-11
cd2 <- cooks.distance(mdl2)
plot(cd2, main="Factor Cyl Column")

Now, all of a sudden, our adjusted R-Squared value has shot way up at o over 0.8311402. However, we should be cautious. If we look at the plot of the cooks distance for this model, no observations have a cooks distance greater than one. We appear to have gotten rid of the outliers in our model, but really we just maked them. If we look very carefully, we will note that the cooks distance was unable to be computed for The Homer.

Why is that?

The cooks distance formula is:

\[ D_i = \frac{\sum_{j=1}^n(\hat Y_j-\hat Y_{j(i)})^2}{(p+1)\hat\sigma^2} \]

Not to get into too much of the gorey details, but note that p on the bottom. Cooks distance summarizes how much the all the values in the model change with the ith observation is removed. In the above formula, p represents the number of parameters in our model. If we remove the The Homer from our dataset, the number of parameters in our dataset changes. We no longer have a predictor for the dummy variable for a car with 36 cylinders. If you want to, you can add more cars with 36 cylinders to the data (maybe the Batmobile or Wonder Woman’s car.) If you do, you will notice that the Cook’s distances for the models with numeric values are going to be different from those with the factors.

Now, just to wrap things up, lets return to model 1 and remove the outlier…


mynewcars2 <- mynewcars %>% slice(-outliers)
mdl3 <- lm(mpg~cyl+wt, data=mynewcars2)
summary(mdl3)

Call:
lm(formula = mpg ~ cyl + wt, data = mynewcars2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2893 -1.5512 -0.4684  1.5743  6.1004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
cyl          -1.5078     0.4147  -3.636 0.001064 ** 
wt           -3.1910     0.7569  -4.216 0.000222 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared:  0.8302,    Adjusted R-squared:  0.8185 
F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

After removing the outlier in our data, notice that the Adjusted R2 value is now 0.8185189, which is line with the factor version of the model. So, converting to a factor didn’t really gain us anything in accuracy anyway, AND the factor hid the fact that one of our observations was an outlier.

Conclusion

This data is clearly made up. The point that I am trying to make, yet again, is that great care should be taken when converting numeric values in models into factors (categorical variables). As seen in the above example, it can have unexpected statistical ramifications including masking outliers in the data.

LS0tCnRpdGxlOiAiWWV0IEFub3RoZXIgUmVhc29uIHRvIE5PVCBVc2UgTnVtZXJpY3MgYXMgRmFjdG9ycyIKb3V0cHV0OiBodG1sX25vdGVib29rCmF1dGhvcjogTWlsZXMgUi4gUG9ydGVyCmRhdGU6IE1heSAyOSwgMjAyMAotLS0KCiMgSW50cm9kdWN0aW9uCgpXaGVuIGRlYWxpbmcgd2l0aCBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbHMsIGl0IGlzIHRlbXB0aW5nIHRvIGNvbnZlcnQgbnVtZXJpYyBmaWVsZHMgdG8gZmFjdG9ycy4gIEhvd2V2ZXIsIGNhdXRpb24gbXVzdCBiZSB0YWtlbiB3aGVuIGRvaW5nIHRoaXMuICBPbmUgb2YgdGhlIHByb2JsZW1zIHRoYXQgY2FuIGFycmlzZSB3aGVuIGF0dGVtcHRpbmcgdGhpcyB0eXBlIG9mIGNvbnZlcnNpb24gaXMgdGhhdCBpdCBjYW4gY2xvdWQgdGhlIGRldGVjdGlvbiBvZiBvdXRsaWVycyBpbiB0aGUgZGF0YXNldC4gIFRoaXMgaXMgbm90IHN1cnByaXNpbmcuICBXaGVuIHdlIHRoaW5rIG9mIHRoZSBudW1lcmljIHZhbHVlcyBvZiAoMSwyLDMsMzYpLCBpdCBpcyBub3QgaGFyZCB0byBkZXRlY3QgdGhlIG91dGxpbmVyLiAgSG93ZXZlciwgaWYgd2UgaGF2ZSBhIGRhdGFzZXQgbGlrZSB0aGlzOiAgKCJyZWQiLCJibHVlIiwiZ3JlZW4iLCJ5ZWxsb3ciKSBvdXRsaWVyIGRldGVjdGlvbiBiZWNvbWVzIG11Y2ggbW9yZSBkaWZmaWN1bHQgaWYgbm90IGltcG9zc2libGUuICBUaGUgZm9sbG93aW5nIGV4YW1wbGUgZGVtb25zdHJhdGVzIGhvdyB0aGlzIGNhbiBiZSBtYW5pZmVzdGVkIGluIGEgc2ltcGxlIGRhdGFzZXQgaW4gUi4KClRoZSBmb2xsb3dpbmcgY2VsbCBjb25zdHJ1Y3RzIGEgZGF0YXNldCBiYXNlZCBvbiB0aGUgYnVpbHRpbiBtdGNhcnMgZGF0YXNldCBpbiBSLgoKYGBge3J9CnJtKGxpc3Q9bHMoKSkKbGlicmFyeShkcGx5cikKbGlicmFyeSh0aWR5dmVyc2UpCm15bmV3Y2FycyA8LSBtdGNhcnMKbXluZXdjYXJzJElEIDwtIHNlcS5pbnQobnJvdyhteW5ld2NhcnMpKQpteW5ld2NhcnMkbW9kZWwgPC0gcm93Lm5hbWVzKG15bmV3Y2FycykKcm93Lm5hbWVzKG15bmV3Y2FycykgPC0gbXluZXdjYXJzJElECm15bmV3Y2FycyA8LSBteW5ld2NhcnMgJT4lIHNlbGVjdChtb2RlbCwgY3lsLCB3dCwgbXBnKQpteW5ld2NhcnMgPC0gYXNfdGliYmxlKG15bmV3Y2FycykKYGBgCgpUaGUgZm9sbG93aW5nIGNvZGUgaW50cm9kdWNlcyBhbiBvdXRsaWVycyB0byB0aGlzIGRhdGFzZXQuICBBbmQsIHdoYXQgaXMgbW9yZSBvZiBhbiBvdXRsaWVyIFtUaGUgSG9tZXJdKGh0dHBzOi8vd3d3LnlvdXR1YmUuY29tL3dhdGNoP3Y9UHc5Z2FFaVFBeFkpPyEKCmBgYHtyfQpteW5ld2NhcnMgPC0gcmJpbmQobXluZXdjYXJzLCBjKG1vZGVsPSJUaGUgSG9tZXIiLGN5bD0zNiwgd3Q9Ny44LCBtcGc9OS4zKSkKCm15bmV3Y2FycyRjeWwgPC0gYXMubnVtZXJpYyhteW5ld2NhcnMkY3lsKQpteW5ld2NhcnMkd3QgPC0gYXMubnVtZXJpYyhteW5ld2NhcnMkd3QpCm15bmV3Y2FycyRtcGcgPC0gYXMubnVtZXJpYyhteW5ld2NhcnMkbXBnKQpteW5ld2NhcnMKYGBgCgpOZXh0IHdlIHdpbGwgY29uc3RydWN0IGEgbW9kZWwgd2l0aCB0aGUgcHJlZGljdG9ycyBpbiBvdXIgZGF0YXNldCBhcyBudW1lcmljIChkYmwpIHZhbHVlcy4gIFdlIGFyZSBhdHRlbXB0aW5nIHRvIHByZWRpY3QgdGhlIGZ1ZWwgZWZmaWNpZW5jeSBvZiB0aGUgY2FycyBiYXNlZCBvbiB0aGVpciBudW1iZXIgb2YgY3lsaW5kZXJzIGFuZCB3ZWlnaHQuCgpgYGB7cn0KbWRsIDwtIGxtKG1wZ35jeWwrd3QsIGRhdGE9bXluZXdjYXJzKQpzdW1tYXJ5KG1kbCkKYGBgCgpCYXNlZCBvbiB0aGUgYWJvdmUgd2UgaGF2ZSBhIHJlc29uYWJseSBnb29kIG1vZGVsIHdpdGggYW4gYWRqdXN0ZWQgUi1zcXVhcmVkIHZhbHVlIG9mIGByIHN1bW1hcnkobWRsKSRhZGouci5zcXVhcmVkYC4KCklmIHdlIGxvb2sgYXQgdGhlIG91dGxpZXJzIGZvciB0aGlzIGRhdGEsIHdoYXQgd2UgZ2V0IGlzIG5vIHN1cnByaXNlLgoKYGBge3J9CmNkMSA8LSBjb29rcy5kaXN0YW5jZShtZGwpCm91dGxpZXJzIDwtIHdoaWNoKGNkMT4xKQpwbG90KGNkMSwgbWFpbj0iTnVtZXJpYyBDeWMgQ29sdW1uIikKCmBgYAoKQmFzZWQgb24gdGhlICJydWxlIG9mIHRodW1iIiBmb3IgQ29va3MgZGlzdGFuY2UsIEhvbWVyJ3MgY2FyIGFwcGVhcnMgdG8gYmUgYW4gb3V0bGllci4KCk5vdywgbGV0J3MgZG8gdGhlIHNhbWUgdGhpbmcsIGJ1dCBjb252ZXJ0IHRoZSBudW1iZXIgb2YgY3lsaW5kZXJzIGZyb20gYSBudW1lcmljIHRvIGEgZmFjdG9yLi4uCgpgYGB7cn0KbXluZXdjYXJzX2YgPC0gbXluZXdjYXJzCm15bmV3Y2Fyc19mJGN5bCA8LSBhcy5mYWN0b3IobXluZXdjYXJzX2YkY3lsKQptZGwyIDwtIGxtKG1wZ35jeWwrd3QsIGRhdGE9bXluZXdjYXJzX2YpCnN1bW1hcnkobWRsMikKCmNkMiA8LSBjb29rcy5kaXN0YW5jZShtZGwyKQpwbG90KGNkMiwgbWFpbj0iRmFjdG9yIEN5bCBDb2x1bW4iKQoKYGBgCgpOb3csIGFsbCBvZiBhIHN1ZGRlbiwgb3VyIGFkanVzdGVkIFItU3F1YXJlZCB2YWx1ZSBoYXMgc2hvdCB3YXkgdXAgYXQgbyBvdmVyIGByIHN1bW1hcnkobWRsMikkYWRqLnIuc3F1YXJlZGAuICBIb3dldmVyLCB3ZSBzaG91bGQgYmUgY2F1dGlvdXMuICBJZiB3ZSBsb29rIGF0IHRoZSBwbG90IG9mIHRoZSBjb29rcyBkaXN0YW5jZSBmb3IgdGhpcyBtb2RlbCwgbm8gb2JzZXJ2YXRpb25zIGhhdmUgYSBjb29rcyBkaXN0YW5jZSBncmVhdGVyIHRoYW4gb25lLiAgV2UgYXBwZWFyIHRvIGhhdmUgZ290dGVuIHJpZCBvZiB0aGUgb3V0bGllcnMgaW4gb3VyIG1vZGVsLCBidXQgcmVhbGx5IHdlIGp1c3QgbWFrZWQgdGhlbS4gIElmIHdlIGxvb2sgdmVyeSBjYXJlZnVsbHksIHdlIHdpbGwgbm90ZSB0aGF0IHRoZSBjb29rcyBkaXN0YW5jZSB3YXMgdW5hYmxlIHRvIGJlIGNvbXB1dGVkIGZvciBUaGUgSG9tZXIuICAKCldoeSBpcyB0aGF0PwoKVGhlIGNvb2tzIGRpc3RhbmNlIGZvcm11bGEgaXM6CgokJCBEX2kgPSBcZnJhY3tcc3VtX3tqPTF9Xm4oXGhhdCBZX2otXGhhdCBZX3tqKGkpfSleMn17KHArMSlcaGF0XHNpZ21hXjJ9ICQkCgpOb3QgdG8gZ2V0IGludG8gdG9vIG11Y2ggb2YgdGhlIGdvcmV5IGRldGFpbHMsIGJ1dCBub3RlIHRoYXQgcCBvbiB0aGUgYm90dG9tLiAgQ29va3MgZGlzdGFuY2Ugc3VtbWFyaXplcyBob3cgbXVjaCB0aGUgYWxsIHRoZSB2YWx1ZXMgaW4gdGhlIG1vZGVsIGNoYW5nZSB3aXRoIHRoZSBpdGggb2JzZXJ2YXRpb24gaXMgcmVtb3ZlZC4gIEluIHRoZSBhYm92ZSBmb3JtdWxhLCBwIHJlcHJlc2VudHMgdGhlIG51bWJlciBvZiBwYXJhbWV0ZXJzIGluIG91ciBtb2RlbC4gIElmIHdlIHJlbW92ZSB0aGUgVGhlIEhvbWVyIGZyb20gb3VyIGRhdGFzZXQsIHRoZSBudW1iZXIgb2YgcGFyYW1ldGVycyBpbiBvdXIgZGF0YXNldCBjaGFuZ2VzLiAgV2Ugbm8gbG9uZ2VyIGhhdmUgYSBwcmVkaWN0b3IgZm9yIHRoZSBkdW1teSB2YXJpYWJsZSBmb3IgYSBjYXIgd2l0aCAzNiBjeWxpbmRlcnMuICBJZiB5b3Ugd2FudCB0bywgeW91IGNhbiBhZGQgbW9yZSBjYXJzIHdpdGggMzYgY3lsaW5kZXJzIHRvIHRoZSBkYXRhIChtYXliZSB0aGUgQmF0bW9iaWxlIG9yIFdvbmRlciBXb21hbidzIGNhci4pICBJZiB5b3UgZG8sIHlvdSB3aWxsIG5vdGljZSB0aGF0IHRoZSBDb29rJ3MgZGlzdGFuY2VzIGZvciB0aGUgbW9kZWxzIHdpdGggbnVtZXJpYyB2YWx1ZXMgYXJlIGdvaW5nIHRvIGJlIGRpZmZlcmVudCBmcm9tIHRob3NlIHdpdGggdGhlIGZhY3RvcnMuCgpOb3csIGp1c3QgdG8gd3JhcCB0aGluZ3MgdXAsIGxldHMgcmV0dXJuIHRvIG1vZGVsIDEgYW5kIHJlbW92ZSB0aGUgb3V0bGllci4uLgoKYGBge3J9CgpteW5ld2NhcnMyIDwtIG15bmV3Y2FycyAlPiUgc2xpY2UoLW91dGxpZXJzKQptZGwzIDwtIGxtKG1wZ35jeWwrd3QsIGRhdGE9bXluZXdjYXJzMikKc3VtbWFyeShtZGwzKQoKYGBgCgpBZnRlciByZW1vdmluZyB0aGUgb3V0bGllciBpbiBvdXIgZGF0YSwgbm90aWNlIHRoYXQgdGhlIEFkanVzdGVkIFIyIHZhbHVlIGlzIG5vdyBgciBzdW1tYXJ5KG1kbDMpJGFkai5yLnNxdWFyZWRgLCB3aGljaCBpcyBsaW5lIHdpdGggdGhlIGZhY3RvciB2ZXJzaW9uIG9mIHRoZSBtb2RlbC4gIFNvLCBjb252ZXJ0aW5nIHRvIGEgZmFjdG9yIGRpZG4ndCByZWFsbHkgZ2FpbiB1cyBhbnl0aGluZyBpbiBhY2N1cmFjeSBhbnl3YXksIEFORCB0aGUgZmFjdG9yIGhpZCB0aGUgZmFjdCB0aGF0IG9uZSBvZiBvdXIgb2JzZXJ2YXRpb25zIHdhcyBhbiBvdXRsaWVyLgoKIyBDb25jbHVzaW9uCgpUaGlzIGRhdGEgaXMgY2xlYXJseSBtYWRlIHVwLiAgVGhlIHBvaW50IHRoYXQgSSBhbSB0cnlpbmcgdG8gbWFrZSwgeWV0IGFnYWluLCBpcyB0aGF0IGdyZWF0IGNhcmUgc2hvdWxkIGJlIHRha2VuIHdoZW4gY29udmVydGluZyBudW1lcmljIHZhbHVlcyBpbiBtb2RlbHMgaW50byBmYWN0b3JzIChjYXRlZ29yaWNhbCB2YXJpYWJsZXMpLiAgQXMgc2VlbiBpbiB0aGUgYWJvdmUgZXhhbXBsZSwgaXQgY2FuIGhhdmUgdW5leHBlY3RlZCBzdGF0aXN0aWNhbCByYW1pZmljYXRpb25zIGluY2x1ZGluZyBtYXNraW5nIG91dGxpZXJzIGluIHRoZSBkYXRhLgo=