How to cheat, lie, and steal to get a better model

The unmitigated pursuit of a great model (high \(R^2\) and low \(pvalues\)) has led to creative forms of quant trickery. The article exposes three quant crimes:

Independently sort your response and explanatory variables
Regress your data using residuals as an explanatory variable
Exclude data that deviate from the regression equation

Random Data

In order to see the deceptions, let’s simulate two data series that are unrelated to each other.

set.seed(1983)

resp <- rnorm(1000,10,10)
explan <- rnorm(1000,20,20)

df <- data.frame(explan=explan, resp=resp)

A simple regression reveals that there is, as expected, no relationship between \(resp\) and \(explan\).

lm.truth <- lm(resp ~ explan, data=df)

summary(lm.truth)

## 
## Call:
## lm(formula = resp ~ explan, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.771  -6.625  -0.341   6.958  29.692 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.59229    0.43548  24.323   <2e-16 ***
## explan      -0.01436    0.01538  -0.934    0.351    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.89 on 998 degrees of freedom
## Multiple R-squared:  0.0008727,  Adjusted R-squared:  -0.0001285 
## F-statistic: 0.8717 on 1 and 998 DF,  p-value: 0.3507

A simple scatter plot also reveals no relationship between the two variables.

Crime 1: Independently sort your data

The modeler may be unsatisfied with the results of \(lm.truth\). The \(R^2\) is very close to 0% and the \(pvalue\) is large. Also suppose that the modeler believes that the coefficient for \(explan\) should be positive. A deceptive sorting of the data would “remedy” all these issues.

resp.sort <- sort(resp)
explan.sort <- sort(explan)

df.sort <- data.frame(explan.sort=explan.sort, resp.sort=resp.sort)

The modeler has deceptively “fixed” all the issues. The \(R^2\) is now close to 100%, the \(pvalue\) is close to 0%, and the coefficient is positive.

lm.sort <- lm(resp.sort ~ explan.sort, data=df.sort)

summary(lm.sort)

## 
## Call:
## lm(formula = resp.sort ~ explan.sort, data = df.sort)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9885 -0.1519  0.0248  0.2155  2.3106 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.7435733  0.0215511    34.5   <2e-16 ***
## explan.sort 0.4855277  0.0007612   637.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4894 on 998 degrees of freedom
## Multiple R-squared:  0.9976, Adjusted R-squared:  0.9976 
## F-statistic: 4.069e+05 on 1 and 998 DF,  p-value: < 2.2e-16

Independently sorting columns in your data is an atrocious quant crime. The keyword here is “independently”. A single row of data records facts and values associated with that observation. By independently sorting, the facts and values lose their relationship with the observation.

Crime 2: Use your residuals

The low \(R^2\) in \(lm.truth\) reflects the high error between model predictions and the actual values of \(resp\). The modeler may “control for error” by putting the regression residuals as an explanatory variable.

df.resid <- cbind(df, resid=lm.truth$residuals)

Now the model looks amazing.

lm.resid <- lm(resp ~ explan + resid, data=df.resid)

summary(lm.resid)

## Warning in summary.lm(lm.resid): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = resp ~ explan + resid, data = df.resid)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.852e-13 -2.240e-16  2.610e-16  6.700e-16  4.516e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  1.059e+01  2.630e-16  4.027e+16   <2e-16 ***
## explan      -1.436e-02  9.291e-18 -1.546e+15   <2e-16 ***
## resid        1.000e+00  1.912e-17  5.230e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.974e-15 on 997 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.369e+33 on 2 and 997 DF,  p-value: < 2.2e-16

This model is unusable because the residuals are calculated with the actual value of \(resp\).

\[residual = resp - \widehat{resp}\]

The actual value of \(resp\) is often unknown at the time of model use. If the values of \(resp\) were always known, then there’s no purpose of creating a predictive model (whose purpose is to predict \(resp\)).

Crime 3: Filter for data that fit your model

The modeler may also “reduce the noise” in the data by filtering out the data that does not fit on the regression line. The following code would select the rows of data where the absolute value of the residual is less than 1. The data set shrinks from 1,000 observations to 77 observations.

df.filt <- df.resid[abs(df.resid$resid) < 1 ,]

head(df.filt)

##        explan      resp       resid
## 1  29.3759446  9.829480 -0.34095783
## 7   8.5435229 10.424163 -0.04544049
## 8   8.8757525  9.875803 -0.58902956
## 11 -1.9103426 11.334630  0.71490352
## 20 -0.5136898  9.952783 -0.64688665
## 34 35.6643273  9.334346 -0.74578671

nrow(df.filt)

## [1] 77

The model with the filtered data looks much better. Notice that the coefficients of the new model are essentially the same as the coefficients of the true model, but the \(R^2\) and \(pvalue\) have both improved drastically.

lm.filt <- lm(resp ~ explan, data=df.filt)

summary(lm.filt)

## 
## Call:
## lm(formula = resp ~ explan, data = df.filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9247 -0.5355 -0.1025  0.5514  1.0525 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.518976   0.087101 120.768  < 2e-16 ***
## explan      -0.014237   0.003322  -4.285 5.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5774 on 75 degrees of freedom
## Multiple R-squared:  0.1967, Adjusted R-squared:  0.186 
## F-statistic: 18.36 on 1 and 75 DF,  p-value: 5.351e-05

Data filtering is a much more subtle quant crime than the first two quant crimes. There are several statistical techniques that identify influential observations, leverage observations, and outliers. In some cases, these observations should be filtered out of the data set. Modelers and model validators should proceed with caution whenever data is filtered out of the model.

Summary

Modeling is hard work. Be honest.