The unmitigated pursuit of a great model (high \(R^2\) and low \(pvalues\)) has led to creative forms of quant trickery. The article exposes three quant crimes:
In order to see the deceptions, let’s simulate two data series that are unrelated to each other.
set.seed(1983)
resp <- rnorm(1000,10,10)
explan <- rnorm(1000,20,20)
df <- data.frame(explan=explan, resp=resp)
A simple regression reveals that there is, as expected, no relationship between \(resp\) and \(explan\).
lm.truth <- lm(resp ~ explan, data=df)
summary(lm.truth)
##
## Call:
## lm(formula = resp ~ explan, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.771 -6.625 -0.341 6.958 29.692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.59229 0.43548 24.323 <2e-16 ***
## explan -0.01436 0.01538 -0.934 0.351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.89 on 998 degrees of freedom
## Multiple R-squared: 0.0008727, Adjusted R-squared: -0.0001285
## F-statistic: 0.8717 on 1 and 998 DF, p-value: 0.3507
A simple scatter plot also reveals no relationship between the two variables.
The modeler may be unsatisfied with the results of \(lm.truth\). The \(R^2\) is very close to 0% and the \(pvalue\) is large. Also suppose that the modeler believes that the coefficient for \(explan\) should be positive. A deceptive sorting of the data would “remedy” all these issues.
resp.sort <- sort(resp)
explan.sort <- sort(explan)
df.sort <- data.frame(explan.sort=explan.sort, resp.sort=resp.sort)
The modeler has deceptively “fixed” all the issues. The \(R^2\) is now close to 100%, the \(pvalue\) is close to 0%, and the coefficient is positive.
lm.sort <- lm(resp.sort ~ explan.sort, data=df.sort)
summary(lm.sort)
##
## Call:
## lm(formula = resp.sort ~ explan.sort, data = df.sort)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9885 -0.1519 0.0248 0.2155 2.3106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7435733 0.0215511 34.5 <2e-16 ***
## explan.sort 0.4855277 0.0007612 637.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4894 on 998 degrees of freedom
## Multiple R-squared: 0.9976, Adjusted R-squared: 0.9976
## F-statistic: 4.069e+05 on 1 and 998 DF, p-value: < 2.2e-16
Independently sorting columns in your data is an atrocious quant crime. The keyword here is “independently”. A single row of data records facts and values associated with that observation. By independently sorting, the facts and values lose their relationship with the observation.
The low \(R^2\) in \(lm.truth\) reflects the high error between model predictions and the actual values of \(resp\). The modeler may “control for error” by putting the regression residuals as an explanatory variable.
df.resid <- cbind(df, resid=lm.truth$residuals)
Now the model looks amazing.
lm.resid <- lm(resp ~ explan + resid, data=df.resid)
summary(lm.resid)
## Warning in summary.lm(lm.resid): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = resp ~ explan + resid, data = df.resid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.852e-13 -2.240e-16 2.610e-16 6.700e-16 4.516e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.059e+01 2.630e-16 4.027e+16 <2e-16 ***
## explan -1.436e-02 9.291e-18 -1.546e+15 <2e-16 ***
## resid 1.000e+00 1.912e-17 5.230e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.974e-15 on 997 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.369e+33 on 2 and 997 DF, p-value: < 2.2e-16
This model is unusable because the residuals are calculated with the actual value of \(resp\).
\[residual = resp - \widehat{resp}\]
The actual value of \(resp\) is often unknown at the time of model use. If the values of \(resp\) were always known, then there’s no purpose of creating a predictive model (whose purpose is to predict \(resp\)).
The modeler may also “reduce the noise” in the data by filtering out the data that does not fit on the regression line. The following code would select the rows of data where the absolute value of the residual is less than 1. The data set shrinks from 1,000 observations to 77 observations.
df.filt <- df.resid[abs(df.resid$resid) < 1 ,]
head(df.filt)
## explan resp resid
## 1 29.3759446 9.829480 -0.34095783
## 7 8.5435229 10.424163 -0.04544049
## 8 8.8757525 9.875803 -0.58902956
## 11 -1.9103426 11.334630 0.71490352
## 20 -0.5136898 9.952783 -0.64688665
## 34 35.6643273 9.334346 -0.74578671
nrow(df.filt)
## [1] 77
The model with the filtered data looks much better. Notice that the coefficients of the new model are essentially the same as the coefficients of the true model, but the \(R^2\) and \(pvalue\) have both improved drastically.
lm.filt <- lm(resp ~ explan, data=df.filt)
summary(lm.filt)
##
## Call:
## lm(formula = resp ~ explan, data = df.filt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9247 -0.5355 -0.1025 0.5514 1.0525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.518976 0.087101 120.768 < 2e-16 ***
## explan -0.014237 0.003322 -4.285 5.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5774 on 75 degrees of freedom
## Multiple R-squared: 0.1967, Adjusted R-squared: 0.186
## F-statistic: 18.36 on 1 and 75 DF, p-value: 5.351e-05
Data filtering is a much more subtle quant crime than the first two quant crimes. There are several statistical techniques that identify influential observations, leverage observations, and outliers. In some cases, these observations should be filtered out of the data set. Modelers and model validators should proceed with caution whenever data is filtered out of the model.
Modeling is hard work. Be honest.