knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(car)
library(splines)
library(ModelMetrics)
library(plotmo)
After fitting a linear model, it is important to check the model residuals for substantial violations of linearity.
Compare a linear model against a regression spline model with BIC or test set RMSE.
How to implement a regression spline model in production.
Suppose annual income increases (with diminishing returns) with age. In addition, annual income increases with years of education but decreases after the twentieth year of education. The following “true model” would simulate this environment.
\[AnnualIncome = -2e5 + 1e5*\log(Age) + 2e4*YearsEdu - 500*YearsEdu^2 + \epsilon\]
set.seed(2020)
Age <- abs(rnorm(1000, 40, 10)) + 14
YearsEdu <- as.integer(abs(rnorm(1000, 16, 8)))
# prevent YearsEdu > Age
YearsEdu <- pmin(YearsEdu, Age)
AnnualIncome = -2e5 + 1e5*log(Age) + 2e4*YearsEdu - 500*YearsEdu^2 + rnorm(1000,0, 30000)
df <- data.frame(AnnualIncome=AnnualIncome,
Age=Age,
YearsEdu=YearsEdu)
rm(AnnualIncome, Age, YearsEdu)
summary(df)
## AnnualIncome Age YearsEdu
## Min. : 81636 Min. :22.34 Min. : 0.00
## 1st Qu.:326590 1st Qu.:46.87 1st Qu.:11.00
## Median :363294 Median :53.42 Median :15.00
## Mean :355772 Mean :53.73 Mean :15.65
## 3rd Qu.:395709 3rd Qu.:60.39 3rd Qu.:21.00
## Max. :498660 Max. :91.03 Max. :44.79
Thank you, Holly, for your code.
set.seed(123)
# Add id
df <- df %>% mutate(id = row_number())
# Create training set
train <- df %>% sample_frac(.70)
# Create test set
test <- anti_join(df, train, by = 'id')
Using the train set, visualize the bi-variate relationship between the predictors and target variable. To prevent bias in the test set, we should always pretend the test set does not exist until we are near the end of the model development process.
ggplot(train, aes(x=Age, y=AnnualIncome)) +
geom_point()
ggplot(train, aes(x=YearsEdu, y=AnnualIncome)) +
geom_point()
It is apparent that there is a non-linear relationship between YearsEdu and AnnualIncome. However, the relationship between Age and AnnualIncome appears linear.
Ignoring the non-linear relationship, we could fit a linear model. Again, we will use the train set.
lin_mod <- lm(AnnualIncome ~ Age + YearsEdu, data=train)
summary(lin_mod)
##
## Call:
## lm(formula = AnnualIncome ~ Age + YearsEdu, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -244162 -24634 5349 31132 114601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 201205.5 10243.0 19.64 <2e-16 ***
## Age 1885.7 175.3 10.76 <2e-16 ***
## YearsEdu 3515.3 238.2 14.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47940 on 697 degrees of freedom
## Multiple R-squared: 0.328, Adjusted R-squared: 0.3261
## F-statistic: 170.1 on 2 and 697 DF, p-value: < 2.2e-16
residualPlots(lin_mod)
## Test stat Pr(>|Test stat|)
## Age -1.8069 0.0712 .
## YearsEdu -32.6175 <2e-16 ***
## Tukey test -15.2505 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The residualPlots function returns both the residual plots (by predictor and fitted values) as well as tests of curvature. Based on the curvature of the blue lines and the hypothesis tests, we reject the null hypothesis that the relationships between the predictors and the response are linear. In particular, the relationship between YearsEdu and AnnualIncome appears highly non-linear.
There is some evidence that the relationship between Age and AnnualIncome may be non-linear, but the evidence is weaker.
In our example, we know that Age should be transformed with a logarithm and that YearsEdu should be transformed with a quadratic polynomial. However, in practice, the correct transformations are unknowable to the data scientist.
An alternative to parametric transformations (e.g., log or polynomial) is to use regression splines to approximate a non-linear relationship. My favorite is the natural cubic spline transformation with 5 knots (set at the quantiles of each predictor).
Again, we use the train set.
ns_mod <- lm(AnnualIncome ~ ns(Age, 5) + ns(YearsEdu, 5), train)
summary(ns_mod)
##
## Call:
## lm(formula = AnnualIncome ~ ns(Age, 5) + ns(YearsEdu, 5), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79912 -20004 -631 19323 89445
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116230 13585 8.556 < 2e-16 ***
## ns(Age, 5)1 75892 11289 6.722 3.75e-11 ***
## ns(Age, 5)2 84334 13071 6.452 2.08e-10 ***
## ns(Age, 5)3 86788 9214 9.419 < 2e-16 ***
## ns(Age, 5)4 175852 27091 6.491 1.63e-10 ***
## ns(Age, 5)5 109827 13345 8.230 9.38e-16 ***
## ns(YearsEdu, 5)1 176458 6690 26.375 < 2e-16 ***
## ns(YearsEdu, 5)2 196456 8152 24.099 < 2e-16 ***
## ns(YearsEdu, 5)3 164066 7268 22.575 < 2e-16 ***
## ns(YearsEdu, 5)4 228988 16403 13.960 < 2e-16 ***
## ns(YearsEdu, 5)5 -33384 11609 -2.876 0.00415 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30250 on 689 degrees of freedom
## Multiple R-squared: 0.7356, Adjusted R-squared: 0.7317
## F-statistic: 191.7 on 10 and 689 DF, p-value: < 2.2e-16
BIC(lin_mod)
## [1] 17098.54
BIC(ns_mod)
## [1] 16498.08
The BIC penalizes the model fit statistics for complexity. Despite being more complex, the regression spline model returns a lower BIC than the linear model. Hence, the regression spline model is preferred over the linear model.
You may argue that the difference in BIC is negligible. However, as we will see in the test set, the BIC is extremely conservative due to a substantial penalty against complexity. Hence, even small differences in BIC could lead to substantially better out-of-sample fits.
residualPlots(ns_mod)
## Warning in residualPlot.default(model, ...): Splines replaced by a fitted linear
## combination
## Warning in residualPlot.default(model, ...): Splines replaced by a fitted linear
## combination
## Test stat Pr(>|Test stat|)
## ns(Age, 5)
## ns(YearsEdu, 5)
## Tukey test -0.5924 0.5536
The blue curves in the residual plots and the Tukey test suggest that there is no additional information that could be extracted from the predictors (through additional transformations).
preds_ns_mod <- predict(ns_mod, newdata=test)
preds_lin_mod <- predict(lin_mod, newdata=test)
rmse(test$AnnualIncome, preds_ns_mod)
## [1] 30686.17
rmse(test$AnnualIncome, preds_lin_mod)
## [1] 52777.1
The prediction error on the regression spline model is substantially lower than the linear model.
One of the main concerns with using a regression spline model is the loss of interpretability. The coefficients in a regression spline model have no business meaning.
Despite this drawback, plotting the sensitivity of the predictions to the ranges of each predictor could reveal interpretable behaviors.
plotmo(ns_mod, caption="Prediction Sensitivity Plot")
## plotmo grid: Age YearsEdu
## 53.42795 15
AnnualIncome increases with Age. AnnualIncome also increases with YearsEdu up until 20 years. Each additional year of education, after 20, reduces AnnualIncome.
Another concern with regression spline models is extrapolation. How would the predictions behave if the predictors contained outlandishly large or small values?
This could be addressed by running plotmo again but with a wider range of predictor values.
plotmo(ns_mod, caption="Extreme Prediction Sensitivity Plot", extend=3)
## plotmo grid: Age YearsEdu
## 53.42795 15
Natural cubic splines extrapolate linearly outside the train range. Based on the plots above, we may wish to cap and floor the predictor values to prevent nonsense predictions like negative incomes.
For folks familiar with polynomial regression, one may argue that polynomial regression would account for the non-linearity without the cryptic nature of ns().
poly_mod <- lm(AnnualIncome ~ poly(Age, 5, raw=TRUE) + poly(YearsEdu, 5, raw=TRUE), data=train)
summary(poly_mod)
##
## Call:
## lm(formula = AnnualIncome ~ poly(Age, 5, raw = TRUE) + poly(YearsEdu,
## 5, raw = TRUE), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79780 -20165 -161 19392 88155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.836e+05 5.258e+05 -0.730 0.466
## poly(Age, 5, raw = TRUE)1 4.549e+04 5.521e+04 0.824 0.410
## poly(Age, 5, raw = TRUE)2 -1.591e+03 2.239e+03 -0.711 0.477
## poly(Age, 5, raw = TRUE)3 2.893e+01 4.398e+01 0.658 0.511
## poly(Age, 5, raw = TRUE)4 -2.611e-01 4.198e-01 -0.622 0.534
## poly(Age, 5, raw = TRUE)5 9.328e-04 1.561e-03 0.598 0.550
## poly(YearsEdu, 5, raw = TRUE)1 2.059e+04 4.068e+03 5.061 5.36e-07 ***
## poly(YearsEdu, 5, raw = TRUE)2 -5.615e+02 6.514e+02 -0.862 0.389
## poly(YearsEdu, 5, raw = TRUE)3 1.312e+00 4.387e+01 0.030 0.976
## poly(YearsEdu, 5, raw = TRUE)4 4.429e-02 1.299e+00 0.034 0.973
## poly(YearsEdu, 5, raw = TRUE)5 -1.223e-03 1.390e-02 -0.088 0.930
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30230 on 689 degrees of freedom
## Multiple R-squared: 0.7359, Adjusted R-squared: 0.732
## F-statistic: 192 on 10 and 689 DF, p-value: < 2.2e-16
The in-sample fit is quite good. Let’s do some model selection checks.
BIC(lin_mod)
## [1] 17098.54
BIC(ns_mod)
## [1] 16498.08
BIC(poly_mod)
## [1] 16497.29
BIC suggests the polynomial regression is better than the natural spline model.
Let’s check the residuals.
residualPlots(poly_mod)
## Test stat Pr(>|Test stat|)
## poly(Age, 5, raw = TRUE)
## poly(YearsEdu, 5, raw = TRUE)
## Tukey test 0.0223 0.9822
No additional information can be extracted from the predictors.
Let’s check the sensitivities.
plotmo(poly_mod, caption="Polynomial Regression")
## plotmo grid: Age YearsEdu
## 53.42795 15
The relationships appear sensible.
Let’s check the test set performance.
preds_ns_mod <- predict(ns_mod, newdata=test)
preds_lin_mod <- predict(lin_mod, newdata=test)
preds_poly_mod <- predict(poly_mod, newdata=test)
rmse(test$AnnualIncome, preds_ns_mod)
## [1] 30686.17
rmse(test$AnnualIncome, preds_lin_mod)
## [1] 52777.1
rmse(test$AnnualIncome, preds_poly_mod)
## [1] 30772.01
The test set performance between the polynomial regression model and natural spline model are almost the same (i.e., the difference is about $86.00 per year).
Up until now, it appears the polynomial regression model and natural spline model are comparable.
However, the case against polynomial regression is when we extrapolate beyond the train range. The predictions are chaotic when we extrapolate.
plotmo(poly_mod, caption="Polynomial Regression (Extreme Values)", extend=3)
## plotmo grid: Age YearsEdu
## 53.42795 15
Unlike natural cubic spline that extrapolate linearly, polynomial regression extrapolates in an unexpected manner. Nonetheless, extreme values are a challenge for all models (linear, spline, and polynomial). Appropriate upper and lower bounds could reduce some of the impacts of extrapolation.
One of the major advantages of polynomial regression is its ease of implementation (i.e., most databases can calculate polynomials), while a natural cubic spline is more difficult to implement in a database system.
Harrell’s rms package has a utility function that converts a natural cubic spline model into a parametric equation (?rms::Function.rms). Here’s the equivalent model built with rms:
library(rms)
ddist <- datadist(train)
options("datadist" = "ddist")
knotsAge <- quantile(train$Age, probs=seq.int(from = 0, to = 1,
length.out = 4 + 2L))
knotsYearsEdu<- quantile(train$YearsEdu, probs=seq.int(from = 0, to = 1,
length.out = 4 + 2L))
rcs_mod <- ols(AnnualIncome ~ rcs(Age, knotsAge) + rcs(YearsEdu, knotsYearsEdu), data=train)
Check the test set performance.
rmse(test$AnnualIncome, preds_ns_mod)
## [1] 30686.17
rmse(test$AnnualIncome, preds_lin_mod)
## [1] 52777.1
rmse(test$AnnualIncome, preds_poly_mod)
## [1] 30772.01
preds_rcs_mod <- predict(rcs_mod, newdata=test)
rmse(test$AnnualIncome, preds_rcs_mod)
## [1] 30686.17
Check the test set predictions.
round(rmse(preds_ns_mod, preds_rcs_mod), 8)
## [1] 0
The test set predictions are identical between rcs_mod and ns_mod. Now we can convert rcs_mod into a parametric equation.
Function(rcs_mod)
## function (Age = 53.427951, YearsEdu = 15)
## {
## 43944.701 + 3235.8933 * Age - 0.74477603 * pmax(Age - 22.33875,
## 0)^3 + 2.9092309 * pmax(Age - 45.833692, 0)^3 + 3.041269 *
## pmax(Age - 51.393723, 0)^3 - 9.171219 * pmax(Age - 56.027236,
## 0)^3 + 4.2174327 * pmax(Age - 62.242424, 0)^3 - 0.25193764 *
## pmax(Age - 86.016317, 0)^3 + 18118.709 * YearsEdu - 29.705732 *
## pmax(YearsEdu, 0)^3 + 64.489825 * pmax(YearsEdu - 9,
## 0)^3 - 25.627078 * pmax(YearsEdu - 13, 0)^3 - 39.862552 *
## pmax(YearsEdu - 17, 0)^3 + 45.124056 * pmax(YearsEdu -
## 22, 0)^3 - 14.418519 * pmax(YearsEdu - 39, 0)^3
## }
## <environment: 0x7ffc03134ba0>
The spline regression is now ready to be implemented into a production system (that does not run R).
The choices for number of knots and the degree of polynomial were both chosen arbitrarily (degrees of freedom equal to 5). An objective method for the choices is to try a grid of parameters (i.e., degrees of freedom from 1 to 5). Then use BIC or cross-validation RMSE to choose the best parameter.
Reader should remember not to tune the parameters on the test set.