knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(car)
library(splines)
library(ModelMetrics)
library(plotmo)

Summary

After fitting a linear model, it is important to check the model residuals for substantial violations of linearity.

Compare a linear model against a regression spline model with BIC or test set RMSE.

How to implement a regression spline model in production.

Simulate Fake Data

Suppose annual income increases (with diminishing returns) with age. In addition, annual income increases with years of education but decreases after the twentieth year of education. The following “true model” would simulate this environment.

\[AnnualIncome = -2e5 + 1e5*\log(Age) + 2e4*YearsEdu - 500*YearsEdu^2 + \epsilon\]

set.seed(2020)

Age <- abs(rnorm(1000, 40, 10)) + 14
YearsEdu <- as.integer(abs(rnorm(1000, 16, 8)))

# prevent YearsEdu > Age

YearsEdu <- pmin(YearsEdu, Age)

AnnualIncome = -2e5 + 1e5*log(Age) + 2e4*YearsEdu - 500*YearsEdu^2 + rnorm(1000,0, 30000)

df <- data.frame(AnnualIncome=AnnualIncome,
                 Age=Age,
                 YearsEdu=YearsEdu)

rm(AnnualIncome, Age, YearsEdu)

summary(df)
##   AnnualIncome         Age           YearsEdu    
##  Min.   : 81636   Min.   :22.34   Min.   : 0.00  
##  1st Qu.:326590   1st Qu.:46.87   1st Qu.:11.00  
##  Median :363294   Median :53.42   Median :15.00  
##  Mean   :355772   Mean   :53.73   Mean   :15.65  
##  3rd Qu.:395709   3rd Qu.:60.39   3rd Qu.:21.00  
##  Max.   :498660   Max.   :91.03   Max.   :44.79

Split the data between train and test

Thank you, Holly, for your code.

set.seed(123)

# Add id
df <- df %>% mutate(id = row_number())

# Create training set
train <- df %>% sample_frac(.70)

# Create test set
test  <- anti_join(df, train, by = 'id')

Data Visualization

Using the train set, visualize the bi-variate relationship between the predictors and target variable. To prevent bias in the test set, we should always pretend the test set does not exist until we are near the end of the model development process.

ggplot(train, aes(x=Age, y=AnnualIncome)) +
  geom_point()

ggplot(train, aes(x=YearsEdu, y=AnnualIncome)) +
  geom_point()

It is apparent that there is a non-linear relationship between YearsEdu and AnnualIncome. However, the relationship between Age and AnnualIncome appears linear.

Fit a Linear Model

Ignoring the non-linear relationship, we could fit a linear model. Again, we will use the train set.

lin_mod <- lm(AnnualIncome ~ Age + YearsEdu, data=train)

summary(lin_mod)
## 
## Call:
## lm(formula = AnnualIncome ~ Age + YearsEdu, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -244162  -24634    5349   31132  114601 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 201205.5    10243.0   19.64   <2e-16 ***
## Age           1885.7      175.3   10.76   <2e-16 ***
## YearsEdu      3515.3      238.2   14.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47940 on 697 degrees of freedom
## Multiple R-squared:  0.328,  Adjusted R-squared:  0.3261 
## F-statistic: 170.1 on 2 and 697 DF,  p-value: < 2.2e-16

Check Your Residuals

residualPlots(lin_mod)

##            Test stat Pr(>|Test stat|)    
## Age          -1.8069           0.0712 .  
## YearsEdu    -32.6175           <2e-16 ***
## Tukey test  -15.2505           <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residualPlots function returns both the residual plots (by predictor and fitted values) as well as tests of curvature. Based on the curvature of the blue lines and the hypothesis tests, we reject the null hypothesis that the relationships between the predictors and the response are linear. In particular, the relationship between YearsEdu and AnnualIncome appears highly non-linear.

There is some evidence that the relationship between Age and AnnualIncome may be non-linear, but the evidence is weaker.

Fit a Regression Spline Model

In our example, we know that Age should be transformed with a logarithm and that YearsEdu should be transformed with a quadratic polynomial. However, in practice, the correct transformations are unknowable to the data scientist.

An alternative to parametric transformations (e.g., log or polynomial) is to use regression splines to approximate a non-linear relationship. My favorite is the natural cubic spline transformation with 5 knots (set at the quantiles of each predictor).

Again, we use the train set.

ns_mod <- lm(AnnualIncome ~ ns(Age, 5) + ns(YearsEdu, 5), train)

summary(ns_mod)
## 
## Call:
## lm(formula = AnnualIncome ~ ns(Age, 5) + ns(YearsEdu, 5), data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79912 -20004   -631  19323  89445 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        116230      13585   8.556  < 2e-16 ***
## ns(Age, 5)1         75892      11289   6.722 3.75e-11 ***
## ns(Age, 5)2         84334      13071   6.452 2.08e-10 ***
## ns(Age, 5)3         86788       9214   9.419  < 2e-16 ***
## ns(Age, 5)4        175852      27091   6.491 1.63e-10 ***
## ns(Age, 5)5        109827      13345   8.230 9.38e-16 ***
## ns(YearsEdu, 5)1   176458       6690  26.375  < 2e-16 ***
## ns(YearsEdu, 5)2   196456       8152  24.099  < 2e-16 ***
## ns(YearsEdu, 5)3   164066       7268  22.575  < 2e-16 ***
## ns(YearsEdu, 5)4   228988      16403  13.960  < 2e-16 ***
## ns(YearsEdu, 5)5   -33384      11609  -2.876  0.00415 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30250 on 689 degrees of freedom
## Multiple R-squared:  0.7356, Adjusted R-squared:  0.7317 
## F-statistic: 191.7 on 10 and 689 DF,  p-value: < 2.2e-16

Compare Linear Model with Regression Spline Model

BIC(lin_mod)
## [1] 17098.54
BIC(ns_mod)
## [1] 16498.08

The BIC penalizes the model fit statistics for complexity. Despite being more complex, the regression spline model returns a lower BIC than the linear model. Hence, the regression spline model is preferred over the linear model.

You may argue that the difference in BIC is negligible. However, as we will see in the test set, the BIC is extremely conservative due to a substantial penalty against complexity. Hence, even small differences in BIC could lead to substantially better out-of-sample fits.

Check Your Residuals Again

residualPlots(ns_mod)
## Warning in residualPlot.default(model, ...): Splines replaced by a fitted linear
## combination

## Warning in residualPlot.default(model, ...): Splines replaced by a fitted linear
## combination

##                 Test stat Pr(>|Test stat|)
## ns(Age, 5)                                
## ns(YearsEdu, 5)                           
## Tukey test        -0.5924           0.5536

The blue curves in the residual plots and the Tukey test suggest that there is no additional information that could be extracted from the predictors (through additional transformations).

Compare Test Set Performance

preds_ns_mod <- predict(ns_mod, newdata=test)
preds_lin_mod <- predict(lin_mod, newdata=test)

rmse(test$AnnualIncome, preds_ns_mod)
## [1] 30686.17
rmse(test$AnnualIncome, preds_lin_mod)
## [1] 52777.1

The prediction error on the regression spline model is substantially lower than the linear model.

Interpreting Regression Spline Model

One of the main concerns with using a regression spline model is the loss of interpretability. The coefficients in a regression spline model have no business meaning.

Despite this drawback, plotting the sensitivity of the predictions to the ranges of each predictor could reveal interpretable behaviors.

plotmo(ns_mod, caption="Prediction Sensitivity Plot")
##  plotmo grid:    Age YearsEdu
##             53.42795       15

AnnualIncome increases with Age. AnnualIncome also increases with YearsEdu up until 20 years. Each additional year of education, after 20, reduces AnnualIncome.

Extrapolation with Regression Splines

Another concern with regression spline models is extrapolation. How would the predictions behave if the predictors contained outlandishly large or small values?

This could be addressed by running plotmo again but with a wider range of predictor values.

plotmo(ns_mod, caption="Extreme Prediction Sensitivity Plot", extend=3)
##  plotmo grid:    Age YearsEdu
##             53.42795       15

Natural cubic splines extrapolate linearly outside the train range. Based on the plots above, we may wish to cap and floor the predictor values to prevent nonsense predictions like negative incomes.

The Case Against Polynomial Regression

For folks familiar with polynomial regression, one may argue that polynomial regression would account for the non-linearity without the cryptic nature of ns().

poly_mod <- lm(AnnualIncome ~ poly(Age, 5, raw=TRUE) + poly(YearsEdu, 5, raw=TRUE), data=train)

summary(poly_mod)
## 
## Call:
## lm(formula = AnnualIncome ~ poly(Age, 5, raw = TRUE) + poly(YearsEdu, 
##     5, raw = TRUE), data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79780 -20165   -161  19392  88155 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -3.836e+05  5.258e+05  -0.730    0.466    
## poly(Age, 5, raw = TRUE)1       4.549e+04  5.521e+04   0.824    0.410    
## poly(Age, 5, raw = TRUE)2      -1.591e+03  2.239e+03  -0.711    0.477    
## poly(Age, 5, raw = TRUE)3       2.893e+01  4.398e+01   0.658    0.511    
## poly(Age, 5, raw = TRUE)4      -2.611e-01  4.198e-01  -0.622    0.534    
## poly(Age, 5, raw = TRUE)5       9.328e-04  1.561e-03   0.598    0.550    
## poly(YearsEdu, 5, raw = TRUE)1  2.059e+04  4.068e+03   5.061 5.36e-07 ***
## poly(YearsEdu, 5, raw = TRUE)2 -5.615e+02  6.514e+02  -0.862    0.389    
## poly(YearsEdu, 5, raw = TRUE)3  1.312e+00  4.387e+01   0.030    0.976    
## poly(YearsEdu, 5, raw = TRUE)4  4.429e-02  1.299e+00   0.034    0.973    
## poly(YearsEdu, 5, raw = TRUE)5 -1.223e-03  1.390e-02  -0.088    0.930    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30230 on 689 degrees of freedom
## Multiple R-squared:  0.7359, Adjusted R-squared:  0.732 
## F-statistic:   192 on 10 and 689 DF,  p-value: < 2.2e-16

The in-sample fit is quite good. Let’s do some model selection checks.

BIC(lin_mod)
## [1] 17098.54
BIC(ns_mod)
## [1] 16498.08
BIC(poly_mod)
## [1] 16497.29

BIC suggests the polynomial regression is better than the natural spline model.

Let’s check the residuals.

residualPlots(poly_mod)

##                               Test stat Pr(>|Test stat|)
## poly(Age, 5, raw = TRUE)                                
## poly(YearsEdu, 5, raw = TRUE)                           
## Tukey test                       0.0223           0.9822

No additional information can be extracted from the predictors.

Let’s check the sensitivities.

plotmo(poly_mod, caption="Polynomial Regression")
##  plotmo grid:    Age YearsEdu
##             53.42795       15

The relationships appear sensible.

Let’s check the test set performance.

preds_ns_mod <- predict(ns_mod, newdata=test)
preds_lin_mod <- predict(lin_mod, newdata=test)
preds_poly_mod <- predict(poly_mod, newdata=test)

rmse(test$AnnualIncome, preds_ns_mod)
## [1] 30686.17
rmse(test$AnnualIncome, preds_lin_mod)
## [1] 52777.1
rmse(test$AnnualIncome, preds_poly_mod)
## [1] 30772.01

The test set performance between the polynomial regression model and natural spline model are almost the same (i.e., the difference is about $86.00 per year).

Up until now, it appears the polynomial regression model and natural spline model are comparable.

However, the case against polynomial regression is when we extrapolate beyond the train range. The predictions are chaotic when we extrapolate.

plotmo(poly_mod, caption="Polynomial Regression (Extreme Values)", extend=3)
##  plotmo grid:    Age YearsEdu
##             53.42795       15

Unlike natural cubic spline that extrapolate linearly, polynomial regression extrapolates in an unexpected manner. Nonetheless, extreme values are a challenge for all models (linear, spline, and polynomial). Appropriate upper and lower bounds could reduce some of the impacts of extrapolation.

Implementing a Regression Spline Model

One of the major advantages of polynomial regression is its ease of implementation (i.e., most databases can calculate polynomials), while a natural cubic spline is more difficult to implement in a database system.

Harrell’s rms package has a utility function that converts a natural cubic spline model into a parametric equation (?rms::Function.rms). Here’s the equivalent model built with rms:

library(rms)

ddist <- datadist(train)
options("datadist" = "ddist")

knotsAge <- quantile(train$Age, probs=seq.int(from = 0, to = 1,
                                              length.out = 4 + 2L))

knotsYearsEdu<- quantile(train$YearsEdu, probs=seq.int(from = 0, to = 1,
                                                       length.out = 4 + 2L))

rcs_mod <- ols(AnnualIncome ~ rcs(Age, knotsAge) + rcs(YearsEdu, knotsYearsEdu), data=train)

Check the test set performance.

rmse(test$AnnualIncome, preds_ns_mod)
## [1] 30686.17
rmse(test$AnnualIncome, preds_lin_mod)
## [1] 52777.1
rmse(test$AnnualIncome, preds_poly_mod)
## [1] 30772.01
preds_rcs_mod <- predict(rcs_mod, newdata=test)

rmse(test$AnnualIncome, preds_rcs_mod)
## [1] 30686.17

Check the test set predictions.

round(rmse(preds_ns_mod, preds_rcs_mod), 8)
## [1] 0

The test set predictions are identical between rcs_mod and ns_mod. Now we can convert rcs_mod into a parametric equation.

Function(rcs_mod)
## function (Age = 53.427951, YearsEdu = 15) 
## {
##     43944.701 + 3235.8933 * Age - 0.74477603 * pmax(Age - 22.33875, 
##         0)^3 + 2.9092309 * pmax(Age - 45.833692, 0)^3 + 3.041269 * 
##         pmax(Age - 51.393723, 0)^3 - 9.171219 * pmax(Age - 56.027236, 
##         0)^3 + 4.2174327 * pmax(Age - 62.242424, 0)^3 - 0.25193764 * 
##         pmax(Age - 86.016317, 0)^3 + 18118.709 * YearsEdu - 29.705732 * 
##         pmax(YearsEdu, 0)^3 + 64.489825 * pmax(YearsEdu - 9, 
##         0)^3 - 25.627078 * pmax(YearsEdu - 13, 0)^3 - 39.862552 * 
##         pmax(YearsEdu - 17, 0)^3 + 45.124056 * pmax(YearsEdu - 
##         22, 0)^3 - 14.418519 * pmax(YearsEdu - 39, 0)^3
## }
## <environment: 0x7ffc03134ba0>

The spline regression is now ready to be implemented into a production system (that does not run R).

Other Thoughts

The choices for number of knots and the degree of polynomial were both chosen arbitrarily (degrees of freedom equal to 5). An objective method for the choices is to try a grid of parameters (i.e., degrees of freedom from 1 to 5). Then use BIC or cross-validation RMSE to choose the best parameter.

Reader should remember not to tune the parameters on the test set.