Non-Linear Modeling Approaches - Polynomial Regression, Step Functions, Splines and GAMs

This is an analysis of non-linear modeling approaches following the details in chapter 7 of An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani. We build out polynomial regression models then apply step functions and splines and then illustrate the comparison with generalized additive models.

Using the Wage data found in the ISLR package we discuss and build several complex non-linear models using the fitting steps discussed in the book.

library(ISLR)
attach(Wage)

Examine the data

head(Wage)
##        year age           maritl     race       education             region
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
## 86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
## 376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
##              jobclass         health health_ins  logwage      wage
## 231655  1. Industrial      1. <=Good      2. No 4.318063  75.04315
## 86582  2. Information 2. >=Very Good      2. No 4.255273  70.47602
## 161300  1. Industrial      1. <=Good     1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good     1. Yes 5.041393 154.68529
## 11443  2. Information      1. <=Good     1. Yes 4.318063  75.04315
## 376662 2. Information 2. >=Very Good     1. Yes 4.845098 127.11574
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

We see that we have 3000 observations with 11 variables including several qualitative variables that are in the data type Factors with levels ranging from 2 to 9.

Polynomial Regression and Step Functions

Polynomial regression extends linear regression to situations in which the relationship between response and predictor is non-linear. The standard linear model

\[y_i=\beta_0+\beta_1x_i+\epsilon_i\] is replaced with a polynomial function \[y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3+...+\beta_dx_i^d+\epsilon_i\] where \(\epsilon_i\) is the error and \(d\) is the degree of the polynomial. For a large enough \(d\) polynomial regression fits an extremely non-linear curve.

To begin, we fit a linear model using the lm() function using a fourth degree polynomial on predictor age and call the model pr.fit.

pr_fit <- lm(wage ~ poly(age, 4), data = Wage)
coef(summary(pr_fit))
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
## poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
## poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02

The poly() function shortens to process to coding the powers of age. The function returns a matrix with columns that are a basis of orthogonal polynomials so that each column is a linear combination of the variables age, age^2, age^3 and age^4.

If we add the argument raw = TRUE we can get the age, age^2, age^3 and age^4 variables directly

pr_fit2 <- lm(wage ~ poly(age, 4, raw = TRUE), data = Wage)
coef(summary(pr_fit2))
##                                Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)               -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
## poly(age, 4, raw = TRUE)1  2.124552e+01 5.886748e+00  3.609042 0.0003123618
## poly(age, 4, raw = TRUE)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
## poly(age, 4, raw = TRUE)3  6.810688e-03 3.065931e-03  2.221409 0.0263977518
## poly(age, 4, raw = TRUE)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498

We can also fit the model using the wrapper function I() to use terms like wage^2 as shown below.

pr_fit3 <- lm(wage ~ age + I(age^2) + I(age ^3) + I(age^4), data = Wage)
coef(summary(pr_fit3))
##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
## age          2.124552e+01 5.886748e+00  3.609042 0.0003123618
## I(age^2)    -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
## I(age^3)     6.810688e-03 3.065931e-03  2.221409 0.0263977518
## I(age^4)    -3.203830e-05 1.641359e-05 -1.951938 0.0510386498

We can also do the same thing more efficiently using the cbind() function to build a matric from columns of vectors.

pr_fit4 <- lm(wage ~ cbind(age, age^2, age^3, age^4), data = Wage)
coef(summary(pr_fit4))
##                                         Estimate   Std. Error   t value
## (Intercept)                        -1.841542e+02 6.004038e+01 -3.067172
## cbind(age, age^2, age^3, age^4)age  2.124552e+01 5.886748e+00  3.609042
## cbind(age, age^2, age^3, age^4)    -5.638593e-01 2.061083e-01 -2.735743
## cbind(age, age^2, age^3, age^4)     6.810688e-03 3.065931e-03  2.221409
## cbind(age, age^2, age^3, age^4)    -3.203830e-05 1.641359e-05 -1.951938
##                                        Pr(>|t|)
## (Intercept)                        0.0021802539
## cbind(age, age^2, age^3, age^4)age 0.0003123618
## cbind(age, age^2, age^3, age^4)    0.0062606446
## cbind(age, age^2, age^3, age^4)    0.0263977518
## cbind(age, age^2, age^3, age^4)    0.0510386498

Creating a grid of values for age where we want to make predictions we can call the function predict() and also specify we want to output the standard error values.

agelims <- range(age)
age.grid <- seq(from = agelims[1], to = agelims[2])
preds <- predict(pr_fit, newdata = list(age = age.grid), se = TRUE)
se.bands <- cbind(preds$pr_fit + 2 * preds$se.pr_fit, preds$pr_fit - 2 * preds$se.pr_fit)

We can plot this and add the fit from the 4th degree polynomial using the following

par(mfrow = c(1, 2), mar = c(4.5, 4.5, 1, 1), oma = c(0, 0, 4, 0))
plot(age, wage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Degree-4 Polynomial", outer = T)
lines(age.grid, preds$pr_fit4, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)
## Error in matplot(x = x, y = y, type = type, lty = lty, lwd = lwd, pch = pch, : 'x' and 'y' must have same number of rows

We decide on the degree of the polynomial to use with a hypothesis test. To illustrate we fit models from linear to fifth degree polynomial to explore the relationship between age and wage. We perform this using the anova() function which will perform an analysis of variance (ANOVA, using an F-test) to test the null hypothesis that a model \(M_1\) that a model is sufficient to explain the data against the alternative hypothesis that a more complex model \(M_2\) is needed.

To use the anova() function \(M_1\) and \(M_2\) need to be nested models. I.e. the predictors in \(M_1\) must be a subset of those in \(M_2\). Here, we fit five different models and sequentially compare the simpler model to the more complex version.

fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The very low p-values comparing models 1 to 2 and models 2 to 3 indicate the linear and quadratic fits are not sufficient. The p-value comparing the degree 3 model to the degree 4 model is 0.051 which indicates significance and the p-value comparing the degree 4 to the degree 5 polynomial is 0.369 which is highly significant but likely to be overly complex and prone to overfitting.

The ANOVA method works with other terms in the models as well .

fit.1 <- lm(wage ~ education + age, data = Wage)
fit.2 <- lm(wage ~ education + poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ education + poly(age, 3), data = Wage)
anova(fit.1, fit.2, fit.3)
## Analysis of Variance Table
## 
## Model 1: wage ~ education + age
## Model 2: wage ~ education + poly(age, 2)
## Model 3: wage ~ education + poly(age, 3)
##   Res.Df     RSS Df Sum of Sq        F Pr(>F)    
## 1   2994 3867992                                 
## 2   2993 3725395  1    142597 114.6969 <2e-16 ***
## 3   2992 3719809  1      5587   4.4936 0.0341 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We could also use k-fold cross-validation to choose the best degree polynomial.

How to predict if an individual makes more than $250k per year?

We use the glm() function with family = “binomial” to fit a polynomial logistic regression model and make predictions using the predict() function.

fit <- glm(I(wage > 250) ~ poly(age, 4), data = Wage, family = binomial)
preds <- predict(fit, newdata = list(age = age.grid), se = T)

Calculating the confidence intervals for the prediction uses the default prediction type for a glm() model is type = “link” to get predictions for the logit of the form

\[log \Big( \frac{Pr(y = 1 \vert X)}{1 - Pr(Y = 1 \vert X)} \Big) = X \beta\] The standard errors are also of this form. The confidence intervals for \(Pr(Y = 1\vert X)\) use the transformation \[Pr(Y = 1\vert X)=\frac{exp(X\beta)}{1+exp(X\beta)}\]

pfit <- exp(preds$fit) / (1 + exp(preds$fit))
se.bands.logit <- cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit)
se.bands <- exp(se.bands.logit) / (1 + exp(se.bands.logit))

We can plot the results using

plot(age, I(wage > 250), xlim = agelims, type = "n", ylim = c(0, 0.2))
points(jitter(age), I((wage > 250) / 5), cex = 0.5, pch = "|", col = "darkgrey")
lines(age.grid, pfit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)

Step Functions

Polynomial functions of the features as predictors in a linear model imposes a global structure on the non-linear function of \(X\) that can be resolved by using step functions to avoid this imposition.

The idea is to break up \(X\) into bins and fit a different constant in each bin which converts a continuous variable \(X\) into an ordered categorical variable.

In greater detail, we create cutpoints \(c_1, c_2,...,c_k\) in the range of \(X\) and then build \(K+1\) new variables

\[\begin{aligned} C_0(X) &= I(X<c_1)\\ C_1(X) &= I(c_1\leq X<c_2)\\ C_2(X) &= I(c_2\leq X<c_3)\\ \vdots \\ C_{K-1}(X) &= I(c_{K-1}\leq X < c_K)\\ C_K(X) &= I(c_K\leq X)\\ \end{aligned}\]

where \(I(\cdot)\) is an indicator function that returns \(1\) if true and \(0\) otherwise. These are dummy variables. For any value of \(X\), \(C_0(X)+C_1(X)+...+C_K(X)=1\) because \(X\) is exactly in one of the \(K+1\) intervals. Using least squares, we can then fit a linear model using \(C_0(X),C_1(X),...,C_K(X)=1\) as predictors:

\[y_i=\beta_0+\beta_1C_1(x_i)+\beta_2C_2(x_i)+...+\beta_KC_K(x_i) + \epsilon_i\]

For some value of \(X\), no more than one of \(C_1, C_2,...,C_k\) can be non-zero.

In order to fit a step function we use the cut() function

table(cut(age, 4))
## 
## (17.9,33.5]   (33.5,49]   (49,64.5] (64.5,80.1] 
##         750        1399         779          72
fit <- lm(wage ~ cut(age, 4), data = Wage)
coef(summary(fit))
##                         Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)            94.158392   1.476069 63.789970 0.000000e+00
## cut(age, 4)(33.5,49]   24.053491   1.829431 13.148074 1.982315e-38
## cut(age, 4)(49,64.5]   23.664559   2.067958 11.443444 1.040750e-29
## cut(age, 4)(64.5,80.1]  7.640592   4.987424  1.531972 1.256350e-01

We see that cut() automatically picked the cutpoints at \(33.5, 49\), and \(64.5\) years of age.

Splines

Polynomial regression models are a special case of the basis function approach. We want to apply a family of functions to \(X\) to fit \[y_i=\beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+\beta_3b_3(x_i)+...+\beta_Kb_K(x_i)+\epsilon_i\] where the basis functions \(b_1(\cdot), b_2(\cdot), ..., b_K(\cdot)\) are fixed and known. Then we can use least squares to estimate the unknown regression coefficients.

If we extend this idea we can fit piecewise polynomials over different ranges of \(X\). A piecewise cubic polynomial, for example, is found by fitting a cubic regression model of the form \[y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3+\epsilon\] where the coefficients, \(\beta_i\) differ in different parts of the range. The points where the coefficients change are called knots.

A piecewise cubic with no knots is just a standard cubic polynomial but if there is a knot at a point \(c\) then the form becomes

\[ \begin{aligned} y_i = \begin{cases} \beta_{01}+\beta_{11}x_i+\beta_{21}x_i^2+\beta_{31}x_i^3+\epsilon_i\text{ if }x_i<c;\\ \beta_{02}+\beta_{12}x_i+\beta_{22}x_i^2+\beta_{32}x_i^3+\epsilon_i\text{ if }x_i\geq c.\\ \end{cases} \end{aligned}\]

This fits two polynomials to the data, one under \(c\) and the other at or above. Each can be fit using least squares. More knots leads to more flexibility. The general definition of a degree-d spline is that is a piecewise degree-d polynomial with continuity in derivatives up to degree \(d-1\) at each knot.

In order to fit regression splines in R we use the splines library. The bs() function generates the matrix of basis functions for splines with the specified set of knots.

library(splines)
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
pred <- predict(fit, newdata = list(age = age.grid), se = T)
plot(age, wage, col = "gray")
lines(age.grid, pred$fit, lwd = 2)
lines(age.grid, pred$fit + 2 * pred$se, lty = "dashed")
lines(age.grid, pred$fit - 2 * pred$se, lty = "dashed")

This generates a spline with six basis functions given pre-specified knots at \(25, 40\), and \(60\).

We can also use the df() function to produce a spline with knots at uniform quantiles of data.

dim(bs(age, knots = c(25, 40, 60)))
## [1] 3000    6
dim(bs(age, df = 6))
## [1] 3000    6
attr(bs(age, df = 6), "knots")
##   25%   50%   75% 
## 33.75 42.00 51.00

In this case, knots were chosen at ages \(33.8, 42.0,\) and \(51\) corresponding to the \(25\)th, \(50\)th and \(75\)th percentiles of age.

TO fit a natural spline we can use the ns() function

fit2 <- lm(wage ~ ns(age, df = 4), data = Wage)
pred2 <- predict(fit2, newdata = list(age = age.grid), se = T)
#lines(age.grid, pred2$fit, col = "red", lwd = 2)
plot(age, wage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Smoothing Spline")
fit <- smooth.spline(age, wage, df = 16)
fit2 <- smooth.spline(age, wage, cv = TRUE)
## Warning in smooth.spline(age, wage, cv = TRUE): cross-validation with non-unique
## 'x' values seems doubtful
fit2$df
## [1] 6.794596
lines(fit, col ="red", lwd = 2)
lines(fit2, col = "blue", lwd = 2)
legend("topright", legend = c("16 DF", "6.8 DF"),
       col = c("red", "blue"), lty = 1, cex = 0.8)

In the first call to smooth.spline() we specified \(df = 16\). The function then determines the optimal \(\lambda\) that leads to \(16\) degrees of freedom. In the second call the smoothness level is selected by cross validation resulting in a value of \(\lambda\) that yields \(6.8\) degrees of freedom. To perform local regression we use the loess() function.

plot(age, wage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Local Regression")
fit <- loess(wage ~ age, span = 0.2, data = Wage)
fit2 <- loess(wage ~ age, span = 0.5, data = Wage)
lines(age.grid, predict(fit, data.frame(age = age.grid)), 
      col = "red", lwd = 2)
lines(age.grid, predict(fit2, data.frame(age = age.grid)),
      col = "blue", lwd = 2)
legend("topright", legend = c("Span = 0.2", "Span = 0.5"),
       col = c("red", "blue"), lty = 1, lwd = 2, cex = 0.8)

This produces local regression using spans of 20% and 50% of the observations. The larger the span the smoother the fit. The locfit library is also used for local regression.

GAMs

Fit a GAM with wage as the response and natural spline functions of year and age, treating education as the qualitative predictor. Since this is a regression model we can use lm()

gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + education, data = Wage)
coef(summary(gam1))
##                              Estimate Std. Error    t value      Pr(>|t|)
## (Intercept)                 46.949491   4.704182  9.9803733  4.255933e-23
## ns(year, 4)1                 8.624692   3.466060  2.4883275  1.288867e-02
## ns(year, 4)2                 3.762266   2.959187  1.2713851  2.036907e-01
## ns(year, 4)3                 8.126549   4.211446  1.9296340  5.374684e-02
## ns(year, 4)4                 6.806473   2.396969  2.8396166  4.547335e-03
## ns(age, 5)1                 45.170123   4.193491 10.7714834  1.435245e-26
## ns(age, 5)2                 38.449704   5.076174  7.5745439  4.775895e-14
## ns(age, 5)3                 34.239237   4.382517  7.8126879  7.692241e-15
## ns(age, 5)4                 48.677634  10.571609  4.6045623  4.306057e-06
## ns(age, 5)5                  6.557265   8.367147  0.7836919  4.332831e-01
## education2. HS Grad         10.983419   2.430148  4.5196494  6.434935e-06
## education3. Some College    23.472889   2.561672  9.1631133  9.124942e-20
## education4. College Grad    38.313667   2.547137 15.0418574  2.397682e-49
## education5. Advanced Degree 62.553971   2.761309 22.6537345 5.566864e-105

Next we fit the model using smoothing splines. These require basis functions which will need the gam library. The s() function indicates a smoothing spline.

library(gam)
## Loading required package: foreach
## Loaded gam 1.20
gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
par(mfrow = c(1, 3))
plot(gam.m3, se = TRUE, col = "blue")

We can also use plot.Gam()

plot.Gam(gam1, se = TRUE, col = "red")

ANOVA helps us decide which of the models is the best: - GAM excluding year\((M_1)\), - GAM using linear function of year\((M_2)\), or - GAM using spline function of year\((M_3)\).

gam.m1 <- gam(wage ~ s(age, 5) + education, data = Wage)
gam.m2 <- gam(wage ~ year + s(age, 5) + education, data = Wage)
anova(gam.m1, gam.m2, gam.m3, test = "F")
## Analysis of Deviance Table
## 
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
##   Resid. Df Resid. Dev Df Deviance       F    Pr(>F)    
## 1      2990    3711731                                  
## 2      2989    3693842  1  17889.2 14.4771 0.0001447 ***
## 3      2986    3689770  3   4071.1  1.0982 0.3485661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is strong evidence as indicated by the p-value near 0 that a GAM with a linear function of year is better than a GAM without year included and no evidence that a non-linear function of year is needed based on the p-value of 0.348. Therefore \(M_2\) is the preferred model.

The summary() function produces a summary of the GAM fit

summary(gam.m3)
## 
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -119.43  -19.70   -3.33   14.17  213.48 
## 
## (Dispersion Parameter for gaussian family taken to be 1235.69)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(year, 4)    1   27162   27162  21.981 2.877e-06 ***
## s(age, 5)     1  195338  195338 158.081 < 2.2e-16 ***
## education     4 1069726  267432 216.423 < 2.2e-16 ***
## Residuals  2986 3689770    1236                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F  Pr(F)    
## (Intercept)                          
## s(year, 4)        3  1.086 0.3537    
## s(age, 5)         4 32.380 <2e-16 ***
## education                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-values for year and age correspond to the null hypothesis of a linear versus non-linear relationship as the alternative. The p-value of \(0.3537\) for year reinforces that a linear function is adequate for this term but a non-linear term is required for age.

Using the predict() function we can use GAMs for predictions as in lm() objects.

preds <- predict(gam.m2, newdata = Wage)

Using the lo() function we can create interaction terms before calling the gam() function.

gam.lo <- gam(wage ~ s(year, df = 4) + lo(age, span = 0.7) + education, data = Wage)
plot.Gam(gam.lo, se = TRUE, col = "green")