Polynomial Regression and Step Functions

Polynomial Regression is utilized when plots seem to be under-fitting and the complexity of the model needs to be increased.

It extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power.

fit <- lm(wage ~ poly(age, 4), data = Wage)
coef(summary(fit))

This code fits a linear model in order to predict “wage”. However, we utilize the variable ’age” to a fourth-degree polynomial with this code: age: poly(age, 4) . The poly() command allows us to put a variable to any power and avoids extra lines of code.

##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
## poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
## poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02

The function returns a matrix whose columns are a basis of orthogonal polynomials , which essentially means that each column is a linear combination of the variables age, age^2, age^3 and age^4. These values can also be obtained with using the raw = TRUE argument to the poly() function.

fit2 <- lm(wage ~ poly(age, 4, raw = T), data = Wage)

There are several other equivalent ways of fitting this model:

fit2a <- lm(wage ~ age + I(age^2) + I(age^3) + I(age^4),
    data = Wage)
coef(fit2a)

fit2b <- lm(wage ~ cbind(age, age^2, age^3, age^4),
    data = Wage)

The model fit2a utilizes the function I() and the ^ symbol for specifying the polynomial degrees

The model fit2b utilizes the function cbind() for building a matrix from a collection of vectors; any function call such as cbind() inside a formula also serves as a wrapper.

Predicting with Polynomial Regression

Now we create a grid of values for age , which we want predictions, and then call the predict() function, specifying that we want standard errors as well.

agelims <- range(age)
age.grid <- seq(from = agelims[1], to = agelims[2]) ## the specific grid range for age values  
preds <- predict(fit, newdata = list(age = age.grid), 
                 se = TRUE)
se.bands <- cbind(preds$fit + 2 * preds$se.fit,
                  preds$fit - 2 * preds$se.fit)

We can now plot the data and add the fit from the degree-4 polynomial.

par(mfrow = c(1, 2), mar = c(4.5, 4.5, 1, 1),
    oma = c(0, 0, 4, 0))  
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Degree-4 Polynomial", outer = T)
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)

The mar and oma arguments to par() allow us to control the margins of the plot, and the title() function creates a figure title that spans both subplots

This is a plot of wage against age , which contains income and demographic information for males who reside in the central Atlantic region of the United States. The plot will show the results of fitting a degree-4 polynomial using least squares.

In this model, the individual coefficients are not of particular interest. Instead, we look at the entire fitted function across a grid of 63 values for age from 18 to 80 in order to understand the relationship between age and wage.

The solid blue curve is a degree-4 polynomial of wage (in thousands of dollars) as a function of age, fit by least squares.
The dashed curves indicate an estimated 95% confidence interval; these are (2×) standard error curves. The reason it is plotted twice is because for normally distributed error terms, this quantity corresponds to an approximate 95% confidence interval.
Computation:
- The variance of the fit is found by using the Least squares which returns variance estimates for each of the fitted coefficients βˆ_j , as well as the co-variances between pairs of coefficient estimates. Computing these together gives the estimated pointwise standard error of the fit, which is the square-root of this variance.
- This computation is repeated at each reference point x₀, and we plot the fitted curve, as well as twice the standard error on either side of the fitted curve.

How to know which degree of the polynomial to use when performing a polynomial regression?

One way to do this is by using hypothesis tests and the ANOVA function. We fit models ranging from linear to a degree-5 polynomial and seek to determine the simplest model which is sufficient to explain the relationship between wage and age.

Using an ANOVA, we test the null hypothesis to find the simplest model that is sufficient to explain the data against the alternative hypothesis and to see if a more complex model is required. In order to use the anova() function, the models must be nested: meaning the predictors in one model must be a subset of the predictors in the other model.

In this case, we fit five different models and sequentially compare the simpler model to the more complex model.

fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage) ##(age, #) is the polynomial degree
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value comparing the linear Model 1 to the quadratic Model 2 is almost zero, indicating that a linear fit is not sufficient. Similarly the p-value comparing the quadratic Model 2 to the cubic Model 3 is very low (0.00170), so the quadratic fit is also insufficient.

The p-value comparing the cubic and degree-4 polynomials, Model 3 and Model 4, is approximately 5% while the degree-5 polynomial Model 5 seems unnecessary because its p-value is 0.370. Hence, either a cubic or a quadratic polynomial Model 2 and Model 3 appear to provide a reasonable fit to the data, but lower- or higher-order models are not justified.

As an alternative to using hypothesis tests and ANOVA, we could choose the polynomial degree using cross-validation. However, the ANOVA method works whether or not we used orthogonal polynomials; it also works when we have other terms in the model as well.

fit.1 <- lm(wage ~ education + age, data = Wage)
fit.2 <- lm(wage ~ education + poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ education + poly(age, 3), data = Wage)
anova(fit.1, fit.2, fit.3)

## Analysis of Variance Table
## 
## Model 1: wage ~ education + age
## Model 2: wage ~ education + poly(age, 2)
## Model 3: wage ~ education + poly(age, 3)
##   Res.Df     RSS Df Sum of Sq        F Pr(>F)    
## 1   2994 3867992                                 
## 2   2993 3725395  1    142597 114.6969 <2e-16 ***
## 3   2992 3719809  1      5587   4.4936 0.0341 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Binary Variable

Looking at the previous plot, we see two distinct populations: there appears to be a high earners group earning more than $250,000, as well as a low earners group. We can treat wage as a binary variable by splitting it into these two groups.

So for the task of predicting whether an individual earns more than $250,000 per year. We first create the appropriate response vector, and then apply the glm() function using family = "binomial" in order to fit a polynomial logistic regression model.

fit <- glm(I(wage > 250) ~ poly(age, 4), data = Wage,
    family = binomial)

preds <- predict(fit, newdata = list(age = age.grid), se = T)

The wrapper I() is used to create a binary response variable. The expression wage > 250 evaluates to a logical variable containing TRUEs and FALSEs, which glm() coerces to binary by setting the TRUEs to 1 and the FALSEs to 0.

To get confidence intervals, compute upper and lower bounds on the logit scale (or log-odds), and then invert to get on probability scale.

pfit <- exp(preds$fit) / (1 + exp(preds$fit))
se.bands.logit <- cbind(preds$fit + 2 * preds$se.fit,
    preds$fit - 2 * preds$se.fit)
se.bands <- exp(se.bands.logit) / (1 + exp(se.bands.logit)

Now use these on bounds on the plot:

plot(age, I(wage > 250), xlim = agelims, type = "n",
    ylim = c(0, .2))
points(jitter(age), I((wage > 250) / 5), cex = .5, pch = "|", col = "darkgrey")
lines(age.grid, pfit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)

The jitter() function was used to jitter the age values so that observations with the same age value do not cover each other up.

This plot drew the age values corresponding to the observations with wage values above 250 as gray marks on the top of the plot, and those with wage values below 250 are shown as gray marks on the bottom of the plot.

The gray marks on the top and bottom of the panel indicate the ages of the high earners and the low earners.
The solid blue curve indicates the fitted probabilities of being a high earner, as a function of age.

Step Functions

Since using polynomial functions of the features as predictors in a linear model imposes a global structure on the non-linear function of X. We can instead use step functions in order to avoid this.

We breakup X into bins, and fit a different constant in each bin. Which converts a continuous variable into an ordered categorical variable.

In order to fit a step function, we use the cut() function.

table(cut(age, 4))

fit <- lm(wage ~ cut(age, 4), data = Wage)
coef(summary(fit))

## (17.9,33.5]   (33.5,49]   (49,64.5] (64.5,80.1] 
##         750        1399         779          72


##                         Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)            94.158392   1.476069 63.789970 0.000000e+00
## cut(age, 4)(33.5,49]   24.053491   1.829431 13.148074 1.982315e-38
## cut(age, 4)(49,64.5]   23.664559   2.067958 11.443444 1.040750e-29
## cut(age, 4)(64.5,80.1]  7.640592   4.987424  1.531972 1.256350e-01

Here cut() automatically picked the cut points at 33.5, 49, and 64.5~years of age. Using the breaks option, we can specifically choose the cut points.

The cut() code returns an ordered categorical variable while the lm() function then creates a set of dummy variables for use in the regression.

The age < 33.5 category is left out, so the intercept coefficient of $94,160 is interpreted as the average salary for those under 33.5~years of age, and the other coefficients are interpreted as the average additional salary for those in the other age groups.

Splines

Regression Splines are more flexible than polynomials and step functions, and are an extension of the two. They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data.

In order to fit regression splines in R, we use the splines library.

Regression splines can be fit by constructing an appropriate matrix of basis functions. The bs() function generates the entire matrix of basis functions for splines with the specified set of knots.

library(splines)
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
pred <- predict(fit, newdata = list(age = age.grid), se = T)
plot(age, wage, col = "gray")
lines(age.grid, pred$fit, lwd = 2)
lines(age.grid, pred$fit + 2 * pred$se, lty = "dashed")
lines(age.grid, pred$fit - 2 * pred$se, lty = "dashed")

Here we have pre-specified knots at ages 25, 40, and 60. The cubic polynomials are constrained to be continuous, and to have continuous first and second derivatives.

This produces a cubic spline with six basis functions. A cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept.

A cubic spline with K knots uses a total of 4 + K degrees of freedom.

Where should knots be placed?

One option is to place more knots in places where the function might vary most rapidly, and to place fewer knots where it seems more stable.
However, it is common practice to place knots in a uniform fashion.
- We can use the df option to produce a spline with knots at uniform quantiles of the data.

dim(bs(age, knots = c(25, 40, 60)))

## [1] 3000    6

dim(bs(age, df = 6))

## [1] 3000    6

attr(bs(age, df = 6), "knots")

##   25%   50%   75% 
## 33.75 42.00 51.00

These numbers correspond to the 25th, 50th, and 75th percentiles of age.

To fit a natural spline, we use the ns() function.

A natural spline with K knots has K degrees of freedom

fit2 <- lm(wage ~ ns(age, df = 4), data = Wage)
pred2 <- predict(fit2, newdata = list(age = age.grid),
     se = T)
plot(age, wage, col = "gray")
lines(age.grid, pred2$fit, col = "red", lwd = 2)

Here we fit a natural spline with four degrees of freedom. The spline is fit to wage as a function of age.

How many knots should our spline contain?

A simple option is to try out different numbers of knots and see which produces the best looking curve.
The more objective option is to use cross validation:
- With this method, we remove a portion of the data, fit a spline with a certain number of knots to the remaining data, and then use the spline to make predictions for the held-out portion.
- We repeat this process multiple times until each observation has been left out once, and then compute the overall cross-validated RSS.
- This procedure can be repeated for different numbers of knots K. Then the value of K giving the smallest RSS is chosen.

Smoothing Spline

In fitting a smooth curve to a set of data, what we really want to do is find some function, say g(x), that fits the observed data well; (We want RSS to be small)
However; we need constraints on g(xi), so we want a small RSS that is also smooth.
Smoothing Spline Criteria:
- The roughness penalty controls how wiggly g(x) is.
- It is modulated by the tuning parameter: λ ≥ 0.
  - The smaller λ, the more wiggly the function, eventually interpolating yi when λ = 0.
  - As λ → ∞, the function g(x) becomes linear.
The solution to this is a natural cubic spline, with a knot at every unique value of x_i
- The roughness penalty still controls the roughness via λ.
- This also avoids the knot-selection issue, leaving a single λ to be chosen.

In order to fit a smoothing spline, we use the smooth.spline() function.

plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Smoothing Spline")
fit <- smooth.spline(age, wage, df = 16)
fit2 <- smooth.spline(age, wage, cv = TRUE)
fit2$df
lines(fit, col = "red", lwd = 2)
lines(fit2, col = "blue", lwd = 2)
legend("topright", legend = c("16 DF", "6.8 DF"),
       col = c("red", "blue"), lty = 1, lwd = 2, cex = .8)

Notice that in the first call to smooth.spline(), we specified df = 16. The function then determines which value of λ leads to 16 degrees of freedom. In the second call to smooth.spline(), we select the smoothness level by cross-validation; this results in a value of λ that yields 6.8 degrees of freedom.

Local Regression

Local regression is a another approach for fitting flexible non-linear functions, which involves computing the fit at a target point x₀ using only the nearby training observations.

In order to perform local regression, we use the loess() function. With a sliding weight function, we fit separate linear fits over the range of X by weighted least squares. Which runs a local regression over each X point.

plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Local Regression")
fit <- loess(wage ~ age, span = .2, data = Wage)
fit2 <- loess(wage ~ age, span = .5, data = Wage)
lines(age.grid, predict(fit, data.frame(age = age.grid)),
    col = "red", lwd = 2)
lines(age.grid, predict(fit2, data.frame(age = age.grid)),
    col = "blue", lwd = 2)
legend("topright", legend = c("Span = 0.2", "Span = 0.5"),
    col = c("red", "blue"), lty = 1, lwd = 2, cex = .8)

Here we have performed local linear regression using spans of 0.2 and 0.5: that is, each neighborhood consists of 20% or 50% of the observations. The larger the span, the smoother the fit.

The span specifies the fraction of the data used to compute the fit at each target point.
- Red: Span = 0.2
- Blue: Span = 0.5

Generalized Additive Models

GAMs provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity. It can be applied to both qualitative and quantitative responses.

We now fit a GAM to predict wage using natural spline functions of lyear and age, treating education as a qualitative predictor. Since this is just a linear regression model using an appropriate choice of basis functions, we can simply do this using the lm() function.

gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + education,
    data = Wage)

We now fit the model using smoothing splines rather than natural splines. In order to fit more general sorts of GAMs, using smoothing splines or other components that cannot be expressed in terms of basis functions and then fit using least squares regression, we will need to use the gam library in R.

The s() function, which is part of the gam library, is used to indicate that we would like to use a smoothing spline. We specify that the function of lyear should have 4 degrees of freedom, and that the function of age will have 5 degrees of freedom. Since education is qualitative, we leave it as is, and it is converted into four dummy variables.

We use the gam() function in order to fit a GAM using these components. All of the terms in are fit simultaneously, taking each other into account to explain the response.

gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
##s() indicated the DoF for each variable

Call the plot() function to plot gam.3

par(mfrow = c(1, 3))
plot(gam.m3, se = TRUE, col = "blue")

The generic plot() function recognizes that gam.m3 is an object of class Gam, and invokes the appropriate plot.Gam() method. Conveniently, even though gam1 is not of class Gam but rather of class lm, we can {} use plot.Gam() on it.

plot.Gam(gam1, se = TRUE, col = "red")

gam.m1 <- gam(wage ~ s(age, 5) + education, data = Wage)
gam.m2 <- gam(wage ~ year + s(age, 5) + education,
    data = Wage)
anova(gam.m1, gam.m2, gam.m3, test = "F")

Model 1: a GAM that excludes year
Model 2: a GAM that uses a linear function of year
Model 3: a GAM that uses a spline function of year

## Analysis of Deviance Table
## 
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
##   Resid. Df Resid. Dev Df Deviance       F    Pr(>F)    
## 1      2990    3711731                                  
## 2      2989    3693842  1  17889.2 14.4771 0.0001447 ***
## 3      2986    3689770  3   4071.1  1.0982 0.3485661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see there is compelling evidence that a GAM with a linear function of year is better than a GAM that does not include year at all.

However, there is no evidence that a non-linear function of year is needed (p-value,=,0.349). In other words, based on the results of this ANOVA, Model 2 is preferred.

The summary() function produces a summary of the gam fit.

summary(gam.m3)

## 
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -119.43  -19.70   -3.33   14.17  213.48 
## 
## (Dispersion Parameter for gaussian family taken to be 1235.69)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(year, 4)    1   27162   27162  21.981 2.877e-06 ***
## s(age, 5)     1  195338  195338 158.081 < 2.2e-16 ***
## education     4 1069726  267432 216.423 < 2.2e-16 ***
## Residuals  2986 3689770    1236                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F  Pr(F)    
## (Intercept)                          
## s(year, 4)        3  1.086 0.3537    
## s(age, 5)         4 32.380 <2e-16 ***
## education                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Anova for Parametric Effects: p-values clearly demonstrate that `year`, `age`, and `education` are all highly statistically significant, even when only assuming a linear relationship.

The Anova for Nonparametric Effects” p-values for year and age correspond to a null hypothesis of a linear relationship versus the alternative of a non-linear relationship. The large p-value for year reinforces our conclusion from the ANOVA test that a linear function is adequate for this term. However, there is very clear evidence that a non-linear term is required for age.

We can make predictions using the predict() method for the class Gam. Here we make predictions on the training set. We can also use local regression fits as building blocks in a GAM, using the lo() function.

preds <- predict(gam.m2, newdata = Wage)
gam.lo <- gam(
    wage ~ s(year, df = 4) + lo(age, span = 0.7) + education,
    data = Wage
  )
plot.Gam(gam.lo, se = TRUE, col = "green")

Here we have use local regression for the age term, with a span of 0.70.

We can also use the lo() function to create interactions before calling the gam() function.

gam.lo.i <- gam(wage ~ lo(year, age, span = 0.5) + education,
    data = Wage)

This code fits a two-term model, in which the first term is an interaction between year and age, fit by a local regression surface. We can plot the resulting two-dimensional surface if we first install the akima package.

library(akima)
library(interp) #Interpolation Methods to use plotting method for bivariate functions
plot(gam.lo.i)

In order to fit a logistic regression GAM, we once again use the I() function in constructing the binary response variable, and set family=binomial.

gam.lr <- gam(
    I(wage > 250) ~ year + s(age, df = 5) + education,
    family = binomial, data = Wage
  )
par(mfrow = c(1, 3))
plot(gam.lr, se = T, col = "green")

It is easy to see that there are no high earners in the < HS category

table(education, I(wage > 250))

##                     
## education            FALSE TRUE
##   1. < HS Grad         268    0
##   2. HS Grad           966    5
##   3. Some College      643    7
##   4. College Grad      663   22
##   5. Advanced Degree   381   45

We fit a logistic regression GAM using all but this category. This provides more sensible results.

gam.lr.s <- gam(
    I(wage > 250) ~ year + s(age, df = 5) + education,
    family = binomial, data = Wage,
    subset = (education != "1. < HS Grad")
  )
plot(gam.lr.s, se = T, col = "green")

Pros and Cons of GAMs:

Pro:
- GAMs allow us to fit a non-linear f_j to each X_j , so that we can automatically model non-linear relationships that standard linear regression will miss. This means that we do not need to manually try out many different transformations on each variable individually.
- The non-linear fits can potentially make more accurate predictions for the response Y
Con:
- The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed.
- However, as with linear regression, we can manually add interaction terms to the GAM model by including additional predictors

Nonlinear Modeling - Chapter 7 Lab

Theo Meyer

March 22, 2023

When do we use nonlinear modeling ?