Often we assume linearity, but sometimes we will be confronted with non linearity.
Linearity assumption will not suffice when faced with these:
Polynomials
Step functions
Splines
Local Regression
Generalized Additive Models
This lab will go over the process of utilizing nonlinear modeling when faced with these.
First, we load the packages needed and attach the data set we will be using:
library(ISLR2)
library(tidyverse)
library(ggplot2)
attach(Wage)
view(Wage)
head(Wage,10)## The following objects are masked from Wage (pos = 3):
##
## age, education, health, health_ins, jobclass, logwage, maritl,
## race, region, wage, year
Polynomial Regression is utilized when plots seem to be under-fitting and the complexity of the model needs to be increased.
It extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power.
fit <- lm(wage ~ poly(age, 4), data = Wage)
coef(summary(fit))This code fits a linear model in order to predict “wage”. However, we
utilize the variable ’age” to a fourth-degree polynomial with this code:
age: poly(age, 4) . The poly()
command allows us to put a variable to any power and avoids extra lines
of code.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
The function returns a matrix whose columns are a basis of
orthogonal polynomials , which essentially means that each
column is a linear combination of the variables age,
age^2, age^3 and age^4. These
values can also be obtained with using the raw = TRUE
argument to the poly() function.
fit2 <- lm(wage ~ poly(age, 4, raw = T), data = Wage)There are several other equivalent ways of fitting this model:
fit2a <- lm(wage ~ age + I(age^2) + I(age^3) + I(age^4),
data = Wage)
coef(fit2a)fit2b <- lm(wage ~ cbind(age, age^2, age^3, age^4),
data = Wage)The model fit2a utilizes the function I()
and the ^ symbol for specifying the polynomial degrees
The model fit2b utilizes the function
cbind() for building a matrix from a collection of vectors;
any function call such as cbind() inside a formula also
serves as a wrapper.
Now we create a grid of values for age , which we want
predictions, and then call the predict() function,
specifying that we want standard errors as well.
agelims <- range(age)
age.grid <- seq(from = agelims[1], to = agelims[2]) ## the specific grid range for age values
preds <- predict(fit, newdata = list(age = age.grid),
se = TRUE)
se.bands <- cbind(preds$fit + 2 * preds$se.fit,
preds$fit - 2 * preds$se.fit)We can now plot the data and add the fit from the degree-4 polynomial.
par(mfrow = c(1, 2), mar = c(4.5, 4.5, 1, 1),
oma = c(0, 0, 4, 0))
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Degree-4 Polynomial", outer = T)
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)The mar and oma arguments to
par() allow us to control the margins of the plot, and the
title() function creates a figure title that spans both
subplots
This is a plot of wage against age , which
contains income and demographic information for males who reside in the
central Atlantic region of the United States. The plot will show the
results of fitting a degree-4 polynomial using least squares.
In this model, the individual coefficients are not of particular
interest. Instead, we look at the entire fitted function across a grid
of 63 values for age from 18 to 80 in order to understand
the relationship between age and wage.
The solid blue curve is a degree-4 polynomial of
wage (in thousands of dollars) as a function of
age, fit by least squares.
The dashed curves indicate an estimated 95% confidence interval; these are (2×) standard error curves. The reason it is plotted twice is because for normally distributed error terms, this quantity corresponds to an approximate 95% confidence interval.
Computation:
The variance of the fit is found by using the Least squares which returns variance estimates for each of the fitted coefficients βˆj , as well as the co-variances between pairs of coefficient estimates. Computing these together gives the estimated pointwise standard error of the fit, which is the square-root of this variance.
This computation is repeated at each reference point x0, and we plot the fitted curve, as well as twice the standard error on either side of the fitted curve.
One way to do this is by using hypothesis tests and the ANOVA
function. We fit models ranging from linear to a degree-5 polynomial and
seek to determine the simplest model which is sufficient to explain the
relationship between wage and age.
Using an ANOVA, we test the null hypothesis to find the simplest
model that is sufficient to explain the data against the alternative
hypothesis and to see if a more complex model is required. In order to
use the anova() function, the models must be
nested: meaning the predictors in one model must be a subset of
the predictors in the other model.
In this case, we fit five different models and sequentially compare the simpler model to the more complex model.
fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage) ##(age, #) is the polynomial degree
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value comparing the linear Model 1 to the
quadratic Model 2 is almost zero, indicating that a linear
fit is not sufficient. Similarly the p-value comparing the quadratic
Model 2 to the cubic Model 3 is very low
(0.00170), so the quadratic fit is also insufficient.
The p-value comparing the cubic and degree-4 polynomials,
Model 3 and Model 4, is approximately 5% while
the degree-5 polynomial Model 5 seems unnecessary because
its p-value is 0.370. Hence, either a cubic or a quadratic polynomial
Model 2 and Model 3 appear to provide a
reasonable fit to the data, but lower- or higher-order models are not
justified.
As an alternative to using hypothesis tests and ANOVA, we could choose the polynomial degree using cross-validation. However, the ANOVA method works whether or not we used orthogonal polynomials; it also works when we have other terms in the model as well.
fit.1 <- lm(wage ~ education + age, data = Wage)
fit.2 <- lm(wage ~ education + poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ education + poly(age, 3), data = Wage)
anova(fit.1, fit.2, fit.3)## Analysis of Variance Table
##
## Model 1: wage ~ education + age
## Model 2: wage ~ education + poly(age, 2)
## Model 3: wage ~ education + poly(age, 3)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2994 3867992
## 2 2993 3725395 1 142597 114.6969 <2e-16 ***
## 3 2992 3719809 1 5587 4.4936 0.0341 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the previous plot, we see two distinct populations: there appears to be a high earners group earning more than $250,000, as well as a low earners group. We can treat wage as a binary variable by splitting it into these two groups.
So for the task of predicting whether an individual earns more than
$250,000 per year. We first create the appropriate response vector, and
then apply the glm() function using
family = "binomial" in order to fit a polynomial logistic
regression model.
fit <- glm(I(wage > 250) ~ poly(age, 4), data = Wage,
family = binomial)
preds <- predict(fit, newdata = list(age = age.grid), se = T)The wrapper I() is used to create a binary response
variable. The expression wage > 250 evaluates to a
logical variable containing TRUEs and FALSEs, which
glm() coerces to binary by setting the TRUEs to
1 and the FALSEs to 0.
To get confidence intervals, compute upper and lower bounds on the logit scale (or log-odds), and then invert to get on probability scale.
pfit <- exp(preds$fit) / (1 + exp(preds$fit))
se.bands.logit <- cbind(preds$fit + 2 * preds$se.fit,
preds$fit - 2 * preds$se.fit)
se.bands <- exp(se.bands.logit) / (1 + exp(se.bands.logit)Now use these on bounds on the plot:
plot(age, I(wage > 250), xlim = agelims, type = "n",
ylim = c(0, .2))
points(jitter(age), I((wage > 250) / 5), cex = .5, pch = "|", col = "darkgrey")
lines(age.grid, pfit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)The jitter() function was used to jitter the
age values so that observations with the same
age value do not cover each other up.
This plot drew the age values corresponding to the
observations with wage values above 250 as gray marks on
the top of the plot, and those with wage values below 250
are shown as gray marks on the bottom of the plot.
The gray marks on the top and bottom of the panel indicate the ages of the high earners and the low earners.
The solid blue curve indicates the fitted probabilities of being a high earner, as a function of age.
Since using polynomial functions of the features as predictors in a
linear model imposes a global structure on the non-linear
function of X. We can instead use step
functions in order to avoid this.
We breakup X into bins, and fit a different constant in
each bin. Which converts a continuous variable into an
ordered categorical variable.
In order to fit a step function, we use the cut()
function.
table(cut(age, 4))
fit <- lm(wage ~ cut(age, 4), data = Wage)
coef(summary(fit))## (17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]
## 750 1399 779 72
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.158392 1.476069 63.789970 0.000000e+00
## cut(age, 4)(33.5,49] 24.053491 1.829431 13.148074 1.982315e-38
## cut(age, 4)(49,64.5] 23.664559 2.067958 11.443444 1.040750e-29
## cut(age, 4)(64.5,80.1] 7.640592 4.987424 1.531972 1.256350e-01
Here cut() automatically picked the cut points at
33.5, 49, and
64.5~years of age. Using the breaks option, we
can specifically choose the cut points.
The cut() code returns an ordered categorical
variable while the lm() function then creates a
set of dummy variables for use in the regression.
The age < 33.5 category is left out, so the intercept
coefficient of $94,160 is interpreted as the average salary for those
under 33.5~years of age, and the other coefficients are
interpreted as the average additional salary for those in the other age
groups.
Regression Splines are more flexible than polynomials and step functions, and are an extension of the two. They involve dividing the range of X into K distinct regions. Within each region, a polynomial function is fit to the data.
In order to fit regression splines in R, we use the
splines library.
Regression splines can be fit by constructing an appropriate matrix
of basis functions. The bs() function generates the entire
matrix of basis functions for splines with the specified set of
knots.
library(splines)
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
pred <- predict(fit, newdata = list(age = age.grid), se = T)
plot(age, wage, col = "gray")
lines(age.grid, pred$fit, lwd = 2)
lines(age.grid, pred$fit + 2 * pred$se, lty = "dashed")
lines(age.grid, pred$fit - 2 * pred$se, lty = "dashed")Here we have pre-specified knots at ages 25, 40, and 60. The cubic polynomials are constrained to be continuous, and to have continuous first and second derivatives.
This produces a cubic spline with six basis functions. A cubic spline with three knots has seven degrees of freedom; these degrees of freedom are used up by an intercept.
One option is to place more knots in places where the function might vary most rapidly, and to place fewer knots where it seems more stable.
However, it is common practice to place knots in a uniform fashion.
df option to produce a spline with knots
at uniform quantiles of the data.dim(bs(age, knots = c(25, 40, 60)))## [1] 3000 6
dim(bs(age, df = 6))## [1] 3000 6
attr(bs(age, df = 6), "knots")## 25% 50% 75%
## 33.75 42.00 51.00
These numbers correspond to the 25th, 50th, and 75th percentiles of
age.
To fit a natural spline, we use the
ns() function.
A natural spline with K knots has K degrees of freedom
fit2 <- lm(wage ~ ns(age, df = 4), data = Wage)
pred2 <- predict(fit2, newdata = list(age = age.grid),
se = T)
plot(age, wage, col = "gray")
lines(age.grid, pred2$fit, col = "red", lwd = 2)Here we fit a natural spline with four degrees of freedom. The spline
is fit to wage as a function of age.
A simple option is to try out different numbers of knots and see which produces the best looking curve.
The more objective option is to use cross validation:
With this method, we remove a portion of the data, fit a spline with a certain number of knots to the remaining data, and then use the spline to make predictions for the held-out portion.
We repeat this process multiple times until each observation has been left out once, and then compute the overall cross-validated RSS.
This procedure can be repeated for different numbers of knots K. Then the value of K giving the smallest RSS is chosen.
In fitting a smooth curve to a set of data, what we really want
to do is find some function, say g(x), that fits the
observed data well; (We want RSS to be small)
However; we need constraints on g(xi), so we want a
small RSS that is also smooth.
Smoothing Spline Criteria:
The roughness penalty controls how wiggly
g(x) is.
It is modulated by the tuning parameter: λ ≥ 0.
The smaller λ, the more wiggly the function, eventually interpolating yi when λ = 0.
As λ → ∞, the function g(x) becomes linear.
The solution to this is a natural cubic spline, with a knot at every unique value of xi
The roughness penalty still controls the roughness via λ.
This also avoids the knot-selection issue, leaving a single λ to be chosen.
In order to fit a smoothing spline, we use the
smooth.spline() function.
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Smoothing Spline")
fit <- smooth.spline(age, wage, df = 16)
fit2 <- smooth.spline(age, wage, cv = TRUE)
fit2$df
lines(fit, col = "red", lwd = 2)
lines(fit2, col = "blue", lwd = 2)
legend("topright", legend = c("16 DF", "6.8 DF"),
col = c("red", "blue"), lty = 1, lwd = 2, cex = .8)Notice that in the first call to smooth.spline(), we
specified df = 16. The function then determines which value
of λ leads to 16 degrees of freedom. In the second call to
smooth.spline(), we select the smoothness level by
cross-validation; this results in a value of λ that yields 6.8 degrees
of freedom.
Local regression is a another approach for fitting flexible non-linear functions, which involves computing the fit at a target point x0 using only the nearby training observations.
In order to perform local regression, we use the loess()
function. With a sliding weight function, we fit separate linear fits
over the range of X by weighted least squares. Which runs a local
regression over each X point.
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey")
title("Local Regression")
fit <- loess(wage ~ age, span = .2, data = Wage)
fit2 <- loess(wage ~ age, span = .5, data = Wage)
lines(age.grid, predict(fit, data.frame(age = age.grid)),
col = "red", lwd = 2)
lines(age.grid, predict(fit2, data.frame(age = age.grid)),
col = "blue", lwd = 2)
legend("topright", legend = c("Span = 0.2", "Span = 0.5"),
col = c("red", "blue"), lty = 1, lwd = 2, cex = .8)Here we have performed local linear regression using spans of 0.2 and 0.5: that is, each neighborhood consists of 20% or 50% of the observations. The larger the span, the smoother the fit.
The span specifies the fraction of the data used to compute the fit at each target point.
Red: Span = 0.2
Blue: Span = 0.5
GAMs provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity. It can be applied to both qualitative and quantitative responses.
We now fit a GAM to predict wage using natural spline
functions of lyear and age, treating
education as a qualitative predictor. Since this is just a
linear regression model using an appropriate choice of basis functions,
we can simply do this using the lm() function.
gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + education,
data = Wage)We now fit the model using smoothing splines rather than natural
splines. In order to fit more general sorts of GAMs, using smoothing
splines or other components that cannot be expressed in terms of basis
functions and then fit using least squares regression, we will need to
use the gam library in R.
The s() function, which is part of the gam
library, is used to indicate that we would like to use a smoothing
spline. We specify that the function of lyear should have 4
degrees of freedom, and that the function of age will have
5 degrees of freedom. Since education is
qualitative, we leave it as is, and it is converted into four
dummy variables.
We use the gam() function in order to fit a GAM using
these components. All of the terms in are fit simultaneously, taking
each other into account to explain the response.
gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
##s() indicated the DoF for each variable Call the plot() function to plot gam.3
par(mfrow = c(1, 3))
plot(gam.m3, se = TRUE, col = "blue")The generic plot() function recognizes that
gam.m3 is an object of class Gam, and invokes
the appropriate plot.Gam() method. Conveniently, even
though gam1 is not of class Gam but rather of
class lm, we can {} use plot.Gam() on it.
plot.Gam(gam1, se = TRUE, col = "red")gam.m1 <- gam(wage ~ s(age, 5) + education, data = Wage)
gam.m2 <- gam(wage ~ year + s(age, 5) + education,
data = Wage)
anova(gam.m1, gam.m2, gam.m3, test = "F")Model 1: a GAM that excludes year
Model 2: a GAM that uses a linear function of
year
Model 3: a GAM that uses a spline function of
year
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, 5) + education
## Model 2: wage ~ year + s(age, 5) + education
## Model 3: wage ~ s(year, 4) + s(age, 5) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2990 3711731
## 2 2989 3693842 1 17889.2 14.4771 0.0001447 ***
## 3 2986 3689770 3 4071.1 1.0982 0.3485661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see there is compelling evidence that a GAM with a linear function
of year is better than a GAM that does not include
year at all.
However, there is no evidence that a non-linear function of
year is needed (p-value,=,0.349). In other words, based on
the results of this ANOVA, Model 2 is preferred.
The summary() function produces a summary of the gam
fit.
summary(gam.m3)##
## Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -119.43 -19.70 -3.33 14.17 213.48
##
## (Dispersion Parameter for gaussian family taken to be 1235.69)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3689770 on 2986 degrees of freedom
## AIC: 29887.75
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(year, 4) 1 27162 27162 21.981 2.877e-06 ***
## s(age, 5) 1 195338 195338 158.081 < 2.2e-16 ***
## education 4 1069726 267432 216.423 < 2.2e-16 ***
## Residuals 2986 3689770 1236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(year, 4) 3 1.086 0.3537
## s(age, 5) 4 32.380 <2e-16 ***
## education
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova for Parametric Effects: p-values clearly
demonstrate that `year`, `age`, and
`education` are all highly statistically significant,
even when only assuming a linear relationship.
The Anova for Nonparametric Effects” p-values for year
and age correspond to a null hypothesis of a linear
relationship versus the alternative of a non-linear relationship. The
large p-value for year reinforces our conclusion from the
ANOVA test that a linear function is adequate for this term. However,
there is very clear evidence that a non-linear term is required for
age.
We can make predictions using the predict() method for
the class Gam. Here we make predictions on the training set. We can also
use local regression fits as building blocks in a GAM, using the
lo() function.
preds <- predict(gam.m2, newdata = Wage)
gam.lo <- gam(
wage ~ s(year, df = 4) + lo(age, span = 0.7) + education,
data = Wage
)
plot.Gam(gam.lo, se = TRUE, col = "green")Here we have use local regression for the age term, with
a span of 0.70.
We can also use the lo() function to create interactions
before calling the gam() function.
gam.lo.i <- gam(wage ~ lo(year, age, span = 0.5) + education,
data = Wage)This code fits a two-term model, in which the first
term is an interaction between year and age,
fit by a local regression surface. We can plot the resulting
two-dimensional surface if we first install the akima
package.
library(akima)
library(interp) #Interpolation Methods to use plotting method for bivariate functions
plot(gam.lo.i)In order to fit a logistic regression GAM, we once again use the
I() function in constructing the binary response variable,
and set family=binomial.
gam.lr <- gam(
I(wage > 250) ~ year + s(age, df = 5) + education,
family = binomial, data = Wage
)
par(mfrow = c(1, 3))
plot(gam.lr, se = T, col = "green")It is easy to see that there are no high earners in the
< HS category
table(education, I(wage > 250))##
## education FALSE TRUE
## 1. < HS Grad 268 0
## 2. HS Grad 966 5
## 3. Some College 643 7
## 4. College Grad 663 22
## 5. Advanced Degree 381 45
We fit a logistic regression GAM using all but this category. This provides more sensible results.
gam.lr.s <- gam(
I(wage > 250) ~ year + s(age, df = 5) + education,
family = binomial, data = Wage,
subset = (education != "1. < HS Grad")
)
plot(gam.lr.s, se = T, col = "green")Pro:
GAMs allow us to fit a non-linear fj to each Xj , so that we can automatically model non-linear relationships that standard linear regression will miss. This means that we do not need to manually try out many different transformations on each variable individually.
The non-linear fits can potentially make more accurate predictions for the response Y
Con:
The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed.
However, as with linear regression, we can manually add interaction terms to the GAM model by including additional predictors