Non-Linear Methods

The linearity assumption is often inadequate. Very few trends follow this pattern perfectly. However, it can be improved upon with

Polynomial regression
Step functions
Splines
Local regression
General additive models

This report focuses on 1 to 3.

Polynomial Regression

Creates a non-linear fit by raising the original predictors to a power.
Example: a cubic regression includes $X,\space X^2,\space X^3$.

The standard linear model:

\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\]

becomes:

\[y_i = \beta_0 + \beta_1x_i + \beta_2x^2_i + \beta_3x^3_i+ \dotsc+ \beta_dx^d_i + \epsilon_i\]

A large value of $d$ will produce an extremely non-linear curve.
It is unusual to use a value greater than 3 or 4.

We can also use Logistic regression to predict a binary response:

\[Pr(X) = \frac{e^{(\beta_0 + \beta_1x_i + \beta_2x^2_i + \beta_3x^3_i+ \dotsc+ \beta_dx^d_i)}}{1+ e^{(\beta_0 + \beta_1x_i + \beta_2x^2_i + \beta_3x^3_i+ \dotsc+ \beta_dx^d_i)}}\]

Step Functions

Cut the range of $X$ into bins and fit a different constant in each bin.
Converts a continuous variable into an ordered categorical variable.

For example, create cutpoints $c_1, c_2,\dotsc,c_K$ and $K+1$ new variables:

\[\begin{matrix} C_0(X) & = & I(X<c_1), \\ C_1(X) & = & I(c_1\leq X<c_2), \\ C_2(X) & = & I(c_2\leq X<c_3), \\ & \vdots & \\ C_K-1(X) & = & I(c_K-1\leq X<c_K), \\ C_K(X) & = & I(c_K\leq X) \\ \end{matrix}\]

$I(\dotsm)$ is an indicator function. It equals 1 if the condition is true and 0 otherwise. These functions create dummy variables.

The linear model becomes:

\[y_i = \beta_0+\beta_1C_1(x_i)+\beta_2C_2(x_i)+\dotsc+\beta_KC_K(x_i)+\epsilon_i\]

For any given value of $X$, at most one of $C_1,C_2,\dotsc,C_k$ can be non-zero.
When $X<c_1$, all of the predictors are zero. $\beta_0$ can be interpreted as the mean value of $Y$ for $X<c_1$.
The model can also be extended to Logistic Regression.

Basis Functions

Basis functions are transformations applied to a variable $X$.
Polynomial regression and step functions are examples of basis functions.

The linear model becomes:

\[y_i = \beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+\dotsc+\beta_Kb_K(x_i)+\epsilon_i\]

where $b_1,b_2,\dotsc,b_K$ are functions that are fixed and known.

For polynomial regression the functions are $b_j(x_i) = x^j_i$
For step functions they are $b_j(x_i)=I(c_j\leq x_i < c_{j+1})$

Regression Splines

Regression splines are more flexible than polynomials or step functions.
You divide the range of $X$ into $K$ distinct regions and fit a polynomial to each region.
Each polynomial is constrained so that they join smoothly at the boundaries.

Piecewise Polynomials

Fit separate low-degree polynomials over different regions of $X$.
The points where the coefficients change are called knots.

A piecewise cubic polynomial with a single knot at $c$:

\[ y_i = \begin{cases} \beta_{01}+\beta_{11}x_i+\beta_{21}x^2_i +\beta_{31}x^3_i+\epsilon_i & \quad \text{if } x_i <c;\\ \beta_{02}+\beta_{12}x_i+\beta_{22}x^2_i +\beta_{32}x^3_i+\epsilon_i & \quad \text{if } x_i \geq c. \end{cases} \]

With $K$ knots we would fit $K+1$ cubic polynomials.

Constraints and Splines

Piecewise polynomials will usually “jump” at the knots, creating a gap.
We can add constraints to make the fitted curves continuous.
If we require that the first and second derivatives are continuous at the knots than the fit will be very smooth. (for a cubic spline)
A degree-d spline is a degree-d polynomial with continuity in derivatives up to degree-d-1.
A linear spline also requires continuity at each knot.

The Spline Basis Representation

We can use the basis model to represent a regression spline. A cubic spline with $K$ knots can be modeled as:

\[y_i = \beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+\dotsc+\beta_{K+3}b_{K+3}(x_i)+\epsilon_i\]

To represent a cubic spline we start off with a basis for a cubic polynomial ($x,x^2,x^3$) and add one truncated power basis function per knot. A truncated power basis function is represented as:

\[ h(x,\xi) = (x-\xi)^3_+= \begin{cases} (x-\xi)^3 & \quad \text{if } x_i <\xi;\\ 0 & \quad \text{otherwise} . \end{cases} \]

where $\xi$ is the knot.

For a cubic spline with $K$ knots we can perform a least squares regression with an intercept and $3+K$ predictors.

\[X,X^2,X^3,h(X,\xi_1),h(X,\xi_2),\dotsc,h(X,\xi_1K)\]

where $\xi_1,\dotsc,\xi_K$ are the knots.

A natural spline has an additional constraint. It must be linear at the boundary. This produces more stable estimates.

Choosing the Knots

Specify the desired degrees of freedom and have the software place the corresponding number knots at uniform quantiles.
Try out different values and see which produces the best looking curve.
Use cross validation.
- Choose a number of knots
- Apply K-fold cross validation
- Calculate RSS
- Repeat until the lowest RSS is found.

Comparison to Polynomial Regression

Splines often outperform polynomial regression.

Polynomials must use a high degree to get a good fit.
Splines increase flexibility by changing the number of knots. (with a fixed low degree)

Smoothing Splines

We would like to find a function, $g(x)$, that fits the data well and we would like to minimize $RSS = \sum^n_{i=1}(y_i-g(x_i))^2$. However, we don’t want to over fit the data.

To find a function $g$ that minimizes $RSS$ and is also smooth (doesn’t over fit) we can minimize:

\[\sum^n_{i=1}(y_i-g(x_i))^2 + \lambda \int g''(t)^2dt\]

$\lambda$ is a non-negative tuning parameter.
The function $g$ is known as a smoothing spline.
$\sum^n_{i=1}(y_i-g(x_i))^2$ is a loss function that encourages $g$ to fit the data well.
$\lambda \int g''(t)^2dt$ is a penalty term that penalizes the variability in $g$.
- $g'(t)$ measures the slope of a function at $t$.
- $\int g''(t)^2dt$ is a measure of the total change in $g'(t)$ over its entire range.
- If $g$ is smooth, than $\int g''(t)^2dt$ will be small.
- If $g$ is jumpy, than $\int g''(t)^2dt$ will be large.
- $\lambda \int g''(t)^2dt$ therefore encourages $g$ to be smooth.
- The larger the value of $\lambda$, the smoother $g$ will be.
- When $\lambda=0$ then $g$ will exactly interpolate the observations.
- When $\lambda=\infty$ then $g$ will be perfectly smooth. It will be the least squares line.
- $\lambda$ controls the bias-variance trade-off.

The function $g(x)$ that minimizes $\sum^n_{i=1}(y_i-g(x_i))^2 + \lambda \int g''(t)^2dt$ is a cubic polynomial with knots at the unique values of $x_1,\dotsc, x_n$ and continuous first and second derivatives at each knot.

The value of $\lambda$ that minimizes RSS is typically found with leave-one-out cross-validation.

Non-Linear Modeling

Polynomials and Step Functions

There are several ways to generate a polynomial regression. Interestingly, they do not all produce the same coefficient estimates, but the fitted values will all be the same.

We can use the Wage dataset for examples.

1. Orthogonal Polynomials

With poly() you can specify the degree. The default will output orthogonal polynomials (a linear combination of the variables age, age^2, age^3, and age^4).

> library(ISLR)
> attach(Wage)
> fit=lm(wage~poly(age,4),data=Wage)
> coef(summary(fit))

                Estimate Std. Error    t value     Pr(>|t|)
(Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02

2. Raw Values

We can also use poly() to obtain age, age^2, age^3, and age^4 directly by specifying raw=TRUE.

> fit2=lm(wage~poly(age,4,raw=T),data=Wage)
> coef(summary(fit2))

                            Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)            -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
poly(age, 4, raw = T)1  2.124552e+01 5.886748e+00  3.609042 0.0003123618
poly(age, 4, raw = T)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
poly(age, 4, raw = T)3  6.810688e-03 3.065931e-03  2.221409 0.0263977518
poly(age, 4, raw = T)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498

3. Wrapper Function

We can use the wrapper function I().

> fit2a=lm(wage~age+I(age^2)+I(age^3)+I(age^4),data=Wage)
> coef(fit2a)

  (Intercept)           age      I(age^2)      I(age^3)      I(age^4) 
-1.841542e+02  2.124552e+01 -5.638593e-01  6.810688e-03 -3.203830e-05

4. Collection of vectors

cbind() is a more compact version.

> fit2b=lm(wage~cbind(age,age^2,age^3,age^4),data=Wage)
> coef(fit2b)

                       (Intercept) cbind(age, age^2, age^3, age^4)age 
                     -1.841542e+02                       2.124552e+01 
   cbind(age, age^2, age^3, age^4)    cbind(age, age^2, age^3, age^4) 
                     -5.638593e-01                       6.810688e-03 
   cbind(age, age^2, age^3, age^4) 
                     -3.203830e-05

Predictions

We can now create a grid of values for age and generate predictions and confidence intervals with the predict() function.

> agelims=range(age)
> age.grid=seq(from=agelims[1],to=agelims[2])
> preds=predict(fit,newdata=list(age=age.grid),se=TRUE)
> se.bands=cbind(preds$fit+2*preds$se.fit,
+                preds$fit-2*preds$se.fit)

Plot

We can plot the data, predictions, and confidence interval.

> library(ggplot2)
> ggplot()+
+   geom_point(mapping = aes(x=age,y=wage),
+              col="darkgrey", cex=0.5)+
+   scale_x_continuous(breaks=seq(round(agelims[1],-1),
+                             round(agelims[2],-1),10))+
+   geom_line(mapping = aes(x=age.grid,y=preds$fit),lwd=1)+
+   geom_line(mapping = aes(x=age.grid,y=se.bands[,1]),
+             lwd=1, lty=3, col="blue")+
+   geom_line(mapping = aes(x=age.grid,y=se.bands[,2]),
+             lwd=1, lty=3, col="blue")+
+   ggtitle("Degree-4 Polynomial")+
+   theme_classic()+
+   theme(plot.title = element_text(hjust = 0.5))

Fitted Values

Although fit and fit2 have different coefficient estimates we can see that they return the same fitted values.

Find the maximum absolute difference in predicted values.
Used near() to check if it is sufficiently close to 0.

> library(dplyr)
> preds2=predict(fit2,newdata=list(age=age.grid),se=TRUE)
> near(max(abs(preds$fit-preds2$fit)),0)

[1] TRUE

Polynomial Degree Selection

We can use the anova() function to select the polynomial degree. It tests the null hypothesis that a model $\mathcal{M}_1$ is sufficient to explain the data against the alternative hypothesis that a more complex model $\mathcal{M}_2$ is required. The predictors in $\mathcal{M}_1$ must be a subset of those in $\mathcal{M}_2$.

> fit.1=lm(wage~age,data=Wage)
> fit.2=lm(wage~poly(age,2),data=Wage)
> fit.3=lm(wage~poly(age,3),data=Wage)
> fit.4=lm(wage~poly(age,4),data=Wage)
> fit.5=lm(wage~poly(age,5),data=Wage)
> anova(fit.1,fit.2,fit.3,fit.4,fit.5)

Analysis of Variance Table

Model 1: wage ~ age
Model 2: wage ~ poly(age, 2)
Model 3: wage ~ poly(age, 3)
Model 4: wage ~ poly(age, 4)
Model 5: wage ~ poly(age, 5)
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1   2998 5022216                                    
2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
3   2996 4777674  1     15756   9.8888  0.001679 ** 
4   2995 4771604  1      6070   3.8098  0.051046 .  
5   2994 4770322  1      1283   0.8050  0.369682    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value comparing model 1 to model 2 is almost zero, so we can reject the null hypothesis that model 1 is sufficient.
The p-value comparing model 2 to model 3 is also less than 0.05, and so model 2 is not sufficient.
The p-value comparing model 3 to model 4 is above 0.05, and so we fail to reject the null hypothesis. model 3 is sufficient.
The p-value comparing model 4 to model 5 is above 0.05, and so we fail to reject the null hypothesis. model 4 is sufficient.

We could have obtained the same p-values with the poly() function (since it creates orthogonal polynomials).

> coef(summary(fit.5))

                Estimate Std. Error     t value     Pr(>|t|)
(Intercept)    111.70361  0.7287647 153.2780243 0.000000e+00
poly(age, 5)1  447.06785 39.9160847  11.2001930 1.491111e-28
poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
poly(age, 5)3  125.52169 39.9160847   3.1446392 1.679213e-03
poly(age, 5)4  -77.91118 39.9160847  -1.9518743 5.104623e-02
poly(age, 5)5  -35.81289 39.9160847  -0.8972045 3.696820e-01

The square of the t-statistics are equal to the F-statistics from the anova() function.

However, the ANOVA method is more flexible. It works with with other terms as well.

> fit.1=lm(wage~education+age,data=Wage)
> fit.2=lm(wage~education+poly(age,2),data=Wage)
> fit.3=lm(wage~education+poly(age,3),data=Wage)
> anova(fit.1,fit.2,fit.3)

Analysis of Variance Table

Model 1: wage ~ education + age
Model 2: wage ~ education + poly(age, 2)
Model 3: wage ~ education + poly(age, 3)
  Res.Df     RSS Df Sum of Sq        F Pr(>F)    
1   2994 3867992                                 
2   2993 3725395  1    142597 114.6969 <2e-16 ***
3   2992 3719809  1      5587   4.4936 0.0341 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Prediction

We can predict whether an individual earns more than $250,000

I() creates a binary response. TRUE = 1 and FALSE = 0.
family="binomial" fits a logistic regression.

> fit=glm(I(wage>250)~poly(age,4),data=Wage,family=binomial)

After we fit the model we can use predict()

> preds=predict(fit,newdata=list(age=age.grid),se.fit=TRUE)

By default we get predictions for the logit:

\[log \left(\frac{Pr(Y=1|X)}{1-Pr(Y=1|X)}=X\beta\right)\]

In order to obtain confidence intervals use the transformation:

\[PR(Y=1|X)=\frac{exp(X\beta)}{1+exp(X\beta)}\]

> pfit=exp(preds$fit)/(1+exp(preds$fit))
> se.bands.logit = cbind(preds$fit+2*preds$se.fit, 
+                        preds$fit-2*preds$se.fit)
> se.bands = exp(se.bands.logit)/(1+exp(se.bands.logit))

You could use type="response" to get the probabilities, but the standard errors wouldn’t make sense.

> preds=predict(fit,newdata=list(age=age.grid),
+               type="response",se.fit=TRUE)

Prediction Plot

> plot(age,I(wage>250),xlim=agelims,type="n",ylim=c(0,.2))
> points(jitter(age), I((wage>250)/5),cex=.5,
+        pch="|",col="darkgrey")
> lines(age.grid,pfit,lwd=2, col="blue")
> matlines(age.grid,se.bands,lwd=1,col="blue",lty=3)

wage values above 250K are at the top in gray
wage values below 250K are at the bottom in gray
jitter() scatters the values a bit so that they don’t all overlap
The predictions and confidence intervals are in blue.

Step Function

You can fit a step function with cut()

> table(cut(age,4))


(17.9,33.5]   (33.5,49]   (49,64.5] (64.5,80.1] 
        750        1399         779          72

> fit=lm(wage~cut(age,4),data=Wage)
> coef(summary(fit))

                        Estimate Std. Error   t value     Pr(>|t|)
(Intercept)            94.158392   1.476069 63.789970 0.000000e+00
cut(age, 4)(33.5,49]   24.053491   1.829431 13.148074 1.982315e-38
cut(age, 4)(49,64.5]   23.664559   2.067958 11.443444 1.040750e-29
cut(age, 4)(64.5,80.1]  7.640592   4.987424  1.531972 1.256350e-01

cut() automatically picks the cutpoints but you can use breaks to set them manually.
lm() creates dummy variables.
age<33.5 is left out and the intercept represents the average salary from those under 33.5 years of age.

Splines

Model

We can use the splines library.
bs() generates a matrix of basis functions with the specified set of knots. By default cubic splines are produced.
bs() also has a degree argument if a cubic spline is not desired.
Here the knots have been specified at 25, 40, and 60.

> library(splines)
> fit=lm(wage~bs(age,knots=c(25,40,60)),data=Wage)
> pred=predict(fit,newdata=list(age=age.grid),se.fit = TRUE)

Plot

> plot(age,wage,col="gray")
> lines(age.grid,pred$fit,lwd=2)
> lines(age.grid,pred$fit+2*pred$se,lty="dashed", col="blue")
> lines(age.grid,pred$fit-2*pred$se,lty="dashed", col="blue")

A cubic spline with three knots has 7 degrees a freedom (an intercept and 6 basis functions)
We could use df to produce a spline with knots at uniform quantiles.

> dim(bs(age,knots=c(25,40,60)))

[1] 3000    6

> dim(bs(age,df=6))

[1] 3000    6

> attr(bs(age,df=6),"knots")

  25%   50%   75% 
33.75 42.00 51.00

Here the knots are at ages 33.8, 42.0, and 51.0.

Natural Spline

You can fit a natural spline with ns().
You could specify the knots with the knots option.

> fit2=lm(wage~ns(age,df=4),data=Wage)
> pred2=predict(fit2,newdata=list(age=age.grid),se.fit = TRUE)
> plot(age,wage,col="gray")
> lines(age.grid, pred2$fit,col="red",lwd=2)

Smoothing Spline

You can fit a smoothing spline with the smooth.spline() function.
If we specify df=16 it will determine the value of $\lambda$ that leads to 16 degrees of freedom.
Cross-validation leads to a $\lambda$ value of 6.8.

> plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
> title("Smoothing Spline")
> fit=smooth.spline(age,wage,df=16)
> fit2=smooth.spline(age,wage,cv=TRUE)
> fit2$df

[1] 6.794596

> lines(fit,col="red",lwd=2)
> lines(fit2,col="blue",lwd=2)
> legend("topright",legend=c("16 DF","6.8 DF"),
+        col=c("red","blue"),lty=1,lwd=2,cex=.8)

Local Regression

You can perform a local regression with the loess() function.
Here the spans are 0.2 and 0.5 (each neigborhood consists of 20% or 50% of the observations)
The larger the span the smoother the fit.
The locfit library could also be used.

> plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
> title("Local Regression")
> fit=loess(wage~age,span=.2,data=Wage)
> fit2=loess(wage~age,span=.5,data=Wage)
> lines(age.grid,predict(fit,data.frame(age=age.grid)),
+       col="red",lwd=2)
> lines(age.grid,predict(fit2,data.frame(age=age.grid)),
+       col="blue",lwd=2)
> legend("topright",legend=c("Span=0.2","Span=0.5"),
+        col=c("red","blue"),lty=1,lwd=2,cex=.8)

Moving Beyond Linearity - 1

Implemented in R Markdown

Paul Jozefek

2020-11-15