Non-Linear Methods

The linearity assumption is often inadequate. Very few trends follow this pattern perfectly. However, it can be improved upon with

Polynomial regression
Step functions
Splines
Local regression
General additive models

This report focuses on 4 and 5.

Local Regression

Local Regression involves computing the fit at a target point $x_0$ using only nearby training observations. The alogrithm is as follows:

Local Regression at $X=x_0$

Gather the fraction $s=k/n$ of training points whose $x_i$ are closest to $x_0$.
Assign a weight $K_{i0} = K(x_i,x_0)$ to each point in this neighborhood, so that the point furthest from $x_0$ has a weight of zero and the closest has the highest weight. Everything outside of the span $s$ gets a weight of zero.
Fit a weighted least squares regression of the $y_i$ on the $x_i$ using the weights from step 2, by finding the $\hat\beta_0$ and $\hat\beta_1$ that minimize

\[\sum_{i=1}^nK_{i0}(y_i-\beta_0-\beta_1x_i)^2\]

The fitted value at $x_0$ is given by $\hat f(x_0)=\hat\beta_0+\hat\beta_1x_0$.

Must determine the weighting function $K$, whether to fit a linear, constant, or quadratic regression etc., and the span $s$.
The span, $s$, is the most important choice. It is similar to the $\lambda$ parameter in smoothing splines. As the span increases so does the smoothness. A very large span will use most or all of the training observations.
The span can be determined with cross validation or it could be selected directly.

It is also possible to fit a model that is global in some variables but local in another, like time. Higher dimensions are also possible - we could fit a model that is local in a pair of variables, $X_1$ and $X_2$.

General Additive Models

General Additive Models allow non-linear functions of each variable while maintaining additivity.

Instead of the standard multiple linear regression model:

\[y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\dots+\beta_px_{ip}+\epsilon_i\]

We could rewrite it with a non-linear function $f_j(x_{ij})$:

\[y_i=\beta_0+f_1(x_{i1})+f_2(x_{i2})+\dots+f_p(x_{ip})+\epsilon_i\]

We can calculate a separate function for each $X_j$ and add together their contributions.

For example, with

wage $=\beta_0+f_1$ (year) $+f_2$ (age) $+f_3$ (education) + $\epsilon$

we could use a linear function of year, a natural spline of age, and a separate constant for each level of education. We then fit the model with least squares.

With a smoothing spline least squares can’t be used, but the gam() function uses backfitting. This method updates the fit for each predictor while holding the others constant. Each variable is fit using a partial residual

Example:
A partial residual for $X_3$ has the form $r_i=y_i-f_i(x_{i1})-f_2(x_{i2})$. If we know $f_1$ and $f_2$ then we can fit $f_3$ by treating this residual as a response in a non-linear regression on $X_3$.

Pros and Cons of GAMS

Pro - Since we can fit a non-linear function to each $X_j$ we don’t have to manually try transformations on each variable.
Pro - Non-linear fits can lead to better predictions.
Pro - Since the model is additive we can assess the effect of each $X_j$ on $Y$ while holding the others constant.
Pro - The smoothness of the function $f_j$ can be summarized by degrees of freedom.
Con - The model is restricted to be additive so interactions can be missed. However, interactions can be added manually by adding predictors in the form of $X_j \times X_k$ or functions in the form of $f_{jk}(X_k,X_k)$.

GAMS for classification

Where $Y$ is qualitative we can use the form:

\[log \left(\frac{p(X)}{1-p(X)}\right)=\beta_0+f_1(X_1)+f_2(X_2)+\dots+f_p(X_p)\] For example, if we wanted to predict the probability that an individual’s income exceeds $250,000 (using the Wage dataset):

\[log \left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1\times\text{year}+f_2(\text{age})+f_3(\text{education})\]

Where $p(X)=Pr(\text{wage}>250|\text{year,age,education})$

Non-Linear Modeling

General Additive Models

Natural Spline GAM

We can use lm() to fit a GAM using natural spline functions of year and age while treating education as a qualitative predictor.

> library(ISLR)
> library(splines)
> gam1=lm(wage~ns(year,4)+ns(age,5)+education,data=Wage)

Smoothing Spline GAM

To fit a smoothing spline we need the gam library and the s() function. Here we fit year with 4 degrees of freedom and age with 5 degrees of freedom.

> library(gam)
> gam.m3=gam(wage~s(year,4)+s(age,5)+education,data=Wage)

Smoothing Spline Plot

plot() automatically recognizes that gam.m3 is an object of class gam and invokes the plot.gam() method.

> par(mfrow=c(1,3))
> plot(gam.m3, se=TRUE,col="blue")

Natural Spline Plot

For an object of class lm we must specify plot.gam()

> par(mfrow=c(1,3))
> plot.Gam(gam1, se=TRUE, col="red")

Plot Output

The left-hand panel indicates that wage tends to increase with year when holding age and education fixed.
The middle panel indicates that wage tends to be highest for intermediate levels of age when holding year and education fixed.
The right-hand panel indicates that wage tends to increase with education when holding year and age fixed.

ANOVA test

We can perform an ANOVA test to see which of these models is best:

$\mathcal{M}_1$ - a GAM that excludes year.
$\mathcal{M}_2$ - a GAM that uses a linear function of year.
$\mathcal{M}_3$ - a GAM that uses a spline function of year.

> gam.m1=gam(wage~s(age,5)+education,data=Wage)
> gam.m2=gam(wage~year+s(age,5)+education,data=Wage)
> anova(gam.m1,gam.m2,gam.m3,test="F")

Analysis of Deviance Table

Model 1: wage ~ s(age, 5) + education
Model 2: wage ~ year + s(age, 5) + education
Model 3: wage ~ s(year, 4) + s(age, 5) + education
  Resid. Df Resid. Dev Df Deviance       F    Pr(>F)    
1      2990    3711731                                  
2      2989    3693842  1  17889.2 14.4771 0.0001447 ***
3      2986    3689770  3   4071.1  1.0982 0.3485661    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we can see that the model without year is not sufficient (p-value of 0.00014). However, the model becomes sufficient when a linear function of year is added, and so $\mathcal{M}_2$ is preferred.

Model Summary

We can also view the summary()

> summary(gam.m3)


Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-119.43  -19.70   -3.33   14.17  213.48 

(Dispersion Parameter for gaussian family taken to be 1235.69)

    Null Deviance: 5222086 on 2999 degrees of freedom
Residual Deviance: 3689770 on 2986 degrees of freedom
AIC: 29887.75 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
             Df  Sum Sq Mean Sq F value    Pr(>F)    
s(year, 4)    1   27162   27162  21.981 2.877e-06 ***
s(age, 5)     1  195338  195338 158.081 < 2.2e-16 ***
education     4 1069726  267432 216.423 < 2.2e-16 ***
Residuals  2986 3689770    1236                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
            Npar Df Npar F  Pr(F)    
(Intercept)                          
s(year, 4)        3  1.086 0.3537    
s(age, 5)         4 32.380 <2e-16 ***
education                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null hypothesis - the relationship between the predictor and the response is linear
Alternative hypothesis - the relationship between the predictor and the response is non-linear
Here we can see that year is best represented with a linear function (p-value 0.35 - fail to reject the Null).
A non-linear function is required for age (very low p-value - reject the Null).

Predictions

We can use the predict() function to make predictions.

> preds=predict(gam.m2,newdata=Wage)

Local Regression

You can also use local regression with lo(). Here the span is set to 0.7.

> gam.lo=gam(wage~s(year,df=4)+lo(age,span=0.7)+education,
+            data=Wage)
> 
> par(mfrow=c(1,3))
> plot.Gam(gam.lo, se=TRUE, col="green")

Interactions

We can also fit interactions. Here an interaction between year and age is one term and education is another.

> gam.lo.i=gam(wage~lo(year,age,span=0.5)+education,data=Wage)

Plot

The akima package can be used to generate the plot.

> library(akima)
> par(mfrow=c(1,2))
> plot(gam.lo.i)

Logistic Regression GAM

To fit a logistic regression we can use the I() function to create a binary response variable.

> gam.lr=gam(I(wage>250)~year+s(age,df=5)+education,
+            family=binomial,data=Wage)
> par(mfrow=c(1,3))
> plot(gam.lr,se=TRUE,col="green")

The confidence interval for level <HS is very wide. There are no individuals with less than a HS degree making more than $250,000.

> table(Wage$education,I(Wage$wage>250))

                    
                     FALSE TRUE
  1. < HS Grad         268    0
  2. HS Grad           966    5
  3. Some College      643    7
  4. College Grad      663   22
  5. Advanced Degree   381   45

Fitting a logistic regression without that category makes more sense.

> gam.lr.s=gam(I(wage>250)~year+s(age,df=5)+education,
+              family=binomial,data=Wage,
+              subset=(education!="1. < HS Grad"))
> par(mfrow=c(1,3))
> plot(gam.lr.s,se=TRUE,col="green")

Moving Beyond Linearity - 2

Implemented in R Markdown

Paul Jozefek

2020-11-21

Non-Linear Methods

Local Regression

General Additive Models

Pros and Cons of GAMS

GAMS for classification

Non-Linear Modeling

General Additive Models

Local Regression

Logistic Regression GAM