The linearity assumption is often inadequate. Very few trends follow this pattern perfectly. However, it can be improved upon with
This report focuses on 4 and 5.
Local Regression involves computing the fit at a target point \(x_0\) using only nearby training observations. The alogrithm is as follows:
Local Regression at \(X=x_0\)
Gather the fraction \(s=k/n\) of training points whose \(x_i\) are closest to \(x_0\).
Assign a weight \(K_{i0} = K(x_i,x_0)\) to each point in this neighborhood, so that the point furthest from \(x_0\) has a weight of zero and the closest has the highest weight. Everything outside of the span \(s\) gets a weight of zero.
Fit a weighted least squares regression of the \(y_i\) on the \(x_i\) using the weights from step 2, by finding the \(\hat\beta_0\) and \(\hat\beta_1\) that minimize
\[\sum_{i=1}^nK_{i0}(y_i-\beta_0-\beta_1x_i)^2\]
Must determine the weighting function \(K\), whether to fit a linear, constant, or quadratic regression etc., and the span \(s\).
The span, \(s\), is the most important choice. It is similar to the \(\lambda\) parameter in smoothing splines. As the span increases so does the smoothness. A very large span will use most or all of the training observations.
The span can be determined with cross validation or it could be selected directly.
It is also possible to fit a model that is global in some variables but local in another, like time. Higher dimensions are also possible - we could fit a model that is local in a pair of variables, \(X_1\) and \(X_2\).
General Additive Models allow non-linear functions of each variable while maintaining additivity.
Instead of the standard multiple linear regression model:
\[y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\dots+\beta_px_{ip}+\epsilon_i\]
We could rewrite it with a non-linear function \(f_j(x_{ij})\):
\[y_i=\beta_0+f_1(x_{i1})+f_2(x_{i2})+\dots+f_p(x_{ip})+\epsilon_i\]
We can calculate a separate function for each \(X_j\) and add together their contributions.
For example, with
wage \(=\beta_0+f_1\) (year) \(+f_2\) (age) \(+f_3\) (education) + \(\epsilon\)
we could use a linear function of year, a natural spline of age, and a separate constant for each level of education. We then fit the model with least squares.
With a smoothing spline least squares can’t be used, but the gam() function uses backfitting. This method updates the fit for each predictor while holding the others constant. Each variable is fit using a partial residual
Example:
A partial residual for \(X_3\) has the form \(r_i=y_i-f_i(x_{i1})-f_2(x_{i2})\). If we know \(f_1\) and \(f_2\) then we can fit \(f_3\) by treating this residual as a response in a non-linear regression on \(X_3\).
Where \(Y\) is qualitative we can use the form:
\[log \left(\frac{p(X)}{1-p(X)}\right)=\beta_0+f_1(X_1)+f_2(X_2)+\dots+f_p(X_p)\] For example, if we wanted to predict the probability that an individual’s income exceeds $250,000 (using the Wage dataset):
\[log \left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1\times\text{year}+f_2(\text{age})+f_3(\text{education})\]
Where \(p(X)=Pr(\text{wage}>250|\text{year,age,education})\)
Natural Spline GAM
lm() to fit a GAM using natural spline functions of year and age while treating education as a qualitative predictor.Smoothing Spline GAM
To fit a smoothing spline we need the gam library and the s() function. Here we fit year with 4 degrees of freedom and age with 5 degrees of freedom.
Smoothing Spline Plot
plot() automatically recognizes that gam.m3 is an object of class gam and invokes the plot.gam() method.
Natural Spline Plot
For an object of class lm we must specify plot.gam()
Plot Output
wage tends to increase with year when holding age and education fixed.wage tends to be highest for intermediate levels of age when holding year and education fixed.wage tends to increase with education when holding year and age fixed.ANOVA test
We can perform an ANOVA test to see which of these models is best:
year.year.year.> gam.m1=gam(wage~s(age,5)+education,data=Wage)
> gam.m2=gam(wage~year+s(age,5)+education,data=Wage)
> anova(gam.m1,gam.m2,gam.m3,test="F")Analysis of Deviance Table
Model 1: wage ~ s(age, 5) + education
Model 2: wage ~ year + s(age, 5) + education
Model 3: wage ~ s(year, 4) + s(age, 5) + education
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 2990 3711731
2 2989 3693842 1 17889.2 14.4771 0.0001447 ***
3 2986 3689770 3 4071.1 1.0982 0.3485661
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we can see that the model without year is not sufficient (p-value of 0.00014). However, the model becomes sufficient when a linear function of year is added, and so \(\mathcal{M}_2\) is preferred.
Model Summary
We can also view the summary()
Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
Deviance Residuals:
Min 1Q Median 3Q Max
-119.43 -19.70 -3.33 14.17 213.48
(Dispersion Parameter for gaussian family taken to be 1235.69)
Null Deviance: 5222086 on 2999 degrees of freedom
Residual Deviance: 3689770 on 2986 degrees of freedom
AIC: 29887.75
Number of Local Scoring Iterations: NA
Anova for Parametric Effects
Df Sum Sq Mean Sq F value Pr(>F)
s(year, 4) 1 27162 27162 21.981 2.877e-06 ***
s(age, 5) 1 195338 195338 158.081 < 2.2e-16 ***
education 4 1069726 267432 216.423 < 2.2e-16 ***
Residuals 2986 3689770 1236
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for Nonparametric Effects
Npar Df Npar F Pr(F)
(Intercept)
s(year, 4) 3 1.086 0.3537
s(age, 5) 4 32.380 <2e-16 ***
education
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null hypothesis - the relationship between the predictor and the response is linear
Alternative hypothesis - the relationship between the predictor and the response is non-linear
Here we can see that year is best represented with a linear function (p-value 0.35 - fail to reject the Null).
A non-linear function is required for age (very low p-value - reject the Null).
Predictions
We can use the predict() function to make predictions.
You can also use local regression with lo(). Here the span is set to 0.7.
> gam.lo=gam(wage~s(year,df=4)+lo(age,span=0.7)+education,
+ data=Wage)
>
> par(mfrow=c(1,3))
> plot.Gam(gam.lo, se=TRUE, col="green")Interactions
We can also fit interactions. Here an interaction between year and age is one term and education is another.
Plot
The akima package can be used to generate the plot.
To fit a logistic regression we can use the I() function to create a binary response variable.
> gam.lr=gam(I(wage>250)~year+s(age,df=5)+education,
+ family=binomial,data=Wage)
> par(mfrow=c(1,3))
> plot(gam.lr,se=TRUE,col="green")The confidence interval for level <HS is very wide. There are no individuals with less than a HS degree making more than $250,000.
FALSE TRUE
1. < HS Grad 268 0
2. HS Grad 966 5
3. Some College 643 7
4. College Grad 663 22
5. Advanced Degree 381 45
Fitting a logistic regression without that category makes more sense.
> gam.lr.s=gam(I(wage>250)~year+s(age,df=5)+education,
+ family=binomial,data=Wage,
+ subset=(education!="1. < HS Grad"))
> par(mfrow=c(1,3))
> plot(gam.lr.s,se=TRUE,col="green")