The linearity assumption is often inadequate. Very few trends follow this pattern perfectly. However, it can be improved upon with
This report focuses on 1 to 3.
The standard linear model:
\[y_i = \beta_0 + \beta_1x_i + \epsilon_i\]
becomes:
\[y_i = \beta_0 + \beta_1x_i + \beta_2x^2_i + \beta_3x^3_i+ \dotsc+ \beta_dx^d_i + \epsilon_i\]
We can also use Logistic regression to predict a binary response:
\[Pr(X) = \frac{e^{(\beta_0 + \beta_1x_i + \beta_2x^2_i + \beta_3x^3_i+ \dotsc+ \beta_dx^d_i)}}{1+ e^{(\beta_0 + \beta_1x_i + \beta_2x^2_i + \beta_3x^3_i+ \dotsc+ \beta_dx^d_i)}}\]
For example, create cutpoints \(c_1, c_2,\dotsc,c_K\) and \(K+1\) new variables:
\[\begin{matrix} C_0(X) & = & I(X<c_1), \\ C_1(X) & = & I(c_1\leq X<c_2), \\ C_2(X) & = & I(c_2\leq X<c_3), \\ & \vdots & \\ C_K-1(X) & = & I(c_K-1\leq X<c_K), \\ C_K(X) & = & I(c_K\leq X) \\ \end{matrix}\]
\(I(\dotsm)\) is an indicator function. It equals 1 if the condition is true and 0 otherwise. These functions create dummy variables.
The linear model becomes:
\[y_i = \beta_0+\beta_1C_1(x_i)+\beta_2C_2(x_i)+\dotsc+\beta_KC_K(x_i)+\epsilon_i\]
The linear model becomes:
\[y_i = \beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+\dotsc+\beta_Kb_K(x_i)+\epsilon_i\]
where \(b_1,b_2,\dotsc,b_K\) are functions that are fixed and known.
A piecewise cubic polynomial with a single knot at \(c\):
\[ y_i = \begin{cases} \beta_{01}+\beta_{11}x_i+\beta_{21}x^2_i +\beta_{31}x^3_i+\epsilon_i & \quad \text{if } x_i <c;\\ \beta_{02}+\beta_{12}x_i+\beta_{22}x^2_i +\beta_{32}x^3_i+\epsilon_i & \quad \text{if } x_i \geq c. \end{cases} \]
We can use the basis model to represent a regression spline. A cubic spline with \(K\) knots can be modeled as:
\[y_i = \beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+\dotsc+\beta_{K+3}b_{K+3}(x_i)+\epsilon_i\]
To represent a cubic spline we start off with a basis for a cubic polynomial (\(x,x^2,x^3\)) and add one truncated power basis function per knot. A truncated power basis function is represented as:
\[ h(x,\xi) = (x-\xi)^3_+= \begin{cases} (x-\xi)^3 & \quad \text{if } x_i <\xi;\\ 0 & \quad \text{otherwise} . \end{cases} \]
where \(\xi\) is the knot.
For a cubic spline with \(K\) knots we can perform a least squares regression with an intercept and \(3+K\) predictors.
\[X,X^2,X^3,h(X,\xi_1),h(X,\xi_2),\dotsc,h(X,\xi_1K)\]
where \(\xi_1,\dotsc,\xi_K\) are the knots.
Specify the desired degrees of freedom and have the software place the corresponding number knots at uniform quantiles.
Try out different values and see which produces the best looking curve.
Use cross validation.
Splines often outperform polynomial regression.
We would like to find a function, \(g(x)\), that fits the data well and we would like to minimize \(RSS = \sum^n_{i=1}(y_i-g(x_i))^2\). However, we don’t want to over fit the data.
To find a function \(g\) that minimizes \(RSS\) and is also smooth (doesn’t over fit) we can minimize:
\[\sum^n_{i=1}(y_i-g(x_i))^2 + \lambda \int g''(t)^2dt\]
\(\lambda\) is a non-negative tuning parameter.
The function \(g\) is known as a smoothing spline.
\(\sum^n_{i=1}(y_i-g(x_i))^2\) is a loss function that encourages \(g\) to fit the data well.
\(\lambda \int g''(t)^2dt\) is a penalty term that penalizes the variability in \(g\).
The function \(g(x)\) that minimizes \(\sum^n_{i=1}(y_i-g(x_i))^2 + \lambda \int g''(t)^2dt\) is a cubic polynomial with knots at the unique values of \(x_1,\dotsc, x_n\) and continuous first and second derivatives at each knot.
The value of \(\lambda\) that minimizes RSS is typically found with leave-one-out cross-validation.
There are several ways to generate a polynomial regression. Interestingly, they do not all produce the same coefficient estimates, but the fitted values will all be the same.
We can use the Wage dataset for examples.
1. Orthogonal Polynomials
With poly() you can specify the degree. The default will output orthogonal polynomials (a linear combination of the variables age, age^2, age^3, and age^4).
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
2. Raw Values
We can also use poly() to obtain age, age^2, age^3, and age^4 directly by specifying raw=TRUE.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
poly(age, 4, raw = T)1 2.124552e+01 5.886748e+00 3.609042 0.0003123618
poly(age, 4, raw = T)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
poly(age, 4, raw = T)3 6.810688e-03 3.065931e-03 2.221409 0.0263977518
poly(age, 4, raw = T)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498
3. Wrapper Function
We can use the wrapper function I().
(Intercept) age I(age^2) I(age^3) I(age^4)
-1.841542e+02 2.124552e+01 -5.638593e-01 6.810688e-03 -3.203830e-05
4. Collection of vectors
cbind() is a more compact version.
(Intercept) cbind(age, age^2, age^3, age^4)age
-1.841542e+02 2.124552e+01
cbind(age, age^2, age^3, age^4) cbind(age, age^2, age^3, age^4)
-5.638593e-01 6.810688e-03
cbind(age, age^2, age^3, age^4)
-3.203830e-05
Predictions
We can now create a grid of values for age and generate predictions and confidence intervals with the predict() function.
> agelims=range(age)
> age.grid=seq(from=agelims[1],to=agelims[2])
> preds=predict(fit,newdata=list(age=age.grid),se=TRUE)
> se.bands=cbind(preds$fit+2*preds$se.fit,
+ preds$fit-2*preds$se.fit)Plot
We can plot the data, predictions, and confidence interval.
> library(ggplot2)
> ggplot()+
+ geom_point(mapping = aes(x=age,y=wage),
+ col="darkgrey", cex=0.5)+
+ scale_x_continuous(breaks=seq(round(agelims[1],-1),
+ round(agelims[2],-1),10))+
+ geom_line(mapping = aes(x=age.grid,y=preds$fit),lwd=1)+
+ geom_line(mapping = aes(x=age.grid,y=se.bands[,1]),
+ lwd=1, lty=3, col="blue")+
+ geom_line(mapping = aes(x=age.grid,y=se.bands[,2]),
+ lwd=1, lty=3, col="blue")+
+ ggtitle("Degree-4 Polynomial")+
+ theme_classic()+
+ theme(plot.title = element_text(hjust = 0.5))Fitted Values
Although fit and fit2 have different coefficient estimates we can see that they return the same fitted values.
near() to check if it is sufficiently close to 0.> library(dplyr)
> preds2=predict(fit2,newdata=list(age=age.grid),se=TRUE)
> near(max(abs(preds$fit-preds2$fit)),0)[1] TRUE
Polynomial Degree Selection
We can use the anova() function to select the polynomial degree. It tests the null hypothesis that a model \(\mathcal{M}_1\) is sufficient to explain the data against the alternative hypothesis that a more complex model \(\mathcal{M}_2\) is required. The predictors in \(\mathcal{M}_1\) must be a subset of those in \(\mathcal{M}_2\).
> fit.1=lm(wage~age,data=Wage)
> fit.2=lm(wage~poly(age,2),data=Wage)
> fit.3=lm(wage~poly(age,3),data=Wage)
> fit.4=lm(wage~poly(age,4),data=Wage)
> fit.5=lm(wage~poly(age,5),data=Wage)
> anova(fit.1,fit.2,fit.3,fit.4,fit.5)Analysis of Variance Table
Model 1: wage ~ age
Model 2: wage ~ poly(age, 2)
Model 3: wage ~ poly(age, 3)
Model 4: wage ~ poly(age, 4)
Model 5: wage ~ poly(age, 5)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2998 5022216
2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
3 2996 4777674 1 15756 9.8888 0.001679 **
4 2995 4771604 1 6070 3.8098 0.051046 .
5 2994 4770322 1 1283 0.8050 0.369682
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model 1 to model 2 is almost zero, so we can reject the null hypothesis that model 1 is sufficient.model 2 to model 3 is also less than 0.05, and so model 2 is not sufficient.model 3 to model 4 is above 0.05, and so we fail to reject the null hypothesis. model 3 is sufficient.model 4 to model 5 is above 0.05, and so we fail to reject the null hypothesis. model 4 is sufficient.We could have obtained the same p-values with the poly() function (since it creates orthogonal polynomials).
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.70361 0.7287647 153.2780243 0.000000e+00
poly(age, 5)1 447.06785 39.9160847 11.2001930 1.491111e-28
poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
poly(age, 5)3 125.52169 39.9160847 3.1446392 1.679213e-03
poly(age, 5)4 -77.91118 39.9160847 -1.9518743 5.104623e-02
poly(age, 5)5 -35.81289 39.9160847 -0.8972045 3.696820e-01
The square of the t-statistics are equal to the F-statistics from the anova() function.
However, the ANOVA method is more flexible. It works with with other terms as well.
> fit.1=lm(wage~education+age,data=Wage)
> fit.2=lm(wage~education+poly(age,2),data=Wage)
> fit.3=lm(wage~education+poly(age,3),data=Wage)
> anova(fit.1,fit.2,fit.3)Analysis of Variance Table
Model 1: wage ~ education + age
Model 2: wage ~ education + poly(age, 2)
Model 3: wage ~ education + poly(age, 3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2994 3867992
2 2993 3725395 1 142597 114.6969 <2e-16 ***
3 2992 3719809 1 5587 4.4936 0.0341 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Prediction
We can predict whether an individual earns more than $250,000
I() creates a binary response. TRUE = 1 and FALSE = 0.family="binomial" fits a logistic regression.After we fit the model we can use predict()
By default we get predictions for the logit:
\[log \left(\frac{Pr(Y=1|X)}{1-Pr(Y=1|X)}=X\beta\right)\]
In order to obtain confidence intervals use the transformation:
\[PR(Y=1|X)=\frac{exp(X\beta)}{1+exp(X\beta)}\]
> pfit=exp(preds$fit)/(1+exp(preds$fit))
> se.bands.logit = cbind(preds$fit+2*preds$se.fit,
+ preds$fit-2*preds$se.fit)
> se.bands = exp(se.bands.logit)/(1+exp(se.bands.logit))You could use type="response" to get the probabilities, but the standard errors wouldn’t make sense.
Prediction Plot
> plot(age,I(wage>250),xlim=agelims,type="n",ylim=c(0,.2))
> points(jitter(age), I((wage>250)/5),cex=.5,
+ pch="|",col="darkgrey")
> lines(age.grid,pfit,lwd=2, col="blue")
> matlines(age.grid,se.bands,lwd=1,col="blue",lty=3)wage values above 250K are at the top in graywage values below 250K are at the bottom in grayjitter() scatters the values a bit so that they don’t all overlapStep Function
You can fit a step function with cut()
(17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]
750 1399 779 72
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.158392 1.476069 63.789970 0.000000e+00
cut(age, 4)(33.5,49] 24.053491 1.829431 13.148074 1.982315e-38
cut(age, 4)(49,64.5] 23.664559 2.067958 11.443444 1.040750e-29
cut(age, 4)(64.5,80.1] 7.640592 4.987424 1.531972 1.256350e-01
cut() automatically picks the cutpoints but you can use breaks to set them manually.lm() creates dummy variables.age<33.5 is left out and the intercept represents the average salary from those under 33.5 years of age.Model
splines library.bs() generates a matrix of basis functions with the specified set of knots. By default cubic splines are produced.bs() also has a degree argument if a cubic spline is not desired.> library(splines)
> fit=lm(wage~bs(age,knots=c(25,40,60)),data=Wage)
> pred=predict(fit,newdata=list(age=age.grid),se.fit = TRUE)Plot
> plot(age,wage,col="gray")
> lines(age.grid,pred$fit,lwd=2)
> lines(age.grid,pred$fit+2*pred$se,lty="dashed", col="blue")
> lines(age.grid,pred$fit-2*pred$se,lty="dashed", col="blue")df to produce a spline with knots at uniform quantiles.[1] 3000 6
[1] 3000 6
25% 50% 75%
33.75 42.00 51.00
Natural Spline
ns().knots option.> fit2=lm(wage~ns(age,df=4),data=Wage)
> pred2=predict(fit2,newdata=list(age=age.grid),se.fit = TRUE)
> plot(age,wage,col="gray")
> lines(age.grid, pred2$fit,col="red",lwd=2)Smoothing Spline
smooth.spline() function.df=16 it will determine the value of \(\lambda\) that leads to 16 degrees of freedom.> plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
> title("Smoothing Spline")
> fit=smooth.spline(age,wage,df=16)
> fit2=smooth.spline(age,wage,cv=TRUE)
> fit2$df[1] 6.794596
> lines(fit,col="red",lwd=2)
> lines(fit2,col="blue",lwd=2)
> legend("topright",legend=c("16 DF","6.8 DF"),
+ col=c("red","blue"),lty=1,lwd=2,cex=.8)Local Regression
loess() function.locfit library could also be used.> plot(age,wage,xlim=agelims,cex=.5,col="darkgrey")
> title("Local Regression")
> fit=loess(wage~age,span=.2,data=Wage)
> fit2=loess(wage~age,span=.5,data=Wage)
> lines(age.grid,predict(fit,data.frame(age=age.grid)),
+ col="red",lwd=2)
> lines(age.grid,predict(fit2,data.frame(age=age.grid)),
+ col="blue",lwd=2)
> legend("topright",legend=c("Span=0.2","Span=0.5"),
+ col=c("red","blue"),lty=1,lwd=2,cex=.8)