Fit with linear regression and orthogonal polynomials:
library(ISLR)
fit <- lm(wage ~ poly(age, 4), data = Wage)
coef(summary(fit))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
library(ISLR)
fit2 <- lm(wage ~ poly(age, 4, raw = TRUE), data = Wage)
coef(summary(fit2))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
poly(age, 4, raw = TRUE)1 2.124552e+01 5.886748e+00 3.609042 0.0003123618
poly(age, 4, raw = TRUE)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
poly(age, 4, raw = TRUE)3 6.810688e-03 3.065931e-03 2.221409 0.0263977518
poly(age, 4, raw = TRUE)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498
coef(fit2) # same result with above command
(Intercept) poly(age, 4, raw = TRUE)1 poly(age, 4, raw = TRUE)2 poly(age, 4, raw = TRUE)3
-1.841542e+02 2.124552e+01 -5.638593e-01 6.810688e-03
poly(age, 4, raw = TRUE)4
-3.203830e-05
Fit with linear regression and raw polynomials in another form of formula:
fit2a <- lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data=Wage)
coef(fit2a)
(Intercept) age I(age^2) I(age^3) I(age^4)
-1.841542e+02 2.124552e+01 -5.638593e-01 6.810688e-03 -3.203830e-05
Fit with a formula based on cbind()
:
fit2b <- lm(wage ~ cbind(age, age^2, age^3, age^4), data=Wage)
coef(fit2b)
(Intercept) cbind(age, age^2, age^3, age^4)age cbind(age, age^2, age^3, age^4)
-1.841542e+02 2.124552e+01 -5.638593e-01
cbind(age, age^2, age^3, age^4) cbind(age, age^2, age^3, age^4)
6.810688e-03 -3.203830e-05
Predict wage with age grid:
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
preds <- predict(fit, list(age = age.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2* preds$se.fit, preds$fit - 2 * preds$se.fit)
Plot the data and add the fit from the degree-4 polynomial:
par(mfrow = c(1, 2), mar = c(4.5, 4.5, 1, 1), oma = c(0, 0, 4, 0))
plot(Wage$age, Wage$wage, xlim = age.range, cex = 0.5, col = "darkgrey")
title("Degree-4 Polynomial ", outer = T)
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)
No matter producing an orthogonal set of basis functions with the poly()
function, or non-orthogonal with I(x ^ n)
, the predicting result is the same:
preds2 <- predict(fit2, list(age = age.grid), se = TRUE)
max(abs(preds$fit - preds2$fit))
[1] 1.627143e-12
Find the best model with ANOVA:
fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
Analysis of Variance Table
Model 1: wage ~ age
Model 2: wage ~ poly(age, 2)
Model 3: wage ~ poly(age, 3)
Model 4: wage ~ poly(age, 4)
Model 5: wage ~ poly(age, 5)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2998 5022216
2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
3 2996 4777674 1 15756 9.8888 0.001679 **
4 2995 4771604 1 6070 3.8098 0.051046 .
5 2994 4770322 1 1283 0.8050 0.369682
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower- or higher-order models are not justified.
Calculating p-value with only poly()
:
coef(summary(fit.5))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.70361 0.7287647 153.2780243 0.000000e+00
poly(age, 5)1 447.06785 39.9160847 11.2001930 1.491111e-28
poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
poly(age, 5)3 125.52169 39.9160847 3.1446392 1.679213e-03
poly(age, 5)4 -77.91118 39.9160847 -1.9518743 5.104623e-02
poly(age, 5)5 -35.81289 39.9160847 -0.8972045 3.696820e-01
Notice that the p-values are the same.
The ANOVA method works whether or not we used orthogonal polynomials; it also works when we have other terms in the model as well. For example, we can use anova() to compare these three models:
fit.1 <- lm(wage ~ education + age, data = Wage)
fit.2 <- lm(wage ~ education + poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ education + poly(age, 3), data = Wage)
anova(fit.1, fit.2, fit.3)
Analysis of Variance Table
Model 1: wage ~ education + age
Model 2: wage ~ education + poly(age, 2)
Model 3: wage ~ education + poly(age, 3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2994 3867992
2 2993 3725395 1 142597 114.6969 <2e-16 ***
3 2992 3719809 1 5587 4.4936 0.0341 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Predicting whether an individual earns more than $250,000 per year:
fit <- glm(I(wage > 250) ~ poly(age, 4), data = Wage, family = 'binomial')
preds <- predict(fit, list(age = age.grid), se = TRUE)
pfit <- exp(preds$fit) / (1 + exp(preds$fit))
se.bands.logit <- cbind(preds$fit + 2 * preds$se.fit, preds$fit + 2 * preds$se.fit)
se.bands <- exp(se.bands.logit) / (1 + exp(se.bands.logit))
plot(Wage$age, I(Wage$age > 250), xlim = age.range, type = 'n', ylim = c(0, 0.2))
points(jitter(Wage$age), I((Wage$wage > 250) / 5), cex = 0.5, pch = '|', col = 'darkgray')
lines(age.grid, pfit, lwd = 2, col = 'blue')
matlines(age.grid, se.bands, lwd = 1, col = 'blue', lty = 3)
Fit with a step function:
table(cut(Wage$age, 4))
(17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]
750 1399 779 72
fit <- lm(wage ~ cut(age, 4), data = Wage)
coef(summary(fit))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.158392 1.476069 63.789970 0.000000e+00
cut(age, 4)(33.5,49] 24.053491 1.829431 13.148074 1.982315e-38
cut(age, 4)(49,64.5] 23.664559 2.067958 11.443444 1.040750e-29
cut(age, 4)(64.5,80.1] 7.640592 4.987424 1.531972 1.256350e-01
Fit age to wage with cubic basis splines, and using a natural spline with 4 degree of freedom:
library(splines)
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
preds <- predict(fit, list(age = age.grid), se = TRUE)
plot(Wage$age, Wage$wage, col = 'gray')
lines(age.grid, preds$fit, lwd = 2)
lines(age.grid, preds$fit + 2 * preds$se, lty = 'dashed')
lines(age.grid, preds$fit - 2 * preds$se, lty = 'dashed')
fit2 <- lm(wage ~ ns(age, df = 4), data = Wage)
preds2 <- predict(fit2, list(age = age.grid), se = TRUE)
lines(age.grid, preds2$fit, col = 'red', lwd = 2)
Fit with a smooth spline:
plot(Wage$age, Wage$wage, xlim = age.range, cex = 0.5, col = 'darkgray')
title("Smoothing Spline")
fit <- smooth.spline(Wage$age, Wage$wage, df = 16)
fit2 <- smooth.spline(Wage$age, Wage$wage, cv = TRUE)
cross-validation with non-unique 'x' values seems doubtful
fit2$df
[1] 6.794596
lines(fit, col = 'red', lwd = 2)
lines(fit2, col = 'blue', lwd = 2)
legend("topright", legend = c("16 DF", "6.8 DF"), col = c('red', 'blue'), lty = 1, lwd =2, cex = 0.8)
Fit with local regression:
plot(Wage$age, Wage$wage, xlim = age.range, cex = 0.5, col = 'darkgray')
title('Local Regression')
fit <- loess(wage ~ age, span = .2, data = Wage)
fit2 <- loess(wage ~ age, span = .5, data = Wage)
lines(age.grid, predict(fit, data.frame(age = age.grid)), col = 'red', lwd = 2)
lines(age.grid, predict(fit2, data.frame(age = age.grid)), col = 'blue', lwd = 2)
legend('topright', legend = c('Span = 0.2', 'Span = 0.5'), col = c('red', 'blue'), lty = 1, lwd = 2, cex = .8)
bs()
函数的说明bs(x, knots = knots, degree = degree, intercept = TRUE, Boundary.knots = range(x))
返回一个 length(x)
行 length(knots) + degree + 1
列矩阵,其中 x
是一维实数向量,例如 seq(0, 3, by = 0.01)
(不需要按顺序排列),knots
是 处于 x
覆盖范围(例如range(x)
是 [0, 3]
)内的一组叫做节点的实数(例如 c(0.5, 1.7)
),它的每一列是一个样条 (spline) 函数,这组函数由 range(x)
和 knots 以及样条函数的次数 degree 唯一确定,每个样条都保证在节点处 degree - 1
阶导数连续,下面的代码演示了在 [0, 3]
区间上,包含 0.5, 1.7 两个节点的2次样条(由 \(ax^2 + bx +c\) 描述)曲线(共5条):
degree <- 2
x <- sample(seq(0, 3, by = 0.01)) # shuffle x to demonstrate the order (of x) is irrelevant
iknots <- c(0.5, 1.7)
par(mfrow = c(2, 3))
ybs <- bs(x, knots = iknots, degree = degree, intercept = TRUE)
ncol(ybs) == degree + length(iknots) + 1
[1] TRUE
for (i in 1 : ncol(ybs)) {
plot(x, ybs[, i], ylab = 'y', main = paste('bs: deg =', degree, 'i = ', i))
}
将 degree
改为1可以清晰的看到函数曲线在 0.5 和 1.7 保持了连续,随着 degree
的升高,节点处连续的导数升高,视觉效果越来越平滑。
bs()
函数生成的一组函数叫做 B-spline Basis Functions,basis function 类似于向量空间中的基底向量,各阶函数的计算公式由 Cox-de Boor recursion formula 定义,这个公式的初始状态(0阶:\(N_{i,0}(u)\))是节点分隔的各个子区间上的阶梯函数,每升一阶自变量 u
的阶次增加1,例如 \(N_{i,3}(u)\) 是一个u
的3次方多项式。 上面的链接给出了 [0, 3]
区间上,包含 1, 2 两个节点的 0 次到 2 次 basis 函数的手工计算过程。
下面的代码(基于 A very short note on B-splines)给出了 Cox-de Boor recursion formula 的 R 语言实现:
basis <- function(x, degree, i, knots) {
if (degree == 0) {
if ((x < knots[i + 1]) & (x >= knots[i]))
y <- 1
else
y <- 0
} else {
if ((knots[degree+i] - knots[i]) == 0) {
temp1 <- 0
} else {
temp1 <- (x-knots[i]) / (knots[degree+i] - knots[i])
}
if ((knots[i + degree + 1] - knots[i + 1]) == 0) {
temp2 <- 0
} else {
temp2 <- (knots[i + degree + 1] - x) / (knots[i + degree + 1] - knots[i + 1])
}
y <- temp1 * basis(x, (degree - 1), i, knots) + temp2 * basis(x, (degree - 1), (i + 1), knots)
}
return(y)
}
newbs <- function(x, degree, inner.knots, Boundary.knots) {
Boundary.knots <- sort(Boundary.knots)
knots <- c(rep(Boundary.knots[1], (degree + 1)),
sort(inner.knots),
rep(Boundary.knots[2], (degree + 1)))
np <- degree + length(inner.knots) + 1
s <- rep(0, np)
if (x == Boundary.knots[2]) {
s[np] <- 1
} else {
for (i in 1 : np)
s[i] <- basis(x, degree, i, knots)
}
return(s)
}
lines <- degree + length(iknots) + 1
par(mfrow = c(2, 3))
y <- sapply(x, newbs, degree = degree, inner.knots = iknots, Boundary.knots = range(x))
for (i in 1 : lines) {
plot(x, y[i, ], ylab = 'y', main = paste('newbs: deg =', degree, 'i = ', i))
}
可以证明上述实现与 bs()
的计算结果一致:
lines == ncol(ybs)
[1] TRUE
max(t(ybs) - y) < 1e-10 # demonstrate y and ybs are the same
[1] TRUE
为什么 bs()
生成的函数组包含length(knots) + degree + 1
列,或者说自由度为length(knots) + degree + 1
?
一段 \(d\) 阶样条曲线的自由度是 \((d + 1)\): \(\beta_0 + \beta_1 x + \dots + \beta_d x^d\),\(K\) 个节点将空间分为 \(K+1\) 份,总自由度是 \((d+1) \times (K+1)\),同时在每个节点上有 \(d\) 个约束(从 0 到 \(d-1\) 阶导数相等),最终自由度是总自由度减去总约束数: \[ (K + 1) \times (d + 1) - K \times d = K + d + 1 \]
所以7.4.2 节提到3阶样条函数的自由度是 \(K + 4\):
In general, a cubic spline with K knots uses a total of 4 + K degrees of freedom.
以及 7.4.3 节提到:
… we perform least squares regression with an intercept and \(3 + K\) predictors, of the form \(X, X^2, X^3, h(X, \xi_1), h(X, \xi_2), \dots, h(X, \xi_K)\), where \(\xi_1, \dots, \xi_K\) are the knots.
这里 \(K\) 就是 length(iknots)
,degree = 3
,1表示截距 (intercept)。
Fit with natural splines and linear regression:
gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + education, data = Wage)
summary(gam1)
Call:
lm(formula = wage ~ ns(year, 4) + ns(age, 5) + education, data = Wage)
Residuals:
Min 1Q Median 3Q Max
-120.513 -19.608 -3.583 14.112 214.535
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.949 4.704 9.980 < 2e-16 ***
ns(year, 4)1 8.625 3.466 2.488 0.01289 *
ns(year, 4)2 3.762 2.959 1.271 0.20369
ns(year, 4)3 8.127 4.211 1.930 0.05375 .
ns(year, 4)4 6.806 2.397 2.840 0.00455 **
ns(age, 5)1 45.170 4.193 10.771 < 2e-16 ***
ns(age, 5)2 38.450 5.076 7.575 4.78e-14 ***
ns(age, 5)3 34.239 4.383 7.813 7.69e-15 ***
ns(age, 5)4 48.678 10.572 4.605 4.31e-06 ***
ns(age, 5)5 6.557 8.367 0.784 0.43328
education2. HS Grad 10.983 2.430 4.520 6.43e-06 ***
education3. Some College 23.473 2.562 9.163 < 2e-16 ***
education4. College Grad 38.314 2.547 15.042 < 2e-16 ***
education5. Advanced Degree 62.554 2.761 22.654 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 35.16 on 2986 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.2899
F-statistic: 95.2 on 13 and 2986 DF, p-value: < 2.2e-16
par(mfrow = c(1,3))
library(gam)
plot.Gam(gam1, se = TRUE, col = 'red')
Note the function name is plot.Gam()
instead of plot.gam()
.
Fit with smooth splines and linear regression:
gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
par(mfrow = c(1,3))
plot(gam.m3, se = TRUE, col = 'blue')
Compare 3 models:
gam.m1 <- gam(wage ~ s(age, 5) + education, data = Wage)
gam.m2 <- gam(wage ~ year + s(age, 5) + education, data = Wage)
anova(gam.m1, gam.m2, gam.m3, test = "F")
Analysis of Deviance Table
Model 1: wage ~ s(age, 5) + education
Model 2: wage ~ year + s(age, 5) + education
Model 3: wage ~ s(year, 4) + s(age, 5) + education
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 2990 3711731
2 2989 3693842 1 17889.2 14.4771 0.0001447 ***
3 2986 3689770 3 4071.1 1.0982 0.3485661
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To test if there is a non-linear relationship between a feature and the response variable:
summary(gam.m3)
Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
Deviance Residuals:
Min 1Q Median 3Q Max
-119.43 -19.70 -3.33 14.17 213.48
(Dispersion Parameter for gaussian family taken to be 1235.69)
Null Deviance: 5222086 on 2999 degrees of freedom
Residual Deviance: 3689770 on 2986 degrees of freedom
AIC: 29887.75
Number of Local Scoring Iterations: 2
Anova for Parametric Effects
Df Sum Sq Mean Sq F value Pr(>F)
s(year, 4) 1 27162 27162 21.981 2.877e-06 ***
s(age, 5) 1 195338 195338 158.081 < 2.2e-16 ***
education 4 1069726 267432 216.423 < 2.2e-16 ***
Residuals 2986 3689770 1236
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Anova for Nonparametric Effects
Npar Df Npar F Pr(F)
(Intercept)
s(year, 4) 3 1.086 0.3537
s(age, 5) 4 32.380 <2e-16 ***
education
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the output here is different from the book (at the top of page 296). Here we judge the non-linear relationshop by p-values in section Anova for Nonparametric Effects. While this section is called DF for Terms and F-values for Nonparametric Effects in the book.
Make predictions on the training set:
preds <- predict(gam.m2, Wage)
Use local regression in a GAM:
gam.lo <- gam(wage ~ s(year, df = 4) + lo(age, span = 0.7) + education, data = Wage)
plot.Gam(gam.lo, se = TRUE, col = 'green')
Add interaction item with lo()
before calling the gam()
:
gam.lo.i <- gam(wage ~ lo(year, age, span = 0.5) + education, data = Wage)
liv too small. (Discovered by lowesd)lv too small. (Discovered by lowesd)liv too small. (Discovered by lowesd)lv too small. (Discovered by lowesd)
library(akima)
plot(gam.lo.i)
Fit a logistic regression GAM:
gam.lr <- gam(I(wage > 250) ~ year + s(age, df = 5) + education, family = binomial, data = Wage)
par(mfrow = c(1,1))
plot(gam.lr, se = TRUE, col = 'green')
See the relationship between a qualitative feature and the response:
table(Wage$education, I(Wage$wage > 250))
FALSE TRUE
1. < HS Grad 268 0
2. HS Grad 966 5
3. Some College 643 7
4. College Grad 663 22
5. Advanced Degree 381 45
Fit a logistic regression GAM using all but a selected category:
gam.lr.s <- gam(I(wage > 250) ~ year + s(age, df = 5) + education, family = binomial, data = Wage, subset = (education != "1. < HS Grad"))
plot(gam.lr.s, se = TRUE, col = 'green')