Solutions of the exercises from Chapter 7

Conceptual

Q1. It was mentioned in the chapter that a cubic regression spline with one knot at \(\xi\) can be obtained using a basis of the form \(x\); \(x^2\), \(x^3\), \((x - \xi)^3_+\), where \((x - \xi)^3_+ = (x - \xi)^3\) if \(x > \xi\) and equals \(0\) otherwise. We will now show that a function of the form \[f(x) = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)^3_+\] is indeed a cubic regression spline, regardless of the values of \(\beta_0,\beta_1,\beta_2,\beta_3,\beta_4\).

Find a cubic polynomial \[f_1(x) = a_1 + b_1x + c_1x^2 + d_1x^3\] such that \(f(x) = f_1(x)\) for all \(x\le\xi\). Express \(a_1,b_1,c_1,d_1\) in terms of \(\beta_0,\beta_1,\beta_2,\beta_3,\beta_4\).

For \(x\le\xi\), we have \[f(x) = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3,\] so we take \(a_1 = \beta_0\), \(b_1 = \beta_1\), \(c_1 = \beta_2\) and \(d_1 = \beta_3\).

Find a cubic polynomial \[f_2(x) = a_2 + b_2x + c_2x^2 + d_2x^3\] such that \(f(x) = f_2(x)\) for all \(x>\xi\). Express \(a_2,b_2,c_2,d_2\) in terms of \(\beta_0,\beta_1,\beta_2,\beta_3,\beta_4\). We have now established that \(f(x)\) is a piecewie polynomial.

For \(x>\xi\), we have \[f(x) = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)^3 = (\beta_0 - \beta_4\xi^3) + (\beta_1 + 3\xi^2\beta_4)x + (\beta_2 - 3\beta_4\xi)x^2 + (\beta_3 + \beta_4)x^3,\] so we take \(a_2 = \beta_0 - \beta_4\xi^3\), \(b_2 = \beta_1 + 3\xi^2\beta_4\), \(c_2 = \beta_2 - 3\beta_4\xi\) and \(d_2 = \beta_3 + \beta_4\).

Show that \(f_1(\xi) = f_2(\xi)\). That is \(f(x)\) is continuous at \(\xi\).

We have immediately that \[f_1(\xi) = \beta_0 + \beta_1\xi + \beta_2\xi^2 + \beta_3\xi^3\] and \[f_2(\xi) = (\beta_0 - \beta_4\xi^3) + (\beta_1 + 3\xi^2\beta_4)\xi + (\beta_2 - 3\beta_4\xi)\xi^2 + (\beta_3 + \beta_4)\xi^3 = \beta_0 + \beta_1\xi + \beta_2\xi^2 + \beta_3\xi^3.\]

Show that \(f_1'(\xi) = f_2'(\xi)\). That is \(f'(x)\) is continuous at \(\xi\).

We also have immediately that \[f_1'(\xi) = \beta_1 + 2\beta_2\xi + 3\beta_3\xi^2\] and \[f_2'(\xi) = \beta_1 + 3\xi^2\beta_4 + 2(\beta_2 - 3\beta_4\xi)\xi + 3(\beta_3 + \beta_4)\xi^2 = \beta_1 + 2\beta_2\xi + 3\beta_3\xi^2.\]

Show that \(f_1''(\xi) = f_2''(\xi)\). That is \(f''(x)\) is continuous at \(\xi\). Therefore, \(f(x)\) is indeed a cubic spline.

We finally have that \[f_1''(\xi) = 2\beta_2 + 6\beta_3\xi\] and \[f_2''(\xi) = 2(\beta_2 - 3\beta_4\xi) + 6(\beta_3 + \beta_4)\xi = 2\beta_2 + 6\beta_3\xi.\]

Q2. Suppose that a curve \(\hat{g}\) is computed to smoothly fit a set of \(n\) points using the following formula \[\hat{g} = \arg\min_g\Biggl(\sum_{i=1}^n(y_i - g(x_i))^2 + \lambda\int[g^{(m)}(x)]^2dx\biggr),\] where \(g^{(m)}\) represents the mth derivative of \(g\) (and \(g^{(0)} = g\)). Provide example sketches of \(\hat{g}\) in each of the following scenarios.

\(\lambda = \infty\), \(m = 0\).

In this case \(\hat{g} = 0\) because a large smoothing parameter forces \(g^{(0)}(x)\rightarrow 0\).

\(\lambda = \infty\), \(m = 1\).

In this case \(\hat{g} = c\) because a large smoothing parameter forces \(g^{(1)}(x)\rightarrow 0\).

\(\lambda = \infty\), \(m = 2\).

In this case \(\hat{g} = cx + d\) because a large smoothing parameter forces \(g^{(2)}(x)\rightarrow 0\).

\(\lambda = \infty\), \(m = 3\).

In this case \(\hat{g} = cx^2 + dx + e\) because a large smoothing parameter forces \(g^{(3)}(x)\rightarrow 0\).

\(\lambda = 0\), \(m = 3\).

The penalty term doesn’t play any role, so in this case \(g\) is the interpolating spline.

Q3. Suppose we fit a curve with basis functions \(b_1(X) = X\), \(b_2(X) = (X - 1)^2I(X\ge 1)\). We fit the linear regression model \[Y = \beta_0 + \beta_1b_1(X) + \beta_2b_2(X) + \varepsilon,\] and obtain coefficient estimates \(\hat{\beta}_0 = 1\), \(\hat{\beta}_1 = 1\), \(\hat{\beta}_2 = -2\). Sketch the estimated curve between \(X = -2\) and \(X = 2\). Note the intercepts, slopes, and other relevant information.

x = -2:2
y = 1 + x + -2 * (x-1)^2 * I(x>1)
plot(x, y)

The curve is linear between \(-2\) and \(1\) : \(y = 1 + x\) and quadratic between \(1\), and \(2\) : \(y = 1 + x -2(x - 1)^2\).

Q4. Suppose we fit a curve with basis functions \(b_1(X) = I(0\le X\le 2) - (X - 1)I(1\le X\le 2)\), \(b_2(X) = (X - 3)I(3\le X\le 4) + I(4\le X\le 5)\). We fit the linear regression model \[Y = \beta_0 + \beta_1b_1(X) + \beta_2b_2(X) + \varepsilon,\] and obtain coefficient estimates \(\hat{\beta}_0 = 1\), \(\hat{\beta}_1 = 1\), \(\hat{\beta}_2 = 3\). Sketch the estimated curve between \(X = -2\) and \(X = 2\). Note the intercepts, slopes, and other relevant information.

x = -2:2
y = c(1 + 0 + 0, # x = -2
      1 + 0 + 0, # x = -1
      1 + 1 + 0, # x = 0
      1 + (1-0) + 0, # x = 1
      1 + (1-1) + 0 # x =2
      )
plot(x,y)

The curve is constant between \(-2\) and \(0\) : \(y = 1\), constant between \(0\) and \(1\) : \(y = 2\), and linear between \(1\) and \(2\) : \(y = 3 - x\).

Q5. consider two curves, \(\hat{g}_1\) and \(\hat{g}_2\), defined by \[\hat{g}_1 = \arg\min_g\Biggl(\sum_{i=1}^n(y_i - g(x_i))^2 + \lambda\int[g^{(3)}(x)]^2dx\biggr)\] \[\hat{g}_2 = \arg\min_g\Biggl(\sum_{i=1}^n(y_i - g(x_i))^2 + \lambda\int[g^{(4)}(x)]^2dx\biggr)\] where \(g^{(m)}\) represents the mth derivative of \(g\).

As \(\lambda\rightarrow\infty\), will \(\hat{g}_1\) or \(\hat{g}_2\) have the smaller training RSS ?

The smoothing spline \(\hat{g}_2\) will probably have the smaller training RSS because it will be a higher order polynomial due to the order of the penalty term (it will be more flexible).

As \(\lambda\rightarrow\infty\), will \(\hat{g}_1\) or \(\hat{g}_2\) have the smaller test RSS ?

As mentioned above we expect \(\hat{g}_2\) to be more flexible, so it may overfit the data. It will probably be \(\hat{g}_1\) that have the smaller test RSS.

For \(\lambda = 0\), will \(\hat{g}_1\) or \(\hat{g}_2\) have the smaller training and test RSS ?

If \(\lambda = 0\), we have \(\hat{g}_1 = \hat{g}_2\), so they will have the same training and test RSS.

Applied

Q6. In this exercise, you will further analyze the “Wage” data set considered throughout this chapter.

Perform polynomial regression to predict “wage” using “age”. Use cross-validation to select the optimal degree \(d\) for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA ? Make a plot of the resulting polynomial fit to the data.

We will perform K-fold cross-validation with \(K = 10\).

library(ISLR)
library(boot)
set.seed(1)
deltas <- rep(NA, 10)
for (i in 1:10) {
    fit <- glm(wage ~ poly(age, i), data = Wage)
    deltas[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(1:10, deltas, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min <- which.min(deltas)
points(which.min(deltas), deltas[which.min(deltas)], col = "red", cex = 2, pch = 20)

We may see that \(d = 4\) is the optimal degree for the polynomial. We now use ANOVA to test the null hypothesis that a model \(\mathcal{M}_1\) is sufficient to explain the data against the alternative hypothesis that a more complex \(\mathcal{M}_2\) is required.

fit1 <- lm(wage ~ age, data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We may see, by examining the p-values, that either a cubic or quartic polynomial appear to provide a reasonable fit to the data, but lower or higher order models are not justified. It remains to plot the resulting polynomial fit to the data.

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Fit a step function to predict “wage” using “age”, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

We will perform K-fold cross-validation with \(K = 10\).

cvs <- rep(NA, 10)
for (i in 2:10) {
    Wage$age.cut <- cut(Wage$age, i)
    fit <- glm(wage ~ age.cut, data = Wage)
    cvs[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(2:10, cvs[-1], xlab = "Cuts", ylab = "Test MSE", type = "l")
d.min <- which.min(cvs)
points(which.min(cvs), cvs[which.min(cvs)], col = "red", cex = 2, pch = 20)

We may see that the error is minimum for 8 cuts. Now, we fit the entire data with a step function using 8 cuts and plot it.

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, data.frame(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Q7. The “Wage” data set contains a number of other features nor explored in this chapter, such as marital status (“marit1”), job class (“jobclass”), and others. Explore the relationships between some of these other predictors and “wage”, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.

set.seed(1)
summary(Wage$maritl)

## 1. Never Married       2. Married       3. Widowed      4. Divorced 
##              648             2074               19              204 
##     5. Separated 
##               55

summary(Wage$jobclass)

##  1. Industrial 2. Information 
##           1544           1456

par(mfrow = c(1, 2))
plot(Wage$maritl, Wage$wage)
plot(Wage$jobclass, Wage$wage)

We may conclude that a married couple earns more money on average, and also that informational jobs earns more on average. We will now use GAM to predict “wage” using natural spline functions of “year”, “age”, “education”, “jobclass” and “maritl”.

library(gam)

## Loading required package: splines
## Loaded gam 1.09.1

fit0 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education, data = Wage)
fit1 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass, data = Wage)
fit2 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + maritl, data = Wage)
fit3 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass + maritl, data = Wage)
anova(fit0, fit1, fit2, fit3)

## Analysis of Deviance Table
## 
## Model 1: wage ~ lo(year, span = 0.7) + s(age, 5) + education
## Model 2: wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass
## Model 3: wage ~ lo(year, span = 0.7) + s(age, 5) + education + maritl
## Model 4: wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass + 
##     maritl
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1    2987.1    3691855                          
## 2    2986.1    3679689  1    12166 0.0014637 ** 
## 3    2983.1    3597526  3    82163  9.53e-15 ***
## 4    2982.1    3583675  1    13852 0.0006862 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We may conclude that the model “fit3” is significantly better.

par(mfrow = c(3, 3))
plot(fit3, se = T, col = "blue")

Q8. Fit some of the non-linear models investigated in this chapter to the “Auto” data set. Is there evidence for non-linear relationships in this data set ? Create some informative plots to justify your answer.

set.seed(1)
pairs(Auto)

Wa may see that “mpg” is negatively correlated to “cylindes”, “displacement”, “horsepower” and “weight”. We begin by performing polynomial regression of “wage” vs “displacement”.

deltas <- rep(NA, 15)
for (i in 1:15) {
    fit <- glm(mpg ~ poly(displacement, i), data = Auto)
    deltas[i] <- cv.glm(Auto, fit, K = 10)$delta[1]
}
plot(1:15, deltas, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min <- which.min(deltas)
points(which.min(deltas), deltas[which.min(deltas)], col = "red", cex = 2, pch = 20)

We may see that \(d = 11\) is the optimal degree for the polynomial. Now we use step functions.

cvs <- rep(NA, 10)
for (i in 2:10) {
    Auto$dis.cut <- cut(Auto$displacement, i)
    fit <- glm(mpg ~ dis.cut, data = Auto)
    cvs[i] <- cv.glm(Auto, fit, K = 10)$delta[1]
}
plot(2:10, cvs[-1], xlab = "Cuts", ylab = "Test MSE", type = "l")
d.min <- which.min(cvs)
points(which.min(cvs), cvs[which.min(cvs)], col = "red", cex = 2, pch = 20)

We may see that the error is minimum for 9 cuts. Now we use spline functions.

library(splines)
cvs <- rep(NA, 10)
for (i in 3:10) {
    fit <- glm(mpg ~ ns(displacement, df = i), data = Auto)
    cvs[i] <- cv.glm(Auto, fit, K = 10)$delta[1]
}
plot(3:10, cvs[-c(1, 2)], xlab = "Cuts", ylab = "Test MSE", type = "l")
d.min <- which.min(cvs)
points(which.min(cvs), cvs[which.min(cvs)], col = "red", cex = 2, pch = 20)

We may see that the error is minimum for 9 degrees of freedom. It remains to use GAM.

fit <- gam(mpg ~ s(displacement, 4) + s(horsepower, 4), data = Auto)
summary(fit)

## 
## Call: gam(formula = mpg ~ s(displacement, 4) + s(horsepower, 4), data = Auto)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2982  -2.1592  -0.4394   2.1247  17.0946 
## 
## (Dispersion Parameter for gaussian family taken to be 15.3543)
## 
##     Null Deviance: 23818.99 on 391 degrees of freedom
## Residual Deviance: 5880.697 on 382.9999 degrees of freedom
## AIC: 2194.05 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                     Df  Sum Sq Mean Sq F value  Pr(>F)    
## s(displacement, 4)   1 15254.9 15254.9 993.524 < 2e-16 ***
## s(horsepower, 4)     1  1038.4  1038.4  67.632 3.1e-15 ***
## Residuals          383  5880.7    15.4                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                    Npar Df Npar F     Pr(F)    
## (Intercept)                                    
## s(displacement, 4)       3 13.613 1.863e-08 ***
## s(horsepower, 4)         3 15.606 1.349e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q9. This question uses the variables “dis” (the weighted mean of distances to five Boston employment centers) and “nox” (nitrogen oxides concentration in parts per 10 million) from the “Boston” data. We will treat “dis” as the predictor and “nox” as the response.

Use the “poly()” function to fit a cubic polynomial regression to predict “nox” using “dis”. Report the regression output, and plot the resulting data and polynomial fits.

library(MASS)
set.seed(1)
fit <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit)

## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

dislims <- range(Boston$dis)
dis.grid <- seq(from = dislims[1], to = dislims[2], by = 0.1)
preds <- predict(fit, list(dis = dis.grid))
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, preds, col = "red", lwd = 2)

We may conclude that all polynomial terms are significant.

Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

rss <- rep(NA, 10)
for (i in 1:10) {
    fit <- lm(nox ~ poly(dis, i), data = Boston)
    rss[i] <- sum(fit$residuals^2)
}
plot(1:10, rss, xlab = "Degree", ylab = "RSS", type = "l")

It seems that the RSS decreases with the degree of the polynomial, and so is minimum for a polynomial of degree 10.

Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

deltas <- rep(NA, 10)
for (i in 1:10) {
    fit <- glm(nox ~ poly(dis, i), data = Boston)
    deltas[i] <- cv.glm(Boston, fit, K = 10)$delta[1]
}
plot(1:10, deltas, xlab = "Degree", ylab = "Test MSE", type = "l")

We may see that a polynomial of degree 4 minimizes the test MSE.

Use the “bs()” function to fit a regression spline to predict “nox” using “dis”. Report the output for the fit using four degrees of freedom. How did you choose the knots ? Plot the resulting fit.

fit <- lm(nox ~ bs(dis, knots = c(4, 7, 11)), data = Boston)
summary(fit)

## 
## Call:
## lm(formula = nox ~ bs(dis, knots = c(4, 7, 11)), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124567 -0.040355 -0.008702  0.024740  0.192920 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.73926    0.01331  55.537  < 2e-16 ***
## bs(dis, knots = c(4, 7, 11))1 -0.08861    0.02504  -3.539  0.00044 ***
## bs(dis, knots = c(4, 7, 11))2 -0.31341    0.01680 -18.658  < 2e-16 ***
## bs(dis, knots = c(4, 7, 11))3 -0.26618    0.03147  -8.459 3.00e-16 ***
## bs(dis, knots = c(4, 7, 11))4 -0.39802    0.04647  -8.565  < 2e-16 ***
## bs(dis, knots = c(4, 7, 11))5 -0.25681    0.09001  -2.853  0.00451 ** 
## bs(dis, knots = c(4, 7, 11))6 -0.32926    0.06327  -5.204 2.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06185 on 499 degrees of freedom
## Multiple R-squared:  0.7185, Adjusted R-squared:  0.7151 
## F-statistic: 212.3 on 6 and 499 DF,  p-value: < 2.2e-16

pred <- predict(fit, list(dis = dis.grid))
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, preds, col = "red", lwd = 2)

We may conclude that all terms in spline fit are significant.

Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

rss <- rep(NA, 16)
for (i in 3:16) {
    fit <- lm(nox ~ bs(dis, df = i), data = Boston)
    rss[i] <- sum(fit$residuals^2)
}
plot(3:16, rss[-c(1, 2)], xlab = "Degrees of freedom", ylab = "RSS", type = "l")

We may see that RSS decreases until 14 and then slightly increases after that.

Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

cv <- rep(NA, 16)
for (i in 3:16) {
    fit <- glm(nox ~ bs(dis, df = i), data = Boston)
    cv[i] <- cv.glm(Boston, fit, K = 10)$delta[1]
}

## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots =
## c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots =
## c(1.1296, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots =
## c(1.137, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots =
## c(1.137, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(3.0993, .Names = "50%"),
## : some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(3.0993, .Names = "50%"),
## : some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(3.1523, .Names = "50%"),
## : some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(3.1523, .Names = "50%"),
## : some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(2.38403333333333,
## 4.25576666666667: some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(2.38403333333333,
## 4.25576666666667: some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(2.38876666666667,
## 4.4031: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(2.38876666666667,
## 4.4031: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(2.102875, 3.26745,
## 5.218725: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(2.102875, 3.26745,
## 5.218725: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(2.0788, 3.2721,
## 5.2119: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(2.0788, 3.2721,
## 5.2119: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.94264, 2.62334,
## 3.9175, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.94264, 2.62334,
## 3.9175, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.97036, 2.66262,
## 3.85838, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.97036, 2.66262,
## 3.85838, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.87351666666667,
## 2.3865, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.87351666666667,
## 2.3865, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.84476666666667,
## 2.3817, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.84476666666667,
## 2.3817, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.79777142857143,
## 2.219, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.79777142857143,
## 2.219, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.79078571428571,
## 2.16592857142857, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.79078571428571,
## 2.16592857142857, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.734325, 2.0493,
## 2.4718, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.734325, 2.0493,
## 2.4718, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.7275, 2.0581,
## 2.4553, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.7275, 2.0581,
## 2.4553, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.65344444444444,
## 1.98665555555556, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.65344444444444,
## 1.98665555555556, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.71806666666667,
## 2.00333333333333, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.71806666666667,
## 2.00333333333333, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.6424, 1.96376,
## 2.26178, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.6424, 1.96376,
## 2.26178, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.64186, 1.95434,
## 2.2693, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.64186, 1.95434,
## 2.2693, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.59590909090909,
## 1.90056363636364, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.59590909090909,
## 1.90056363636364, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.61450909090909,
## 1.9047, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.61450909090909,
## 1.9047, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.591425, 1.86565,
## 2.100525, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.591425, 1.86565,
## 2.100525, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.58948333333333,
## 1.85083333333333, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.58948333333333,
## 1.85083333333333, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.5888, 1.8172,
## 2.0407, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.5888, 1.8172,
## 2.0407, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.5311,
## 1.80062857142857, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.5311,
## 1.80062857142857, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.584,
## 1.81652857142857, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

## Warning in bs(dis, degree = 3L, knots = structure(c(1.584,
## 1.81652857142857, : some 'x' values beyond boundary knots may cause
## ill-conditioned bases

plot(3:16, cv[-c(1, 2)], xlab = "Degrees of freedom", ylab = "Test MSE", type = "l")

Test MSE is minimum for 10 degrees of freedom.

Q10. This question relates to the “College” data set.

Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

library(leaps)
set.seed(1)
attach(College)
train <- sample(length(Outstate), length(Outstate) / 2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]
fit <- regsubsets(Outstate ~ ., data = College.train, nvmax = 17, method = "forward")
fit.summary <- summary(fit)
par(mfrow = c(1, 3))
plot(fit.summary$cp, xlab = "Number of variables", ylab = "Cp", type = "l")
min.cp <- min(fit.summary$cp)
std.cp <- sd(fit.summary$cp)
abline(h = min.cp + 0.2 * std.cp, col = "red", lty = 2)
abline(h = min.cp - 0.2 * std.cp, col = "red", lty = 2)
plot(fit.summary$bic, xlab = "Number of variables", ylab = "BIC", type='l')
min.bic <- min(fit.summary$bic)
std.bic <- sd(fit.summary$bic)
abline(h = min.bic + 0.2 * std.bic, col = "red", lty = 2)
abline(h = min.bic - 0.2 * std.bic, col = "red", lty = 2)
plot(fit.summary$adjr2, xlab = "Number of variables", ylab = "Adjusted R2", type = "l", ylim = c(0.4, 0.84))
max.adjr2 <- max(fit.summary$adjr2)
std.adjr2 <- sd(fit.summary$adjr2)
abline(h = max.adjr2 + 0.2 * std.adjr2, col = "red", lty = 2)
abline(h = max.adjr2 - 0.2 * std.adjr2, col = "red", lty = 2)

Cp, BIC and adjr2 show that size 6 is the minimum size for the subset for which the scores are within 0.2 standard devitations of optimum.

fit <- regsubsets(Outstate ~ ., data = College, method = "forward")
coeffs <- coef(fit, id = 6)
names(coeffs)

## [1] "(Intercept)" "PrivateYes"  "Room.Board"  "PhD"         "perc.alumni"
## [6] "Expend"      "Grad.Rate"

Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

fit <- gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, df = 2), data=College.train)
par(mfrow = c(2, 3))
plot(fit, se = T, col = "blue")

Evaluate the model obtained on the test set, and explain the results obtained.

preds <- predict(fit, College.test)
err <- mean((College.test$Outstate - preds)^2)
err

## [1] 3745460

tss <- mean((College.test$Outstate - mean(College.test$Outstate))^2)
rss <- 1 - err / tss
rss

## [1] 0.7696916

We obtain a test R^2 of 0.77 using GAM with 6 predictors.

For which variables, if any, is there evidence of a non-linear relationship with the response ?

summary(fit)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, 
##     df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, 
##     df = 2), data = College.train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -4977.74 -1184.52    58.33  1220.04  7688.30 
## 
## (Dispersion Parameter for gaussian family taken to be 3300711)
## 
##     Null Deviance: 6221998532 on 387 degrees of freedom
## Residual Deviance: 1231165118 on 373 degrees of freedom
## AIC: 6941.542 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                  1 1779433688 1779433688 539.106 < 2.2e-16 ***
## s(Room.Board, df = 2)    1 1221825562 1221825562 370.171 < 2.2e-16 ***
## s(PhD, df = 2)           1  382472137  382472137 115.876 < 2.2e-16 ***
## s(perc.alumni, df = 2)   1  328493313  328493313  99.522 < 2.2e-16 ***
## s(Expend, df = 5)        1  416585875  416585875 126.211 < 2.2e-16 ***
## s(Grad.Rate, df = 2)     1   55284580   55284580  16.749 5.232e-05 ***
## Residuals              373 1231165118    3300711                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df  Npar F     Pr(F)    
## (Intercept)                                         
## Private                                             
## s(Room.Board, df = 2)        1  3.5562   0.06010 .  
## s(PhD, df = 2)               1  4.3421   0.03786 *  
## s(perc.alumni, df = 2)       1  1.9158   0.16715    
## s(Expend, df = 5)            4 16.8636 1.016e-12 ***
## s(Grad.Rate, df = 2)         1  3.7208   0.05450 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA shows a strong evidence of non-linear relationship between “Outstate” and “Expend”“, and a moderately strong non-linear relationship (using p-value of 0.05) between”Outstate" and “Grad.Rate”" or “PhD”.