Question 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a)

Perform polynomial regression to predict wage using age . Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR)
library(boot)
set.seed(1)
degree <- 10
cv.errs <- rep(NA, degree)
for (i in 1:degree) {
  fit <- glm(wage ~ poly(age, i), data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}

Plot the test MSE by the degrees:

plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

The minimum of test MSE at the degree 9. But test MSE of degree 4 is small enough. The comparison by ANOVA (anova(fit.1,fit.2,fit.3,fit.4,fit.5), on page 290, section 7.8.1) suggests degree 4 is enough.

Predict with 3 degree model:

plot(wage ~ age, data = Wage, col = "darkgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

(b)

Fit a step function to predict wage using age , and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

cv.errs <- rep(NA, degree)
for (i in 2:degree) {
  Wage$age.cut <- cut(Wage$age, i)
  fit <- glm(wage ~ age.cut, data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

So 8 cuts produce minimum test MSE.

Predict with 8-cuts step function:

plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Understand the cut() function:

res <- cut(c(1,5,2,3,8), 2)
res
[1] (0.993,4.5] (4.5,8.01]  (0.993,4.5] (0.993,4.5] (4.5,8.01] 
Levels: (0.993,4.5] (4.5,8.01]
length(res)
[1] 5
class(res[1])
[1] "factor"

cut(x, k) acts like bin or binage, turning a continuous quantitative variable into a discrete qualitative variable, by deviding the range of x evenly into k intervals. Each interval is called a level. The output of cut(x, k) is a vector with the same length of x. Each element of output (a factor object) is a level where the corresponding input element falls in.

Question 7

The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage , and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.

See introductions about regression on qulitative predictors in section 3.3.1 and 3.6.6.

Use summary()

set.seed(1)
summary(Wage$maritl)
1. Never Married       2. Married       3. Widowed      4. Divorced     5. Separated 
             648             2074               19              204               55 
# table(Wage$maritl) the same with `summary`
summary(Wage$jobclass)
 1. Industrial 2. Information 
          1544           1456 
par(mfrow = c(1, 2))
plot(Wage$maritl, Wage$wage)
plot(Wage$jobclass, Wage$wage)

Fit wage on multiple predictors with GAM:

library(gam)
fit1 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education, data = Wage)
fit2 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass, data = Wage)
fit3 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + maritl, data = Wage)
fit4 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass + maritl, data = Wage)
anova(fit1, fit2, fit3, fit4)
Analysis of Deviance Table

Model 1: wage ~ lo(year, span = 0.7) + s(age, 5) + education
Model 2: wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass
Model 3: wage ~ lo(year, span = 0.7) + s(age, 5) + education + maritl
Model 4: wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass + 
    maritl
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1    2987.1    3691855                          
2    2986.1    3679689  1    12166 0.0014637 ** 
3    2983.1    3597526  3    82163  9.53e-15 ***
4    2982.1    3583675  1    13852 0.0006862 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So model fit4 fits the best.

Plot the model:

par(mfrow = c(2, 3))
plot(fit4, se = T, col = "blue")

Question 8

Fit some of the non-linear models investigated in this chapter to the Auto data set. Is there evidence for non-linear relationships in this data set? Create some informative plots to justify your answer.

There’s no direct way to answer if there are nonlinear realtionships between any feature in the data set. We have to choose a feature as the response, some others as predictors. And investigate if the relationship between them is linear or not.

Get the overall relationship between all pairs of the Auto data set:

set.seed(1)
pairs(Auto)

From the plot above we can see when using mpg as the response, there are clear relationships between it and cylinders, displacement, horsepower, weight. Now we will find out if they are nonlinear or not.

fit <- lm(mpg ~ poly(cylinders, 2) + poly(displacement, 5) + poly(horsepower, 5) + poly(weight, 5), data = Auto)
summary(fit)

Call:
lm(formula = mpg ~ poly(cylinders, 2) + poly(displacement, 5) + 
    poly(horsepower, 5) + poly(weight, 5), data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.793  -2.219  -0.183   1.841  17.030 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             23.4459     0.1937 121.072  < 2e-16 ***
poly(cylinders, 2)1     27.1932    18.4258   1.476 0.140834    
poly(cylinders, 2)2     -0.5902     7.4794  -0.079 0.937143    
poly(displacement, 5)1 -43.8830    19.8805  -2.207 0.027897 *  
poly(displacement, 5)2  16.5805     9.8437   1.684 0.092942 .  
poly(displacement, 5)3  12.7002     7.8850   1.611 0.108095    
poly(displacement, 5)4 -13.1163     5.7039  -2.300 0.022024 *  
poly(displacement, 5)5   2.4590     4.9780   0.494 0.621607    
poly(horsepower, 5)1   -62.5295    12.1728  -5.137 4.51e-07 ***
poly(horsepower, 5)2    21.5799     6.4347   3.354 0.000879 ***
poly(horsepower, 5)3    -8.4355     6.7254  -1.254 0.210526    
poly(horsepower, 5)4     0.9338     4.4089   0.212 0.832378    
poly(horsepower, 5)5     8.5955     4.5355   1.895 0.058841 .  
poly(weight, 5)1       -53.8275    12.9345  -4.162 3.93e-05 ***
poly(weight, 5)2         6.3627     7.0359   0.904 0.366406    
poly(weight, 5)3        -3.1785     5.4532  -0.583 0.560333    
poly(weight, 5)4        -2.0484     4.6025  -0.445 0.656527    
poly(weight, 5)5         1.9338     4.1948   0.461 0.645062    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.834 on 374 degrees of freedom
Multiple R-squared:  0.7692,    Adjusted R-squared:  0.7587 
F-statistic: 73.31 on 17 and 374 DF,  p-value: < 2.2e-16

The results show that there’s no significant relationship between mpg and cylinders. There’s weak relationship (p-value: 0.028) between mpg and displacement. There’s strong quadratic relation between mpg and horsepower. There’s strong linear relation between mpg and weight.

Test the degree with ANOVA:

anv1 <- gam(mpg ~ displacement + horsepower + weight, data = Auto)
anv2 <- gam(mpg ~ displacement + s(horsepower, 2) + weight, data = Auto)
anv3 <- gam(mpg ~ s(displacement, 5) + s(horsepower, 5) + s(weight, 5), data = Auto)
anova(anv1, anv2, anv3, test = 'F')
Analysis of Deviance Table

Model 1: mpg ~ displacement + horsepower + weight
Model 2: mpg ~ displacement + s(horsepower, 2) + weight
Model 3: mpg ~ s(displacement, 5) + s(horsepower, 5) + s(weight, 5)
  Resid. Df Resid. Dev       Df Deviance       F    Pr(>F)    
1       388     6980.0                                        
2       387     6145.6  0.99991   834.46 57.3356 2.879e-13 ***
3       376     5472.8 10.99990   672.78  4.2021 6.952e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(anv3)

Call: gam(formula = mpg ~ s(displacement, 5) + s(horsepower, 5) + s(weight, 
    5), data = Auto)
Deviance Residuals:
     Min       1Q   Median       3Q      Max 
-10.5671  -2.0665  -0.2317   1.8421  16.3984 

(Dispersion Parameter for gaussian family taken to be 14.5553)

    Null Deviance: 23818.99 on 391 degrees of freedom
Residual Deviance: 5472.787 on 376.0002 degrees of freedom
AIC: 2179.87 

Number of Local Scoring Iterations: 2 

Anova for Parametric Effects
                    Df  Sum Sq Mean Sq  F value    Pr(>F)    
s(displacement, 5)   1 15397.9 15397.9 1057.889 < 2.2e-16 ***
s(horsepower, 5)     1   946.6   946.6   65.038 9.935e-15 ***
s(weight, 5)         1   400.7   400.7   27.528 2.592e-07 ***
Residuals          376  5472.8    14.6                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Anova for Nonparametric Effects
                   Npar Df Npar F     Pr(F)    
(Intercept)                                    
s(displacement, 5)       4 5.5978  0.000219 ***
s(horsepower, 5)         4 9.8615 1.349e-07 ***
s(weight, 5)             4 1.1977  0.311372    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
par(mfrow=c(1,3))
plot.Gam(anv3, se=TRUE, col="red")

According to plot of anv3, try quadratic with displacement and horsepower, linear with weight:

anv4 <- gam(mpg ~ s(displacement, 3) + s(horsepower, 3) + weight, data = Auto)
anova(anv4, anv3, test = 'F')
Analysis of Deviance Table

Model 1: mpg ~ s(displacement, 3) + s(horsepower, 3) + weight
Model 2: mpg ~ s(displacement, 5) + s(horsepower, 5) + s(weight, 5)
  Resid. Df Resid. Dev     Df Deviance     F  Pr(>F)  
1       384     5688.9                                
2       376     5472.8 7.9999   216.12 1.856 0.06572 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
par(mfrow=c(1,3))
plot(anv4, se=TRUE, col="red")

So model anv4 is good enough.

Compare their test MSE:

lm1 <- glm(mpg ~ displacement + horsepower + weight, data = Auto)
lm2 <- glm(mpg ~ poly(displacement, 3) + poly(horsepower, 3) + weight, data = Auto)
lm3 <- glm(mpg ~ poly(displacement, 5) + poly(horsepower, 5) + poly(weight, 5), data = Auto)
cv.glm(Auto, lm1, K = 10)$delta[1]
[1] 18.37228
cv.glm(Auto, lm2, K = 10)$delta[1]
[1] 15.62762
cv.glm(Auto, lm3, K = 10)$delta[1]
[1] 15.57547

The results also suggest model lm2 (same with anv4) is good enough.

So the conclusion of relationships with mpg: mpg ~ displacement: cubic; mpg ~ horsepower: cubic; mpg ~ weight: linear.

Question 9

This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.

(a)

Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

library(MASS)
fit <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit)

Call:
lm(formula = nox ~ poly(dis, 3), data = Boston)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.121130 -0.040619 -0.009738  0.023385  0.194904 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.06207 on 502 degrees of freedom
Multiple R-squared:  0.7148,    Adjusted R-squared:  0.7131 
F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16
dis.grid <- seq(min(Boston$dis), max(Boston$dis), by = 0.1)
preds <- predict(fit, list(dis = dis.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2* preds$se.fit, preds$fit - 2 * preds$se.fit)
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, preds$fit, lwd = 2, col = "red")
matlines(dis.grid, se.bands, lwd = 1, col = "red", lty = 3)

Another way to plot the fit:

preds <- predict(fit, list(dis = dis.grid))
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, preds, col = "red", lwd = 2)

Yet another way to plot the fit:

plot(Boston$dis, Boston$nox)
points(Boston$dis, fit$fitted.values, col = 'red')

(b)

Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

rss <- rep(NA, 10)
for (i in 1:10) {
  fit <- lm(nox ~ poly(dis, i), data = Boston)
  rss[i] <- sum(fit$residuals ^ 2)
}
plot(1:10, rss, type = 'l', xlab = "Degree", ylab = "RSS")

(c)

Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

testMSE <- rep(NA, 10)
for (i in 1:10) {
  fit <- glm(nox ~ poly(dis, i), data = Boston)
  testMSE[i] <- cv.glm(Boston, fit, K = 10)$delta[1]
}
plot(1:10, testMSE, type = 'l', xlab = "Degree", ylab = "Test MSE")
points(which.min(testMSE), testMSE[which.min(testMSE)], col = 'red', pch = 19)

So the optimal degree is 3.

(d)

Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

library(splines)
dof <- 4
fit <- lm(nox ~ bs(dis, df = dof), data = Boston)
attr(bs(Boston$dis, df = dof), "knots")
    50% 
3.20745 
preds <- predict(fit, list(dis = dis.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2* preds$se.fit, preds$fit - 2 * preds$se.fit)
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, preds$fit, lwd = 2, col = "red")
matlines(dis.grid, se.bands, lwd = 1, col = "red", lty = 3)

For degree of freedom (dof) equals to the sum of the number of knots and degree, and the default degree is 3, there is 1 knot.

(e)

Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

res <- c()
df.range <- 3:16
for (dof in df.range) {
  fit <- lm(nox ~ bs(dis, df = dof), data = Boston)
  res <- c(res, sum(fit$residuals ^ 2))
}
plot(df.range, res, type = 'l', xlab = 'degree of freedom', ylab = 'RSS')

From the results, df = 10 is good enough.

(f)

Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

res <- c()
for (dof in df.range) {
  fit <- glm(nox ~ bs(dis, df = dof), data = Boston)
  testMSE <- cv.glm(Boston, fit, K = 10)$delta[1]
  res <- c(res, testMSE)
}
some 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned basessome 'x' values beyond boundary knots may cause ill-conditioned bases
plot(df.range, res, type = 'l', xlab = 'degree of freedom', ylab = 'Test MSE')
points(which.min(res) + 2, res[which.min(res)], col = 'red', pch = 19)

According to test MSE, df = 8 is a good choice.

Question 10

This question relates to the College data set.

(a)

Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

library(ISLR)
library(leaps)
train <- sample(1: nrow(College), nrow(College)/2)
test <- -train
fit <- regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
fit.summary <- summary(fit)
fit.summary
Subset selection object
Call: regsubsets.formula(Outstate ~ ., data = College, subset = train, 
    method = "forward")
17 Variables  (and intercept)
            Forced in Forced out
PrivateYes      FALSE      FALSE
Apps            FALSE      FALSE
Accept          FALSE      FALSE
Enroll          FALSE      FALSE
Top10perc       FALSE      FALSE
Top25perc       FALSE      FALSE
F.Undergrad     FALSE      FALSE
P.Undergrad     FALSE      FALSE
Room.Board      FALSE      FALSE
Books           FALSE      FALSE
Personal        FALSE      FALSE
PhD             FALSE      FALSE
Terminal        FALSE      FALSE
S.F.Ratio       FALSE      FALSE
perc.alumni     FALSE      FALSE
Expend          FALSE      FALSE
Grad.Rate       FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: forward
         PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
1  ( 1 ) " "        " "  " "    " "    " "       " "       " "         " "        
2  ( 1 ) " "        " "  " "    " "    " "       " "       " "         " "        
3  ( 1 ) " "        " "  " "    " "    " "       " "       " "         " "        
4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "         " "        
5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "         " "        
6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "         " "        
7  ( 1 ) "*"        " "  " "    " "    " "       " "       " "         " "        
8  ( 1 ) "*"        " "  " "    " "    " "       " "       " "         " "        
         Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
1  ( 1 ) "*"        " "   " "      " " " "      " "       " "         " "    " "      
2  ( 1 ) "*"        " "   " "      " " " "      " "       "*"         " "    " "      
3  ( 1 ) "*"        " "   " "      " " " "      " "       "*"         "*"    " "      
4  ( 1 ) "*"        " "   " "      " " " "      " "       "*"         "*"    " "      
5  ( 1 ) "*"        " "   " "      " " "*"      " "       "*"         "*"    " "      
6  ( 1 ) "*"        " "   " "      " " "*"      " "       "*"         "*"    "*"      
7  ( 1 ) "*"        " "   " "      " " "*"      "*"       "*"         "*"    "*"      
8  ( 1 ) "*"        " "   "*"      " " "*"      "*"       "*"         "*"    "*"      

List names of 6 predictors:

coef(fit, id = 6)
  (Intercept)    PrivateYes    Room.Board      Terminal   perc.alumni        Expend 
-4558.4951554  2562.9868495     1.0250967    46.2938069    56.3086762     0.1719931 
    Grad.Rate 
   30.5888513 

(b)

Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

gam.mod <- gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gam.mod, se = TRUE, col = 'blue')

Based on the shape of the fit curves, Expend and Grad.Rate are strong non-linear with the Outstate.

(c)

Evaluate the model obtained on the test set, and explain the results obtained.

preds <- predict(gam.mod, College[test, ])
RSS <- sum((College[test, ]$Outstate - preds)^2) # based on equation (3.16)
TSS <- sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS)   # based on equation
[1] 0.6902148

The \(R^2\) statistic on test set is 0.69.

(d)

For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.mod)

Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 
    5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), 
    data = College, subset = train)
Deviance Residuals:
     Min       1Q   Median       3Q      Max 
-6544.24 -1019.64    69.71  1164.80  7946.32 

(Dispersion Parameter for gaussian family taken to be 3220454)

    Null Deviance: 6601480944 on 387 degrees of freedom
Residual Deviance: 1162581189 on 360.9992 degrees of freedom
AIC: 6943.304 

Number of Local Scoring Iterations: 2 

Anova for Parametric Effects
                   Df     Sum Sq    Mean Sq F value    Pr(>F)    
Private             1 1555118447 1555118447 482.888 < 2.2e-16 ***
s(Room.Board, 5)    1 1430972478 1430972478 444.339 < 2.2e-16 ***
s(Terminal, 5)      1  437128457  437128457 135.735 < 2.2e-16 ***
s(perc.alumni, 5)   1  281540520  281540520  87.423 < 2.2e-16 ***
s(Expend, 5)        1  451236800  451236800 140.116 < 2.2e-16 ***
s(Grad.Rate, 5)     1   81627274   81627274  25.346 7.577e-07 ***
Residuals         361 1162581189    3220454                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Anova for Nonparametric Effects
                  Npar Df  Npar F   Pr(F)    
(Intercept)                                  
Private                                      
s(Room.Board, 5)        4  1.7036 0.14858    
s(Terminal, 5)          4  1.6748 0.15522    
s(perc.alumni, 5)       4  2.0589 0.08569 .  
s(Expend, 5)            4 24.5778 < 2e-16 ***
s(Grad.Rate, 5)         4  3.2353 0.01256 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

“Anova for Nonparametric Effects” shows Expend has strong non-linear relationshop with the Outstate. Grad.Rate and PhD have moderate non-linear relationship with the Outstate, which coincide with the result of (b).

Qustion 11

In Section 7.7, it was mentioned that GAMs are generally fit using a backfitting approach. The idea behind backfitting is actually quite simple. We will now explore backfitting in the context of multiple linear regression.

Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. The process is continued until convergence—that is, until the coefficient estimates stop changing.

We now try this out on a toy example.

(a) & (b)

Generate a response \(Y\) and two predictors \(X_1\) and \(X_2\), with \(n = 100\). Initialize \(\hat \beta_1\) to take on a value of your choice. It does not matter what value you choose.

set.seed(1)
y <- rnorm(100)
x1 <- rnorm(100)
x2 <- rnorm(100)
beta1 <- 3.27

(c)

Keeping the \(\hat \beta_1\) fixed, fit the model: \[ Y - \hat\beta_1 X_1 = \beta_0 + \beta_2 X_2 + \epsilon \]

a <- y - beta1 * x1
beta2 <- lm(a ~ x2)$coef[2]

(d)

Keeping the \(\hat \beta_2\) fixed, fit the model: \[ Y - \hat\beta_2 X_2 = \beta_0 + \beta_1 X_1 + \epsilon \]

a <- y - beta2 * x2
beta1 <- lm(a ~ x1)$coef[2]

(e)

Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of \(\hat \beta_0\), \(\hat \beta_1\), and \(\hat \beta_2\) at each iteration of the for loop. Create a plot in which each of these values is displayed, with \(\hat \beta_0\), \(\hat \beta_1\), and \(\hat \beta_2\) each shown in a different color.

iter <- 10
df <- data.frame(0.0, 0.27, 0.0)
names(df) <- c('beta0', 'beta1', 'beta2')
for (i in 1:iter) {
  beta1 <- df[nrow(df), 2]
  a <- y - beta1 * x1
  beta2 <- lm(a ~ x2)$coef[2]
  a <- y - beta2 * x2
  beta1 <- lm(a ~ x1)$coef[2]
  beta0 <- lm(a ~ x1)$coef[1]
  print(beta0)
  print(beta1)
  print(beta2)
  df[nrow(df) + 1,] <- list(beta0, beta1, beta2)
}
(Intercept) 
  0.1080682 
         x1 
0.000584017 
        x2 
0.02835083 
(Intercept) 
    0.10841 
           x1 
-7.708576e-05 
        x2 
0.01599065 
(Intercept) 
  0.1084108 
         x1 
-7.8708e-05 
        x2 
0.01596032 
(Intercept) 
  0.1084108 
           x1 
-7.871198e-05 
        x2 
0.01596025 
(Intercept) 
  0.1084108 
           x1 
-7.871199e-05 
        x2 
0.01596025 
(Intercept) 
  0.1084108 
           x1 
-7.871199e-05 
        x2 
0.01596025 
(Intercept) 
  0.1084108 
           x1 
-7.871199e-05 
        x2 
0.01596025 
(Intercept) 
  0.1084108 
           x1 
-7.871199e-05 
        x2 
0.01596025 
(Intercept) 
  0.1084108 
           x1 
-7.871199e-05 
        x2 
0.01596025 
(Intercept) 
  0.1084108 
           x1 
-7.871199e-05 
        x2 
0.01596025 
plot(df$beta0, col = 'red', type = 'l')
lines(df$beta1, col = 'blue')
lines(df$beta2, col = 'green')

(f)

Compare your answer in (e) to the results of simply performing multiple linear regression to predict \(Y\) using \(X_1\) and \(X_2\). Use the abline() function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

plot(df$beta0, col = 'red', type = 'l')
lines(df$beta1, col = 'blue')
lines(df$beta2, col = 'green')
res <- coef(lm(y ~ x1 + x2))
abline(h = res[1], col = 'darkred', lty = 2)
abline(h = res[2], col = 'darkblue', lty = 2)
abline(h = res[3], col = 'darkgreen', lty = 2)

The coefficients from iterations and multiple regression are exactly the same.

(g)

On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

For this data set, 3 iterations are enough to converge.

Question 12

This problem is a continuation of the previous exercise. In a toy example with \(p = 100\), show that one can approximate the multiple linear regressioncoefficient estimates by repeatedly performing simple linear regression in a backfitting procedure. How many backfitting iterations are required in order to obtain a “good” approximation to the multiple regression coefficient estimates? Create a plot to justify your answer.

Step 1: give \(\hat \beta_1, \dots, \hat \beta_{99}\) arbitrary initial values.

Step 2: fit the model: \[ Y - \hat\beta_1 X_1 - \dots - \hat\beta_{99} X_{99} = \beta_0 + \beta_{100} X_{100} + \epsilon \] So we have the values of \(\hat \beta_0\) and \(\hat \beta_{100}\).

Step 3: with this new \(\hat\beta_{100}\), fit the model: \[ Y - \hat\beta_1 X_1 - \dots - \hat\beta_{98} X_{98} - \hat\beta_{100}X_{100} = \beta_0 + \beta_{99} X_{99} + \epsilon \]

Repeat above step until fitting the model: \[ Y - \hat\beta_2 X_2 - \dots - \hat\beta_{100}X_{100} = \beta_0 + \beta_{1} X_{1} + \epsilon \]

So we have new values of \(\hat \beta_0\) and \(\hat \beta_1\). With this \(\hat \beta_1\) fit the model in step 2 again.

Repeat this loop until \(\hat\beta_0, \dots, \hat\beta_{100}\) converge.

---
title: "Applied Exercises of Chapter 7"
output: html_notebook
---

# Question 6

In this exercise, you will further analyze the *Wage* data set considered throughout this chapter.

## (a)

Perform polynomial regression to predict wage using age . Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to
the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

```{r}
library(ISLR)
library(boot)
set.seed(1)
degree <- 10
cv.errs <- rep(NA, degree)
for (i in 1:degree) {
  fit <- glm(wage ~ poly(age, i), data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
```
Plot the test MSE by the degrees:
```{r}
plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)
```

The minimum of test MSE at the degree 9. But test MSE of degree 4 is small enough.
The comparison by ANOVA (`anova(fit.1,fit.2,fit.3,fit.4,fit.5)`, on page 290, section 7.8.1) suggests degree 4 is enough.

Predict with 3 degree model:
```{r}
plot(wage ~ age, data = Wage, col = "darkgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)
```

## (b)

Fit a step function to predict wage using age , and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

```{r}
cv.errs <- rep(NA, degree)
for (i in 2:degree) {
  Wage$age.cut <- cut(Wage$age, i)
  fit <- glm(wage ~ age.cut, data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)
```

So 8 cuts produce minimum test MSE.

Predict with 8-cuts step function:
```{r}
plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, data.frame(age = age.grid))  # both `data.frame` and `list` work
lines(age.grid, preds, col = "red", lwd = 2)
```

Understand the `cut()` function:
```{r}
res <- cut(c(1,5,2,3,8), 2)
res
length(res)
class(res[1])
```

`cut(x, k)` acts like *bin* or *binage*, turning a continuous quantitative variable into a discrete qualitative variable, by deviding the range of `x` evenly into `k` intervals.
Each interval is called a *level*.
The output of `cut(x, k)` is a vector with the same length of `x`.
Each element of output (a *factor* object) is a *level* where the corresponding input element falls in.

# Question 7

The *Wage* data set contains a number of other features not explored in this chapter, such as marital status (*maritl*), job class (*jobclass*), and others. Explore the relationships between some of these other predictors and *wage* , and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained,
and write a summary of your findings.

See introductions about regression on qulitative predictors in section 3.3.1 and 3.6.6.

Use `summary()`
```{r}
set.seed(1)
summary(Wage$maritl)
# table(Wage$maritl) the same with `summary`
summary(Wage$jobclass)
par(mfrow = c(1, 2))
plot(Wage$maritl, Wage$wage)
plot(Wage$jobclass, Wage$wage)
```

Fit wage on multiple predictors with GAM:
```{r}
library(gam)
fit1 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education, data = Wage)
fit2 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass, data = Wage)
fit3 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + maritl, data = Wage)
fit4 <- gam(wage ~ lo(year, span = 0.7) + s(age, 5) + education + jobclass + maritl, data = Wage)
anova(fit1, fit2, fit3, fit4)
```

So model *fit4* fits the best.

Plot the model:
```{r}
par(mfrow = c(2, 3))
plot(fit4, se = T, col = "blue")
```

# Question 8

Fit some of the non-linear models investigated in this chapter to the *Auto* data set. Is there evidence for non-linear relationships in this data set? Create some informative plots to justify your answer.

There's no direct way to answer if there are nonlinear realtionships between any feature in the data set. We have to choose a feature as the response, some others as predictors. And investigate if the relationship between them is linear or not.

Get the overall relationship between all pairs of the *Auto* data set:
```{r}
set.seed(1)
pairs(Auto)
```

From the plot above we can see when using *mpg* as the response, there are clear relationships between it and *cylinders*, *displacement*, *horsepower*, *weight*. Now we will find out if they are nonlinear or not.

```{r}
fit <- lm(mpg ~ poly(cylinders, 2) + poly(displacement, 5) + poly(horsepower, 5) + poly(weight, 5), data = Auto)
summary(fit)
```

The results show that there's no significant relationship between mpg and cylinders.
There's weak relationship (p-value: 0.028) between mpg and displacement.
There's strong quadratic relation between mpg and horsepower.
There's strong linear relation between mpg and weight.

Test the degree with ANOVA:
```{r}
anv1 <- gam(mpg ~ displacement + horsepower + weight, data = Auto)
anv2 <- gam(mpg ~ displacement + s(horsepower, 2) + weight, data = Auto)
anv3 <- gam(mpg ~ s(displacement, 5) + s(horsepower, 5) + s(weight, 5), data = Auto)
anova(anv1, anv2, anv3, test = 'F')
summary(anv3)
par(mfrow=c(1,3))
plot.Gam(anv3, se=TRUE, col="red")
```

According to plot of *anv3*, try quadratic with *displacement* and *horsepower*, linear with *weight*:
```{r}
anv4 <- gam(mpg ~ s(displacement, 3) + s(horsepower, 3) + weight, data = Auto)
anova(anv4, anv3, test = 'F')
par(mfrow=c(1,3))
plot(anv4, se=TRUE, col="red")
```

So model *anv4* is good enough.

Compare their test MSE:
```{r}
lm1 <- glm(mpg ~ displacement + horsepower + weight, data = Auto)
lm2 <- glm(mpg ~ poly(displacement, 3) + poly(horsepower, 3) + weight, data = Auto)
lm3 <- glm(mpg ~ poly(displacement, 5) + poly(horsepower, 5) + poly(weight, 5), data = Auto)

cv.glm(Auto, lm1, K = 10)$delta[1]
cv.glm(Auto, lm2, K = 10)$delta[1]
cv.glm(Auto, lm3, K = 10)$delta[1]
```

The results also suggest model *lm2* (same with *anv4*) is good enough.

So the conclusion of relationships with *mpg*:
mpg ~ displacement: cubic;
mpg ~ horsepower: cubic;
mpg ~ weight: linear.

# Question 9

This question uses the variables *dis* (the weighted mean of distances to five Boston employment centers) and *nox* (nitrogen oxides concentration in parts per 10 million) from the *Boston* data. We will treat *dis* as the predictor and *nox* as the response.

## (a)

Use the `poly()` function to fit a cubic polynomial regression to predict *nox* using *dis*. Report the regression output, and plot the resulting data and polynomial fits.

```{r}
library(MASS)
fit <- lm(nox ~ poly(dis, 3), data = Boston)
summary(fit)
dis.grid <- seq(min(Boston$dis), max(Boston$dis), by = 0.1)
preds <- predict(fit, list(dis = dis.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2* preds$se.fit, preds$fit - 2 * preds$se.fit)
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, preds$fit, lwd = 2, col = "red")
matlines(dis.grid, se.bands, lwd = 1, col = "red", lty = 3)
```

Another way to plot the fit:
```{r}
preds <- predict(fit, list(dis = dis.grid))
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, preds, col = "red", lwd = 2)
```

Yet another way to plot the fit:
```{r}
plot(Boston$dis, Boston$nox)
points(Boston$dis, fit$fitted.values, col = 'red')
```

## (b)

Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

```{r}
rss <- rep(NA, 10)
for (i in 1:10) {
  fit <- lm(nox ~ poly(dis, i), data = Boston)
  rss[i] <- sum(fit$residuals ^ 2)
}
plot(1:10, rss, type = 'l', xlab = "Degree", ylab = "RSS")
```

## (c)

Perform cross-validation or another approach to select the optimal degree for the polynomial, and explain your results.

```{r}
testMSE <- rep(NA, 10)
for (i in 1:10) {
  fit <- glm(nox ~ poly(dis, i), data = Boston)
  testMSE[i] <- cv.glm(Boston, fit, K = 10)$delta[1]
}
plot(1:10, testMSE, type = 'l', xlab = "Degree", ylab = "Test MSE")
points(which.min(testMSE), testMSE[which.min(testMSE)], col = 'red', pch = 19)
```

So the optimal degree is 3.

## (d)

Use the `bs()` function to fit a regression spline to predict *nox* using *dis*. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

```{r}
library(splines)
dof <- 4
fit <- lm(nox ~ bs(dis, df = dof), data = Boston)
attr(bs(Boston$dis, df = dof), "knots")

preds <- predict(fit, list(dis = dis.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2* preds$se.fit, preds$fit - 2 * preds$se.fit)
plot(nox ~ dis, data = Boston, col = "darkgrey")
lines(dis.grid, preds$fit, lwd = 2, col = "red")
matlines(dis.grid, se.bands, lwd = 1, col = "red", lty = 3)
```

For degree of freedom (*dof*) equals to the sum of the number of knots and degree, and the default degree is 3, there is 1 knot.

## (e)

Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

```{r}
res <- c()
df.range <- 3:16
for (dof in df.range) {
  fit <- lm(nox ~ bs(dis, df = dof), data = Boston)
  res <- c(res, sum(fit$residuals ^ 2))
}
plot(df.range, res, type = 'l', xlab = 'degree of freedom', ylab = 'RSS')
```

From the results, `df = 10` is good enough.

## (f)

Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

```{r}
res <- c()
for (dof in df.range) {
  fit <- glm(nox ~ bs(dis, df = dof), data = Boston)
  testMSE <- cv.glm(Boston, fit, K = 10)$delta[1]
  res <- c(res, testMSE)
}
plot(df.range, res, type = 'l', xlab = 'degree of freedom', ylab = 'Test MSE')
points(which.min(res) + 2, res[which.min(res)], col = 'red', pch = 19)
```

According to test MSE, `df = 8` is a good choice.

# Question 10

This question relates to the *College* data set.

## (a)

Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order
to identify a satisfactory model that uses just a subset of the predictors.

```{r}
library(ISLR)
library(leaps)
train <- sample(1: nrow(College), nrow(College)/2)
test <- -train
fit <- regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
fit.summary <- summary(fit)
fit.summary
```

List names of 6 predictors:
```{r}
coef(fit, id = 6)
```

## (b)

Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

```{r}
gam.mod <- gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gam.mod, se = TRUE, col = 'blue')
```

Based on the shape of the fit curves, *Expend* and *Grad.Rate* are strong non-linear with the *Outstate*.

## (c)

Evaluate the model obtained on the test set, and explain the results obtained.

```{r}
preds <- predict(gam.mod, College[test, ])
RSS <- sum((College[test, ]$Outstate - preds)^2) # based on equation (3.16)
TSS <- sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS)   # based on equation
```

The $R^2$ statistic on test set is 0.69.

## (d)

For which variables, if any, is there evidence of a non-linear relationship with the response?

```{r}
summary(gam.mod)
```

"Anova for Nonparametric Effects" shows *Expend* has strong non-linear relationshop with the *Outstate*. *Grad.Rate* and *PhD* have moderate non-linear relationship with the *Outstate*, which coincide with the result of (b).

# Qustion 11

In Section 7.7, it was mentioned that GAMs are generally fit using a backfitting approach. The idea behind backfitting is actually quite simple. We will now explore backfitting in the context of multiple linear regression.

Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. The process is continued until convergence—that is, until the coefficient estimates stop changing.

We now try this out on a toy example.

## (a) & (b)

Generate a response $Y$ and two predictors $X_1$ and $X_2$, with $n = 100$.
Initialize $\hat \beta_1$ to take on a value of your choice. It does not matter what value you choose.

```{r}
set.seed(1)
y <- rnorm(100)
x1 <- rnorm(100)
x2 <- rnorm(100)
beta1 <- 3.27
```

## (c)

Keeping the $\hat \beta_1$ fixed, fit the model:
$$
Y - \hat\beta_1 X_1 = \beta_0 + \beta_2 X_2 + \epsilon
$$

```{r}
a <- y - beta1 * x1
beta2 <- lm(a ~ x2)$coef[2]
```

## (d)

Keeping the $\hat \beta_2$ fixed, fit the model:
$$
Y - \hat\beta_2 X_2 = \beta_0 + \beta_1 X_1 + \epsilon
$$

```{r}
a <- y - beta2 * x2
beta1 <- lm(a ~ x1)$coef[2]
```

## (e)

Write a for loop to repeat (c) and (d) 1,000 times. Report the estimates of $\hat \beta_0$, $\hat \beta_1$, and $\hat \beta_2$ at each iteration of the for loop.
Create a plot in which each of these values is displayed, with $\hat \beta_0$, $\hat \beta_1$, and $\hat \beta_2$ each shown in a different color.

```{r}
iter <- 10
df <- data.frame(0.0, 0.27, 0.0)
names(df) <- c('beta0', 'beta1', 'beta2')
for (i in 1:iter) {
  beta1 <- df[nrow(df), 2]
  a <- y - beta1 * x1
  beta2 <- lm(a ~ x2)$coef[2]
  a <- y - beta2 * x2
  beta1 <- lm(a ~ x1)$coef[2]
  beta0 <- lm(a ~ x1)$coef[1]
  print(beta0)
  print(beta1)
  print(beta2)
  df[nrow(df) + 1,] <- list(beta0, beta1, beta2)
}
```

## (f)

Compare your answer in (e) to the results of simply performing multiple linear regression to predict $Y$ using $X_1$ and $X_2$. Use the `abline()` function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).

```{r}
plot(df$beta0, col = 'red', type = 'l')
lines(df$beta1, col = 'blue')
lines(df$beta2, col = 'green')
res <- coef(lm(y ~ x1 + x2))
abline(h = res[1], col = 'darkred', lty = 2)
abline(h = res[2], col = 'darkblue', lty = 2)
abline(h = res[3], col = 'darkgreen', lty = 2)
```

The coefficients from iterations and multiple regression are exactly the same.

## (g)

On this data set, how many backfitting iterations were required in order to obtain a “good” approximation to the multiple regression coefficient estimates?

For this data set, 3 iterations are enough to converge.

# Question 12

This problem is a continuation of the previous exercise. In a toy example with $p = 100$, show that one can approximate the multiple linear regressioncoefficient estimates by repeatedly performing simple linear regression in a backfitting procedure. How many backfitting iterations are required in order to obtain a “good” approximation to the multiple regression coefficient estimates? Create a plot to justify your answer.

Step 1: give $\hat \beta_1, \dots, \hat \beta_{99}$ arbitrary initial values.

Step 2: fit the model:
$$
Y - \hat\beta_1 X_1 - \dots - \hat\beta_{99} X_{99} = \beta_0 + \beta_{100} X_{100} + \epsilon
$$
So we have the values of $\hat \beta_0$ and $\hat \beta_{100}$.

Step 3: with this new $\hat\beta_{100}$, fit the model:
$$
Y - \hat\beta_1 X_1 - \dots - \hat\beta_{98} X_{98} - \hat\beta_{100}X_{100} = \beta_0 + \beta_{99} X_{99} + \epsilon
$$

Repeat above step until fitting the model:
$$
Y - \hat\beta_2 X_2 - \dots - \hat\beta_{100}X_{100} = \beta_0 + \beta_{1} X_{1} + \epsilon
$$

So we have new values of $\hat \beta_0$ and $\hat \beta_1$.
With this $\hat \beta_1$ fit the model in step 2 again.

Repeat this loop until $\hat\beta_0, \dots, \hat\beta_{100}$ converge.