Assignment 6

library(tidyverse)
library(openintro)

Exercise 6

In this exercise, you will further analyze the `Wage` data set considered throughout this chapter.

\((a)\) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree \(d\) for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR)
library(boot)

## 
## Attaching package: 'boot'

## The following object is masked from 'package:openintro':
## 
##     salinity

attach(Wage)
set.seed(15)
cv.error <- rep(0,5)

for (i in 1:5){
glm.fit <- glm(wage ~ poly(age,i),data=Wage)
cv.error[i]<- cv.glm(Wage,glm.fit,K=10)$delta[1]
}
cv.error

## [1] 1676.192 1599.786 1596.256 1593.755 1595.106

plot(cv.error, type="b", xlab="Degree", ylab="Test MSE")
points(which.min(cv.error), cv.error[4], col="red", pch=20, cex=2)

The optimal degree chosen by cross-validation is 4.

fit_1 <- lm(wage ~ age, data=Wage)
fit_2 <- lm(wage ~ poly(age, 2), data=Wage) 
fit_3 <- lm(wage ~ poly(age, 3), data=Wage) 
fit_4 <- lm(wage ~ poly(age, 4), data=Wage) 
fit_5 <- lm(wage ~ poly(age, 5), data=Wage) 
anova(fit_1, fit_2, fit_3, fit_4, fit_5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value indicates that either a quadratic or cubic fit provide a reasonable fit to the data. A degree fit of 4 is sufficient as well but the previously mentioned degrees are better suited for this model.

age_lim <- range(age)
age_grid <- seq(from=age_lim[1], to=age_lim[2])
preds <- predict(fit_4, newdata=list(age=age_grid),se=TRUE)
se_bands <- cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)

plot(age, wage, xlim=age_lim, cex=.5, col="darkgrey")
lines(age_grid, preds$fit, lwd=2, col="blue")
matlines(age_grid, se_bands, lwd=1, col="blue", lty=3)

\((b)\) Fit a step function to predict `wage` using `age`, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(2)

cv.errors <- rep(NA, 10)

for(i in 2:10){
  Wage$age.cut <- cut(Wage$age,i)
  glm.fit <- glm(wage ~ age.cut, data=Wage)
  cv.errors[i] <- cv.glm(Wage, glm.fit, K=10)$delta[1]
}
cv.errors

##  [1]       NA 1734.448 1681.623 1635.869 1632.864 1621.643 1613.148 1601.850
##  [9] 1610.852 1603.607

plot(2:10, cv.errors[-1], type="b", xlab="Number of cuts", ylab="Test MSE")
points(which.min(cv.errors), cv.errors[which.min(cv.errors)], col="red", pch=20, cex=2)

The optimal number of cuts is 8.

fit_step <- glm(wage ~ cut(age, 8), data=Wage)
preds <- predict(fit_step, data.frame(age = age_grid))
plot(age, wage, col="darkgray")
lines(age_grid, preds, col="darkgreen", lwd=2)

Exercise 10

This question relates to the `College` data set.

\((a)\) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

attach(College)

set.seed(5)
test_sample <- sample(1:nrow(College), nrow(College)/4)
train <- College[-test_sample, ]
test <- College[test_sample, ]

library(leaps)

fwd <- regsubsets(Outstate ~ ., data=train, nvmax=17, method='forward')
fwd_sum <- summary(fwd)

par(mfrow=c(2,2))
plot(fwd_sum$cp ,xlab="Number of Variables ", ylab="Cp",
type="b")
points(which.min(fwd_sum$cp), fwd_sum$cp[which.min(fwd_sum$cp)], col="red", 
       cex=2, pch=20)

plot(fwd_sum$bic ,xlab="Number of Variables ", 
ylab="BIC",type="b")
points(which.min(fwd_sum$bic), fwd_sum$bic[which.min(fwd_sum$bic)], col="red", 
       cex=2, pch=20)

plot(fwd_sum$adjr2 ,xlab="Number of Variables ", 
ylab="Adjusted R^2^",type="b")
points(which.max(fwd_sum$adjr2), fwd_sum$adjr2[which.max(fwd_sum$adjr2)], 
       col="red", cex=2, pch=20)

which.min(fwd_sum$cp)

## [1] 13

which.min(fwd_sum$bic)

## [1] 11

which.max(fwd_sum$adjr2)

## [1] 14

test_matrix <- model.matrix(Outstate~., data=test)

val.errors <- rep(NA,17)
for(i in 1:17){
 coefi <- coef(fwd,id=i)
 pred <- test_matrix[,names(coefi)]%*%coefi
 val.errors[i] <- mean((test$Outstate-pred)^2) 
}

which.min(val.errors)

## [1] 6

plot(val.errors, type='b')
points(which.min(val.errors), val.errors[12], col='red', pch=20, cex=2)

The best model selected using the forward stepwise method is one that has 12 variables, but by looking at the plot, we see that a simpler model could be enough, as the test MSE doesn’t decrease significantly after 6 variables.

fwd_full <- regsubsets(Outstate ~ ., data=College, nvmax=17, method='forward')
coef(fwd_full, 6)

##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3553.2345268  2768.6347025     0.9679086    35.5283359    48.4221031 
##        Expend     Grad.Rate 
##     0.2210255    29.7119093

\((b)\) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)

## Loading required package: splines

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loaded gam 1.22-5

library(splines)

gam_fit <- gam(Outstate ~ Private + s(Room.Board, 3) + s(Terminal, 3) + s(perc.alumni, 3) + s(Expend, 3) + s(Grad.Rate, 3), data=train)

par(mfrow=c(2,3))
plot(gam_fit, se=TRUE, col="blue")

\((c)\) Evaluate the model obtained on the test set, and explain the results obtained.

preds <- predict(gam_fit, newdata = test)
error <- mean((test$Outstate-preds)^2)

val.errors[6]-error

## [1] 388542.3

The GAM model does better than a simple linear model with 6 variables.

\((d)\) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam_fit)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 3) + s(Terminal, 
##     3) + s(perc.alumni, 3) + s(Expend, 3) + s(Grad.Rate, 3), 
##     data = train)
## Deviance Residuals:
##       Min        1Q    Median        3Q       Max 
## -7067.838 -1138.361     4.964  1256.279  8163.609 
## 
## (Dispersion Parameter for gaussian family taken to be 3646032)
## 
##     Null Deviance: 9466747723 on 582 degrees of freedom
## Residual Deviance: 2063655284 on 566.0003 degrees of freedom
## AIC: 10481.86 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 2545651515 2545651515 698.198 < 2.2e-16 ***
## s(Room.Board, 3)    1 1826513428 1826513428 500.959 < 2.2e-16 ***
## s(Terminal, 3)      1  507722090  507722090 139.253 < 2.2e-16 ***
## s(perc.alumni, 3)   1  341093757  341093757  93.552 < 2.2e-16 ***
## s(Expend, 3)        1  802720318  802720318 220.163 < 2.2e-16 ***
## s(Grad.Rate, 3)     1  141392931  141392931  38.780 9.253e-10 ***
## Residuals         566 2063655284    3646032                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar F   Pr(F)    
## (Intercept)                                 
## Private                                     
## s(Room.Board, 3)        2  2.506 0.08252 .  
## s(Terminal, 3)          2  1.620 0.19880    
## s(perc.alumni, 3)       2  0.997 0.36953    
## s(Expend, 3)            2 45.483 < 2e-16 ***
## s(Grad.Rate, 3)         2  3.191 0.04188 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model suggests a strong non-linear relationship between “Outstate” and “Expend”, and a moderate non-linear relationship between “Outstate” and “Grad.rate”.

LS0tCnRpdGxlOiAiQXNzaWdubWVudCA2IgphdXRob3I6ICJSYW5pIE1pc3JhIgpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiCm91dHB1dDogb3BlbmludHJvOjpsYWJfcmVwb3J0Ci0tLQoKYGBge3IgbG9hZC1wYWNrYWdlcywgbWVzc2FnZT1GQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkob3BlbmludHJvKQpgYGAKCiMjIEV4ZXJjaXNlIDYKIyMjIyMgSW4gdGhpcyBleGVyY2lzZSwgeW91IHdpbGwgZnVydGhlciBhbmFseXplIHRoZSBgV2FnZWAgZGF0YSBzZXQgY29uc2lkZXJlZCB0aHJvdWdob3V0IHRoaXMgY2hhcHRlci4KCiMjIyMjICQoYSkkIFBlcmZvcm0gcG9seW5vbWlhbCByZWdyZXNzaW9uIHRvIHByZWRpY3Qgd2FnZSB1c2luZyBhZ2UuIFVzZSBjcm9zcy12YWxpZGF0aW9uIHRvIHNlbGVjdCB0aGUgb3B0aW1hbCBkZWdyZWUgJGQkIGZvciB0aGUgcG9seW5vbWlhbC4gV2hhdCBkZWdyZWUgd2FzIGNob3NlbiwgYW5kIGhvdyBkb2VzIHRoaXMgY29tcGFyZSB0byB0aGUgcmVzdWx0cyBvZiBoeXBvdGhlc2lzIHRlc3RpbmcgdXNpbmcgQU5PVkE/IE1ha2UgYSBwbG90IG9mIHRoZSByZXN1bHRpbmcgcG9seW5vbWlhbCBmaXQgdG8gdGhlIGRhdGEuCgpgYGB7cn0KbGlicmFyeShJU0xSKQpsaWJyYXJ5KGJvb3QpCmF0dGFjaChXYWdlKQpzZXQuc2VlZCgxNSkKY3YuZXJyb3IgPC0gcmVwKDAsNSkKCmZvciAoaSBpbiAxOjUpewpnbG0uZml0IDwtIGdsbSh3YWdlIH4gcG9seShhZ2UsaSksZGF0YT1XYWdlKQpjdi5lcnJvcltpXTwtIGN2LmdsbShXYWdlLGdsbS5maXQsSz0xMCkkZGVsdGFbMV0KfQpjdi5lcnJvcgoKcGxvdChjdi5lcnJvciwgdHlwZT0iYiIsIHhsYWI9IkRlZ3JlZSIsIHlsYWI9IlRlc3QgTVNFIikKcG9pbnRzKHdoaWNoLm1pbihjdi5lcnJvciksIGN2LmVycm9yWzRdLCBjb2w9InJlZCIsIHBjaD0yMCwgY2V4PTIpCmBgYAoKVGhlIG9wdGltYWwgZGVncmVlIGNob3NlbiBieSBjcm9zcy12YWxpZGF0aW9uIGlzIDQuCgpgYGB7cn0KZml0XzEgPC0gbG0od2FnZSB+IGFnZSwgZGF0YT1XYWdlKQpmaXRfMiA8LSBsbSh3YWdlIH4gcG9seShhZ2UsIDIpLCBkYXRhPVdhZ2UpIApmaXRfMyA8LSBsbSh3YWdlIH4gcG9seShhZ2UsIDMpLCBkYXRhPVdhZ2UpIApmaXRfNCA8LSBsbSh3YWdlIH4gcG9seShhZ2UsIDQpLCBkYXRhPVdhZ2UpIApmaXRfNSA8LSBsbSh3YWdlIH4gcG9seShhZ2UsIDUpLCBkYXRhPVdhZ2UpIAphbm92YShmaXRfMSwgZml0XzIsIGZpdF8zLCBmaXRfNCwgZml0XzUpCmBgYAoKVGhlIHAtdmFsdWUgaW5kaWNhdGVzIHRoYXQgZWl0aGVyIGEgcXVhZHJhdGljIG9yIGN1YmljIGZpdCBwcm92aWRlIGEgcmVhc29uYWJsZSAKZml0IHRvIHRoZSBkYXRhLiBBIGRlZ3JlZSBmaXQgb2YgNCBpcyBzdWZmaWNpZW50IGFzIHdlbGwgYnV0IHRoZSBwcmV2aW91c2x5IAptZW50aW9uZWQgZGVncmVlcyBhcmUgYmV0dGVyIHN1aXRlZCBmb3IgdGhpcyBtb2RlbC4gCgpgYGB7cn0KYWdlX2xpbSA8LSByYW5nZShhZ2UpCmFnZV9ncmlkIDwtIHNlcShmcm9tPWFnZV9saW1bMV0sIHRvPWFnZV9saW1bMl0pCnByZWRzIDwtIHByZWRpY3QoZml0XzQsIG5ld2RhdGE9bGlzdChhZ2U9YWdlX2dyaWQpLHNlPVRSVUUpCnNlX2JhbmRzIDwtIGNiaW5kKHByZWRzJGZpdCsyKnByZWRzJHNlLmZpdCxwcmVkcyRmaXQtMipwcmVkcyRzZS5maXQpCgpwbG90KGFnZSwgd2FnZSwgeGxpbT1hZ2VfbGltLCBjZXg9LjUsIGNvbD0iZGFya2dyZXkiKQpsaW5lcyhhZ2VfZ3JpZCwgcHJlZHMkZml0LCBsd2Q9MiwgY29sPSJibHVlIikKbWF0bGluZXMoYWdlX2dyaWQsIHNlX2JhbmRzLCBsd2Q9MSwgY29sPSJibHVlIiwgbHR5PTMpCmBgYAoKIyMjIyMgJChiKSQgRml0IGEgc3RlcCBmdW5jdGlvbiB0byBwcmVkaWN0IGB3YWdlYCB1c2luZyBgYWdlYCwgYW5kIHBlcmZvcm0gY3Jvc3MtdmFsaWRhdGlvbiB0byBjaG9vc2UgdGhlIG9wdGltYWwgbnVtYmVyIG9mIGN1dHMuIE1ha2UgYSBwbG90IG9mIHRoZSBmaXQgb2J0YWluZWQuCgpgYGB7cn0Kc2V0LnNlZWQoMikKCmN2LmVycm9ycyA8LSByZXAoTkEsIDEwKQoKZm9yKGkgaW4gMjoxMCl7CiAgV2FnZSRhZ2UuY3V0IDwtIGN1dChXYWdlJGFnZSxpKQogIGdsbS5maXQgPC0gZ2xtKHdhZ2UgfiBhZ2UuY3V0LCBkYXRhPVdhZ2UpCiAgY3YuZXJyb3JzW2ldIDwtIGN2LmdsbShXYWdlLCBnbG0uZml0LCBLPTEwKSRkZWx0YVsxXQp9CmN2LmVycm9ycwoKcGxvdCgyOjEwLCBjdi5lcnJvcnNbLTFdLCB0eXBlPSJiIiwgeGxhYj0iTnVtYmVyIG9mIGN1dHMiLCB5bGFiPSJUZXN0IE1TRSIpCnBvaW50cyh3aGljaC5taW4oY3YuZXJyb3JzKSwgY3YuZXJyb3JzW3doaWNoLm1pbihjdi5lcnJvcnMpXSwgY29sPSJyZWQiLCBwY2g9MjAsIGNleD0yKQpgYGAKClRoZSBvcHRpbWFsIG51bWJlciBvZiBjdXRzIGlzIDguCgpgYGB7cn0KZml0X3N0ZXAgPC0gZ2xtKHdhZ2UgfiBjdXQoYWdlLCA4KSwgZGF0YT1XYWdlKQpwcmVkcyA8LSBwcmVkaWN0KGZpdF9zdGVwLCBkYXRhLmZyYW1lKGFnZSA9IGFnZV9ncmlkKSkKcGxvdChhZ2UsIHdhZ2UsIGNvbD0iZGFya2dyYXkiKQpsaW5lcyhhZ2VfZ3JpZCwgcHJlZHMsIGNvbD0iZGFya2dyZWVuIiwgbHdkPTIpCmBgYAoKXG5ld3BhZ2UgCgojIyBFeGVyY2lzZSAxMAojIyMjIyBUaGlzIHF1ZXN0aW9uIHJlbGF0ZXMgdG8gdGhlIGBDb2xsZWdlYCBkYXRhIHNldC4KCiMjIyMjICQoYSkkIFNwbGl0IHRoZSBkYXRhIGludG8gYSB0cmFpbmluZyBzZXQgYW5kIGEgdGVzdCBzZXQuIFVzaW5nIG91dC1vZi1zdGF0ZSB0dWl0aW9uIGFzIHRoZSByZXNwb25zZSBhbmQgdGhlIG90aGVyIHZhcmlhYmxlcyBhcyB0aGUgcHJlZGljdG9ycywgcGVyZm9ybSBmb3J3YXJkIHN0ZXB3aXNlIHNlbGVjdGlvbiBvbiB0aGUgdHJhaW5pbmcgc2V0IGluIG9yZGVyIHRvIGlkZW50aWZ5IGEgc2F0aXNmYWN0b3J5IG1vZGVsIHRoYXQgdXNlcyBqdXN0IGEgc3Vic2V0IG9mIHRoZSBwcmVkaWN0b3JzLgoKYGBge3J9CmF0dGFjaChDb2xsZWdlKQoKc2V0LnNlZWQoNSkKdGVzdF9zYW1wbGUgPC0gc2FtcGxlKDE6bnJvdyhDb2xsZWdlKSwgbnJvdyhDb2xsZWdlKS80KQp0cmFpbiA8LSBDb2xsZWdlWy10ZXN0X3NhbXBsZSwgXQp0ZXN0IDwtIENvbGxlZ2VbdGVzdF9zYW1wbGUsIF0KCmxpYnJhcnkobGVhcHMpCgpmd2QgPC0gcmVnc3Vic2V0cyhPdXRzdGF0ZSB+IC4sIGRhdGE9dHJhaW4sIG52bWF4PTE3LCBtZXRob2Q9J2ZvcndhcmQnKQpmd2Rfc3VtIDwtIHN1bW1hcnkoZndkKQoKcGFyKG1mcm93PWMoMiwyKSkKcGxvdChmd2Rfc3VtJGNwICx4bGFiPSJOdW1iZXIgb2YgVmFyaWFibGVzICIsIHlsYWI9IkNwIiwKdHlwZT0iYiIpCnBvaW50cyh3aGljaC5taW4oZndkX3N1bSRjcCksIGZ3ZF9zdW0kY3Bbd2hpY2gubWluKGZ3ZF9zdW0kY3ApXSwgY29sPSJyZWQiLCAKICAgICAgIGNleD0yLCBwY2g9MjApCgpwbG90KGZ3ZF9zdW0kYmljICx4bGFiPSJOdW1iZXIgb2YgVmFyaWFibGVzICIsIAp5bGFiPSJCSUMiLHR5cGU9ImIiKQpwb2ludHMod2hpY2gubWluKGZ3ZF9zdW0kYmljKSwgZndkX3N1bSRiaWNbd2hpY2gubWluKGZ3ZF9zdW0kYmljKV0sIGNvbD0icmVkIiwgCiAgICAgICBjZXg9MiwgcGNoPTIwKQoKcGxvdChmd2Rfc3VtJGFkanIyICx4bGFiPSJOdW1iZXIgb2YgVmFyaWFibGVzICIsIAp5bGFiPSJBZGp1c3RlZCBSXjJeIix0eXBlPSJiIikKcG9pbnRzKHdoaWNoLm1heChmd2Rfc3VtJGFkanIyKSwgZndkX3N1bSRhZGpyMlt3aGljaC5tYXgoZndkX3N1bSRhZGpyMildLCAKICAgICAgIGNvbD0icmVkIiwgY2V4PTIsIHBjaD0yMCkKCndoaWNoLm1pbihmd2Rfc3VtJGNwKQp3aGljaC5taW4oZndkX3N1bSRiaWMpCndoaWNoLm1heChmd2Rfc3VtJGFkanIyKQoKdGVzdF9tYXRyaXggPC0gbW9kZWwubWF0cml4KE91dHN0YXRlfi4sIGRhdGE9dGVzdCkKCnZhbC5lcnJvcnMgPC0gcmVwKE5BLDE3KQpmb3IoaSBpbiAxOjE3KXsKIGNvZWZpIDwtIGNvZWYoZndkLGlkPWkpCiBwcmVkIDwtIHRlc3RfbWF0cml4WyxuYW1lcyhjb2VmaSldJSolY29lZmkKIHZhbC5lcnJvcnNbaV0gPC0gbWVhbigodGVzdCRPdXRzdGF0ZS1wcmVkKV4yKSAKfQoKd2hpY2gubWluKHZhbC5lcnJvcnMpCgpwbG90KHZhbC5lcnJvcnMsIHR5cGU9J2InKQpwb2ludHMod2hpY2gubWluKHZhbC5lcnJvcnMpLCB2YWwuZXJyb3JzWzEyXSwgY29sPSdyZWQnLCBwY2g9MjAsIGNleD0yKQpgYGAKClRoZSBiZXN0IG1vZGVsIHNlbGVjdGVkIHVzaW5nIHRoZSBmb3J3YXJkIHN0ZXB3aXNlIG1ldGhvZCBpcyBvbmUgdGhhdCBoYXMgMTIgCnZhcmlhYmxlcywgYnV0IGJ5IGxvb2tpbmcgYXQgdGhlIHBsb3QsIHdlIHNlZSB0aGF0IGEgc2ltcGxlciBtb2RlbCBjb3VsZCBiZSAKZW5vdWdoLCBhcyB0aGUgdGVzdCBNU0UgZG9lc24ndCBkZWNyZWFzZSBzaWduaWZpY2FudGx5IGFmdGVyIDYgdmFyaWFibGVzLgoKYGBge3J9CmZ3ZF9mdWxsIDwtIHJlZ3N1YnNldHMoT3V0c3RhdGUgfiAuLCBkYXRhPUNvbGxlZ2UsIG52bWF4PTE3LCBtZXRob2Q9J2ZvcndhcmQnKQpjb2VmKGZ3ZF9mdWxsLCA2KQpgYGAKCgojIyMjIyAkKGIpJCBGaXQgYSBHQU0gb24gdGhlIHRyYWluaW5nIGRhdGEsIHVzaW5nIG91dC1vZi1zdGF0ZSB0dWl0aW9uIGFzIHRoZSByZXNwb25zZSBhbmQgdGhlIGZlYXR1cmVzIHNlbGVjdGVkIGluIHRoZSBwcmV2aW91cyBzdGVwIGFzIHRoZSBwcmVkaWN0b3JzLiBQbG90IHRoZSByZXN1bHRzLCBhbmQgZXhwbGFpbiB5b3VyIGZpbmRpbmdzLgoKYGBge3J9CmxpYnJhcnkoZ2FtKQpsaWJyYXJ5KHNwbGluZXMpCgpnYW1fZml0IDwtIGdhbShPdXRzdGF0ZSB+IFByaXZhdGUgKyBzKFJvb20uQm9hcmQsIDMpICsgcyhUZXJtaW5hbCwgMykgKyBzKHBlcmMuYWx1bW5pLCAzKSArIHMoRXhwZW5kLCAzKSArIHMoR3JhZC5SYXRlLCAzKSwgZGF0YT10cmFpbikKCnBhcihtZnJvdz1jKDIsMykpCnBsb3QoZ2FtX2ZpdCwgc2U9VFJVRSwgY29sPSJibHVlIikKYGBgCgojIyMjIyAkKGMpJCBFdmFsdWF0ZSB0aGUgbW9kZWwgb2J0YWluZWQgb24gdGhlIHRlc3Qgc2V0LCBhbmQgZXhwbGFpbiB0aGUgcmVzdWx0cyBvYnRhaW5lZC4KCmBgYHtyfQpwcmVkcyA8LSBwcmVkaWN0KGdhbV9maXQsIG5ld2RhdGEgPSB0ZXN0KQplcnJvciA8LSBtZWFuKCh0ZXN0JE91dHN0YXRlLXByZWRzKV4yKQoKdmFsLmVycm9yc1s2XS1lcnJvcgpgYGAKClRoZSBHQU0gbW9kZWwgZG9lcyBiZXR0ZXIgdGhhbiBhIHNpbXBsZSBsaW5lYXIgbW9kZWwgd2l0aCA2IHZhcmlhYmxlcy4KCiMjIyMjICQoZCkkIEZvciB3aGljaCB2YXJpYWJsZXMsIGlmIGFueSwgaXMgdGhlcmUgZXZpZGVuY2Ugb2YgYSBub24tbGluZWFyIHJlbGF0aW9uc2hpcCB3aXRoIHRoZSByZXNwb25zZT8KCmBgYHtyfQpzdW1tYXJ5KGdhbV9maXQpCmBgYAoKVGhlIG1vZGVsIHN1Z2dlc3RzIGEgc3Ryb25nIG5vbi1saW5lYXIgcmVsYXRpb25zaGlwIGJldHdlZW4gIk91dHN0YXRlIiBhbmQgCiJFeHBlbmQiLCBhbmQgYSBtb2RlcmF0ZSBub24tbGluZWFyIHJlbGF0aW9uc2hpcCBiZXR3ZWVuICJPdXRzdGF0ZSIgYW5kIAoiR3JhZC5yYXRlIi4g

Assignment 6

Rani Misra

2025-04-14

Exercise 6

In this exercise, you will further analyze the `Wage` data set considered throughout this chapter.

\((a)\) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree \(d\) for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

\((b)\) Fit a step function to predict `wage` using `age`, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

Exercise 10

This question relates to the `College` data set.

\((a)\) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

\((b)\) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

\((c)\) Evaluate the model obtained on the test set, and explain the results obtained.

\((d)\) For which variables, if any, is there evidence of a non-linear relationship with the response?

Assignment 6

Rani Misra

2025-04-14

Exercise 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

\((a)\) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree \(d\) for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

\((b)\) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

Exercise 10

This question relates to the College data set.

\((a)\) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

\((b)\) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

\((c)\) Evaluate the model obtained on the test set, and explain the results obtained.

\((d)\) For which variables, if any, is there evidence of a non-linear relationship with the response?

In this exercise, you will further analyze the `Wage` data set considered throughout this chapter.

\((b)\) Fit a step function to predict `wage` using `age`, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

This question relates to the `College` data set.