- Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
library(ISLR)
library(boot)
set.seed(1)
degree <- 10
cv.errs <- rep(NA, degree)
for (i in 1:degree) {
fit <- glm(wage ~ poly(age, i), data = Wage)
cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)
The minimum of test MSE at the degree 9. But test MSE of degree 4 is small enough. The comparison by ANOVA suggests degree 4 is enough.
plot(wage ~ age, data = Wage, col = "darkgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)
- Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.
cv.errs <- rep(NA, degree)
for (i in 2:degree) {
Wage$age.cut <- cut(Wage$age, i)
fit <- glm(wage ~ age.cut, data = Wage)
cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)
8 cuts produce minimum test MSE
plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)
res <- cut(c(1,5,2,3,8), 2)
res
## [1] (0.993,4.5] (4.5,8.01] (0.993,4.5] (0.993,4.5] (4.5,8.01]
## Levels: (0.993,4.5] (4.5,8.01]
length(res)
## [1] 5
class(res[1])
## [1] "factor"
cut(x,k) turns continuous quantitative variable into a discrete qualitative variable by dividing the range of x evenly into k intervals. each interval is a level. The output is a vector with the same length of x. each element of output is a level where the corresponding input element falls in.
- Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
library(ISLR)
set.seed(12345)
train <- sample(nrow(College) * 0.7)
train_set <- College[train, ]
test_set <- College[-train, ]
plot_metric <- function(metric, yaxis_label, reverse = FALSE) {
plot(metric, xlab = "Number of Variables", ylab = yaxis_label, xaxt = "n", type = "l")
axis(side = 1, at = 1:length(metric))
if (reverse) {
metric_1se <- max(metric) - (sd(metric) / sqrt(length(metric)))
min_subset <- which(metric > metric_1se)
} else {
metric_1se <- min(metric) + (sd(metric) / sqrt(length(metric)))
min_subset <- which(metric < metric_1se)
}
abline(h = metric_1se, col = "red", lty = 2)
abline(v = min_subset[1], col = "green", lty = 2)
}
par(mfrow=c(1, 3))
- Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
calc_mse <- function(y, y_hat) {
return(mean((y - y_hat)^2))
}
calc_rmse <- function(y, y_hat) {
return(sqrt(calc_mse(y, y_hat)))
}
calc_r2 <- function(y, y_hat) {
y_bar <- mean(y)
rss <- sum((y - y_hat)^2)
tss <- sum((y - y_bar)^2)
return(1 - (rss / tss))
}
- Evaluate the model obtained on the test set, and explain the results obtained.
test RMSE of 1984.3845506 and R-squared of 0.7614328 using GAM with 6 predictors
- For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
Non-parametric Anova test shows a strong evidence of non-linear relationship between response and Expend and PhD, and a moderately strong non-linear relationship between response and Room.Board and Grad.Rate.