regression

#6. In this exercise, you will further analyze the Wage data set considered throughout this chapter. (a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library (ISLR2)
library(boot)
library(ggplot2)
library(caret)

## Warning: package 'caret' was built under R version 4.2.2

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.2.2

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

library(data.table)
library(leaps)
library(glmnet)

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 4.2.2

## Loaded glmnet 4.1-4

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data("Wage")
set.seed(1)
degree = 10
cv.errs = rep(NA, degree)
for (i in 1:degree) {
  fit = glm(wage ~ poly(age, i), data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit, K=10)$delta[2]
}
plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min = which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

set.seed(1)
anova.fit1=lm(wage~poly(age,1,raw=TRUE),data=Wage)
anova.fit2=lm(wage~poly(age,2,raw=TRUE),data=Wage)
anova.fit3=lm(wage~poly(age,3,raw=TRUE),data=Wage)
anova.fit4=lm(wage~poly(age,4,raw=TRUE),data=Wage)
anova.fit5=lm(wage~poly(age,5,raw=TRUE),data=Wage)
anova.fit6=lm(wage~poly(age,6,raw=TRUE),data=Wage)
anova.fit7=lm(wage~poly(age,7,raw=TRUE),data=Wage)
anova(anova.fit1,anova.fit2,anova.fit3,anova.fit4,anova.fit5,anova.fit6,anova.fit7)

## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, 1, raw = TRUE)
## Model 2: wage ~ poly(age, 2, raw = TRUE)
## Model 3: wage ~ poly(age, 3, raw = TRUE)
## Model 4: wage ~ poly(age, 4, raw = TRUE)
## Model 5: wage ~ poly(age, 5, raw = TRUE)
## Model 6: wage ~ poly(age, 6, raw = TRUE)
## Model 7: wage ~ poly(age, 7, raw = TRUE)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.6926 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8956  0.001673 ** 
## 4   2995 4771604  1      6070   3.8125  0.050966 .  
## 5   2994 4770322  1      1283   0.8055  0.369516    
## 6   2993 4766389  1      3932   2.4697  0.116165    
## 7   2992 4763834  1      2555   1.6049  0.205311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The minimum of test MSE is at degree 6. But test MSE of degree 4 is small enough. The comparison by ANOVA suggest degree 4 is enough.

ggplot(Wage, aes(age,wage)) + 
  geom_point(color="orange") + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 6), size = 1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot ofthe fit obtained.

set.seed(1)

cvs = rep(NA, 10)
for (i in 2:10) {
  Wage$cut.point = cut(Wage$age, i)
  lm.fit = glm(wage~cut.point, data=Wage)
  cvs[i] = cv.glm(Wage, lm.fit, K=10)$delta[2]
}

cvs.data <- data.table(c(2,3,4,5,6,7,8,9,10),cvs[-1])

plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

In fitting the step function age was cut up into 20 intervals which was used by the cross-validation function to select best cut for the model. For this model the optimal number of cuts is 9. The model will now be used to predict wage with an eight interval step function of age.

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ cut(x, 8)") + 
  labs(title = "Wage Dataset: Step Function (Eight Intervals)")

#10. This question relates to the College data set. (a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors

data(College)
set.seed(234)
train = sample(1: nrow(College), nrow(College)/2)
test = -train
fit = regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
fit.sum <- summary(fit)
fit.sum

## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 ) "*"        " "  "*"    " "    " "       " "       " "        
## 8  ( 1 ) "*"        " "  "*"    "*"    " "       " "       " "        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         " "        " "   " "      " " "*"      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 5  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 6  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 7  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 8  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         "*"    " "      
## 2  ( 1 ) " "         "*"    " "      
## 3  ( 1 ) " "         "*"    " "      
## 4  ( 1 ) " "         "*"    " "      
## 5  ( 1 ) "*"         "*"    " "      
## 6  ( 1 ) "*"         "*"    "*"      
## 7  ( 1 ) "*"         "*"    "*"      
## 8  ( 1 ) "*"         "*"    "*"

coef(fit, id = 6)

##   (Intercept)    PrivateYes    Room.Board      Terminal   perc.alumni 
## -4520.0323039  3004.9423636     0.7806096    52.5433032    49.3604312 
##        Expend     Grad.Rate 
##     0.2040293    31.5931149

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)

## Warning: package 'gam' was built under R version 4.2.2

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.20.2

gam.mod1 = gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gam.mod1, se = TRUE)

Expend and Grad.Rate are strong non-linear with outstate (c) Evaluate the model obtained on the test set, and explain the results obtained.

pred1 = predict(gam.mod1, College[test, ])
RSS = sum((College[test, ]$Outstate - pred1)^2) 
TSS = sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS)

## [1] 0.7851594

The R squared statistic on test set is 0.785 (d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.mod1)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 
##     5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), 
##     data = College, subset = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -6794.5 -1048.6    23.7  1154.1  7935.7 
## 
## (Dispersion Parameter for gaussian family taken to be 3517766)
## 
##     Null Deviance: 6153965471 on 387 degrees of freedom
## Residual Deviance: 1269914652 on 361.0003 degrees of freedom
## AIC: 6977.565 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 1445378795 1445378795 410.880 < 2.2e-16 ***
## s(Room.Board, 5)    1 1133006579 1133006579 322.081 < 2.2e-16 ***
## s(Terminal, 5)      1  397611450  397611450 113.029 < 2.2e-16 ***
## s(perc.alumni, 5)   1  250757162  250757162  71.283 7.626e-16 ***
## s(Expend, 5)        1  580108202  580108202 164.908 < 2.2e-16 ***
## s(Grad.Rate, 5)     1   92636987   92636987  26.334 4.701e-07 ***
## Residuals         361 1269914652    3517766                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F     Pr(F)    
## (Intercept)                                    
## Private                                        
## s(Room.Board, 5)        4  1.8840  0.112635    
## s(Terminal, 5)          4  1.0227  0.395421    
## s(perc.alumni, 5)       4  0.8109  0.518837    
## s(Expend, 5)            4 14.8281 3.153e-11 ***
## s(Grad.Rate, 5)         4  3.5294  0.007667 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects shows Expend has strong non-linear relationshop with the Outstate. Grad.Rate and PhD have moderate non-linear relationship with the Outstate. This matched what we saw in part b

regression

Richard Rivas

2022-11-05