hw_6_data_mining

Question 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

Wage <- ISLR::Wage
k <- 10
degree <- 5
folds <- cut(seq(1,nrow(Wage)), breaks = k, labels = FALSE)
mse <- matrix(data = NA, nrow = k, ncol = degree)
for(i in 1 : k){
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- Wage[testIndexes, ]
    trainData <- Wage[-testIndexes, ]
for (j in 1 : degree){
        fit.train <- lm(wage ~ poly(age,j), data = trainData)
        fit.test <- predict(fit.train, newdata = testData)
        mse[i,j] <- mean((fit.test - testData$wage)^2) 
    }
}
colMeans(mse)

## [1] 1675.674 1600.327 1595.312 1593.936 1594.480

According to our cross-validation, the associated MSEs for each of our models would be: -Degree 1: 1675.674 -Degree 2: 1600.327 -Degree 3: 1595.312 -Degree 4: 1593.936 -Degree 5: 1594.480

If we take these results as the basis of the decision for the degree of our model, we would end up choosing a quartic model, as it would be the one with the lowest MSE. Now, let’s run an ANOVA for the degrees:

poly_wage_1 <- lm(wage ~ age, data = Wage)
poly_wage_2 <- lm(wage ~ poly(age,2), data = Wage)
poly_wage_3 <- lm(wage ~ poly(age,3), data = Wage)
poly_wage_4 <- lm(wage ~ poly(age,4), data = Wage)
poly_wage_5 <- lm(wage ~ poly(age,5), data = Wage)
anova(poly_wage_1, poly_wage_2, poly_wage_3, poly_wage_4, poly_wage_5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value for comparisons between models 1&2 and models 2&3 are very low (being close to 0), therefore indicating that these types of fits (both linear and quadratic respectively) are not a good fit for the model we are trying to create. We also see that a model with a 5-degree polynomial has a p-value that is way too high (~ 0.37), thus also not being a good fit for our data. Therefore, the best option would be to employ either a cubic or quartic fit, which is aligned with our previous result.

We now see the fit of our model:

ggplot(Wage, aes(x=age, y=wage)) + 
          geom_point() +
          stat_smooth(method='lm', formula = y ~ poly(x,4), size = 1) + 
          xlab('Age') +
          ylab('Wage')

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

require(boot)

## Loading required package: boot

## Warning: package 'boot' was built under R version 4.1.2

set.seed(2)

cv.errors <- rep(NA, 10)

for(i in 2:10){
  Wage$age.cut <- cut(Wage$age,i)
  glm.fit <- glm(wage ~ age.cut, data=Wage)
  cv.errors[i] <- cv.glm(Wage, glm.fit, K=10)$delta[1]
}

cv.errors

##  [1]       NA 1734.448 1681.623 1635.869 1632.864 1621.643 1613.148 1601.850
##  [9] 1610.852 1603.607

According to our cross-validation errors, the smallest error is attributed to 8 cuts. We then fit the following step model (as seen with the corresponding graph):

age_lim <- range(Wage$age)
age_grid <- seq(from = age_lim[1], to = age_lim[2])
fit_step <- glm(wage ~ cut(age, 8), data=Wage)
preds_wage <- predict(fit_step, data.frame(age = age_grid))
plot(Wage$age, Wage$wage, col="grey")
lines(age_grid, preds_wage, col="red", lwd=2)

Question 10

This question relates to the College data set. (a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward step-wise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

college <- ISLR::College
training_sample_college <- sample(nrow(college), 0.6*nrow(college))
training_data <- college[training_sample_college, ]
test_data <- college[-training_sample_college, ]

fwd_college <- leaps::regsubsets(Outstate ~ ., data = training_data, nvmax = 17, method = 'forward') #we use a nvmax = 17 since the maximum number of predictors we can use for this model
fwd_sum <- summary(fwd_college)
which.max(fwd_sum$adjr2)

## [1] 15

plot(fwd_sum$adjr2 ,xlab="Number of Variables", 
ylab="Adjusted R^2^",type="b")
points(which.max(fwd_sum$adjr2), fwd_sum$adjr2[which.max(fwd_sum$adjr2)], col="red", cex=2, pch=20)

According to our adjusted R^2 values, the model that is able to explain most of the variance is our model using 15 predictor variables. Despite this, we see that the adjusted R^2 doesn’t change much since the number of predictor variables is 6, meaning that we could potentially make our model simpler without much change. If we use this number of predictor variables, we obtain the following equation:

coef(fwd_college, 6)

##   (Intercept)    PrivateYes    Room.Board      Terminal   perc.alumni 
## -3659.4598628  2514.8526117     1.0601301    31.7798197    47.7665719 
##        Expend     Grad.Rate 
##     0.2210067    28.6794928

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

To see which power to use for the variables of the model, we do a cross-validation process and see which one would have the lowest error:

require(gam)

## Loading required package: gam

## Warning: package 'gam' was built under R version 4.1.3

## Loading required package: splines

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.1.2

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loaded gam 1.20.1

require(splines)

gam_college <- gam(Outstate ~ Private + s(Room.Board, 3) + s(Terminal, 3) + s(perc.alumni, 3) + s(Expend, 3) + s(Grad.Rate, 3), data=training_data) 
#since Private is a binary variable, it would only have 1 degree of freedom
#there are 5 other variables we are utilizing, and since the maximum number of variables we are allowed to use is 17, we have 16 df for the total model (17 - 1), so (16 - 1) = 15 / 5 = 3 df per variable

par(mfrow=c(2,3))
plot(gam_college, se=TRUE, col="blue")

We see that for all of the predictor variables that are NOT binary, the estimates are within 2 standard errors both above and below, meaning that the GAM model is able to confidently predict about 95% of the true mean of the population for each of the variables, as well as establishing that some of these variables are non-linear.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam_college)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 3) + s(Terminal, 
##     3) + s(perc.alumni, 3) + s(Expend, 3) + s(Grad.Rate, 3), 
##     data = training_data)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6856.98 -1136.57    82.44  1255.83  7999.16 
## 
## (Dispersion Parameter for gaussian family taken to be 3529962)
## 
##     Null Deviance: 7413326032 on 465 degrees of freedom
## Residual Deviance: 1584951847 on 448.9997 degrees of freedom
## AIC: 8366.921 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 1925352510 1925352510 545.432 < 2.2e-16 ***
## s(Room.Board, 3)    1 1663138399 1663138399 471.149 < 2.2e-16 ***
## s(Terminal, 3)      1  354763176  354763176 100.501 < 2.2e-16 ***
## s(perc.alumni, 3)   1  318188164  318188164  90.139 < 2.2e-16 ***
## s(Expend, 3)        1  598013591  598013591 169.411 < 2.2e-16 ***
## s(Grad.Rate, 3)     1   74129105   74129105  21.000  5.96e-06 ***
## Residuals         449 1584951847    3529962                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar F     Pr(F)    
## (Intercept)                                   
## Private                                       
## s(Room.Board, 3)        2  2.086    0.1254    
## s(Terminal, 3)          2  0.757    0.4696    
## s(perc.alumni, 3)       2  0.891    0.4110    
## s(Expend, 3)            2 38.176 4.441e-16 ***
## s(Grad.Rate, 3)         2  1.783    0.1693    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that for the non-parametric effects, the variable with the strongest non-linear relationship with Outstate is Expend, with Room.Board and Grad.Rate following behind.

hw_6_data_mining

Eileen Ramirez del Rio

4/6/2022

Question 6

Question 10