Exercise 7.9.1

  1. Find a cubic polynomial

\[f_1(x) = a_1 + b_1x + c_1x^2 + d_1x^3\] such that \(f(x) = f_2(x)\) for all \(x > \xi\). Express \(a_2, b_2, c_2, d_2\) in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\). We have now established that \(f(x)\) is a piecewise polynomial.

The positive subscript indicates \((x)_+ = \operatorname{max}\{0,x\}\), thus indicating that when \(x < \xi\) the following expression evaluates to zero as the insides of the parentheses will be negative: \[ \beta (x - \xi)^3_+ = 0 \]

We thus find the following:

\[ f(x) = f_1(x) \newline \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 = a_1 + b_1x + c_1x^2 + d_1x^3 \]

We thus find that this is true when all coefficients equal each other: \[ \beta_0 = a_1 \beta_1 = b_1 \beta_2 = c_1 \beta_3 = d_1 \] b) Find a cubic polynomial

\[ f_2(x) = a_2 + b_2x + c_2x^2 + d_2x^3 \]

such that \(f(x) = f_2(x)\) for all \(x > \xi\). Express \(a_2, b_2, c_2, d_2\) in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\). We have now established that \(f(x)\) is a piecewise polynomial.

In this case, as \(x > \xi\), we cannot equate _4 to 0 as we did in the previous question. Recall that \((a - b)^3 = (a - b)(a^2 - 2ab + b^2)\), and we can substitute as follows:

\[ f(x) = f_2(x) \newline \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 \beta_4(x - \xi)^3_+ = a_2 + b_2x + c_2x^2 + d_2x^3 \newline \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 \beta_4(x - \xi)(x^2 - 2x\xi + \xi^2) = a_2 + b_2x + c_2x^2 + d_2x^3 \newline \]

If we simplify this expression we reach the following:

\[ a_2 + b_2x + c_2x^2 + d_2x^3 = (\beta_0 - \beta_4 \xi^3) + (\beta_1 + 3\beta_4\xi^2)x + (\beta_2 - 3\beta_4 \xi)x^2 + (\beta_3 + \beta_4)x^3 \newline \]

We thus find the following values of the coefficients:

\[ a_2 = \beta_0 - \beta_4 \xi^3 \newline b_2 = \beta_1 + 3\beta_4\xi^2 \newline c_2 = \beta_2 - 3\beta_4 \xi \newline d_2 = \beta_3 + \beta_4 \newline \]

  1. Show that \(f_1(\xi) = f_2(\xi)\). That is, \(f(x)\) is continuous at \(\xi\).

\[ f_1(\xi) = f_2(\xi) \newline \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3 = (\beta_0 = \beta_4 \xi^3) + (\beta_1 + 3 \beta_4 \xi^2)\xi + (\beta_2 - 3\beta_4 \xi)\xi^2 + (\beta_3 + \beta_4)\xi^3 \newline \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3 = \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3 \]

Thus, \(f_1(\xi) = f_2(\xi)\),

  1. Show that \(f_1'(\xi) = f_2'(\xi)\). That is, \(f'(x)\) is continuous at \(\xi\).

We can find the derivatives of each as follows:

\[ f_1'(x) = b_1 + 2c_1x + 3d_1x^2 \newline f_2'(x) = b_2 + 2c_2x + 3d_2x^2 \newline \]

We can then plug in to the derivatives and set them equal to each other using the values for the coefficients we have previously calculated, and continue with algebra:

\[ \beta_1 + 2\beta_2 \xi + 3\beta_3 \xi ^2 = (\beta_1 + 3\beta_4 \xi^2) + 2(\beta_2 - 3\beta_4 \xi)\xi + (\beta_3 + \beta_4)\xi^2 \newline \beta_1 + 2\beta_2 \xi + 3\beta_3 \xi ^2 = \beta_1 + 2\beta_2 \xi + 3\beta_3 \xi ^2 \newline \]

Thus, we find that \(f_1'(\xi) = f_2'(\xi)\).

  1. Show that \(f_1''(\xi) = f_2''(\xi)\). That is, \(f''(x)\) is continuous at \(\xi\).

We can find the second derivatives as follows:

\[ f_1''(\xi) = 2c_1 + 6d_1x \newline f_2''(\xi) = 2c_2 + 6d_2x \newline \]

We can then plug in to the second derivatives and set them equal to each other using the values for the coefficients we have previously calculated, and continue with algebra:

\[ 2\beta_2 + 6\beta_3 \xi = 2(\beta_2 - 3\beta_4\xi) + 6(\beta_3 + \beta_4) \xi \newline 2\beta_2 + 6\beta_3 \xi = 2\beta_2 + 6\beta_3 \xi \]

Thus, we find that \(f_1''(\xi) = f_2''(\xi)\).

Exercise 7.9.6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

head(Wage)
##        year age           maritl     race       education             region
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
## 86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
## 376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
##              jobclass         health health_ins  logwage      wage
## 231655  1. Industrial      1. <=Good      2. No 4.318063  75.04315
## 86582  2. Information 2. >=Very Good      2. No 4.255273  70.47602
## 161300  1. Industrial      1. <=Good     1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good     1. Yes 5.041393 154.68529
## 11443  2. Information      1. <=Good     1. Yes 4.318063  75.04315
## 376662 2. Information 2. >=Very Good     1. Yes 4.845098 127.11574
  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
set.seed(123)

k <- 10

max_degree <- 10

folds <- createFolds(Wage$wage, k=k, list = F)

results <- as.data.frame(matrix(nrow=k, ncol=max_degree)) # initialize a matrix for error

# cross validation
for(i in 1:k){
  train_data <- Wage[-folds[i],]
  test_data <- Wage[folds[i],]
  
  # get mse for each model
  for(d in 1:max_degree){
    model <- lm(wage ~ poly(age, d), data=train_data)
    preds <- predict(model, newdata=test_data)
    results[i,d] <- mean((preds - test_data$wage)^2)
  }
  
}

# Compute the mean MSE for each polynomial degree across all folds
mean_errors <- colMeans(results, na.rm = TRUE)

# Find the index (degree) of the best performing polynomial degree
best_degree <- which.min(mean_errors)

# Print results
cat("Best polynomial degree:", best_degree, "\n")
## Best polynomial degree: 10
cat("Average MSE for the best degree:", mean_errors[best_degree], "\n")
## Average MSE for the best degree: 1058.158
# Plot mean MSE for each polynomial degree
plot(1:max_degree, mean_errors, type = "b", pch = 19, col = "blue",
     xlab = "Polynomial Degree", ylab = "Mean MSE",
     main = "Cross-Validation Error for Polynomial Degrees")

As we can see, the best performing degree is 10 in this instance.

We can plot the best degree polynomial here:

# Fit the best polynomial regression model using the entire dataset
best_model <- lm(wage ~ poly(age, best_degree), data = Wage)

# Create a sequence of age values for smooth plotting
age_grid <- data.frame(age = seq(min(Wage$age), max(Wage$age), length.out = 100))

# Generate predictions using the best model
age_grid$predicted_wage <- predict(best_model, newdata = age_grid)

# Plot the data and fitted polynomial regression curve
ggplot(Wage, aes(x = age, y = wage)) +
  geom_point(alpha = 0.5) +  # Scatter plot of actual data
  geom_line(data = age_grid, aes(x = age, y = predicted_wage), color = "blue", linewidth = 1.2) +  # Fitted curve
  labs(title = paste("Polynomial Regression (Degree =", best_degree, ")"),
       x = "Age",
       y = "Wage") +
  theme_minimal()

We can compare this with ANOVA:

set.seed(123)
# Fitting polynomial regression models from degree 1 to 10
fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5), data = Wage)
fit.6 <- lm(wage ~ poly(age, 6), data = Wage)
fit.7 <- lm(wage ~ poly(age, 7), data = Wage)
fit.8 <- lm(wage ~ poly(age, 8), data = Wage)
fit.9 <- lm(wage ~ poly(age, 9), data = Wage)
fit.10 <- lm(wage ~ poly(age, 10), data = Wage)

# Performing ANOVA to compare models
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6, fit.7, fit.8, fit.9, fit.10)
## Analysis of Variance Table
## 
## Model  1: wage ~ age
## Model  2: wage ~ poly(age, 2)
## Model  3: wage ~ poly(age, 3)
## Model  4: wage ~ poly(age, 4)
## Model  5: wage ~ poly(age, 5)
## Model  6: wage ~ poly(age, 6)
## Model  7: wage ~ poly(age, 7)
## Model  8: wage ~ poly(age, 8)
## Model  9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
##    Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    2998 5022216                                    
## 2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
## 3    2996 4777674  1     15756   9.9005  0.001669 ** 
## 4    2995 4771604  1      6070   3.8143  0.050909 .  
## 5    2994 4770322  1      1283   0.8059  0.369398    
## 6    2993 4766389  1      3932   2.4709  0.116074    
## 7    2992 4763834  1      2555   1.6057  0.205199    
## 8    2991 4763707  1       127   0.0796  0.777865    
## 9    2990 4756703  1      7004   4.4014  0.035994 *  
## 10   2989 4756701  1         3   0.0017  0.967529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We thus see that using ANOVA, going from degree 1 to 2 and 2 to 3 is significant, meaning that these increases in degree are statistically significant to help explain the relationship between wage and age. There is no statistically significant increase onwards until from degree 8 to 9. Given these results, the ANOVA would suggest the most practical fit is at degree 3, which would likely provide enough flexibility without too much variance. Looking at the plot from before, we see that a 3 degree polynomial provides a similar performance to the best performing degree of 10. This could suggest that the good performance of the 10 degree polynomial is due to overfitting.

  1. Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.

We will try from 2 to 10 cuts, and compare with cross validation:

set.seed(123)

k <- 10

max_cuts <- 10

folds <- createFolds(Wage$wage, k=k, list = F)

results <- as.data.frame(matrix(nrow=k, ncol=max_cuts)) # initialize a matrix for error

# cross validation
for (i in 1:k) {
  test_idx <- which(folds == i)  # Get test indices
  train_data <- Wage[-test_idx, ]
  test_data <- Wage[test_idx, ]
  
  # Get MSE for each number of equal-width cuts
  for (c in 2:max_cuts) {  # Ensure at least 2 intervals for cut()
    
    # Compute equal-width breakpoints based on the full range of age
    breaks <- seq(min(Wage$age), max(Wage$age), length.out = c + 1)
    
    # Apply the same breakpoints to both train and test data
    train_data$age_cut <- cut(train_data$age, breaks = breaks, include.lowest = TRUE)
    test_data$age_cut <- cut(test_data$age, breaks = breaks, include.lowest = TRUE)
    
    model <- lm(wage ~ age_cut, data = train_data)
    preds <- predict(model, newdata = test_data)
    
    results[i, c] <- mean((preds - test_data$wage)^2)
  }
}

# Compute the mean MSE for each polynomial degree across all folds
mean_errors <- colMeans(results, na.rm = TRUE)

# Find the index (degree) of the best performing polynomial degree
best_cuts <- which.min(mean_errors)

# Print results
cat("Best number of cuts:", best_cuts, "\n")
## Best number of cuts: 8
cat("Average MSE for the best cuts:", mean_errors[best_cuts], "\n")
## Average MSE for the best cuts: 1600.994
# Plot mean MSE for each cut
plot(2:max_cuts, mean_errors[2:max_cuts], type = "b", pch = 19, col = "blue",
     xlab = "Number of Cuts", ylab = "Mean MSE",
     main = "Cross-Validation Error for Number of Cuts")

We see that the best cuts is at 8. Below is a plot of the fit:

Exercise 7.9.7

The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.

We will ignore year as it is not a factor about the worker and logwage as it is related to the response. We will also ignore region as there is only one region present in this dataset, which is the middle atlantic:

## 
##        1. New England    2. Middle Atlantic 3. East North Central 
##                     0                  3000                     0 
## 4. West North Central     5. South Atlantic 6. East South Central 
##                     0                     0                     0 
## 7. West South Central           8. Mountain            9. Pacific 
##                     0                     0                     0

The variables we will focus on this problem are maritl, race, education, jobclass, health, and health_ins.

Let’s first focus on maritl vs wage:

boxplot(wage ~ factor(maritl), data=Wage, main="Wage by Marital Status",
        xlab=" ", ylab="Wage", col=cm.colors(5), las=2, cex.axis=0.8)

We can see that Married workers appear to have a slightly higher wage, and has much more outliers.

Now we can create a boxplot for race:

boxplot(wage ~ factor(race), data=Wage, main="Wage by Race",
        xlab=" ", ylab="Wage", col=cm.colors(4), las=2)

We see that white and asian workers appear to have a slightly higher wage than either black or other races.

Next is education:

boxplot(wage ~ factor(education), data=Wage, main="Wage by Education",
        xlab=" ", ylab="Wage", col=cm.colors(5), las=2,cex.axis=0.6)

From here we see that a higher education tends to lead to a higher wage, for all levels of education.

Now for jobclass:

boxplot(wage ~ factor(jobclass), data=Wage, main="Wage by Job Class",
        xlab=" ", ylab="Wage", col=cm.colors(2), las=1)

There appears to be a slightly higher wage for workers in the information type of job rather than industrial jobs.

Now for health:

boxplot(wage ~ factor(health), data=Wage, main="Wage by Health",
        xlab=" ", ylab="Wage", col=cm.colors(2), las=1)

We can see that healthier workers have a slightly higher wage.

Lastly, for health_ins:

boxplot(wage ~ factor(health_ins), data=Wage, main="Wage by Health Insurance",
        xlab=" ", ylab="Wage", col=cm.colors(2), las=1)

From this we can see that workers with health insurance have a clearly higher wage.

We will now use non linear models for these variables to predict wage. We will fit two new models. The first is adding education and health_ins to the age polynomial model from the previous question as those seem to have the clearest correlation to wage, and the second is adding all variables to age.

The model containing education and health_ins:

## 
## Call:
## lm(formula = wage ~ poly(age, best_degree) + education + health_ins, 
##     data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.034  -18.979   -3.411   13.533  210.123 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   94.210      2.246  41.937  < 2e-16 ***
## poly(age, best_degree)1      306.007     34.979   8.748  < 2e-16 ***
## poly(age, best_degree)2     -328.794     34.878  -9.427  < 2e-16 ***
## poly(age, best_degree)3       57.389     34.526   1.662 0.096575 .  
## poly(age, best_degree)4       10.554     34.577   0.305 0.760219    
## poly(age, best_degree)5      -59.364     34.558  -1.718 0.085941 .  
## poly(age, best_degree)6       49.447     34.485   1.434 0.151716    
## poly(age, best_degree)7       60.198     34.474   1.746 0.080880 .  
## poly(age, best_degree)8      -15.077     34.476  -0.437 0.661907    
## poly(age, best_degree)9      -41.138     34.518  -1.192 0.233446    
## poly(age, best_degree)10      21.125     34.537   0.612 0.540806    
## education2. HS Grad            8.395      2.400   3.497 0.000477 ***
## education3. Some College      19.285      2.542   7.586 4.37e-14 ***
## education4. College Grad      33.436      2.533  13.200  < 2e-16 ***
## education5. Advanced Degree   57.263      2.750  20.823  < 2e-16 ***
## health_ins2. No              -16.907      1.428 -11.843  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.44 on 2984 degrees of freedom
## Multiple R-squared:  0.322,  Adjusted R-squared:  0.3186 
## F-statistic:  94.5 on 15 and 2984 DF,  p-value: < 2.2e-16

The model containing all variables (but year and region):

## 
## Call:
## lm(formula = wage ~ poly(age, best_degree) + maritl + race + 
##     education + jobclass + health + health_ins, data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -108.222  -18.869   -3.539   13.814  210.673 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   80.6493     2.8269  28.529  < 2e-16 ***
## poly(age, best_degree)1      221.4768    39.7418   5.573 2.73e-08 ***
## poly(age, best_degree)2     -239.0219    36.1688  -6.609 4.58e-11 ***
## poly(age, best_degree)3       19.7752    34.3694   0.575 0.565083    
## poly(age, best_degree)4       18.9340    34.0647   0.556 0.578373    
## poly(age, best_degree)5      -56.5120    34.1273  -1.656 0.097844 .  
## poly(age, best_degree)6       22.2221    34.0107   0.653 0.513558    
## poly(age, best_degree)7       58.4591    33.9462   1.722 0.085153 .  
## poly(age, best_degree)8      -18.1509    33.9107  -0.535 0.592513    
## poly(age, best_degree)9      -43.2739    33.9460  -1.275 0.202485    
## poly(age, best_degree)10      29.5938    33.9694   0.871 0.383720    
## maritl2. Married              13.1756     1.8234   7.226 6.30e-13 ***
## maritl3. Widowed              -1.0106     7.9921  -0.126 0.899380    
## maritl4. Divorced             -0.1244     2.9415  -0.042 0.966275    
## maritl5. Separated             7.6217     4.8681   1.566 0.117534    
## race2. Black                  -4.4729     2.1425  -2.088 0.036909 *  
## race3. Asian                  -2.3087     2.5944  -0.890 0.373599    
## race4. Other                  -5.2105     5.6520  -0.922 0.356656    
## education2. HS Grad            7.9729     2.3702   3.364 0.000779 ***
## education3. Some College      18.4320     2.5220   7.309 3.45e-13 ***
## education4. College Grad      31.0826     2.5484  12.197  < 2e-16 ***
## education5. Advanced Degree   53.6361     2.8079  19.102  < 2e-16 ***
## jobclass2. Information         3.3955     1.3202   2.572 0.010160 *  
## health2. >=Very Good           6.3416     1.4175   4.474 7.97e-06 ***
## health_ins2. No              -16.3431     1.4119 -11.575  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.84 on 2975 degrees of freedom
## Multiple R-squared:  0.3476, Adjusted R-squared:  0.3424 
## F-statistic: 66.06 on 24 and 2975 DF,  p-value: < 2.2e-16

And for comparison, the polynomial age model:

## 
## Call:
## lm(formula = wage ~ poly(age, best_degree), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -100.38  -24.45   -4.97   15.49  199.61 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               111.7036     0.7283 153.369  < 2e-16 ***
## poly(age, best_degree)1   447.0679    39.8924  11.207  < 2e-16 ***
## poly(age, best_degree)2  -478.3158    39.8924 -11.990  < 2e-16 ***
## poly(age, best_degree)3   125.5217    39.8924   3.147  0.00167 ** 
## poly(age, best_degree)4   -77.9112    39.8924  -1.953  0.05091 .  
## poly(age, best_degree)5   -35.8129    39.8924  -0.898  0.36940    
## poly(age, best_degree)6    62.7077    39.8924   1.572  0.11607    
## poly(age, best_degree)7    50.5498    39.8924   1.267  0.20520    
## poly(age, best_degree)8   -11.2547    39.8924  -0.282  0.77787    
## poly(age, best_degree)9   -83.6918    39.8924  -2.098  0.03599 *  
## poly(age, best_degree)10    1.6240    39.8924   0.041  0.96753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.89 on 2989 degrees of freedom
## Multiple R-squared:  0.08912,    Adjusted R-squared:  0.08607 
## F-statistic: 29.24 on 10 and 2989 DF,  p-value: < 2.2e-16

We can compare the adjR2 to account for the different number of predictors, and get a significantly higher adjR2 when accounting for the other variables, with the highest using all variables at 0.3424. This also provides the lowest residual error of 33.84. We can thus see that accounting for these additional variables we get a better model.

Exercise 7.9.10

This question relates to the College data set.

college <- College
  1. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
library(caret)

set.seed(123)
train_test <- createDataPartition(college$Outstate, times = 1, p = 0.8, list=F) # indices of training set
train_data <- college[train_test,]

null <- lm(Outstate ~ 1, data=train_data)
all <- lm(Outstate ~ ., data=train_data)

forward <- step(null, direction = 'forward', scope = formula(all), trace = 0)
forward$call
## lm(formula = Outstate ~ Expend + Private + Room.Board + perc.alumni + 
##     PhD + Grad.Rate + S.F.Ratio + Personal + Terminal + F.Undergrad + 
##     Accept + Apps + Top10perc + Enroll, data = train_data)

Using stepwise selection, we see that we can use Expend + Private + Room.Board + perc.alumni + PhD + Grad.Rate + S.F.Ratio + Personal + Terminal + F.Undergrad + Accept + Apps + Top10perc + Enroll to predict Outstate. This is a total of 13 variables, meaning forward stepwise regression successfully eliminates 5 predictors.

  1. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-5
gam_model <- gam(Outstate ~ 
    Private +  # Include categorical variable as-is
    s(Expend) + s(Room.Board) + s(perc.alumni) + 
    s(PhD) + s(Grad.Rate) + s(S.F.Ratio) + s(Personal) + 
    s(Terminal) + s(F.Undergrad) + s(Accept) + s(Apps) + 
    s(Top10perc) + s(Enroll), 
    data = train_data)

par(mfrow=c(2,2))  #
plot(gam_model, se=T, col='blue', scale=0)  # Setting scale=0 forces a common y-axis scale

Looking at the plots, it appears that the variables Expend, PhD, Grad.Rate, S.F.Ratio, Personal, Top10perc, and Enroll show some evidence of a non-linear relationship based on the plots.

  1. Evaluate the model obtained on the test set, and explain the results obtained.
## GAM MSE: 3364854
## GAM R2: 0.7844119
## Stepwise Regression MSE: 3598986
## Stepwise Regression R2: 0.769411

We see that the performance is similar, with GAM being slightly better than the stepwise regression. GAM has about a 7% decrease in MSE and a 0.02 increase in R2.

  1. For which variables, if any, is there evidence of a non-linear relationship with the response?

As explained in part b) variables Expend, PhD, Grad.Rate, S.F.Ratio, Personal, Top10perc, and Enroll show evidence of a non-linear relationship based on the plots. We can look at the ANOVA for non parametric effects below for further evidence:

summary(gam_model)
## 
## Call: gam(formula = Outstate ~ Private + s(Expend) + s(Room.Board) + 
##     s(perc.alumni) + s(PhD) + s(Grad.Rate) + s(S.F.Ratio) + s(Personal) + 
##     s(Terminal) + s(F.Undergrad) + s(Accept) + s(Apps) + s(Top10perc) + 
##     s(Enroll), data = train_data)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6660.75 -1061.88    39.09  1157.13  7347.73 
## 
## (Dispersion Parameter for gaussian family taken to be 3056278)
## 
##     Null Deviance: 10155674653 on 622 degrees of freedom
## Residual Deviance: 1739021423 on 568.9997 degrees of freedom
## AIC: 11124.59 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Private          1 2769764849 2769764849 906.2541 < 2.2e-16 ***
## s(Expend)        1 2964900567 2964900567 970.1016 < 2.2e-16 ***
## s(Room.Board)    1  460884042  460884042 150.7991 < 2.2e-16 ***
## s(perc.alumni)   1  172202199  172202199  56.3438 2.361e-13 ***
## s(PhD)           1   59043260   59043260  19.3187 1.320e-05 ***
## s(Grad.Rate)     1   85916813   85916813  28.1116 1.641e-07 ***
## s(S.F.Ratio)     1   10049722   10049722   3.2882  0.070305 .  
## s(Personal)      1   30237464   30237464   9.8936  0.001745 ** 
## s(Terminal)      1   12726158   12726158   4.1639  0.041754 *  
## s(F.Undergrad)   1   28484009   28484009   9.3198  0.002373 ** 
## s(Accept)        1  103109412  103109412  33.7369 1.050e-08 ***
## s(Apps)          1    5061375    5061375   1.6561  0.198660    
## s(Top10perc)     1     582274     582274   0.1905  0.662652    
## s(Enroll)        1   22824450   22824450   7.4681  0.006476 ** 
## Residuals      569 1739021423    3056278                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df  Npar F     Pr(F)    
## (Intercept)                                 
## Private                                     
## s(Expend)            3 25.5850 1.554e-15 ***
## s(Room.Board)        3  0.9495 0.4163201    
## s(perc.alumni)       3  1.5824 0.1925380    
## s(PhD)               3  3.6749 0.0121149 *  
## s(Grad.Rate)         3  3.2413 0.0217872 *  
## s(S.F.Ratio)         3  5.4614 0.0010481 ** 
## s(Personal)          3  5.6025 0.0008628 ***
## s(Terminal)          3  1.7095 0.1639260    
## s(F.Undergrad)       3  1.5038 0.2125030    
## s(Accept)            3  6.9335 0.0001370 ***
## s(Apps)              3  5.1229 0.0016704 ** 
## s(Top10perc)         3  1.4598 0.2245227    
## s(Enroll)            3  1.5761 0.1940774    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using this there is evidence that Expend, PhD, Grad.Rate, S.F.Ratio, Personal, Accept, and Apps all have evidence of a non-linear relationship. This agrees with some of the conclusions for the plots, with a couple exceptions. Both Top10perc and Enroll are not actually significant, while Accept and Apps are. We can see that Top10perc actually does not have even a statistically significant non linear relationship, while Enroll does have a statistically significant linear relationship.

We can thus conclude that there is evidence for non linear relationships for the following variables: Expend, PhD, Grad.Rate, S.F.Ratio, Personal, Accept, and Apps.