HW 3: Moving Beyond Linearity

1. Exercise 7.9.1:

Problem Statement

It was mentioned in the chapter that a cubic regression spline with one knot at \(\xi\) can be obtained using a basis of the form \(x, x^2, x^3, (x - \xi)_+^3\), where \((x - \xi)_+^3 = (x - \xi)^3\) if \(x > \xi\) and equals 0 otherwise. We will now show that a function of the form:

\[ f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 (x - \xi)_+^3 \]

is indeed a cubic regression spline, regardless of the values of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\).

(a) Find a cubic polynomial

Define:

\[ f_1(x) = a_1 + b_1 x + c_1 x^2 + d_1 x^3 \]

such that \(f(x) = f_1(x)\) for all \(x \leq \xi\). Express \(a_1, b_1, c_1, d_1\) in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\).

Ans:

Since \(f(x)\) takes the form:

\[ f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 (x - \xi)_+^3 \]

For \(x \leq \xi\), the term \((x - \xi)_+^3 = 0\), so we obtain:

\[ f_1(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 \]

Thus,

\[ a_1 = \beta_0, \quad b_1 = \beta_1, \quad c_1 = \beta_2, \quad d_1 = \beta_3 \]

(b) Find a cubic polynomial

Define:

\[ f_2(x) = a_2 + b_2 x + c_2 x^2 + d_2 x^3 \]

such that \(f(x) = f_2(x)\) for all \(x > \xi\). Express \(a_2, b_2, c_2, d_2\) in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\).

Ans:

For \(x > \xi\), the function remains:

\[ f_2(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 (x - \xi)^3 \]

So:

\[ a_2 = \beta_0, \quad b_2 = \beta_1, \quad c_2 = \beta_2, \quad d_2 = \beta_3 + \beta_4 \]

Thus, \(f(x)\) is a piecewise polynomial.

(c) Show that \(f_1(\xi) = f_2(\xi)\).

That is, \(f(x)\) is continuous at \(\xi\).

Ans:

Since \(f_1(\xi)\) and \(f_2(\xi)\) should be equal,

\[ \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3 = \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3 + \beta_4 (\xi - \xi)^3 \]

Since \((\xi - \xi)^3 = 0\), equality holds, proving continuity.

(d) Show that \(f_1'(\xi) = f_2'(\xi)\).

That is, \(f(x)\) is continuously differentiable at \(\xi\).

Ans:

First derivative:

\[ f_1'(x) = \beta_1 + 2\beta_2 x + 3\beta_3 x^2 \]

\[ f_2'(x) = \beta_1 + 2\beta_2 x + 3\beta_3 x^2 + 3\beta_4 (x - \xi)^2 \]

Evaluating at \(x = \xi\),

\[ \beta_1 + 2\beta_2 \xi + 3\beta_3 \xi^2 = \beta_1 + 2\beta_2 \xi + 3\beta_3 \xi^2 + 3\beta_4 (0) \]

So the equality holds, proving \(f(x)\) is differentiable.

(e) Show that \(f_1''(\xi) = f_2''(\xi)\).

That is, \(f(x)\) is continuously twice differentiable at \(\xi\).

Ans:

Since the function satisfies continuity and differentiability conditions, it follows that \(f(x)\) is a cubic spline.

Second derivative:

\[ f_1''(x) = 2\beta_2 + 6\beta_3 x \]

\[ f_2''(x) = 2\beta_2 + 6\beta_3 x + 6\beta_4 (x - \xi) \]

At \(x = \xi\),

\[ 2\beta_2 + 6\beta_3 \xi = 2\beta_2 + 6\beta_3 \xi + 6\beta_4 (0) \]

So equality holds, proving \(f(x)\) is twice differentiable.

Thus, \(f(x)\) is a cubic regression spline.

2. Exercise 7.9.6: In this exercise, you will further analyze the Wage data set considered throughout this chapter.

a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR2)
attach(Wage)
library(boot)
## Warning: package 'boot' was built under R version 4.4.2
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...
head(Wage)
##        year age           maritl     race       education             region
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
## 86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
## 376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
##              jobclass         health health_ins  logwage      wage
## 231655  1. Industrial      1. <=Good      2. No 4.318063  75.04315
## 86582  2. Information 2. >=Very Good      2. No 4.255273  70.47602
## 161300  1. Industrial      1. <=Good     1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good     1. Yes 5.041393 154.68529
## 11443  2. Information      1. <=Good     1. Yes 4.318063  75.04315
## 376662 2. Information 2. >=Very Good     1. Yes 4.845098 127.11574
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
## 
set.seed(0)
cv.error <- rep(0,10)
for (i in 1:10) {
  glm.fit <- glm(wage ~ poly(age, i), data = Wage)  # Fit polynomial regression
  cv.error[i] <- cv.glm(Wage, glm.fit, K = 10)$delta[1]  # Compute test MSE
}
cv.error
##  [1] 1677.625 1599.790 1595.625 1594.940 1592.908 1594.775 1594.096 1596.071
##  [9] 1593.908 1593.211
plot(cv.error, type = "b", xlab = "Polynomial Degree", ylab = "Test MSE", main = "10-Fold Cross-Validation Error")
optimal_degree <- which.min(cv.error)
points(optimal_degree, cv.error[optimal_degree], col="red")

cat("Optimal polynomial degree selected by cross-validation:", optimal_degree)
## Optimal polynomial degree selected by cross-validation: 5
# ANOVA to compare polynomial regression models for degrees 1 to 10
fit_1 <- lm(wage ~ age, data = Wage)
fit_2 <- lm(wage ~ poly(age, 2), data = Wage)
fit_3 <- lm(wage ~ poly(age, 3), data = Wage)
fit_4 <- lm(wage ~ poly(age, 4), data = Wage)
fit_5 <- lm(wage ~ poly(age, 5), data = Wage)
fit_6 <- lm(wage ~ poly(age, 6), data = Wage)
fit_7 <- lm(wage ~ poly(age, 7), data = Wage)
fit_8 <- lm(wage ~ poly(age, 8), data = Wage)
fit_9 <- lm(wage ~ poly(age, 9), data = Wage)
fit_10 <- lm(wage ~ poly(age, 10), data = Wage)
anova(fit_1, fit_2, fit_3, fit_4, fit_5, fit_6, fit_7, fit_8, fit_9, fit_10)
## Analysis of Variance Table
## 
## Model  1: wage ~ age
## Model  2: wage ~ poly(age, 2)
## Model  3: wage ~ poly(age, 3)
## Model  4: wage ~ poly(age, 4)
## Model  5: wage ~ poly(age, 5)
## Model  6: wage ~ poly(age, 6)
## Model  7: wage ~ poly(age, 7)
## Model  8: wage ~ poly(age, 8)
## Model  9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
##    Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    2998 5022216                                    
## 2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
## 3    2996 4777674  1     15756   9.9005  0.001669 ** 
## 4    2995 4771604  1      6070   3.8143  0.050909 .  
## 5    2994 4770322  1      1283   0.8059  0.369398    
## 6    2993 4766389  1      3932   2.4709  0.116074    
## 7    2992 4763834  1      2555   1.6057  0.205199    
## 8    2991 4763707  1       127   0.0796  0.777865    
## 9    2990 4756703  1      7004   4.4014  0.035994 *  
## 10   2989 4756701  1         3   0.0017  0.967529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Plot polynomial fit
x <- seq(min(age), max(age), length.out = 300)
pred <- predict(fit_5, newdata = list(age = x))
plot(age, wage, xlab = "Age", ylab = "Wage", main = "Polynomial Regression Fit (Degree = 5)")
lines(x, pred, col = "red", lwd = 2)

While cross validation chose optimal degree of 5, ANOVA chose 3 with p<0.05 (p.001669). This is likely due to CV minimizing test error while ANOVA analyzes individual term significance. As a result, even if individual terms aren’t significant, CV can favor higher degree models as long as they reduce prediction error.

b) Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(0)
num_cuts <- 2:10
cv.errors <- rep(0, length(num_cuts))
for (i in seq_along(num_cuts)) {
  Wage$age_cut <- cut(age, num_cuts[i])  # Create categorical bins
  fit <- glm(wage ~ age_cut, data = Wage)  # Fit step function
  cv.errors[i] <- cv.glm(Wage, fit, K = 10)$delta[1]  # Compute test MSE
}
optimal_cuts <- num_cuts[which.min(cv.errors)]
cat("Optimal number of cuts selected by cross-validation:", optimal_cuts, "\n")
## Optimal number of cuts selected by cross-validation: 8
Wage$age_cut <- cut(age, optimal_cuts)
final_fit <- lm(wage ~ age_cut, data = Wage)
preds <- predict(final_fit, newdata = data.frame(age_cut = cut(x, optimal_cuts)))
plot(age, wage, xlab = "Age", ylab = "Wage", main = paste("Step Function Fit (", optimal_cuts, " Cuts)", sep = ""))
lines(x, preds, col = "red", lwd = 2)

3. Exercise 7.9.7: The Wage data set contains a number of other features not explored in this chapter, such as marital status (maritl), job class (jobclass), and others. Explore the relationships between some of these other predictors and wage, and use non-linear ftting techniques in order to fit fexible models to the data. Create plots of the results obtained, and write a summary of your findings.

library(gam)
## Warning: package 'gam' was built under R version 4.4.3
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-5
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
##                                                                      
##         age_cut   
##  (41.2,49]  :728  
##  (33.5,41.2]:671  
##  (25.8,33.5]:519  
##  (49,56.8]  :503  
##  (56.8,64.5]:276  
##  (17.9,25.8]:231  
##  (Other)    : 72
gam_age <- gam(wage ~ s(age, 4), data = Wage)

# Wage and Marital Status
plot(maritl, wage)

gam_maritl <- gam(wage ~ s(age, 4) + maritl, data = Wage)
pred_never_married <- predict(gam_maritl, newdata = data.frame(age = x, maritl = "1. Never Married"))
pred_married <- predict(gam_maritl, newdata = data.frame(age = x, maritl = "2. Married"))
plot(age, wage, xlab = "Age", ylab = "Wage", main = "Effect of Age & Marital Status on Wage")
lines(x, pred_never_married, col = "red", lwd = 2, lty = 1)
lines(x, pred_married, col = "blue", lwd = 2, lty = 2) 
legend("topright", legend = c("Never Married", "Married"), col = c("red", "blue"), lty = c(1, 2), lwd = 2)

summary(gam_maritl)
## 
## Call: gam(formula = wage ~ s(age, 4) + maritl, data = Wage)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -102.70  -23.32   -4.30   14.62  208.45 
## 
## (Dispersion Parameter for gaussian family taken to be 1553.718)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 4647171 on 2991 degrees of freedom
## AIC: 30569.84 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(age, 4)    1  199870  199870 128.640 < 2.2e-16 ***
## maritl       4  136614   34154  21.982 < 2.2e-16 ***
## Residuals 2991 4647171    1554                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F     Pr(F)    
## (Intercept)                             
## s(age, 4)         3  32.71 < 2.2e-16 ***
## maritl                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Wage and Job Class
plot(jobclass, wage)

gam_jobclass <- gam(wage ~ s(age, 4) + jobclass, data = Wage)
pred_industrial <- predict(gam_jobclass, newdata = data.frame(age = x, jobclass = "1. Industrial"))
pred_information <- predict(gam_jobclass, newdata = data.frame(age = x, jobclass = "2. Information"))
plot(age, wage, xlab = "Age", ylab = "Wage", main = "Effect of Age & Job Class on Wage")
lines(x, pred_industrial, col = "red", lwd = 2, lty = 1)
lines(x, pred_information, col = "blue", lwd = 2, lty = 2) 
legend("topright", legend = c("Industrial", "Information"), col = c("red", "blue"), lty = c(1, 2), lwd = 2)

summary(gam_jobclass)
## 
## Call: gam(formula = wage ~ s(age, 4) + jobclass, data = Wage)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.923  -23.348   -5.014   15.968  196.736 
## 
## (Dispersion Parameter for gaussian family taken to be 1537.282)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 4602621 on 2994 degrees of freedom
## AIC: 30534.94 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(age, 4)    1  199870  199870  130.01 < 2.2e-16 ***
## jobclass     1  169973  169973  110.57 < 2.2e-16 ***
## Residuals 2994 4602621    1537                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F     Pr(F)    
## (Intercept)                             
## s(age, 4)         3 50.129 < 2.2e-16 ***
## jobclass                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Final plots
par(mfrow = c(3,3))
plot(gam_age, se = TRUE, col = "blue")
plot(gam_maritl, se = TRUE, col = "blue")
plot(gam_jobclass, se = TRUE, col = "blue")

When analyzing wage by marital status, never married individuals tend to have lower median wages compared to married individuals. And for wage by job class, industrial jobs tend to have lower median wage compared to information jobs. ANOVA hypothesis testing shows that adding marital status and job class as categorical variables to the GAM significantly improves the smoothing spline for age.

4. Exercise 7.9.10: This question relates to the College data set.

a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

library(leaps)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
data(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
head(College)
##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00
set.seed(0)
train_index <- createDataPartition(College$Outstate, p = 0.7, list = FALSE)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]
forward_fit <- regsubsets(Outstate ~ ., data = train_data, nvmax = ncol(train_data) - 1, method = "forward")
summary_forward <- summary(forward_fit)
summary_forward$outmat
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 9  ( 1 )  "*"        " "  "*"    " "    " "       " "       " "        
## 10  ( 1 ) "*"        " "  "*"    " "    " "       " "       "*"        
## 11  ( 1 ) "*"        "*"  "*"    " "    " "       " "       "*"        
## 12  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 13  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 14  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 15  ( 1 ) "*"        "*"  "*"    " "    "*"       " "       "*"        
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
## 17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 )  " "         "*"        " "   "*"      "*" " "      " "      
## 8  ( 1 )  " "         "*"        " "   "*"      "*" " "      "*"      
## 9  ( 1 )  " "         "*"        " "   "*"      "*" " "      "*"      
## 10  ( 1 ) " "         "*"        " "   "*"      "*" " "      "*"      
## 11  ( 1 ) " "         "*"        " "   "*"      "*" " "      "*"      
## 12  ( 1 ) " "         "*"        " "   "*"      "*" " "      "*"      
## 13  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
## 14  ( 1 ) " "         "*"        "*"   "*"      "*" "*"      "*"      
## 15  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
## 16  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         "*"    " "      
## 2  ( 1 )  " "         "*"    " "      
## 3  ( 1 )  " "         "*"    " "      
## 4  ( 1 )  "*"         "*"    " "      
## 5  ( 1 )  "*"         "*"    " "      
## 6  ( 1 )  "*"         "*"    "*"      
## 7  ( 1 )  "*"         "*"    "*"      
## 8  ( 1 )  "*"         "*"    "*"      
## 9  ( 1 )  "*"         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"
best_cp <- which.min(summary_forward$cp)
best_bic <- which.min(summary_forward$bic)
best_adjr2 <- which.max(summary_forward$adjr2)
cat("Best model based on Cp:", best_cp) #13
## Best model based on Cp: 13
cat("Best model based on BIC:", best_bic) #12
## Best model based on BIC: 12
cat("Best model based on adjusted R-squared:", best_adjr2) #13
## Best model based on adjusted R-squared: 13
best_model_size <- 13
selected_vars <- names(coef(forward_fit, best_model_size))[-1]  # Exclude intercept
cat("Best Model Size:", best_model_size, "\n")
## Best Model Size: 13
cat("Selected Predictors:", paste(selected_vars, collapse = ", "))
## Selected Predictors: PrivateYes, Apps, Accept, Top10perc, F.Undergrad, Room.Board, Personal, PhD, Terminal, S.F.Ratio, perc.alumni, Expend, Grad.Rate
coef(forward_fit, 13)
##   (Intercept)    PrivateYes          Apps        Accept     Top10perc 
## -1696.0494705  2281.0324069    -0.2745795     0.6908079    23.0013881 
##   F.Undergrad    Room.Board      Personal           PhD      Terminal 
##    -0.1636180     0.9336671    -0.2799977    22.7913716    19.3862731 
##     S.F.Ratio   perc.alumni        Expend     Grad.Rate 
##   -59.5472959    39.7135485     0.1663588    19.3178985

b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your fndings.

gam_model <- gam(Outstate ~ Private + s(Apps) + s(Accept) + s(Top10perc) + s(F.Undergrad) + 
                 s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) + s(S.F.Ratio) + 
                 s(perc.alumni) + s(Expend) + s(Grad.Rate), 
                 data = train_data)

par(mfrow = c(3,3))
plot(gam_model, se = TRUE, col = "blue")

Based on the plots, many of the predictors appear to have non-linear relationships with out of state tuition. In general, the confidence interval is wider at extremes of the predictors, where we expect more variance. For private, Yes is above 0 meaning private schools charge higher tuition while No is below 0 meaning public schools charge less tuition. Out of state tuition appears to increase with number of accepted applications, percentage of students in the 10% of their HS class, room and board costs, percentage of faculty with a PhD or terminal degree, student faculty ratio, percentage of alumni who donate to the school, and instructional expenditure per student. Tuition decreases with number of applications received by the college, number of full time undergraduate students, and personal expenses for students. Graduation rate seems to have increasing effect on tuition up until about 80-90% grduation rate, at which it is associated with decreasing effect on tuition.

c) Evaluate the model obtained on the test set, and explain the results obtained.

test_predictions <- predict(gam_model, newdata = test_data)
mse <- mean((test_predictions - test_data$Outstate)^2)
rmse <- sqrt(mse)
sst <- sum((test_data$Outstate - mean(test_data$Outstate))^2)
sse <- sum((test_data$Outstate - test_predictions)^2)
r_squared <- 1 - (sse / sst)

cat("GAM Model Evaluation Metrics:\n")
## GAM Model Evaluation Metrics:
cat("MSE:", mse, "\n")
## MSE: 3517712
cat("RMSE:", rmse, "\n")
## RMSE: 1875.556
cat("R²:", r_squared, "\n\n")
## R²: 0.7949705
lm_model <- lm(Outstate ~ Private + Apps + Accept + Top10perc + F.Undergrad + 
               Room.Board + Personal + PhD + Terminal + S.F.Ratio + 
               perc.alumni + Expend + Grad.Rate, data = train_data)
lm_predictions <- predict(lm_model, newdata = test_data)
lm_mse <- mean((lm_predictions - test_data$Outstate)^2)
lm_rmse <- sqrt(lm_mse)
lm_r_squared <- 1 - sum((test_data$Outstate - lm_predictions)^2) / sum((test_data$Outstate - mean(test_data$Outstate))^2)

cat("Linear Model Performance:\n")
## Linear Model Performance:
cat("MSE:", lm_mse, "\n")
## MSE: 4210025
cat("RMSE:", lm_rmse, "\n")
## RMSE: 2051.834
cat("R²:", lm_r_squared)
## R²: 0.7546191

The GAM is on average around $1875 off from the true tuition values, and it explains 79.5% of the variance in tuition. On the other hand, the linear model is off by around $2050 and explains 75% of variance in tuition. Overall, both models appear to perform similarly well, with GAM slightly outperforming. I could adjust predictors used, add interaction terms and feature engineer, or adjust degrees of freedom. It’s also possible that the nonlinear relationship among some predictors with out of state tuition are not as significant as we thought.

d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam_model)
## 
## Call: gam(formula = Outstate ~ Private + s(Apps) + s(Accept) + s(Top10perc) + 
##     s(F.Undergrad) + s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) + 
##     s(S.F.Ratio) + s(perc.alumni) + s(Expend) + s(Grad.Rate), 
##     data = train_data)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -5990.09 -1062.62    70.67  1170.20  4649.34 
## 
## (Dispersion Parameter for gaussian family taken to be 3043346)
## 
##     Null Deviance: 8595716956 on 545 degrees of freedom
## Residual Deviance: 1509500146 on 496.0002 degrees of freedom
## AIC: 9749.985 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Private          1 2323483405 2323483405 763.4635 < 2.2e-16 ***
## s(Apps)          1  983626768  983626768 323.2057 < 2.2e-16 ***
## s(Accept)        1  126560887  126560887  41.5861 2.681e-10 ***
## s(Top10perc)     1  755720880  755720880 248.3191 < 2.2e-16 ***
## s(F.Undergrad)   1  295817305  295817305  97.2013 < 2.2e-16 ***
## s(Room.Board)    1  585353023  585353023 192.3386 < 2.2e-16 ***
## s(Personal)      1   22992768   22992768   7.5551  0.006202 ** 
## s(PhD)           1  141201515  141201515  46.3968 2.804e-11 ***
## s(Terminal)      1   18641902   18641902   6.1255  0.013658 *  
## s(S.F.Ratio)     1   90552710   90552710  29.7543 7.752e-08 ***
## s(perc.alumni)   1  111111669  111111669  36.5097 2.986e-09 ***
## s(Expend)        1  387424127  387424127 127.3020 < 2.2e-16 ***
## s(Grad.Rate)     1   25447409   25447409   8.3617  0.004000 ** 
## Residuals      496 1509500146    3043346                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df  Npar F     Pr(F)    
## (Intercept)                                 
## Private                                     
## s(Apps)              3  3.7798  0.010578 *  
## s(Accept)            3  4.5449  0.003734 ** 
## s(Top10perc)         3  1.2847  0.278908    
## s(F.Undergrad)       3  2.9724  0.031401 *  
## s(Room.Board)        3  0.7982  0.495266    
## s(Personal)          3  1.8158  0.143309    
## s(PhD)               3  2.5365  0.056062 .  
## s(Terminal)          3  1.1257  0.338134    
## s(S.F.Ratio)         3  4.6471  0.003246 ** 
## s(perc.alumni)       3  1.7353  0.158818    
## s(Expend)            3 18.6839 1.695e-11 ***
## s(Grad.Rate)         3  2.8632  0.036331 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for nonparametric effects shows a strong evidence of non-linear relationship between Apps, Accept, F.Undergrad, S.F.Ratio, Expend, and Grad.Rate with the response of out of state tuition with p<0.05.