It was mentioned in the chapter that a cubic regression spline with one knot at \(\xi\) can be obtained using a basis of the form \(x, x^2, x^3, (x - \xi)_+^3\), where \((x - \xi)_+^3 = (x - \xi)^3\) if \(x > \xi\) and equals 0 otherwise. We will now show that a function of the form:
\[ f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 (x - \xi)_+^3 \]
is indeed a cubic regression spline, regardless of the values of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\).
Define:
\[ f_1(x) = a_1 + b_1 x + c_1 x^2 + d_1 x^3 \]
such that \(f(x) = f_1(x)\) for all \(x \leq \xi\). Express \(a_1, b_1, c_1, d_1\) in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\).
Since \(f(x)\) takes the form:
\[ f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 (x - \xi)_+^3 \]
For \(x \leq \xi\), the term \((x - \xi)_+^3 = 0\), so we obtain:
\[ f_1(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 \]
Thus,
\[ a_1 = \beta_0, \quad b_1 = \beta_1, \quad c_1 = \beta_2, \quad d_1 = \beta_3 \]
Define:
\[ f_2(x) = a_2 + b_2 x + c_2 x^2 + d_2 x^3 \]
such that \(f(x) = f_2(x)\) for all \(x > \xi\). Express \(a_2, b_2, c_2, d_2\) in terms of \(\beta_0, \beta_1, \beta_2, \beta_3, \beta_4\).
For \(x > \xi\), the function remains:
\[ f_2(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 (x - \xi)^3 \]
So:
\[ a_2 = \beta_0, \quad b_2 = \beta_1, \quad c_2 = \beta_2, \quad d_2 = \beta_3 + \beta_4 \]
Thus, \(f(x)\) is a piecewise polynomial.
That is, \(f(x)\) is continuous at \(\xi\).
Since \(f_1(\xi)\) and \(f_2(\xi)\) should be equal,
\[ \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3 = \beta_0 + \beta_1 \xi + \beta_2 \xi^2 + \beta_3 \xi^3 + \beta_4 (\xi - \xi)^3 \]
Since \((\xi - \xi)^3 = 0\), equality holds, proving continuity.
That is, \(f(x)\) is continuously differentiable at \(\xi\).
First derivative:
\[ f_1'(x) = \beta_1 + 2\beta_2 x + 3\beta_3 x^2 \]
\[ f_2'(x) = \beta_1 + 2\beta_2 x + 3\beta_3 x^2 + 3\beta_4 (x - \xi)^2 \]
Evaluating at \(x = \xi\),
\[ \beta_1 + 2\beta_2 \xi + 3\beta_3 \xi^2 = \beta_1 + 2\beta_2 \xi + 3\beta_3 \xi^2 + 3\beta_4 (0) \]
So the equality holds, proving \(f(x)\) is differentiable.
That is, \(f(x)\) is continuously twice differentiable at \(\xi\).
Since the function satisfies continuity and differentiability conditions, it follows that \(f(x)\) is a cubic spline.
Second derivative:
\[ f_1''(x) = 2\beta_2 + 6\beta_3 x \]
\[ f_2''(x) = 2\beta_2 + 6\beta_3 x + 6\beta_4 (x - \xi) \]
At \(x = \xi\),
\[ 2\beta_2 + 6\beta_3 \xi = 2\beta_2 + 6\beta_3 \xi + 6\beta_4 (0) \]
So equality holds, proving \(f(x)\) is twice differentiable.
Thus, \(f(x)\) is a cubic regression spline.
library(ISLR2)
attach(Wage)
library(boot)
## Warning: package 'boot' was built under R version 4.4.2
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
head(Wage)
## year age maritl race education region
## 231655 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic
## 86582 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
## 376662 2008 54 2. Married 1. White 4. College Grad 2. Middle Atlantic
## jobclass health health_ins logwage wage
## 231655 1. Industrial 1. <=Good 2. No 4.318063 75.04315
## 86582 2. Information 2. >=Very Good 2. No 4.255273 70.47602
## 161300 1. Industrial 1. <=Good 1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good 1. Yes 5.041393 154.68529
## 11443 2. Information 1. <=Good 1. Yes 4.318063 75.04315
## 376662 2. Information 2. >=Very Good 1. Yes 4.845098 127.11574
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
set.seed(0)
cv.error <- rep(0,10)
for (i in 1:10) {
glm.fit <- glm(wage ~ poly(age, i), data = Wage) # Fit polynomial regression
cv.error[i] <- cv.glm(Wage, glm.fit, K = 10)$delta[1] # Compute test MSE
}
cv.error
## [1] 1677.625 1599.790 1595.625 1594.940 1592.908 1594.775 1594.096 1596.071
## [9] 1593.908 1593.211
plot(cv.error, type = "b", xlab = "Polynomial Degree", ylab = "Test MSE", main = "10-Fold Cross-Validation Error")
optimal_degree <- which.min(cv.error)
points(optimal_degree, cv.error[optimal_degree], col="red")
cat("Optimal polynomial degree selected by cross-validation:", optimal_degree)
## Optimal polynomial degree selected by cross-validation: 5
# ANOVA to compare polynomial regression models for degrees 1 to 10
fit_1 <- lm(wage ~ age, data = Wage)
fit_2 <- lm(wage ~ poly(age, 2), data = Wage)
fit_3 <- lm(wage ~ poly(age, 3), data = Wage)
fit_4 <- lm(wage ~ poly(age, 4), data = Wage)
fit_5 <- lm(wage ~ poly(age, 5), data = Wage)
fit_6 <- lm(wage ~ poly(age, 6), data = Wage)
fit_7 <- lm(wage ~ poly(age, 7), data = Wage)
fit_8 <- lm(wage ~ poly(age, 8), data = Wage)
fit_9 <- lm(wage ~ poly(age, 9), data = Wage)
fit_10 <- lm(wage ~ poly(age, 10), data = Wage)
anova(fit_1, fit_2, fit_3, fit_4, fit_5, fit_6, fit_7, fit_8, fit_9, fit_10)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Model 8: wage ~ poly(age, 8)
## Model 9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.7638 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.9005 0.001669 **
## 4 2995 4771604 1 6070 3.8143 0.050909 .
## 5 2994 4770322 1 1283 0.8059 0.369398
## 6 2993 4766389 1 3932 2.4709 0.116074
## 7 2992 4763834 1 2555 1.6057 0.205199
## 8 2991 4763707 1 127 0.0796 0.777865
## 9 2990 4756703 1 7004 4.4014 0.035994 *
## 10 2989 4756701 1 3 0.0017 0.967529
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Plot polynomial fit
x <- seq(min(age), max(age), length.out = 300)
pred <- predict(fit_5, newdata = list(age = x))
plot(age, wage, xlab = "Age", ylab = "Wage", main = "Polynomial Regression Fit (Degree = 5)")
lines(x, pred, col = "red", lwd = 2)
While cross validation chose optimal degree of 5, ANOVA chose 3 with p<0.05 (p.001669). This is likely due to CV minimizing test error while ANOVA analyzes individual term significance. As a result, even if individual terms aren’t significant, CV can favor higher degree models as long as they reduce prediction error.
set.seed(0)
num_cuts <- 2:10
cv.errors <- rep(0, length(num_cuts))
for (i in seq_along(num_cuts)) {
Wage$age_cut <- cut(age, num_cuts[i]) # Create categorical bins
fit <- glm(wage ~ age_cut, data = Wage) # Fit step function
cv.errors[i] <- cv.glm(Wage, fit, K = 10)$delta[1] # Compute test MSE
}
optimal_cuts <- num_cuts[which.min(cv.errors)]
cat("Optimal number of cuts selected by cross-validation:", optimal_cuts, "\n")
## Optimal number of cuts selected by cross-validation: 8
Wage$age_cut <- cut(age, optimal_cuts)
final_fit <- lm(wage ~ age_cut, data = Wage)
preds <- predict(final_fit, newdata = data.frame(age_cut = cut(x, optimal_cuts)))
plot(age, wage, xlab = "Age", ylab = "Wage", main = paste("Step Function Fit (", optimal_cuts, " Cuts)", sep = ""))
lines(x, preds, col = "red", lwd = 2)
library(gam)
## Warning: package 'gam' was built under R version 4.4.3
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-5
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
## age_cut
## (41.2,49] :728
## (33.5,41.2]:671
## (25.8,33.5]:519
## (49,56.8] :503
## (56.8,64.5]:276
## (17.9,25.8]:231
## (Other) : 72
gam_age <- gam(wage ~ s(age, 4), data = Wage)
# Wage and Marital Status
plot(maritl, wage)
gam_maritl <- gam(wage ~ s(age, 4) + maritl, data = Wage)
pred_never_married <- predict(gam_maritl, newdata = data.frame(age = x, maritl = "1. Never Married"))
pred_married <- predict(gam_maritl, newdata = data.frame(age = x, maritl = "2. Married"))
plot(age, wage, xlab = "Age", ylab = "Wage", main = "Effect of Age & Marital Status on Wage")
lines(x, pred_never_married, col = "red", lwd = 2, lty = 1)
lines(x, pred_married, col = "blue", lwd = 2, lty = 2)
legend("topright", legend = c("Never Married", "Married"), col = c("red", "blue"), lty = c(1, 2), lwd = 2)
summary(gam_maritl)
##
## Call: gam(formula = wage ~ s(age, 4) + maritl, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -102.70 -23.32 -4.30 14.62 208.45
##
## (Dispersion Parameter for gaussian family taken to be 1553.718)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 4647171 on 2991 degrees of freedom
## AIC: 30569.84
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(age, 4) 1 199870 199870 128.640 < 2.2e-16 ***
## maritl 4 136614 34154 21.982 < 2.2e-16 ***
## Residuals 2991 4647171 1554
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(age, 4) 3 32.71 < 2.2e-16 ***
## maritl
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Wage and Job Class
plot(jobclass, wage)
gam_jobclass <- gam(wage ~ s(age, 4) + jobclass, data = Wage)
pred_industrial <- predict(gam_jobclass, newdata = data.frame(age = x, jobclass = "1. Industrial"))
pred_information <- predict(gam_jobclass, newdata = data.frame(age = x, jobclass = "2. Information"))
plot(age, wage, xlab = "Age", ylab = "Wage", main = "Effect of Age & Job Class on Wage")
lines(x, pred_industrial, col = "red", lwd = 2, lty = 1)
lines(x, pred_information, col = "blue", lwd = 2, lty = 2)
legend("topright", legend = c("Industrial", "Information"), col = c("red", "blue"), lty = c(1, 2), lwd = 2)
summary(gam_jobclass)
##
## Call: gam(formula = wage ~ s(age, 4) + jobclass, data = Wage)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -106.923 -23.348 -5.014 15.968 196.736
##
## (Dispersion Parameter for gaussian family taken to be 1537.282)
##
## Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 4602621 on 2994 degrees of freedom
## AIC: 30534.94
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(age, 4) 1 199870 199870 130.01 < 2.2e-16 ***
## jobclass 1 169973 169973 110.57 < 2.2e-16 ***
## Residuals 2994 4602621 1537
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(age, 4) 3 50.129 < 2.2e-16 ***
## jobclass
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Final plots
par(mfrow = c(3,3))
plot(gam_age, se = TRUE, col = "blue")
plot(gam_maritl, se = TRUE, col = "blue")
plot(gam_jobclass, se = TRUE, col = "blue")
When analyzing wage by marital status, never married individuals tend to have lower median wages compared to married individuals. And for wage by job class, industrial jobs tend to have lower median wage compared to information jobs. ANOVA hypothesis testing shows that adding marital status and job class as categorical variables to the GAM significantly improves the smoothing spline for age.
library(leaps)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
data(College)
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
head(College)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
set.seed(0)
train_index <- createDataPartition(College$Outstate, p = 0.7, list = FALSE)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]
forward_fit <- regsubsets(Outstate ~ ., data = train_data, nvmax = ncol(train_data) - 1, method = "forward")
summary_forward <- summary(forward_fit)
summary_forward$outmat
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " " " " " " "
## 9 ( 1 ) "*" " " "*" " " " " " " " "
## 10 ( 1 ) "*" " " "*" " " " " " " "*"
## 11 ( 1 ) "*" "*" "*" " " " " " " "*"
## 12 ( 1 ) "*" "*" "*" " " "*" " " "*"
## 13 ( 1 ) "*" "*" "*" " " "*" " " "*"
## 14 ( 1 ) "*" "*" "*" " " "*" " " "*"
## 15 ( 1 ) "*" "*" "*" " " "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " " "
## 6 ( 1 ) " " "*" " " " " "*" " " " "
## 7 ( 1 ) " " "*" " " "*" "*" " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " "*"
## 9 ( 1 ) " " "*" " " "*" "*" " " "*"
## 10 ( 1 ) " " "*" " " "*" "*" " " "*"
## 11 ( 1 ) " " "*" " " "*" "*" " " "*"
## 12 ( 1 ) " " "*" " " "*" "*" " " "*"
## 13 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 14 ( 1 ) " " "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" " "
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
## 9 ( 1 ) "*" "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
best_cp <- which.min(summary_forward$cp)
best_bic <- which.min(summary_forward$bic)
best_adjr2 <- which.max(summary_forward$adjr2)
cat("Best model based on Cp:", best_cp) #13
## Best model based on Cp: 13
cat("Best model based on BIC:", best_bic) #12
## Best model based on BIC: 12
cat("Best model based on adjusted R-squared:", best_adjr2) #13
## Best model based on adjusted R-squared: 13
best_model_size <- 13
selected_vars <- names(coef(forward_fit, best_model_size))[-1] # Exclude intercept
cat("Best Model Size:", best_model_size, "\n")
## Best Model Size: 13
cat("Selected Predictors:", paste(selected_vars, collapse = ", "))
## Selected Predictors: PrivateYes, Apps, Accept, Top10perc, F.Undergrad, Room.Board, Personal, PhD, Terminal, S.F.Ratio, perc.alumni, Expend, Grad.Rate
coef(forward_fit, 13)
## (Intercept) PrivateYes Apps Accept Top10perc
## -1696.0494705 2281.0324069 -0.2745795 0.6908079 23.0013881
## F.Undergrad Room.Board Personal PhD Terminal
## -0.1636180 0.9336671 -0.2799977 22.7913716 19.3862731
## S.F.Ratio perc.alumni Expend Grad.Rate
## -59.5472959 39.7135485 0.1663588 19.3178985
gam_model <- gam(Outstate ~ Private + s(Apps) + s(Accept) + s(Top10perc) + s(F.Undergrad) +
s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) + s(S.F.Ratio) +
s(perc.alumni) + s(Expend) + s(Grad.Rate),
data = train_data)
par(mfrow = c(3,3))
plot(gam_model, se = TRUE, col = "blue")
Based on the plots, many of the predictors appear to have non-linear relationships with out of state tuition. In general, the confidence interval is wider at extremes of the predictors, where we expect more variance. For private, Yes is above 0 meaning private schools charge higher tuition while No is below 0 meaning public schools charge less tuition. Out of state tuition appears to increase with number of accepted applications, percentage of students in the 10% of their HS class, room and board costs, percentage of faculty with a PhD or terminal degree, student faculty ratio, percentage of alumni who donate to the school, and instructional expenditure per student. Tuition decreases with number of applications received by the college, number of full time undergraduate students, and personal expenses for students. Graduation rate seems to have increasing effect on tuition up until about 80-90% grduation rate, at which it is associated with decreasing effect on tuition.
test_predictions <- predict(gam_model, newdata = test_data)
mse <- mean((test_predictions - test_data$Outstate)^2)
rmse <- sqrt(mse)
sst <- sum((test_data$Outstate - mean(test_data$Outstate))^2)
sse <- sum((test_data$Outstate - test_predictions)^2)
r_squared <- 1 - (sse / sst)
cat("GAM Model Evaluation Metrics:\n")
## GAM Model Evaluation Metrics:
cat("MSE:", mse, "\n")
## MSE: 3517712
cat("RMSE:", rmse, "\n")
## RMSE: 1875.556
cat("R²:", r_squared, "\n\n")
## R²: 0.7949705
lm_model <- lm(Outstate ~ Private + Apps + Accept + Top10perc + F.Undergrad +
Room.Board + Personal + PhD + Terminal + S.F.Ratio +
perc.alumni + Expend + Grad.Rate, data = train_data)
lm_predictions <- predict(lm_model, newdata = test_data)
lm_mse <- mean((lm_predictions - test_data$Outstate)^2)
lm_rmse <- sqrt(lm_mse)
lm_r_squared <- 1 - sum((test_data$Outstate - lm_predictions)^2) / sum((test_data$Outstate - mean(test_data$Outstate))^2)
cat("Linear Model Performance:\n")
## Linear Model Performance:
cat("MSE:", lm_mse, "\n")
## MSE: 4210025
cat("RMSE:", lm_rmse, "\n")
## RMSE: 2051.834
cat("R²:", lm_r_squared)
## R²: 0.7546191
The GAM is on average around $1875 off from the true tuition values, and it explains 79.5% of the variance in tuition. On the other hand, the linear model is off by around $2050 and explains 75% of variance in tuition. Overall, both models appear to perform similarly well, with GAM slightly outperforming. I could adjust predictors used, add interaction terms and feature engineer, or adjust degrees of freedom. It’s also possible that the nonlinear relationship among some predictors with out of state tuition are not as significant as we thought.
summary(gam_model)
##
## Call: gam(formula = Outstate ~ Private + s(Apps) + s(Accept) + s(Top10perc) +
## s(F.Undergrad) + s(Room.Board) + s(Personal) + s(PhD) + s(Terminal) +
## s(S.F.Ratio) + s(perc.alumni) + s(Expend) + s(Grad.Rate),
## data = train_data)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5990.09 -1062.62 70.67 1170.20 4649.34
##
## (Dispersion Parameter for gaussian family taken to be 3043346)
##
## Null Deviance: 8595716956 on 545 degrees of freedom
## Residual Deviance: 1509500146 on 496.0002 degrees of freedom
## AIC: 9749.985
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2323483405 2323483405 763.4635 < 2.2e-16 ***
## s(Apps) 1 983626768 983626768 323.2057 < 2.2e-16 ***
## s(Accept) 1 126560887 126560887 41.5861 2.681e-10 ***
## s(Top10perc) 1 755720880 755720880 248.3191 < 2.2e-16 ***
## s(F.Undergrad) 1 295817305 295817305 97.2013 < 2.2e-16 ***
## s(Room.Board) 1 585353023 585353023 192.3386 < 2.2e-16 ***
## s(Personal) 1 22992768 22992768 7.5551 0.006202 **
## s(PhD) 1 141201515 141201515 46.3968 2.804e-11 ***
## s(Terminal) 1 18641902 18641902 6.1255 0.013658 *
## s(S.F.Ratio) 1 90552710 90552710 29.7543 7.752e-08 ***
## s(perc.alumni) 1 111111669 111111669 36.5097 2.986e-09 ***
## s(Expend) 1 387424127 387424127 127.3020 < 2.2e-16 ***
## s(Grad.Rate) 1 25447409 25447409 8.3617 0.004000 **
## Residuals 496 1509500146 3043346
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Apps) 3 3.7798 0.010578 *
## s(Accept) 3 4.5449 0.003734 **
## s(Top10perc) 3 1.2847 0.278908
## s(F.Undergrad) 3 2.9724 0.031401 *
## s(Room.Board) 3 0.7982 0.495266
## s(Personal) 3 1.8158 0.143309
## s(PhD) 3 2.5365 0.056062 .
## s(Terminal) 3 1.1257 0.338134
## s(S.F.Ratio) 3 4.6471 0.003246 **
## s(perc.alumni) 3 1.7353 0.158818
## s(Expend) 3 18.6839 1.695e-11 ***
## s(Grad.Rate) 3 2.8632 0.036331 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova for nonparametric effects shows a strong evidence of non-linear relationship between Apps, Accept, F.Undergrad, S.F.Ratio, Expend, and Grad.Rate with the response of out of state tuition with p<0.05.