library(rsconnect)
library(ISLR) # Load the ISLR library
data("Wage") # Load the Wage dataset
head("Wage")
## [1] "Wage"
# Assuming you have already loaded the ISLR package and the Wage dataset
# Display the first few rows of the dataset
head(Wage)
## year age maritl race education region
## 231655 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle Atlantic
## 86582 2004 24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003 45 2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003 43 2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle Atlantic
## 376662 2008 54 2. Married 1. White 4. College Grad 2. Middle Atlantic
## jobclass health health_ins logwage wage
## 231655 1. Industrial 1. <=Good 2. No 4.318063 75.04315
## 86582 2. Information 2. >=Very Good 2. No 4.255273 70.47602
## 161300 1. Industrial 1. <=Good 1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good 1. Yes 5.041393 154.68529
## 11443 2. Information 1. <=Good 1. Yes 4.318063 75.04315
## 376662 2. Information 2. >=Very Good 1. Yes 4.845098 127.11574
# Provide a summary for each column in the dataset
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
# Show the structure of the dataset
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
library(ISLR)
# Load the Wage data
data(Wage)
# Initialize a vector to store AIC or BIC values
aic_values <- numeric(10)
for (degree in 1:10) {
# Fit the model with polynomial features of current degree
model <- lm(wage ~ poly(age, degree), data=Wage)
# Store the AIC value
aic_values[degree] <- AIC(model)
}
# Determine the optimal polynomial degree based on AIC
optimal_degree <- which.min(aic_values)
print(paste("Optimal polynomial degree based on AIC:", optimal_degree))
## [1] "Optimal polynomial degree based on AIC: 4"
# Fit the model with polynomial degree 4
optimal_model <- lm(wage ~ poly(age, 4), data = Wage)
# Summary of the optimal model
summary(optimal_model)
##
## Call:
## lm(formula = wage ~ poly(age, 4), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.707 -24.626 -4.993 15.217 203.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7287 153.283 < 2e-16 ***
## poly(age, 4)1 447.0679 39.9148 11.201 < 2e-16 ***
## poly(age, 4)2 -478.3158 39.9148 -11.983 < 2e-16 ***
## poly(age, 4)3 125.5217 39.9148 3.145 0.00168 **
## poly(age, 4)4 -77.9112 39.9148 -1.952 0.05104 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.91 on 2995 degrees of freedom
## Multiple R-squared: 0.08626, Adjusted R-squared: 0.08504
## F-statistic: 70.69 on 4 and 2995 DF, p-value: < 2.2e-16
# Create a sequence of ages for plotting the fit
age_grid <- seq(from = min(Wage$age), to = max(Wage$age), by = 1)
# Predict wages using the fitted model across the age grid
predicted_wages <- predict(optimal_model, newdata = data.frame(age = age_grid))
# Plot the actual data points
plot(Wage$age, Wage$wage, xlab = "Age", ylab = "Wage",
col = "gray70", main = "Polynomial Regression Fit (Degree = 4)")
# Add the polynomial regression line
lines(age_grid, predicted_wages, col = "blue", lwd = 2)
# Simple linear regression model
linear_model <- lm(wage ~ age, data = Wage)
# ANOVA comparison
anova_result <- anova(linear_model, optimal_model)
print(anova_result)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 4)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2995 4771604 3 250612 52.434 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both tests had the same results, the polynomial test is just more useful in measuring a wider variety of models
library(ISLR)
library(ggplot2)
data(Wage)
# Assuming optimal_bins is determined correctly from the cross-validation process
optimal_bins <- 5 # Example value, replace with the actual optimal bins you determined
# Bin the age variable with the optimal number of bins
Wage$age_bins <- cut(Wage$age, breaks = optimal_bins, include.lowest = TRUE, labels = FALSE)
# Fit the model using the binned age
model_optimal <- glm(wage ~ factor(age_bins), data = Wage)
# Create a summary data frame for plotting
# For each bin, compute the mean wage
summary_data <- aggregate(wage ~ age_bins, data = Wage, mean)
# Add a representative age for each bin to the summary data for plotting
# This could be the midpoint of the bin range, for example
bin_ranges <- cut(Wage$age, breaks = optimal_bins, include.lowest = TRUE)
levels(bin_ranges) <- sapply(levels(bin_ranges), function(b) mean(as.numeric(unlist(strsplit(gsub("\\(|\\]", "", b), ",")))))
## Warning in mean(as.numeric(unlist(strsplit(gsub("\\(|\\]", "", b), ",")))): NAs
## introduced by coercion
summary_data$age_rep <- as.numeric(levels(bin_ranges))[summary_data$age_bins]
# Plot
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha = 0.2) +
geom_line(data = summary_data, aes(x = age_rep, y = wage), color = "blue", group = 1) +
ggtitle(paste("Step Function Fit with", optimal_bins, "Bins"))
## Warning: Removed 1 row containing missing values (`geom_line()`).
library(ISLR)
data("College")
head(College)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
library(ISLR)
data(College)
# Set the seed for reproducibility
set.seed(123)
# Create a vector of random indices
index <- sample(1:nrow(College), round(0.7 * nrow(College)))
# Split the data into training and test sets
training_set <- College[index, ]
test_set <- College[-index, ]
# Fit an initial model with the intercept only
initial_model <- lm(Outstate ~ 1, data=training_set)
# Use the step function for forward selection
forward_model <- step(initial_model, scope = list(lower = formula(initial_model), upper = ~ .), direction = "forward", data = training_set)
## Start: AIC=9046.26
## Outstate ~ 1
# Display the final model
summary(forward_model)
##
## Call:
## lm(formula = Outstate ~ 1, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7815.7 -3229.4 -495.7 2535.6 11304.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10395.7 174.9 59.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4079 on 543 degrees of freedom
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
# Assuming 'forward_model' is the result from the stepwise selection
selected_formula <- formula(forward_model)
# Fit a GAM on the training data
gam_model <- gam(selected_formula, data=training_set)
# Assuming 'forward_model' is the model obtained from forward stepwise selection
# Check the summary of the model
summary(forward_model)
##
## Call:
## lm(formula = Outstate ~ 1, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7815.7 -3229.4 -495.7 2535.6 11304.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10395.7 174.9 59.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4079 on 543 degrees of freedom
# This should show you the coefficients for the intercept and any predictors that were included
library(MASS)
# Set up the full model as the upper scope for the stepwise process
full_model <- lm(Outstate ~ ., data=training_set)
# Perform forward stepwise selection starting from the model with only an intercept
forward_model <- stepAIC(lm(Outstate ~ 1, data=training_set),
scope=list(lower=Outstate ~ 1, upper=formula(full_model)),
direction="forward",
trace=FALSE) # trace=FALSE to reduce verbosity
# Now check the summary of the new forward stepwise selection model
summary(forward_model)
##
## Call:
## lm(formula = Outstate ~ Expend + Private + Room.Board + perc.alumni +
## PhD + Grad.Rate + Top25perc + Personal + Accept + F.Undergrad +
## Apps + Terminal + Top10perc + Enroll, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6498.6 -1327.5 -1.8 1287.5 10012.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.362e+03 7.027e+02 -3.361 0.000832 ***
## Expend 1.968e-01 2.492e-02 7.899 1.64e-14 ***
## PrivateYes 2.739e+03 3.075e+02 8.907 < 2e-16 ***
## Room.Board 7.492e-01 1.031e-01 7.268 1.32e-12 ***
## perc.alumni 4.247e+01 9.307e+00 4.564 6.26e-06 ***
## PhD 1.440e+01 1.077e+01 1.337 0.181662
## Grad.Rate 1.890e+01 6.830e+00 2.767 0.005848 **
## Top25perc 2.888e+00 1.031e+01 0.280 0.779458
## Personal -2.592e-01 1.436e-01 -1.805 0.071608 .
## Accept 8.419e-01 1.567e-01 5.372 1.17e-07 ***
## F.Undergrad -1.221e-01 7.667e-02 -1.593 0.111826
## Apps -2.324e-01 8.801e-02 -2.640 0.008529 **
## Terminal 2.327e+01 1.159e+01 2.008 0.045162 *
## Top10perc 2.613e+01 1.313e+01 1.990 0.047138 *
## Enroll -6.679e-01 4.407e-01 -1.516 0.130183
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2019 on 529 degrees of freedom
## Multiple R-squared: 0.7613, Adjusted R-squared: 0.755
## F-statistic: 120.5 on 14 and 529 DF, p-value: < 2.2e-16
# Fit a GAM on the training data
gam_model <- gam(Outstate ~ s(Expend) + s(Room.Board) + Private + s(perc.alumni) + s(Accept), data=training_set)
# Plot the results for each smooth term
plot(gam_model, pages=1, se=TRUE, col="blue")
The smooths for Expend and Room.Board are indicative of a positive
association with out-of-state tuition. This makes sense as higher
expenditures per student and higher living costs could reflect the
overall cost structure of higher-quality or more expensive institutions,
which might also charge higher tuition.
The smooth for perc.alumni does not suggest a strong relationship with tuition costs. However, this might be somewhat counterintuitive as one might expect institutions with a high percentage of donating alumni to have more resources and potentially lower tuition. This flat relationship might warrant further investigation.
The Accept variable presents a more complex relationship. The initial decrease in tuition with more acceptances could suggest that larger institutions (which accept more students) can leverage economies of scale and charge lower tuition. However, the plateau and potential reversal in this trend might reflect a threshold beyond which institutions cannot or do not reduce tuition further.
# Predict on the test set
predicted_tuition <- predict(gam_model, newdata=test_set)
# Calculate Mean Squared Error (MSE)
mse <- mean((test_set$Outstate - predicted_tuition)^2)
print(paste("Mean Squared Error (MSE):", mse))
## [1] "Mean Squared Error (MSE): 3276878.97805376"
# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mse)
print(paste("Root Mean Squared Error (RMSE):", rmse))
## [1] "Root Mean Squared Error (RMSE): 1810.21517451759"
# Calculate Mean Absolute Error (MAE)
mae <- mean(abs(test_set$Outstate - predicted_tuition))
print(paste("Mean Absolute Error (MAE):", mae))
## [1] "Mean Absolute Error (MAE): 1334.10945796333"
# Calculate R-squared
ss_total <- sum((test_set$Outstate - mean(test_set$Outstate))^2)
ss_residuals <- sum((test_set$Outstate - predicted_tuition)^2)
r_squared <- 1 - (ss_residuals / ss_total)
print(paste("R-squared:", r_squared))
## [1] "R-squared: 0.783102075830272"
While the R-squared value suggests a strong model fit, the RMSE and MAE indicate that there may still be significant prediction errors for individual observations. This could imply that while the model generally captures the trends and patterns in the data well, there might be specific cases where the model does not predict as accurately, possibly due to outliers or certain schools that have unique characteristics not captured by the model.
Here’s a breakdown of the evidence for non-linearity from the plots:
Expend (Expenditures per Student): The plot shows a non-linear relationship. Initially, as Expend increases, the out-of-state tuition increases as well, but the rate of increase slows down. This suggests diminishing returns in terms of tuition increase as expenditures per student grow higher.
Room.Board (Cost of Room and Board): The relationship appears mostly linear with a slight curvature at the ends. This could be considered a weak non-linear relationship. It suggests that the tuition increases with room and board costs, but the effect may change slightly at very high or low values.
perc.alumni (Percentage of Alumni Who Donate): The plot is relatively flat with slight fluctuations, suggesting that there is minimal non-linear relationship or perhaps a very weak one. This indicates that the percentage of alumni who donate may not have a strong non-linear impact on out-of-state tuition.
Accept (Number of Accepted Students): The plot for Accept shows some non-linearity, with tuition decreasing as the number of acceptances increases, but then plateauing. This indicates a non-linear effect where increasing the number of acceptances initially reduces tuition, but only up to a certain point, after which the effect levels off.