Assignment 6

library(rsconnect)

library(ISLR) # Load the ISLR library
data("Wage") # Load the Wage dataset

head("Wage")

## [1] "Wage"

# Assuming you have already loaded the ISLR package and the Wage dataset

# Display the first few rows of the dataset
head(Wage)

##        year age           maritl     race       education             region
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
## 86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
## 376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
##              jobclass         health health_ins  logwage      wage
## 231655  1. Industrial      1. <=Good      2. No 4.318063  75.04315
## 86582  2. Information 2. >=Very Good      2. No 4.255273  70.47602
## 161300  1. Industrial      1. <=Good     1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good     1. Yes 5.041393 154.68529
## 11443  2. Information      1. <=Good     1. Yes 4.318063  75.04315
## 376662 2. Information 2. >=Very Good     1. Yes 4.845098 127.11574

# Provide a summary for each column in the dataset
summary(Wage)

##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
##

# Show the structure of the dataset
str(Wage)

## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

library(ISLR)

# Load the Wage data
data(Wage)

# Initialize a vector to store AIC or BIC values
aic_values <- numeric(10)

for (degree in 1:10) {
  # Fit the model with polynomial features of current degree
  model <- lm(wage ~ poly(age, degree), data=Wage)
  
  # Store the AIC value
  aic_values[degree] <- AIC(model)
}

# Determine the optimal polynomial degree based on AIC
optimal_degree <- which.min(aic_values)
print(paste("Optimal polynomial degree based on AIC:", optimal_degree))

## [1] "Optimal polynomial degree based on AIC: 4"

# Fit the model with polynomial degree 4
optimal_model <- lm(wage ~ poly(age, 4), data = Wage)

# Summary of the optimal model
summary(optimal_model)

## 
## Call:
## lm(formula = wage ~ poly(age, 4), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.707 -24.626  -4.993  15.217 203.693 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    111.7036     0.7287 153.283  < 2e-16 ***
## poly(age, 4)1  447.0679    39.9148  11.201  < 2e-16 ***
## poly(age, 4)2 -478.3158    39.9148 -11.983  < 2e-16 ***
## poly(age, 4)3  125.5217    39.9148   3.145  0.00168 ** 
## poly(age, 4)4  -77.9112    39.9148  -1.952  0.05104 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.91 on 2995 degrees of freedom
## Multiple R-squared:  0.08626,    Adjusted R-squared:  0.08504 
## F-statistic: 70.69 on 4 and 2995 DF,  p-value: < 2.2e-16

# Create a sequence of ages for plotting the fit
age_grid <- seq(from = min(Wage$age), to = max(Wage$age), by = 1)

# Predict wages using the fitted model across the age grid
predicted_wages <- predict(optimal_model, newdata = data.frame(age = age_grid))

# Plot the actual data points
plot(Wage$age, Wage$wage, xlab = "Age", ylab = "Wage", 
     col = "gray70", main = "Polynomial Regression Fit (Degree = 4)")

# Add the polynomial regression line
lines(age_grid, predicted_wages, col = "blue", lwd = 2)

# Simple linear regression model
linear_model <- lm(wage ~ age, data = Wage)

# ANOVA comparison
anova_result <- anova(linear_model, optimal_model)
print(anova_result)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 4)
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   2998 5022216                                  
## 2   2995 4771604  3    250612 52.434 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both tests had the same results, the polynomial test is just more useful in measuring a wider variety of models

library(ISLR)
library(ggplot2)

data(Wage)

# Assuming optimal_bins is determined correctly from the cross-validation process
optimal_bins <- 5  # Example value, replace with the actual optimal bins you determined

# Bin the age variable with the optimal number of bins
Wage$age_bins <- cut(Wage$age, breaks = optimal_bins, include.lowest = TRUE, labels = FALSE)

# Fit the model using the binned age
model_optimal <- glm(wage ~ factor(age_bins), data = Wage)

# Create a summary data frame for plotting
# For each bin, compute the mean wage
summary_data <- aggregate(wage ~ age_bins, data = Wage, mean)

# Add a representative age for each bin to the summary data for plotting
# This could be the midpoint of the bin range, for example
bin_ranges <- cut(Wage$age, breaks = optimal_bins, include.lowest = TRUE)
levels(bin_ranges) <- sapply(levels(bin_ranges), function(b) mean(as.numeric(unlist(strsplit(gsub("\\(|\\]", "", b), ",")))))

## Warning in mean(as.numeric(unlist(strsplit(gsub("\\(|\\]", "", b), ",")))): NAs
## introduced by coercion

summary_data$age_rep <- as.numeric(levels(bin_ranges))[summary_data$age_bins]

# Plot
ggplot(Wage, aes(x = age, y = wage)) +
  geom_point(alpha = 0.2) +
  geom_line(data = summary_data, aes(x = age_rep, y = wage), color = "blue", group = 1) +
  ggtitle(paste("Step Function Fit with", optimal_bins, "Bins"))

## Warning: Removed 1 row containing missing values (`geom_line()`).

library(ISLR)
data("College")

head(College)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

library(ISLR)
data(College)

# Set the seed for reproducibility
set.seed(123)

# Create a vector of random indices
index <- sample(1:nrow(College), round(0.7 * nrow(College)))

# Split the data into training and test sets
training_set <- College[index, ]
test_set <- College[-index, ]

# Fit an initial model with the intercept only
initial_model <- lm(Outstate ~ 1, data=training_set)

# Use the step function for forward selection
forward_model <- step(initial_model, scope = list(lower = formula(initial_model), upper = ~ .), direction = "forward", data = training_set)

## Start:  AIC=9046.26
## Outstate ~ 1

# Display the final model
summary(forward_model)

## 
## Call:
## lm(formula = Outstate ~ 1, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7815.7 -3229.4  -495.7  2535.6 11304.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10395.7      174.9   59.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4079 on 543 degrees of freedom

library(mgcv)

## Loading required package: nlme

## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

# Assuming 'forward_model' is the result from the stepwise selection
selected_formula <- formula(forward_model)

# Fit a GAM on the training data
gam_model <- gam(selected_formula, data=training_set)

# Assuming 'forward_model' is the model obtained from forward stepwise selection
# Check the summary of the model
summary(forward_model)

## 
## Call:
## lm(formula = Outstate ~ 1, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7815.7 -3229.4  -495.7  2535.6 11304.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10395.7      174.9   59.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4079 on 543 degrees of freedom

# This should show you the coefficients for the intercept and any predictors that were included

library(MASS)

# Set up the full model as the upper scope for the stepwise process
full_model <- lm(Outstate ~ ., data=training_set)

# Perform forward stepwise selection starting from the model with only an intercept
forward_model <- stepAIC(lm(Outstate ~ 1, data=training_set), 
                         scope=list(lower=Outstate ~ 1, upper=formula(full_model)), 
                         direction="forward", 
                         trace=FALSE)  # trace=FALSE to reduce verbosity

# Now check the summary of the new forward stepwise selection model
summary(forward_model)

## 
## Call:
## lm(formula = Outstate ~ Expend + Private + Room.Board + perc.alumni + 
##     PhD + Grad.Rate + Top25perc + Personal + Accept + F.Undergrad + 
##     Apps + Terminal + Top10perc + Enroll, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6498.6 -1327.5    -1.8  1287.5 10012.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.362e+03  7.027e+02  -3.361 0.000832 ***
## Expend       1.968e-01  2.492e-02   7.899 1.64e-14 ***
## PrivateYes   2.739e+03  3.075e+02   8.907  < 2e-16 ***
## Room.Board   7.492e-01  1.031e-01   7.268 1.32e-12 ***
## perc.alumni  4.247e+01  9.307e+00   4.564 6.26e-06 ***
## PhD          1.440e+01  1.077e+01   1.337 0.181662    
## Grad.Rate    1.890e+01  6.830e+00   2.767 0.005848 ** 
## Top25perc    2.888e+00  1.031e+01   0.280 0.779458    
## Personal    -2.592e-01  1.436e-01  -1.805 0.071608 .  
## Accept       8.419e-01  1.567e-01   5.372 1.17e-07 ***
## F.Undergrad -1.221e-01  7.667e-02  -1.593 0.111826    
## Apps        -2.324e-01  8.801e-02  -2.640 0.008529 ** 
## Terminal     2.327e+01  1.159e+01   2.008 0.045162 *  
## Top10perc    2.613e+01  1.313e+01   1.990 0.047138 *  
## Enroll      -6.679e-01  4.407e-01  -1.516 0.130183    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2019 on 529 degrees of freedom
## Multiple R-squared:  0.7613, Adjusted R-squared:  0.755 
## F-statistic: 120.5 on 14 and 529 DF,  p-value: < 2.2e-16

# Fit a GAM on the training data
gam_model <- gam(Outstate ~ s(Expend) + s(Room.Board) + Private + s(perc.alumni) + s(Accept), data=training_set)

# Plot the results for each smooth term
plot(gam_model, pages=1, se=TRUE, col="blue")

The smooths for Expend and Room.Board are indicative of a positive association with out-of-state tuition. This makes sense as higher expenditures per student and higher living costs could reflect the overall cost structure of higher-quality or more expensive institutions, which might also charge higher tuition.

The smooth for perc.alumni does not suggest a strong relationship with tuition costs. However, this might be somewhat counterintuitive as one might expect institutions with a high percentage of donating alumni to have more resources and potentially lower tuition. This flat relationship might warrant further investigation.

The Accept variable presents a more complex relationship. The initial decrease in tuition with more acceptances could suggest that larger institutions (which accept more students) can leverage economies of scale and charge lower tuition. However, the plateau and potential reversal in this trend might reflect a threshold beyond which institutions cannot or do not reduce tuition further.

# Predict on the test set
predicted_tuition <- predict(gam_model, newdata=test_set)

# Calculate Mean Squared Error (MSE)
mse <- mean((test_set$Outstate - predicted_tuition)^2)
print(paste("Mean Squared Error (MSE):", mse))

## [1] "Mean Squared Error (MSE): 3276878.97805376"

# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mse)
print(paste("Root Mean Squared Error (RMSE):", rmse))

## [1] "Root Mean Squared Error (RMSE): 1810.21517451759"

# Calculate Mean Absolute Error (MAE)
mae <- mean(abs(test_set$Outstate - predicted_tuition))
print(paste("Mean Absolute Error (MAE):", mae))

## [1] "Mean Absolute Error (MAE): 1334.10945796333"

# Calculate R-squared
ss_total <- sum((test_set$Outstate - mean(test_set$Outstate))^2)
ss_residuals <- sum((test_set$Outstate - predicted_tuition)^2)
r_squared <- 1 - (ss_residuals / ss_total)
print(paste("R-squared:", r_squared))

## [1] "R-squared: 0.783102075830272"

While the R-squared value suggests a strong model fit, the RMSE and MAE indicate that there may still be significant prediction errors for individual observations. This could imply that while the model generally captures the trends and patterns in the data well, there might be specific cases where the model does not predict as accurately, possibly due to outliers or certain schools that have unique characteristics not captured by the model.

Here’s a breakdown of the evidence for non-linearity from the plots:

Expend (Expenditures per Student): The plot shows a non-linear relationship. Initially, as Expend increases, the out-of-state tuition increases as well, but the rate of increase slows down. This suggests diminishing returns in terms of tuition increase as expenditures per student grow higher.

Room.Board (Cost of Room and Board): The relationship appears mostly linear with a slight curvature at the ends. This could be considered a weak non-linear relationship. It suggests that the tuition increases with room and board costs, but the effect may change slightly at very high or low values.

perc.alumni (Percentage of Alumni Who Donate): The plot is relatively flat with slight fluctuations, suggesting that there is minimal non-linear relationship or perhaps a very weak one. This indicates that the percentage of alumni who donate may not have a strong non-linear impact on out-of-state tuition.

Accept (Number of Accepted Students): The plot for Accept shows some non-linearity, with tuition decreasing as the number of acceptances increases, but then plateauing. This indicates a non-linear effect where increasing the number of acceptances initially reduces tuition, but only up to a certain point, after which the effect levels off.

Assignment 6

Ndungu Gakunga

2024-04-12