Homework 9

Please do the following problems from the text book ISLR or written otherwise. (use set.seed(702) to replicate your results).

#1. Question 7.9.6 pg 299

Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

##Question 6 part a)

I will load the Wage data set found in the ISLR library. I will set the seed to 702 and loop through the data fitting the model to degree d = 20 to best determine the polynomial fit for the model. I will also have to use boot library. I will also use cross validation with K=10.

library(ISLR)
data(Wage)
library(boot)
set.seed(702)

cv.error <- rep(NA,20)
for (i in 1:20) {
    poly.fit <- glm(wage ~ poly(age, i), data=Wage)
    cv.error[i] <- cv.glm(Wage, poly.fit, K=10)$delta[1]
}

library(data.table)
track <- data.table(seq(1:20), cv.error, keep.rownames=TRUE)

cat("The Min cross validation MSE is: ", min(cv.error))

## The Min cross validation MSE is:  1593.817

Now I will plot this using both base R-plot and ggplot to have a visual of the degree polynomial that produces the lowest cross validation MSE

library(tidyverse)

plot(x=1:20, y=cv.error, xlab='Degree Polynomial',ylab='CV MSE', type='b', main='CV MSE vs Degree of Polynomial. Wage Vs Age')

points( x=which.min(cv.error), y=min(cv.error), col='green', pch=18)
axis(1, at = seq(1,20, by=1))

ggplot(track, aes(V1, cv.error)) + geom_line() + geom_point(colour='blue') + geom_point(data=as.data.frame(cv.error[1:0]), aes(x=which.min(cv.error), y=min(cv.error)), colour='red') + labs(title='CV MSE vs Degree of Polynomial', x='Degree Polynomial',y='CV MSE') + scale_x_continuous(breaks=seq(from=1, to=20, by=1))

It appears that once you increase the degree of the polynomial from 1 to 9 it reduces the CV MSE. IT appears that a polynomial with degree 9 has the lowest CV MSE.

Now I will fit 20 different polynomial linear regression models with wage explained by a polynomial for age. Then I will run anova test to see how they compare

fit1 <- lm(wage ~ poly(age,1), data=Wage)
fit2 <- lm(wage ~ poly(age,2), data=Wage)
fit3 <- lm(wage ~ poly(age,3), data=Wage)
fit4 <- lm(wage ~ poly(age,4), data=Wage)
fit5 <- lm(wage ~ poly(age,5), data=Wage)


fit6 <- lm(wage ~ poly(age,6), data=Wage)
fit7 <- lm(wage ~ poly(age,7), data=Wage)
fit8 <- lm(wage ~ poly(age,8), data=Wage)
fit9 <- lm(wage ~ poly(age,9), data=Wage)
fit10 <- lm(wage ~ poly(age,10), data=Wage)


fit11 <- lm(wage ~ poly(age,11), data=Wage)
fit12 <- lm(wage ~ poly(age,12), data=Wage)
fit13 <- lm(wage ~ poly(age,13), data=Wage)
fit14 <- lm(wage ~ poly(age,14), data=Wage)
fit15 <- lm(wage ~ poly(age,15), data=Wage)



fit16 <- lm(wage ~ poly(age,16), data=Wage)
fit17 <- lm(wage ~ poly(age,17), data=Wage)
fit18 <- lm(wage ~ poly(age,18), data=Wage)
fit19 <- lm(wage ~ poly(age,19), data=Wage)
fit20 <- lm(wage ~ poly(age,20), data=Wage)

anova(fit1,
fit2,
fit3,
fit4,
fit5,
fit6,
fit7,
fit8,
fit9,
fit10,
fit11,
fit12,
fit13,
fit14,
fit15,
fit16,
fit17,
fit18,
fit19,
fit20
)

After performing the ANOVA test on all the different linear regression models with different degree of polynomials, it appears degrees 2,3,4 and 9 are all statistically significant. 3 and 4 are more statistically significant than 9, nevertheless 9 is still statistically significant.

#Making a plot of the 9 degree polynomial model onto the scatter plot of wage vs age using the plot() command
plot(wage ~ age, data=Wage, col='blue', ylab='Wage',xlab='Age',main='Wage vs Age with 9 degree polynomial')
poly.fit <- glm(wage ~ poly(age,9), data=Wage)
agelims <- range(Wage$age)
age.grid <- seq(from=agelims[1], to=agelims[2])
pred <- predict(poly.fit, data.frame(age=age.grid))
lines(age.grid, pred, col='red', lwd=4)

#Making a data table so I go overlay the 9 degree polynomial model in ggplot using the geom_line() command
l.fit <- data.table(seq(from=min(Wage$age), to=max(Wage$age)), pred, keep.rownames = TRUE)

#Making a plot of the 9 degree polynomial model onto the scatter plot of wage vs age using the ggplot() command#g
ggplot(data=Wage, aes(x=age, y=wage)) + 
  geom_point(col='blue') +
  geom_line(data=l.fit, aes(V1, pred), size=1.5, col='red') + 
  labs(x='Age',y='Wage',title='Wage vs Age with 9 degree polynomial')

#Question 6 b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

Now will fit a multiple step functions using cross validation for K=10 to determine the number of cuts that are optimal for producing the lost cross validation MSE. I will output the graphs using the plot() and ggplot() to show which number of cuts produces the lowest MSE.

#Setting the seed
set.seed(702)

#Cross validation K=10 
cv.error <- rep(NA, 10)
for (i in 2:10) {
  Wage$cut.age <- cut(Wage$age,i)
  cut.fit <- glm(wage ~ cut.age, data=Wage)
  cv.error[i] <- cv.glm(Wage, cut.fit, K=10)$delta[1]
}

#Ploting the CV MSE vs the number of cuts using plot() command
plot(2:10, cv.error[-1], xlab="No Cuts", ylab='CV MSE', main='Determine No. Cuts vs CV MSE',type='b')
points(x=which.min(cv.error), y=cv.error[which.min(cv.error)], col='green', pch=18, cex=3)
axis(1, at = seq(1,10, by=1))

#Ploting the CV MSE vs the number of cuts using ggplot() command.
track <- data.table(seq(1:10), cv.error, keep.rownames=TRUE)
ggplot(track, aes(V1, cv.error)) + 
  geom_line() + geom_point(colour='blue') +
  geom_point(data=as.data.frame(cv.error[1:0]), aes(x=which.min(cv.error), y=cv.error[which.min(cv.error)]), colour='red', size=3.0) +
  labs(title='Determining No. Cuts Vs. CV MSE', x='No. Cuts', y='CV MSE') +
  scale_x_continuous(breaks=seq(from=1, to=10, by=1), limits=c(2,10))

It appears that 8 cuts is the optimal number of cuts for the lowest MSE value. Now I will fit the step function with 8 cuts onto the wage vs age scatter plot to see what it looks like.

#Making a plot of the 8 cut function onto the scatter plot of wage vs age using the plot() command
plot(wage ~ age, data=Wage, col='blue', ylab='Wage',xlab='Age',main='Wage vs Age with Cut Fit Model')
cut.fit <- glm(wage ~ cut(age,8), data=Wage)
agelims <- range(Wage$age)
age.grid <- seq(from=agelims[1], to=agelims[2])
pred <- predict(cut.fit, data.frame(age=age.grid))
lines(age.grid, pred, col='red', lwd=4)

#Making a data table so I go overlay the 8 cut function in ggplot using the geom_line() command
l.fit <- data.table(seq(from=min(Wage$age), to=max(Wage$age)), pred, keep.rownames = TRUE)

#Making a plot of the 8 cut function onto the scatter plot of wage vs age using the ggplot() command#g
ggplot(data=Wage, aes(x=age, y=wage)) + 
  geom_point(col='blue') +
  geom_line(data=l.fit, aes(V1, pred), size=1.5, col='red') + 
  labs(x='Age',y='Wage',title='Wage vs Age with Cut Fit Models')

The 9 degree polynomial looks more wavy and smooth however, it tends get a little wild near the upper age bracket. The cut model tends to be more choppy but does not have the wild behavior near the upper age bracket.

#2. Question 7.9.9 pg 299 This question uses the variables dis (the weighted mean of distances to five Boston employment centers) and nox (nitrogen oxides concentration in parts per 10 million) from the Boston data. We will treat dis as the predictor and nox as the response.

##Question 9 a) Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.

I will load the Boston data set found in the MASS library. Then I will fix a cubpic plolynomial regression model to predict nox explained by dis. I will then output the model results and plot the model over the scatter plot of nox vs dis using both the plot() and ggplot() commands.

##Fitting cubic model and summary

library(MASS)
data(Boston)

set.seed(702)

poly.fit <- lm(nox ~ poly(dis,3), data=Boston) 
summary(poly.fit)

## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

All the variables in this cubic regression model appear to to have p-values that highly statistically significant. The residual standard error is low but the degrees of freedom is high. The R-Squared values are relatively high at 0.71 for explaining the variance in the model.

##Plots of the model

#Plot command
xlim <- range(Boston$dis)
x.grid <- seq(from=xlim[1], to=xlim[2], by=0.1)
pred <- predict(poly.fit, list(dis = x.grid))
plot(nox ~ dis, data=Boston, col='blue', xlab='weighted mean of distances to five Boston employment centres' ,ylab='Nitrogen Oxide Concentration', main='Cubic Polynomial Fit')
lines(x.grid, pred, col='red', lwd=3)

#ggplot
x <- Boston$dis
y <- Boston$nox

df<- data.frame (
  x=x,
  y=y
)

ggplot(data=df, aes(x, y)) +
  geom_point(col='blue') + 
  geom_smooth(method="lm", se=TRUE, fill=NA, formula = y ~ poly(x,3, raw=TRUE), colour='red', size=1.5) +
  labs(x='weighted mean of distances to five Boston employment centres' ,y='Nitrogen Oxide Concentration', title='Cubic Polynomial Fit')

The cubic polynomial looks like good fit in the scatter plot.

##Question 7 b) Plot the polynomial fits for a range of different polynomial degrees (say, from 1 to 10), and report the associated residual sum of squares.

##Fitting 10 different degree polynomials I will fit the 10 different degree polynomials. I will ouput the plots the residual sum of square from each polynomial degree used. This will be done using the plot and ggplot() command.

set.seed(702)
rss <- rep(NA, 10)
for (i in 1:10) {
  poly.fit <- lm(nox ~ poly(dis, i), data=Boston)
  rss[i] <- sum(poly.fit$residuals^2)
}

plot(1:10, rss, xlab='Degree of Polynomial', ylab='Residual Sum of Squares', type='b', main='Degree of Polynomial vs RSS')
axis(1, at = seq(1,10, by=1))

rss <- data.table(seq(1:10), rss, keep.rownames = TRUE)

ggplot(rss, aes(V1, rss)) + 
  geom_line() + 
  scale_x_continuous(breaks=c(1:10)) +
  labs(x='Degree of Polynomial', y='Residual Sum of Squares', title='Degree of Polynomial vs RSS')

##Residual sum of squares output for each polynomial degree

rss

##Question 9 c)

Now I will perform cross-validation and select the optiaml degree for the polynomial with lowest MSE error

set.seed(702)
cv.error <- rep(NA,10)
for (i in 1:10) {
  poly.fit <- glm(nox ~ poly(dis, i), data=Boston)
  cv.error[i] <- cv.glm(Boston, poly.fit, K=10)$delta[1]
}

plot(x=1:10, y=cv.error,xlab='Degree Polynomial',ylab='CV MSE', type='b', main='CV MSE vs Degree of Polynomial')
points(x=which.min(cv.error), y=min(cv.error), col='red', pch=18)
axis(1, at = seq(1,10,by=1))

The x-axis on the graph represents the degrees of the polynomial from 1 to 10. The y-axis represents the cross validation (CV) MSE of the correspond degree of the polynomial. As we can see a 7 degree polynomial contains the highest CV MSE. In contrast, 4 degree polynomial has the lowest CV MSE.

##Question 9 d) Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.

##bs() to fit regression splines I will load the splines library to be able to use bs() command with a linear regression model to fit splines. This function generates the entire matrix of basis functions for splines with the specified knots. Since the range of the dis variable is from 1.1296 to 12.1265 so I choose 3 knots roughly evenly spaced apart witin the range. However, if you look at the scatter plot of nox vs dis you will see that there are changes around when dis = 4, maybe 7, and near the tail end around 10. Therefore, I will fit the knots at c(4,7,10)

#install.packages('splines')
library(splines)
fit <- lm(nox ~ bs(dis, df=4, knots=c(4,7,10)), data=Boston)

summary(fit)

## 
## Call:
## lm(formula = nox ~ bs(dis, df = 4, knots = c(4, 7, 10)), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.124592 -0.040324 -0.008719  0.024749  0.192906 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            0.73918    0.01332  55.510  < 2e-16
## bs(dis, df = 4, knots = c(4, 7, 10))1 -0.08839    0.02506  -3.526  0.00046
## bs(dis, df = 4, knots = c(4, 7, 10))2 -0.31363    0.01684 -18.625  < 2e-16
## bs(dis, df = 4, knots = c(4, 7, 10))3 -0.27037    0.02791  -9.686  < 2e-16
## bs(dis, df = 4, knots = c(4, 7, 10))4 -0.37989    0.03801  -9.995  < 2e-16
## bs(dis, df = 4, knots = c(4, 7, 10))5 -0.28983    0.06615  -4.381 1.44e-05
## bs(dis, df = 4, knots = c(4, 7, 10))6 -0.32971    0.06324  -5.214 2.71e-07
##                                          
## (Intercept)                           ***
## bs(dis, df = 4, knots = c(4, 7, 10))1 ***
## bs(dis, df = 4, knots = c(4, 7, 10))2 ***
## bs(dis, df = 4, knots = c(4, 7, 10))3 ***
## bs(dis, df = 4, knots = c(4, 7, 10))4 ***
## bs(dis, df = 4, knots = c(4, 7, 10))5 ***
## bs(dis, df = 4, knots = c(4, 7, 10))6 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06185 on 499 degrees of freedom
## Multiple R-squared:  0.7185, Adjusted R-squared:  0.7151 
## F-statistic: 212.3 on 6 and 499 DF,  p-value: < 2.2e-16

It appears that all parts of the model with df=4 and the number of knots=c(4,7,10) appear to be statistically significant.

##Plots of the model

#plot() command of model overlain on the scatter plot
pred <- predict(fit, list(dis=x.grid))
plot(nox ~ dis, data=Boston, col='blue',xlab='weighted mean of distances to five Boston employment centres' ,ylab='Nitrogen Oxide Concentration', main='bs() fit regression splines; df=4')
lines(x.grid, pred, col='red',lwd=3)

ggplot(Boston, aes(x=dis,y=nox)) +
  geom_point(col='blue') +
  stat_smooth(method='lm', formula = y ~ bs(x, df=4, knots=c(4,7,10)), col='red') +
  labs(x='weighted mean of distances to five Boston employment centres' ,y='Nitrogen Oxide Concentration', title='bs() fit regression splines; df=4')

##Question 9 e) Now fit a regression spline for a range of degrees of freedom, and plot the resulting fits and report the resulting RSS. Describe the results obtained.

Now I will fit 17 different spline models, so each model will have a different number of degress of freedom. The RSS will be outputed for model with a different degress of freedom. There will be two plots, one in Base R, and the other a ggplot. The plots will should the degrees of freedom of the model along the x-axis and the residual sum of squares along the y-axis.

#set the seed
set.seed(702)

#fitt 17 different degrees of freedom spline models. 
rss <- rep(NA, 20)
for (i in 3:20) {
  fit <- lm(nox ~ bs(dis,df=i), data=Boston)
  rss[i] <- sum(fit$residuals^2)
}

#plot the RSS vs degrees of freedom using the plot() command
plot(3:20, rss[-c(1,2)], xlab='Degrees of Freedom', ylab='CV RSS', type='b', main='Spline Model - of Splines - CV RSS vs Degrees of Freedom')
axis(1, at = seq(1,20,by=1))

#plot the RSS vs degrees of freedom using the plot() command
rss <- data.table(seq(1:20), rss, keep.rownames = TRUE)
ggplot(rss, aes(V1, rss)) + 
  geom_line() + 
  scale_x_continuous(breaks=c(3:20), limits=c(3,20)) +
  labs(x='Degrees of freedom', y='Residual Sum of Squares', title='Spline Model - CV RSS vs Degress of Freedom')

Generally speaking, the appears there is a negative relationship between RSS and Degrees of freedom in the models. So as degrees of freedom increases the RSS tends to decrease.

##Question 9 f) Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.

Now I will use cross validation and check the model CV MSE for each degrees of freedom for each model. The cross validation will be K=10 folds.

set.seed(702)
cv.spline.fun <- function(i) {
  fit <- glm(nox ~ bs(dis, df=i), data=Boston)
  cv.error <- cv.glm(Boston, fit, K=10)$delta[2]
}

cv.err <- sapply(1:20, cv.spline.fun)

df <- seq(1:20)

dt <- data.table(df, cv.err)


ggplot(dt, aes(df, cv.err)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks=seq(1:20)) +
  labs(x='Degrees of Freedom', y='CV MSE', title='Spline - CV MSE vs Degrees of Freedom')

plot(cv.err, type='b',xlab='Degrees of Freedom', ylab='CV MSE', main='Spline - CV MSE vs Degrees of Freedom')
axis(1, at = seq(1,20,by=1))

It appears that degrees of freedom = 10 produces the lowest CV MSE score. It seems that degrees of freedom makes the CV MSE jump up and down. However, the CV MSE is a very small range. It appears too many and to few is also not good.

##plotting all the models

#install.packages('gridExtra')
library('gridExtra')

p1 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=1)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 1')

p2 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=2)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 2')


p3 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=3)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 3')


p4 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=4)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 4')


grid.arrange(p1,p2,p3,p4, ncol=2, nrow=2)

#install.packages('gridExtra')
library('gridExtra')

p5 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=5)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 5')



p6 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=6)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 6')

p7 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=7)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 7')


p8 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=8)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 8')

grid.arrange(p5,p6,p7,p8, ncol=2, nrow=2)

p9 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=9)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 9')


p10 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=10)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 10')

p11 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=11)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 11')


p12 <- ggplot(Boston, aes(x=dis, y=nox)) +
    geom_point(colour='black') +
    stat_smooth(method='lm', formula= y ~ bs(x,df=12)) +
  labs(x='dis',y='nox', title='Degrees of Freedom 12')

grid.arrange(p9,p10,p11,p12, ncol=2, nrow=2)

It appears that the more degrees of freedom the more wild and squiggle the model becomes. Just looking at the graphs I would say degrees of freedom from 3-5 appear generalize to the data better than the higher ones.

#3. Question 7.9.10 pg 300

##Question 10 a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

I will load the College data set found in ISLR library. After loading it, I will split the data set into 70% training and the other 30% will be testing. The training data will be used to perform forward step selection to determine which variables are best for predicting out of state tuition. The stepwise command requires the leaps library. After fitting the model, BIC, CP, and Adjusted R2 values will be outputted with the number of variables for the model. These graphs will provide insight to how many variables are best suited for the selected model.

library(ISLR)
data(College)


library(leaps)

set.seed(702)
index <- sample(x=nrow(College), size=0.70*nrow(College))
training <- College[index,]
testing <- College[-index,]

forward.step <- regsubsets(Outstate ~ ., data=training, method='forward', nvmax = 17)
forward.summary <- summary(forward.step)


par(mfrow=c(2, 2))
plot(forward.summary$bic, type='b', xlab='No. of Variables', ylab='BIC', main='BIC Vs No. Of Variables For Model Selection')
axis(1, at = seq(1,17,by=1))

plot(forward.summary$cp, type='b', xlab='No. of Variables', ylab='CP', main='CP Vs No. Of Variables For Model Selection')
axis(1, at = seq(1,17,by=1))

plot(forward.summary$adjr2, type='b', xlab='No. of Variables', ylab='AdjR2', main='AdjR2 Vs No. Of Variables For Model Selection')
axis(1, at = seq(1,17,by=1))

##Data Frame Of above graphs

df <- data.frame(BIC=c(forward.summary$bic), 
CP=c(forward.summary$cp),
AdjR2=c(forward.summary$adjr2))
df

According to BIC, the number of variables best for the model is 9. According the AdjR2 and CP values the model generally improves in these value the more variables you use. But putting these three scores together I would say any where from 6 to 9 variables would be best. In this case, I am going to select 9 since that has the best BIC score. Too many other variables may cause overfitting and produce too much noise.

##Coefficient names for the 9 variables Now I will find the 9 coefficient names for the model in the forward stepwise model.

coef.names <- coef(forward.step, id=9)
coef.names

##   (Intercept)    PrivateYes     Top10perc   F.Undergrad    Room.Board 
## -3.109430e+03  2.476767e+03  2.077670e+01 -4.553011e-02  1.008290e+00 
##      Personal      Terminal   perc.alumni        Expend     Grad.Rate 
## -2.691567e-01  3.995176e+01  4.527029e+01  1.722600e-01  2.191977e+01

##Question 10 b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

Now I will fit a GAM on the training data using the above coefficients in the model to predict Outstate tuition. First I will look at scatter plots of the data with Outstate explained by each variable in the model.

par(mfrow=c(2, 2))
plot(Outstate ~ Private, data=training)
plot(Outstate ~ Accept, data=training)
plot(Outstate ~ F.Undergrad, data=training)
plot(Outstate ~ Room.Board, data=training)

par(mfrow=c(2, 2))
plot(Outstate ~ Personal, data=training)
plot(Outstate ~ PhD, data=training)
plot(Outstate ~ perc.alumni, data=training)
plot(Outstate ~ Expend, data=training)

plot(Outstate ~ Grad.Rate, data=training)

#GAM Model

Now I fit a GAM model to the training data and output the summary graphs and data.

#install.packages('gam')
library('gam')
gam.fit <- gam(Outstate ~ Private  + Accept+   F.Undergrad +   Room.Board   +   Personal     +      PhD  + perc.alumni    +    Expend +    Grad.Rate, data=training)

par(mfrow=c(3, 3))
plot(gam.fit, se=TRUE)

Looking over some of the data some of the variables might fit better if in the GAM model they are specified with degrees of freedom = 2. Lets try this.

After exploring a little further, I decided to use splines with df=2 for Accept Room.Board Expend, Grad.rate

df=3 Personal, PhD

No Splines: perc.alumni, Private, F.Undergrad

Using this GAM model produced a lower AIC score.

gam.fit2 <- gam(Outstate ~ Private + s(Accept, df=2) +   F.Undergrad +   s(Room.Board, df=2)   +   s(Personal, df=3)     +      s(PhD, df=3)  + perc.alumni    +    s(Expend, df=2) +    s(Grad.Rate, df=2), data=training)

par(mfrow=c(3, 3))
plot(gam.fit2, se=TRUE)

summary(gam.fit2)

## 
## Call: gam(formula = Outstate ~ Private + s(Accept, df = 2) + F.Undergrad + 
##     s(Room.Board, df = 2) + s(Personal, df = 3) + s(PhD, df = 3) + 
##     perc.alumni + s(Expend, df = 2) + s(Grad.Rate, df = 2), data = training)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -5625.51 -1154.12    75.55  1151.01  4896.64 
## 
## (Dispersion Parameter for gaussian family taken to be 3345949)
## 
##     Null Deviance: 8285293553 on 542 degrees of freedom
## Residual Deviance: 1756622409 on 524.9998 degrees of freedom
## AIC: 9718.293 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                        Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                 1 2392315095 2392315095 714.989 < 2.2e-16 ***
## s(Accept, df = 2)       1  515095608  515095608 153.946 < 2.2e-16 ***
## F.Undergrad             1  269905535  269905535  80.666 < 2.2e-16 ***
## s(Room.Board, df = 2)   1 1150898265 1150898265 343.968 < 2.2e-16 ***
## s(Personal, df = 3)     1   49163618   49163618  14.694 0.0001418 ***
## s(PhD, df = 3)          1  589908258  589908258 176.305 < 2.2e-16 ***
## perc.alumni             1  405075923  405075923 121.065 < 2.2e-16 ***
## s(Expend, df = 2)       1  402143581  402143581 120.188 < 2.2e-16 ***
## s(Grad.Rate, df = 2)    1   44479132   44479132  13.293 0.0002929 ***
## Residuals             525 1756622409    3345949                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                       Npar Df Npar F     Pr(F)    
## (Intercept)                                       
## Private                                           
## s(Accept, df = 2)           1  3.410  0.065368 .  
## F.Undergrad                                       
## s(Room.Board, df = 2)       1  1.274  0.259574    
## s(Personal, df = 3)         2  4.803  0.008567 ** 
## s(PhD, df = 3)              2  2.426  0.089337 .  
## perc.alumni                                       
## s(Expend, df = 2)           1 46.828 2.173e-11 ***
## s(Grad.Rate, df = 2)        1  1.452  0.228815    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##Question 10 c) Now I will take the training data and see how the model compares the testing data set. I wil obtain the MSE

pred <- predict(gam.fit2, testing)
error.rate <- mean((testing$Outstate - pred)^2)

cat('The GAM model MSE is: ', error.rate)

## The GAM model MSE is:  3967901

##Test R2 value of the model

r2 <- 1 - error.rate/mean(((testing$Outstate - mean(testing$Outstate)))^2)
cat('The R2 of the GAM mode is: ', r2)

## The R2 of the GAM mode is:  0.7827326

This R2 value is better than the one than when we were trying select the number of variables based on BIC, AdjR2, and CP values in part a. The MSE value is rather large but the R2 value means we can explain roughly 76.9 % of the variance in the model based on 9 predictors.

##Question 10 d) For which variables, if any, is there evidence of a non-linear relationship with the response?

I will ouyput the summary of the model to be able to determine if there is any evidence of a non-linear relationship

summary(gam.fit2)

## 
## Call: gam(formula = Outstate ~ Private + s(Accept, df = 2) + F.Undergrad + 
##     s(Room.Board, df = 2) + s(Personal, df = 3) + s(PhD, df = 3) + 
##     perc.alumni + s(Expend, df = 2) + s(Grad.Rate, df = 2), data = training)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -5625.51 -1154.12    75.55  1151.01  4896.64 
## 
## (Dispersion Parameter for gaussian family taken to be 3345949)
## 
##     Null Deviance: 8285293553 on 542 degrees of freedom
## Residual Deviance: 1756622409 on 524.9998 degrees of freedom
## AIC: 9718.293 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                        Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                 1 2392315095 2392315095 714.989 < 2.2e-16 ***
## s(Accept, df = 2)       1  515095608  515095608 153.946 < 2.2e-16 ***
## F.Undergrad             1  269905535  269905535  80.666 < 2.2e-16 ***
## s(Room.Board, df = 2)   1 1150898265 1150898265 343.968 < 2.2e-16 ***
## s(Personal, df = 3)     1   49163618   49163618  14.694 0.0001418 ***
## s(PhD, df = 3)          1  589908258  589908258 176.305 < 2.2e-16 ***
## perc.alumni             1  405075923  405075923 121.065 < 2.2e-16 ***
## s(Expend, df = 2)       1  402143581  402143581 120.188 < 2.2e-16 ***
## s(Grad.Rate, df = 2)    1   44479132   44479132  13.293 0.0002929 ***
## Residuals             525 1756622409    3345949                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                       Npar Df Npar F     Pr(F)    
## (Intercept)                                       
## Private                                           
## s(Accept, df = 2)           1  3.410  0.065368 .  
## F.Undergrad                                       
## s(Room.Board, df = 2)       1  1.274  0.259574    
## s(Personal, df = 3)         2  4.803  0.008567 ** 
## s(PhD, df = 3)              2  2.426  0.089337 .  
## perc.alumni                                       
## s(Expend, df = 2)           1 46.828 2.173e-11 ***
## s(Grad.Rate, df = 2)        1  1.452  0.228815    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA for non-parametric effects shows that Expend has a significantly strong non-linear relationship. Personal as the next strongest non-linear relationship and accept has a strong non-linear relationship. PhD is significant at the 0.1 level.

Homework 9

Modern Applied Statistics II