Question 6

In this exercise , further analyze the wage dataset considered throughout the chapter

data("Wage")
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

Question 6a

Perform polynomial regression to predict wage using age. Use cross validation to select the optimal degree d for the polynomial.what degree was chosen and how does this compare to the results of the hypothesis testing using Anova? Make a plot of the resulting polynomial fit to the data

attach(Wage)
set.seed(10)
cv.wage.lm.error= rep(0,15)
for (i in 1:15){
wage.lm.fit=glm(wage~poly(age,i), data=Wage)
cv.wage.lm.error[i]= cv.glm(Wage,wage.lm.fit,K=10)$delta[1]
}

Polynomial Regression Using Cross-validation

plot the errors to see which degree suits best

plot(cv.wage.lm.error,xlab = "Degree Of Polynomial",ylab="Cross Validation Error",type = "b",col="red")

As per the above plot the cross validation error drops rapidly at the quadratic polynomial regression. After that there is no much changes in the cross validation error. On an eyeball view Degree 4 polynomial regression shows the minimum validation error and it provides a reasonable fit to the data

find out the Minimum Cross Validation error from the above models

min(cv.wage.lm.error)
## [1] 1592.464
which.min(cv.wage.lm.error)
## [1] 9

As per the cross validation Polynomial degree 9 yields the lowest cross validation error

summary(glm(wage~poly(age,9),data = Wage))
## 
## Call:
## glm(formula = wage ~ poly(age, 9), data = Wage)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -100.380   -24.436    -4.999    15.488   199.552  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    111.7036     0.7282 153.395  < 2e-16 ***
## poly(age, 9)1  447.0679    39.8857  11.209  < 2e-16 ***
## poly(age, 9)2 -478.3158    39.8857 -11.992  < 2e-16 ***
## poly(age, 9)3  125.5217    39.8857   3.147  0.00167 ** 
## poly(age, 9)4  -77.9112    39.8857  -1.953  0.05087 .  
## poly(age, 9)5  -35.8129    39.8857  -0.898  0.36932    
## poly(age, 9)6   62.7077    39.8857   1.572  0.11601    
## poly(age, 9)7   50.5498    39.8857   1.267  0.20512    
## poly(age, 9)8  -11.2547    39.8857  -0.282  0.77783    
## poly(age, 9)9  -83.6918    39.8857  -2.098  0.03596 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1590.871)
## 
##     Null deviance: 5222086  on 2999  degrees of freedom
## Residual deviance: 4756703  on 2990  degrees of freedom
## AIC: 30642
## 
## Number of Fisher Scoring iterations: 2

From the summary detail, the coefficients value till the 4 degree polynomial is significant. Rest of the coefficients has the large p value which makes the coefficient insignificant. Based on the p Value significance cubic or degree 4 polynomial regression fits the data well.

Polynomial Regression using Anova Comparision

wage.lm.fit1=lm(wage~age,data = Wage)
wage.lm.fit2=lm(wage~poly(age,2), data = Wage)
wage.lm.fit3=lm(wage~poly(age,3), data = Wage)
wage.lm.fit4=lm(wage~poly(age,4), data = Wage)
wage.lm.fit5=lm(wage~poly(age,5), data = Wage)
wage.lm.fit6=lm(wage~poly(age,6), data = Wage)

Compare the Models using ANOVA

  • Null Hypothesis: Polynomial degrees of age doesn’t have significant impact on the wages
  • Alternative hypothesis : Polynomial degrees of age does have significant impact on the wages
anova(wage.lm.fit1,wage.lm.fit2,wage.lm.fit3,wage.lm.fit4,wage.lm.fit5,wage.lm.fit6)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.6636 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8936  0.001675 ** 
## 4   2995 4771604  1      6070   3.8117  0.050989 .  
## 5   2994 4770322  1      1283   0.8054  0.369565    
## 6   2993 4766389  1      3932   2.4692  0.116201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As per the comparison of p-value for different models, till degree-4 polynomial regression, the p-value is less than or equal to the threshold value of 0.05,which makes us to conclude that the cubic or degree 4 polynomial regression provides a perfect fit to the data

Inference from Cross-validation and Anova Comparision:

Cross-validation and Anova Comparision makes degree-4 polynominal regression is a bit fit model for the wage data

predict the wage using the degree 4 fit

age.lims=range(age)
age.grid= seq(from=age.lims[1],to=age.lims[2])
wage.lm4.predict=predict(wage.lm.fit4,newdata = list(age=age.grid),se=TRUE)
se.bounds=cbind(wage.lm4.predict$fit+2*wage.lm4.predict$se.fit,wage.lm4.predict$fit-2*wage.lm4.predict$se.fit)

Plot the prediction

plot(age,wage,xlim=age.lims,cex=.5,col='darkgrey')
title(("Degree-4 Polynomial Regression"))
lines(age.grid,wage.lm4.predict$fit,lwd=2,col='blue')
matlines(age.grid,se.bounds,lwd = 2,col = 'red',ltype=3)

Question 6b

Fit a step function to predict wage using age and perform cross validation to choose the optimal number of cuts .Make a plot of the fit obtained.

Step function using cross-Validation

set.seed(10)
cv.wage.step.error= rep(0,25)
for (i in 2:25){
  Wage$age.cut=cut(Wage$age,i)
wage.step.fit=glm(wage~age.cut, data=Wage)
cv.wage.step.error[i]= cv.glm(Wage,wage.step.fit,K=10)$delta[1]
}

plot the errors to see which cut suits best

plot(cv.wage.step.error,ylab="Cross Validation Error",type = "b",col="red")

As per the above plot the cross validation error improves significantly when the cut is 3. After that there is no much changes in the cross validation error. On an eyeball step 8 function shows the maximum validation error and it provides a reasonable fit to the data

fit the data Using cut 8

wage.step8.fit=lm(wage~cut(age,8),data=Wage)
summary(wage.step8.fit)
## 
## Call:
## lm(formula = wage ~ cut(age, 8), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.697 -24.552  -5.307  15.417 198.560 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              76.282      2.630  29.007  < 2e-16 ***
## cut(age, 8)(25.8,33.5]   25.833      3.161   8.172 4.44e-16 ***
## cut(age, 8)(33.5,41.2]   40.226      3.049  13.193  < 2e-16 ***
## cut(age, 8)(41.2,49]     43.501      3.018  14.412  < 2e-16 ***
## cut(age, 8)(49,56.8]     40.136      3.177  12.634  < 2e-16 ***
## cut(age, 8)(56.8,64.5]   44.102      3.564  12.373  < 2e-16 ***
## cut(age, 8)(64.5,72.2]   28.948      6.042   4.792 1.74e-06 ***
## cut(age, 8)(72.2,80.1]   15.224      9.781   1.556     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.97 on 2992 degrees of freedom
## Multiple R-squared:  0.08467,    Adjusted R-squared:  0.08253 
## F-statistic: 39.54 on 7 and 2992 DF,  p-value: < 2.2e-16

Predict wage using the fitted model

wage.step8.predict = predict(wage.step8.fit, newdata = list(age=age.grid),se=TRUE)
wage.step8.se.bands= cbind(wage.step8.predict$fit+2*wage.step8.predict$se.fit,wage.step8.predict$fit-2*wage.step8.predict$se.fit)

Plot the fitted Model

plot(age,wage,xlim=age.lims,cex=0.5,col='darkgrey')
lines(age.grid,wage.step8.predict$fit,lwd=2,col='blue')
matlines(age.grid,wage.step8.se.bands,lwd=2,col='red',lty = 3)

Question 10

This Question relates to college data set

Question 10a

Split the data into training set and test set. Using out of state tuition as the response and other variables as the predictors.Perform forward step wise selection on the training set in order to identify a satisfactory model that uses just the subset of the predictors

Split the data set into training and test set.

data(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
trainIndex <- createDataPartition(College$Outstate, p=0.8, list=FALSE)
college.x.train=College[trainIndex,]
college.x.test=College[-trainIndex,]
attach(College)
college.regfit.fwd=regsubsets(Outstate~., data=College[trainIndex,], nvmax=18, method ="forward")
college.fwd.summary=summary(college.regfit.fwd)

Select the Best Modelusing the R2, Adjusted r2, Cp and BIC score

plot(college.fwd.summary$rsq,xlab="Number of Predictor Variables",ylab = "R Square")

plot(college.fwd.summary$adjr2,xlab="Number of Predictor Variables",ylab = "Adjusted R Square")

plot(college.fwd.summary$cp,xlab="Number of Predictor Variables",ylab = "CP")

plot(college.fwd.summary$bic,xlab="Number of Predictor Variables",ylab = "bic")

As per the Mallow CP and BIC plot, the reduction in cp and bic score is significant un-till the model with 6 after that the reduction of the score is not significant.so Based on the plot, model with 6 predictors best fit the data.

Find the MSE using the Test dataset inorder to get the best model

colg.test.mat = model.matrix(Outstate~.,data=College[-trainIndex,])
Colg.val.errors.fwd=rep(NA,17)
colg.val.acc.fwd= rep(0,17)
for (i in 1:17){
  colg.coefi.fwd= coef(college.regfit.fwd,id=i)
  colg.pred.fwd=colg.test.mat[,names(colg.coefi.fwd)]%*%colg.coefi.fwd
  colg.val.acc.fwd[i]=(cor(College$Outstate[-trainIndex],colg.pred.fwd)^2)
  Colg.val.errors.fwd[i]=mean((College$Outstate[-trainIndex]-colg.pred.fwd)^2)
}
which.min(Colg.val.errors.fwd)
## [1] 11

Based on the Minimum MSE , model with 11 predictors is the best model to fit the data

Find out the accuracy for model with 6 and 11 predictors

colg.val.acc.fwd[11]
## [1] 0.7643334
colg.val.acc.fwd[6]
## [1] 0.7522147

Find out the MSE for the model with six and eleven predictors

Colg.val.errors.fwd[11]
## [1] 3860707
Colg.val.errors.fwd[6]
## [1] 4040752

Inference from forward stepwise selection method.

  • As per the BIC and CP plot , model six predictors will be the best model to fit the data.
  • Based on the Validation error model with eleven predictors will be the best model to fit the data.
  • But there is no much significant difference in the validation error and accuracy between the model with six and eleven predictors.

Find out the predictors of the Model with six predictors

coef(college.regfit.fwd, 6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3383.3535050  2724.2382301     0.9732230    33.3829715    47.5918212 
##        Expend     Grad.Rate 
##     0.2176194    30.9152889

Question 10b

Fit the GAM on the training data, using out of state tuition as the response and the features selected in the previous step as the predictors.Plot the results and explain your findings.

Find the GAM Regression

college.gam.fit=gam(Outstate~Private+s(perc.alumni,4)+s(PhD,5)+Room.Board+Expend+s(Grad.Rate,5),data = college.x.train)
plot(college.gam.fit,se=TRUE,col="#1c9099")

Inference about the GAM plot

  • As per the Plots, The out of state tuition fees is high for the private institution
  • Out of state tuition fees is in increasing trend for percent alumni who donate,instructional expenditure,percent of PhD faculty,Room and Boarding cost increases.
  • Whereas if the institution graduation rate increases, the out of state tuition get into decreasing trend.

Question 10c

Evaluate the model obtained on the test set and explain the results obtained

college.gam.predict=predict(college.gam.fit,data=college.x.test)
mean((college.x.test$Outstate-college.gam.predict)^2)
## Warning in college.x.test$Outstate - college.gam.predict: longer object length
## is not a multiple of shorter object length
## [1] 26792209
summary(college.gam.fit)
## 
## Call: gam(formula = Outstate ~ Private + s(perc.alumni, 4) + s(PhD, 
##     5) + Room.Board + Expend + s(Grad.Rate, 5), data = college.x.train)
## Deviance Residuals:
##    Min     1Q Median     3Q    Max 
##  -7425  -1200   -104   1241   9741 
## 
## (Dispersion Parameter for gaussian family taken to be 3991118)
## 
##     Null Deviance: 10070574121 on 622 degrees of freedom
## Residual Deviance: 2414626616 on 605.0001 degrees of freedom
## AIC: 11257.07 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 3048132122 3048132122  763.73 < 2.2e-16 ***
## s(perc.alumni, 4)   1 1215609180 1215609180  304.58 < 2.2e-16 ***
## s(PhD, 5)           1 1335088400 1335088400  334.51 < 2.2e-16 ***
## Room.Board          1 1063779767 1063779767  266.54 < 2.2e-16 ***
## Expend              1  423144334  423144334  106.02 < 2.2e-16 ***
## s(Grad.Rate, 5)     1  126396930  126396930   31.67 2.798e-08 ***
## Residuals         605 2414626616    3991118                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar F    Pr(F)    
## (Intercept)                                  
## Private                                      
## s(perc.alumni, 4)       3 0.8511 0.466339    
## s(PhD, 5)               4 5.3620 0.000301 ***
## Room.Board                                   
## Expend                                       
## s(Grad.Rate, 5)         4 3.4981 0.007771 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the plot , graduation rate and PhD doesn’t possess any linear relationship with response, whereas summary , it seems all the variables are significant in the GAM model

Question 10d

For which variables, is any is there evidence of the non-linear relationship with the response

college.gam.fit1= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+s(PhD,4),data = college.x.train)
college.gam.fit2= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+s(Grad.Rate),data = college.x.train)
college.gam.fit3= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+s(Grad.Rate)+s(PhD),data = college.x.train)
college.gam.fit4= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+Grad.Rate+s(PhD),data = college.x.train)
college.gam.fit5= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+s(Grad.Rate)+PhD,data = college.x.train)

Comparing the gam model using ANOVA function

anova(college.gam.fit1,college.gam.fit2,college.gam.fit3,college.gam.fit,college.gam.fit4,college.gam.fit5)
## Analysis of Deviance Table
## 
## Model 1: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend, 
##     4) + s(PhD, 4)
## Model 2: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend, 
##     4) + s(Grad.Rate)
## Model 3: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend, 
##     4) + s(Grad.Rate) + s(PhD)
## Model 4: Outstate ~ Private + s(perc.alumni, 4) + s(PhD, 5) + Room.Board + 
##     Expend + s(Grad.Rate, 5)
## Model 5: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend, 
##     4) + Grad.Rate + s(PhD)
## Model 6: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend, 
##     4) + s(Grad.Rate) + PhD
##   Resid. Df Resid. Dev          Df   Deviance Pr(>Chi)    
## 1       605 2219829154                                    
## 2       605 2111305891 -8.7084e-05  108523262             
## 3       601 2068081013  3.9999e+00   43224878  0.01363 *  
## 4       605 2414626616 -4.0003e+00 -346545603  < 2e-16 ***
## 5       604 2091657851  1.0005e+00  322968765  < 2e-16 ***
## 6       604 2088202493 -8.7084e-05    3455358             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As per the Anova comparison Model 4 and model 5 is best to fit the data. It conclude that the PhD predictor possess non linear relationship with the response.