In this exercise , further analyze the wage dataset considered throughout the chapter
data("Wage")
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
Perform polynomial regression to predict wage using age. Use cross validation to select the optimal degree d for the polynomial.what degree was chosen and how does this compare to the results of the hypothesis testing using Anova? Make a plot of the resulting polynomial fit to the data
attach(Wage)
set.seed(10)
cv.wage.lm.error= rep(0,15)
for (i in 1:15){
wage.lm.fit=glm(wage~poly(age,i), data=Wage)
cv.wage.lm.error[i]= cv.glm(Wage,wage.lm.fit,K=10)$delta[1]
}
plot the errors to see which degree suits best
plot(cv.wage.lm.error,xlab = "Degree Of Polynomial",ylab="Cross Validation Error",type = "b",col="red")
As per the above plot the cross validation error drops rapidly at the quadratic polynomial regression. After that there is no much changes in the cross validation error. On an eyeball view Degree 4 polynomial regression shows the minimum validation error and it provides a reasonable fit to the data
min(cv.wage.lm.error)
## [1] 1592.464
which.min(cv.wage.lm.error)
## [1] 9
As per the cross validation Polynomial degree 9 yields the lowest cross validation error
summary(glm(wage~poly(age,9),data = Wage))
##
## Call:
## glm(formula = wage ~ poly(age, 9), data = Wage)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -100.380 -24.436 -4.999 15.488 199.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7282 153.395 < 2e-16 ***
## poly(age, 9)1 447.0679 39.8857 11.209 < 2e-16 ***
## poly(age, 9)2 -478.3158 39.8857 -11.992 < 2e-16 ***
## poly(age, 9)3 125.5217 39.8857 3.147 0.00167 **
## poly(age, 9)4 -77.9112 39.8857 -1.953 0.05087 .
## poly(age, 9)5 -35.8129 39.8857 -0.898 0.36932
## poly(age, 9)6 62.7077 39.8857 1.572 0.11601
## poly(age, 9)7 50.5498 39.8857 1.267 0.20512
## poly(age, 9)8 -11.2547 39.8857 -0.282 0.77783
## poly(age, 9)9 -83.6918 39.8857 -2.098 0.03596 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1590.871)
##
## Null deviance: 5222086 on 2999 degrees of freedom
## Residual deviance: 4756703 on 2990 degrees of freedom
## AIC: 30642
##
## Number of Fisher Scoring iterations: 2
From the summary detail, the coefficients value till the 4 degree polynomial is significant. Rest of the coefficients has the large p value which makes the coefficient insignificant. Based on the p Value significance cubic or degree 4 polynomial regression fits the data well.
wage.lm.fit1=lm(wage~age,data = Wage)
wage.lm.fit2=lm(wage~poly(age,2), data = Wage)
wage.lm.fit3=lm(wage~poly(age,3), data = Wage)
wage.lm.fit4=lm(wage~poly(age,4), data = Wage)
wage.lm.fit5=lm(wage~poly(age,5), data = Wage)
wage.lm.fit6=lm(wage~poly(age,6), data = Wage)
Compare the Models using ANOVA
anova(wage.lm.fit1,wage.lm.fit2,wage.lm.fit3,wage.lm.fit4,wage.lm.fit5,wage.lm.fit6)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.6636 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8936 0.001675 **
## 4 2995 4771604 1 6070 3.8117 0.050989 .
## 5 2994 4770322 1 1283 0.8054 0.369565
## 6 2993 4766389 1 3932 2.4692 0.116201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As per the comparison of p-value for different models, till degree-4 polynomial regression, the p-value is less than or equal to the threshold value of 0.05,which makes us to conclude that the cubic or degree 4 polynomial regression provides a perfect fit to the data
Cross-validation and Anova Comparision makes degree-4 polynominal regression is a bit fit model for the wage data
age.lims=range(age)
age.grid= seq(from=age.lims[1],to=age.lims[2])
wage.lm4.predict=predict(wage.lm.fit4,newdata = list(age=age.grid),se=TRUE)
se.bounds=cbind(wage.lm4.predict$fit+2*wage.lm4.predict$se.fit,wage.lm4.predict$fit-2*wage.lm4.predict$se.fit)
plot(age,wage,xlim=age.lims,cex=.5,col='darkgrey')
title(("Degree-4 Polynomial Regression"))
lines(age.grid,wage.lm4.predict$fit,lwd=2,col='blue')
matlines(age.grid,se.bounds,lwd = 2,col = 'red',ltype=3)
Fit a step function to predict wage using age and perform cross validation to choose the optimal number of cuts .Make a plot of the fit obtained.
set.seed(10)
cv.wage.step.error= rep(0,25)
for (i in 2:25){
Wage$age.cut=cut(Wage$age,i)
wage.step.fit=glm(wage~age.cut, data=Wage)
cv.wage.step.error[i]= cv.glm(Wage,wage.step.fit,K=10)$delta[1]
}
plot the errors to see which cut suits best
plot(cv.wage.step.error,ylab="Cross Validation Error",type = "b",col="red")
As per the above plot the cross validation error improves significantly when the cut is 3. After that there is no much changes in the cross validation error. On an eyeball step 8 function shows the maximum validation error and it provides a reasonable fit to the data
wage.step8.fit=lm(wage~cut(age,8),data=Wage)
summary(wage.step8.fit)
##
## Call:
## lm(formula = wage ~ cut(age, 8), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99.697 -24.552 -5.307 15.417 198.560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.282 2.630 29.007 < 2e-16 ***
## cut(age, 8)(25.8,33.5] 25.833 3.161 8.172 4.44e-16 ***
## cut(age, 8)(33.5,41.2] 40.226 3.049 13.193 < 2e-16 ***
## cut(age, 8)(41.2,49] 43.501 3.018 14.412 < 2e-16 ***
## cut(age, 8)(49,56.8] 40.136 3.177 12.634 < 2e-16 ***
## cut(age, 8)(56.8,64.5] 44.102 3.564 12.373 < 2e-16 ***
## cut(age, 8)(64.5,72.2] 28.948 6.042 4.792 1.74e-06 ***
## cut(age, 8)(72.2,80.1] 15.224 9.781 1.556 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.97 on 2992 degrees of freedom
## Multiple R-squared: 0.08467, Adjusted R-squared: 0.08253
## F-statistic: 39.54 on 7 and 2992 DF, p-value: < 2.2e-16
wage.step8.predict = predict(wage.step8.fit, newdata = list(age=age.grid),se=TRUE)
wage.step8.se.bands= cbind(wage.step8.predict$fit+2*wage.step8.predict$se.fit,wage.step8.predict$fit-2*wage.step8.predict$se.fit)
plot(age,wage,xlim=age.lims,cex=0.5,col='darkgrey')
lines(age.grid,wage.step8.predict$fit,lwd=2,col='blue')
matlines(age.grid,wage.step8.se.bands,lwd=2,col='red',lty = 3)
This Question relates to college data set
Split the data into training set and test set. Using out of state tuition as the response and other variables as the predictors.Perform forward step wise selection on the training set in order to identify a satisfactory model that uses just the subset of the predictors
data(College)
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
trainIndex <- createDataPartition(College$Outstate, p=0.8, list=FALSE)
college.x.train=College[trainIndex,]
college.x.test=College[-trainIndex,]
attach(College)
college.regfit.fwd=regsubsets(Outstate~., data=College[trainIndex,], nvmax=18, method ="forward")
college.fwd.summary=summary(college.regfit.fwd)
plot(college.fwd.summary$rsq,xlab="Number of Predictor Variables",ylab = "R Square")
plot(college.fwd.summary$adjr2,xlab="Number of Predictor Variables",ylab = "Adjusted R Square")
plot(college.fwd.summary$cp,xlab="Number of Predictor Variables",ylab = "CP")
plot(college.fwd.summary$bic,xlab="Number of Predictor Variables",ylab = "bic")
As per the Mallow CP and BIC plot, the reduction in cp and bic score is significant un-till the model with 6 after that the reduction of the score is not significant.so Based on the plot, model with 6 predictors best fit the data.
colg.test.mat = model.matrix(Outstate~.,data=College[-trainIndex,])
Colg.val.errors.fwd=rep(NA,17)
colg.val.acc.fwd= rep(0,17)
for (i in 1:17){
colg.coefi.fwd= coef(college.regfit.fwd,id=i)
colg.pred.fwd=colg.test.mat[,names(colg.coefi.fwd)]%*%colg.coefi.fwd
colg.val.acc.fwd[i]=(cor(College$Outstate[-trainIndex],colg.pred.fwd)^2)
Colg.val.errors.fwd[i]=mean((College$Outstate[-trainIndex]-colg.pred.fwd)^2)
}
which.min(Colg.val.errors.fwd)
## [1] 11
Based on the Minimum MSE , model with 11 predictors is the best model to fit the data
colg.val.acc.fwd[11]
## [1] 0.7643334
colg.val.acc.fwd[6]
## [1] 0.7522147
Colg.val.errors.fwd[11]
## [1] 3860707
Colg.val.errors.fwd[6]
## [1] 4040752
Find out the predictors of the Model with six predictors
coef(college.regfit.fwd, 6)
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3383.3535050 2724.2382301 0.9732230 33.3829715 47.5918212
## Expend Grad.Rate
## 0.2176194 30.9152889
Fit the GAM on the training data, using out of state tuition as the response and the features selected in the previous step as the predictors.Plot the results and explain your findings.
college.gam.fit=gam(Outstate~Private+s(perc.alumni,4)+s(PhD,5)+Room.Board+Expend+s(Grad.Rate,5),data = college.x.train)
plot(college.gam.fit,se=TRUE,col="#1c9099")
Evaluate the model obtained on the test set and explain the results obtained
college.gam.predict=predict(college.gam.fit,data=college.x.test)
mean((college.x.test$Outstate-college.gam.predict)^2)
## Warning in college.x.test$Outstate - college.gam.predict: longer object length
## is not a multiple of shorter object length
## [1] 26792209
summary(college.gam.fit)
##
## Call: gam(formula = Outstate ~ Private + s(perc.alumni, 4) + s(PhD,
## 5) + Room.Board + Expend + s(Grad.Rate, 5), data = college.x.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7425 -1200 -104 1241 9741
##
## (Dispersion Parameter for gaussian family taken to be 3991118)
##
## Null Deviance: 10070574121 on 622 degrees of freedom
## Residual Deviance: 2414626616 on 605.0001 degrees of freedom
## AIC: 11257.07
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 3048132122 3048132122 763.73 < 2.2e-16 ***
## s(perc.alumni, 4) 1 1215609180 1215609180 304.58 < 2.2e-16 ***
## s(PhD, 5) 1 1335088400 1335088400 334.51 < 2.2e-16 ***
## Room.Board 1 1063779767 1063779767 266.54 < 2.2e-16 ***
## Expend 1 423144334 423144334 106.02 < 2.2e-16 ***
## s(Grad.Rate, 5) 1 126396930 126396930 31.67 2.798e-08 ***
## Residuals 605 2414626616 3991118
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(perc.alumni, 4) 3 0.8511 0.466339
## s(PhD, 5) 4 5.3620 0.000301 ***
## Room.Board
## Expend
## s(Grad.Rate, 5) 4 3.4981 0.007771 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the plot , graduation rate and PhD doesn’t possess any linear relationship with response, whereas summary , it seems all the variables are significant in the GAM model
For which variables, is any is there evidence of the non-linear relationship with the response
college.gam.fit1= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+s(PhD,4),data = college.x.train)
college.gam.fit2= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+s(Grad.Rate),data = college.x.train)
college.gam.fit3= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+s(Grad.Rate)+s(PhD),data = college.x.train)
college.gam.fit4= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+Grad.Rate+s(PhD),data = college.x.train)
college.gam.fit5= gam(Outstate~Private+s(perc.alumni,4)+s(Room.Board,4)+s(Expend,4)+s(Grad.Rate)+PhD,data = college.x.train)
Comparing the gam model using ANOVA function
anova(college.gam.fit1,college.gam.fit2,college.gam.fit3,college.gam.fit,college.gam.fit4,college.gam.fit5)
## Analysis of Deviance Table
##
## Model 1: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend,
## 4) + s(PhD, 4)
## Model 2: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend,
## 4) + s(Grad.Rate)
## Model 3: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend,
## 4) + s(Grad.Rate) + s(PhD)
## Model 4: Outstate ~ Private + s(perc.alumni, 4) + s(PhD, 5) + Room.Board +
## Expend + s(Grad.Rate, 5)
## Model 5: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend,
## 4) + Grad.Rate + s(PhD)
## Model 6: Outstate ~ Private + s(perc.alumni, 4) + s(Room.Board, 4) + s(Expend,
## 4) + s(Grad.Rate) + PhD
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 605 2219829154
## 2 605 2111305891 -8.7084e-05 108523262
## 3 601 2068081013 3.9999e+00 43224878 0.01363 *
## 4 605 2414626616 -4.0003e+00 -346545603 < 2e-16 ***
## 5 604 2091657851 1.0005e+00 322968765 < 2e-16 ***
## 6 604 2088202493 -8.7084e-05 3455358
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As per the Anova comparison Model 4 and model 5 is best to fit the data. It conclude that the PhD predictor possess non linear relationship with the response.