This exercise is to demonstrate methods of selecting the best linear model from a array of options available. We will be using the Hitters data set available in the ISLR package for this.
library(ISLR)
##
## Attaching package: 'ISLR'
##
## The following objects are masked _by_ '.GlobalEnv':
##
## Carseats, Hitters
fix(Hitters)
In the best subset selection model, multiple models are fit with (p,k)(factorial) combinations and the mean value of the best model from each combination is selected. With this we get a range of models with various statistics like R2,adj R2,Cp, AIC,BIC etc.
The function regsubsets() gets us various models with combinations of variables. This function is available in leaps() package
library(leaps)
hitter.full <- regsubsets(Salary~.,Hitters)
summary(hitter.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " " " "*" "*" " " " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " " " "
By default the regsubsets() function only gives the first 8 models. The function nvmax() gives all the variables. Let us try that and get the summary statistics.
hitter.full <- regsubsets(Salary~.,Hitters,nvmax=19)
hitter.summary <-summary(hitter.full)
names(hitter.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Let us look at the variuos summary statistics of these models and plot the values to find the desired models
par(mfrow=c(2,2))
plot(hitter.summary$rsq,type="l",xlab="number of variables",ylab="R-Squared")
which.max(hitter.summary$rsq)
## [1] 19
points(19,hitter.summary$rsq[19],col="red",cex=2,pch=20)
plot(hitter.summary$rss,type ="l")
which.min(hitter.summary$rss)
## [1] 19
points(19,hitter.summary$rss[19],col="blue",cex=3,pch=20)
plot(hitter.summary$cp,type="l",col="green")
which.min(hitter.summary$cp)
## [1] 10
points(10,hitter.summary$cp[10],col="red",cex=2,pch=20)
plot(hitter.summary$bic,type="l",col="purple")
which.min(hitter.summary$bic)
## [1] 6
points(6,hitter.summary$bic[6],col="green",cex=2,pch=20)
The regsubs() function also has an inbuilt plot() function to display the selected variables for the best model with a given number of predictors.
par(mar = rep(2, 4))
plot(hitter.full,scale="r2")
plot(hitter.full,scale="adjr2")
plot(hitter.full,scale="Cp")
plot(hitter.full,scale="bic")
Let us also look at the coefficients with the lowest values of bic
coef(hitter.full,6)
## (Intercept) AtBat Hits Walks CRBI DivisionW
## 91.5118 -1.8686 7.6044 3.6976 0.6430 -122.9515
## PutOuts
## 0.2643
Let us now learn the forward and backward stepwise selection procedures to select the right model.
The forward and backward stepwise selection can be carried out with the regsubsets() function with methods = “forward” or “backward”
hitters.fwd <- regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward")
hit.fwd.sum <- summary(hitters.fwd)
hitters.bwd <- regsubsets(Salary~.,data=Hitters,nvmax=19,method="backward")
hit.bwd.sum <-summary(hitters.bwd)
Let us look which is best models with respect to the various statistics
names(hit.fwd.sum)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
which.min(hit.fwd.sum$cp)
## [1] 10
which.min(hit.bwd.sum$cp)
## [1] 10
which.min(hit.fwd.sum$bic)
## [1] 6
which.min(hit.bwd.sum$bic)
## [1] 8
Let us get the coefficients for the above minimum values
coef(hitters.fwd,10)
## (Intercept) AtBat Hits Walks CAtBat CRuns
## 162.5354 -2.1687 6.9180 5.7732 -0.1301 1.4082
## CRBI CWalks DivisionW PutOuts Assists
## 0.7743 -0.8308 -112.3801 0.2974 0.2832
We know that irrespective of what results we get from the training set, the real estimate of the error is what we get from the cross validation sets and test sets. Let us now attempt to select the best models based on the results from the cross validation sets.
Let us now create trainining sets and test sets from the data. We are going to randomly sample from the Hitters dataset to get the training set and test sets.
Hitters <- na.omit(Hitters)
set.seed(1)
train = sample(c(TRUE,FALSE),nrow(Hitters),rep=TRUE)
test = (!train)
After the above we get a subset of training and test sets. Let us now find the best model using the regsubsets() function on to the training set
hitters.best <- regsubsets(Salary~.,data=Hitters[train,],nvmax=19)
Let us introduce a new function called the model.matrix() to make a matrix out of the test set models
test.mat <- model.matrix(Salary~.,data=Hitters[test,])
Next we will take each of the best models, predict the outcomes on the test set and calculate the MSE for all the models. This will help in identifying the best model with the smallest MSE.
errorval <- rep(NA,19)
for (i in 1:19) {
coef1 <- coef(hitters.best,id=i)
predict1 <- test.mat[,names(coef1)]%*%coef1
errorval[i] <- mean((Hitters$Salary[test]-predict1)^2)
}
Let us make a special function for the above steps so that we can call the function again and again
predict.regsubsets <- function(object,newdata,id,...){
form = as.formula(object$call[[2]])
mat = model.matrix(form,newdata)
coefi <- coef(object,id=id)
xvars = names(coefi)
mat[,xvars]%*%coefi
}
From the above we can get all the values of the MSE. From this we can find the minimum value. This is found to be the model with 10 coefficients. Let us find this model
which.min(errorval)
## [1] 10
plot(errorval,type="l",col="blue")
points(10,errorval[10],col="red",cex=2,pch=20)
Now we have seen the model work with the regsubsets() function, let us do a cross validation tests to find the best model.
This involves, performing best subset selection within each of the k training sets.Let us see this in action
k = 10
set.seed(1)
folds = sample(1:k,nrow(Hitters),replace=TRUE)
cv.error = matrix(NA,k,19,dimnames=list(NULL,paste(1:19)))
After defining the basic building blocks let us do a for loop for performing the cross validation
for (j in 1:k){
best.fit <- regsubsets(Salary~.,data=Hitters[folds!=j,],nvmax=19)
for(i in 1:19){
predict <- predict.regsubsets(best.fit,Hitters[folds==j,],id=i)
cv.error[j,i] = mean((Hitters$Salary[folds==j]-predict)^2)
}
}
Now we have got all the errors in a matrix form, let us find the mean values accross all the folds
cv.errors.mean <- apply(cv.error,2,mean)
which.min(cv.errors.mean)
## 11
## 11
As seen from the cross validation test, the 11th model results in the lowest MSE
The Ridge regression and the Lasso models can be fit using the package glmnet(). The glmnet uses an x matrix which is the matrix of predictors. The x matrix have to be numerical,quantitative values. The package will convert any qualitative values to quantitative values. The next variable is the y vector which is the vector of responses. Let us do this with the Hitters data.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 1.9-8
x = model.matrix(Salary~.,Hitters)[,-1]
y = Hitters$Salary
The glmnet() function has an alpha argument. When alpha = 0 a ridge regression model is fit and when alpha = 1, a lasso model is fit.
glmnet() package has a default set of lambda values. However we are using a grid of lambda values ranging from 10^10 to 10^-2
grid = 10^seq(10,-2,length = 100)
ridge.mod = glmnet(x,y,alpha=0,lambda=grid)
dim(coef(ridge.mod))
## [1] 20 100
names(ridge.mod)
## [1] "a0" "beta" "df" "dim" "lambda"
## [6] "dev.ratio" "nulldev" "npasses" "jerr" "offset"
## [11] "call" "nobs"
Let us look at some of the estimates for the model for some values of lambda
ridge.mod$lambda[50]
## [1] 11498
coef(ridge.mod)[,50]
## (Intercept) AtBat Hits HmRun Runs RBI
## 407.356050 0.036957 0.138180 0.524630 0.230702 0.239841
## Walks Years CAtBat CHits CHmRun CRuns
## 0.289619 1.107703 0.003132 0.011654 0.087546 0.023380
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.024138 0.025015 0.085028 -6.215441 0.016483 0.002613
## Errors NewLeagueN
## -0.020503 0.301434
One charachteristics of the lambda is, when lambda is large the l2 norm, which is the distance of the coefficient term from the origin will be the smallest. l2 norm is denoted by the expression, sqrt(sum(coef(ridge.mod)[-1,lambda])^2). Let us look at this fact for two values of lambda
sqrt(sum(coef(ridge.mod)[-1,50])^2)
## [1] 3.088
sqrt(sum(coef(ridge.mod)[-1,60])^2)
## [1] 24.62
We can use the predict() function for many purposes like predicting the coefficients of the model for an arbitary value of lambda
predict(ridge.mod,s=50,type="coefficients")[1:20,]
## (Intercept) AtBat Hits HmRun Runs RBI
## 4.877e+01 -3.581e-01 1.969e+00 -1.278e+00 1.146e+00 8.038e-01
## Walks Years CAtBat CHits CHmRun CRuns
## 2.716e+00 -6.218e+00 5.448e-03 1.065e-01 6.245e-01 2.215e-01
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 2.187e-01 -1.500e-01 4.593e+01 -1.182e+02 2.502e-01 1.216e-01
## Errors NewLeagueN
## -3.279e+00 -9.497e+00
Let us now split the data into training and test sets with a different approach. If we remember we did the training data split earlier using TRUE and FALSE approach. Here we achieve this with a different approach.
set.seed(1)
train = sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]
Let us now fit a regression model on the training set and evaluate its MSE on the test set, using lambda = 4.Let us also calculate the MSE of the ridge regression model.
ridge.mod1 = glmnet(x[train,],y[train],alpha=0,lambda=grid,thresh=1e-12)
ridge.pred1 = predict(ridge.mod1,s=4,newx = x[test,])
mean((ridge.pred1-y.test)^2)
## [1] 101037
We also learnt that a model with lambda = 0 is equivalent to a linear model
ridge.pred2 <- predict(ridge.mod1,s=0,newx=x[test,],exact=T)
mean((ridge.pred2-y.test)^2)
## [1] 114783
lm(y~x,subset=train)
##
## Call:
## lm(formula = y ~ x, subset = train)
##
## Coefficients:
## (Intercept) xAtBat xHits xHmRun xRuns
## 299.4285 -2.5403 8.3668 11.6451 -9.0992
## xRBI xWalks xYears xCAtBat xCHits
## 2.4410 9.2344 -22.9367 -0.1815 -0.1160
## xCHmRun xCRuns xCRBI xCWalks xLeagueN
## -1.3389 3.3284 0.0754 -1.0784 59.7607
## xDivisionW xPutOuts xAssists xErrors xNewLeagueN
## -98.8623 0.3409 0.3416 -0.6421 -0.6744
predict(ridge.mod1,s=0,exact=T,type="coefficients")[1:20,]
## (Intercept) AtBat Hits HmRun Runs RBI
## 299.42884 -2.54015 8.36612 11.64401 -9.09878 2.44152
## Walks Years CAtBat CHits CHmRun CRuns
## 9.23404 -22.93584 -0.18161 -0.11561 -1.33837 3.32818
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.07512 -1.07829 59.76529 -98.85997 0.34086 0.34166
## Errors NewLeagueN
## -0.64206 -0.67606
glmnet() has an inbuilt function to do cross validation. Let us carry out cross validation and find the MSE for both the cross validated set and also the test set. The function for doing cross validation in the glmnet() function is cv.glmnet
set.seed(1)
cv.out1 <- cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out1)
bestlam = cv.out1$lambda.min
bestlam
## [1] 211.7
We can see that the lambda which results in the smallest cross validation error is 212. Let us find the test MSE associated with this.
Let us find the ridge regression model on the full data set, using the value of lambda chosen by cross validation and examine the coefficient estimates
ridge.pred3 <- predict(ridge.mod1,s=bestlam,newx=x[test,])
mean((ridge.pred3-y.test)^2)
## [1] 96016
out <- glmnet(x,y,alpha=0)
predict(out,type="coefficients",s=bestlam)[1:20,]
## (Intercept) AtBat Hits HmRun Runs RBI
## 9.88487 0.03144 1.00883 0.13928 1.11321 0.87319
## Walks Years CAtBat CHits CHmRun CRuns
## 1.80410 0.13074 0.01114 0.06490 0.45159 0.12900
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.13738 0.02909 27.18228 -91.63411 0.19149 0.04255
## Errors NewLeagueN
## -1.81244 7.21208
Let us now fit a Lasso model on the data
lasso.mod <- glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)
Let us now do a cross validation and find the lambda which gives the lowest MSE and then predict the new model with the lowest MSE.
cv.out2 <- cv.glmnet(x[train,],y[train],alpha=1)
lowlam <- cv.out2$lambda.min
lasso.pred <- predict(lasso.mod,newx=x[test,],s=lowlam)
mean((lasso.pred-y.test)^2)
## [1] 101918
Let us also have a look at the coefficients from a lasso model
out2 <- glmnet(x,y,alpha=1)
lasso.coef <- predict(out2,type="coefficients",s=lowlam)[1:20,]
lasso.coef
## (Intercept) AtBat Hits HmRun Runs RBI
## 52.0558 0.0000 1.7367 0.0000 0.0000 0.0000
## Walks Years CAtBat CHits CHmRun CRuns
## 2.0331 0.0000 0.0000 0.0000 0.0000 0.1938
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.3902 0.0000 0.0000 -75.5811 0.1809 0.0000
## Errors NewLeagueN
## 0.0000 0.0000
We can now see that the Lasso model has many of its coefficients as zero. This is the major difference between a ridge regression model and the lasso model. While the ridge regression model uses all the coefficients, lasso model is a parsimonious model and makes the not too significant coefficients as zero.
Let us now examine the Principal component regression and PLS regression
Principal component regression can be done by the pcr() function which is part of the pls library.
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
set.seed(2)
pcr.fit = pcr(Salary~.,data=Hitters,scale=TRUE,validation="CV")
names(pcr.fit)
## [1] "coefficients" "scores" "loadings" "Yloadings"
## [5] "projection" "Xmeans" "Ymeans" "fitted.values"
## [9] "residuals" "Xvar" "Xtotvar" "fit.time"
## [13] "ncomp" "method" "scale" "validation"
## [17] "call" "terms" "model"
pcr.fit$coefficients
## , , 1 comps
##
## Salary
## AtBat 21.13208
## Hits 20.87321
## HmRun 21.77988
## Runs 21.13706
## RBI 25.06280
## Walks 22.26530
## Years 30.11446
## CAtBat 35.21789
## CHits 35.24760
## CHmRun 33.99409
## CRuns 36.04328
## CRBI 36.27081
## CWalks 33.76213
## LeagueN -5.80504
## DivisionW -2.74158
## PutOuts 8.28030
## Assists -0.08969
## Errors -0.83758
## NewLeagueN -4.46644
##
## , , 2 comps
##
## Salary
## AtBat 29.439
## Hits 29.039
## HmRun 26.913
## Runs 29.313
## RBI 31.871
## Walks 27.235
## Years 24.435
## CAtBat 31.043
## CHits 31.289
## CHmRun 31.260
## CRuns 32.314
## CRBI 32.633
## CWalks 29.600
## LeagueN -7.866
## DivisionW -3.535
## PutOuts 11.651
## Assists 3.561
## Errors 3.508
## NewLeagueN -6.146
##
## , , 3 comps
##
## Salary
## AtBat 31.596
## Hits 30.841
## HmRun 21.651
## Runs 28.895
## RBI 30.092
## Walks 28.346
## Years 25.277
## CAtBat 33.077
## CHits 33.388
## CHmRun 29.161
## CRuns 33.604
## CRBI 32.997
## CWalks 30.625
## LeagueN 5.466
## DivisionW -3.930
## PutOuts 12.899
## Assists 13.245
## Errors 12.827
## NewLeagueN 7.110
##
## , , 4 comps
##
## Salary
## AtBat 30.412
## Hits 30.175
## HmRun 30.390
## Runs 30.746
## RBI 35.242
## Walks 33.186
## Years 21.745
## CAtBat 29.700
## CHits 30.285
## CHmRun 31.913
## CRuns 31.041
## CRBI 32.750
## CWalks 29.499
## LeagueN 20.141
## DivisionW -5.514
## PutOuts 23.556
## Assists -6.174
## Errors -2.808
## NewLeagueN 22.589
##
## , , 5 comps
##
## Salary
## AtBat 28.766
## Hits 30.447
## HmRun 25.844
## Runs 33.001
## RBI 33.820
## Walks 35.088
## Years 22.351
## CAtBat 29.015
## CHits 29.786
## CHmRun 30.002
## CRuns 32.069
## CRBI 31.112
## CWalks 31.487
## LeagueN 19.439
## DivisionW -63.204
## PutOuts 17.360
## Assists -5.523
## Errors -6.044
## NewLeagueN 21.743
##
## , , 6 comps
##
## Salary
## AtBat 24.363
## Hits 25.321
## HmRun 16.518
## Runs 24.484
## RBI 26.860
## Walks 33.874
## Years 24.423
## CAtBat 30.534
## CHits 31.618
## CHmRun 27.460
## CRuns 32.498
## CRBI 31.828
## CWalks 33.606
## LeagueN 10.911
## DivisionW -68.868
## PutOuts 74.954
## Assists -3.328
## Errors 3.192
## NewLeagueN 11.960
##
## , , 7 comps
##
## Salary
## AtBat 27.005
## Hits 28.531
## HmRun 4.031
## Runs 29.464
## RBI 18.974
## Walks 47.659
## Years 24.126
## CAtBat 30.832
## CHits 32.112
## CHmRun 21.812
## CRuns 34.054
## CRBI 28.901
## CWalks 37.991
## LeagueN 9.022
## DivisionW -66.069
## PutOuts 74.483
## Assists -3.655
## Errors -6.005
## NewLeagueN 11.401
##
## , , 8 comps
##
## Salary
## AtBat 31.2842
## Hits 34.6957
## HmRun 0.4426
## Runs 31.2803
## RBI 19.0549
## Walks 37.7735
## Years 26.3194
## CAtBat 33.1633
## CHits 35.1968
## CHmRun 17.8551
## CRuns 35.3957
## CRBI 28.8660
## CWalks 33.8181
## LeagueN 8.6066
## DivisionW -66.0226
## PutOuts 75.5157
## Assists -4.8535
## Errors -10.7815
## NewLeagueN 12.4752
##
## , , 9 comps
##
## Salary
## AtBat 30.923
## Hits 32.870
## HmRun 4.219
## Runs 26.206
## RBI 22.941
## Walks 37.138
## Years 26.004
## CAtBat 31.544
## CHits 32.384
## CHmRun 22.832
## CRuns 32.624
## CRBI 30.388
## CWalks 33.742
## LeagueN 8.555
## DivisionW -65.920
## PutOuts 78.304
## Assists 16.091
## Errors -28.836
## NewLeagueN 13.615
##
## , , 10 comps
##
## Salary
## AtBat 45.462
## Hits 45.897
## HmRun -30.836
## Runs 29.425
## RBI 5.746
## Walks 25.064
## Years -25.164
## CAtBat 21.446
## CHits 24.674
## CHmRun 87.780
## CRuns 33.634
## CRBI 58.444
## CWalks 25.593
## LeagueN 10.145
## DivisionW -67.068
## PutOuts 77.713
## Assists 10.070
## Errors -26.041
## NewLeagueN 11.293
##
## , , 11 comps
##
## Salary
## AtBat 42.525
## Hits 45.461
## HmRun -24.736
## Runs 38.844
## RBI -1.598
## Walks 19.742
## Years -30.943
## CAtBat 22.359
## CHits 25.575
## CHmRun 85.373
## CRuns 38.211
## CRBI 55.511
## CWalks 31.342
## LeagueN 27.666
## DivisionW -66.659
## PutOuts 78.785
## Assists 12.698
## Errors -28.331
## NewLeagueN -5.118
##
## , , 12 comps
##
## Salary
## AtBat 44.695
## Hits 48.006
## HmRun -31.358
## Runs 32.078
## RBI 5.716
## Walks 23.639
## Years -23.484
## CAtBat 21.046
## CHits 23.819
## CHmRun 86.932
## CRuns 33.392
## CRBI 58.049
## CWalks 26.806
## LeagueN 36.219
## DivisionW -66.672
## PutOuts 77.911
## Assists 10.634
## Errors -28.138
## NewLeagueN -14.150
##
## , , 13 comps
##
## Salary
## AtBat 41.804
## Hits 48.267
## HmRun -39.242
## Runs 4.424
## RBI 37.644
## Walks 26.615
## Years -67.372
## CAtBat 31.825
## CHits 43.173
## CHmRun 62.986
## CRuns 48.067
## CRBI 60.434
## CWalks 38.898
## LeagueN 36.880
## DivisionW -66.620
## PutOuts 75.142
## Assists 8.196
## Errors -30.678
## NewLeagueN -18.672
##
## , , 14 comps
##
## Salary
## AtBat -43.09
## Hits -10.12
## HmRun -38.87
## Runs 78.32
## RBI 39.72
## Walks 92.57
## Years -91.23
## CAtBat 78.44
## CHits 123.65
## CHmRun 51.94
## CRuns 83.86
## CRBI 118.31
## CWalks -165.39
## LeagueN 41.45
## DivisionW -68.33
## PutOuts 78.46
## Assists 31.95
## Errors -37.21
## NewLeagueN -26.82
##
## , , 15 comps
##
## Salary
## AtBat -82.658
## Hits -8.376
## HmRun -65.282
## Runs 113.125
## RBI 74.012
## Walks 79.455
## Years -86.312
## CAtBat 66.717
## CHits 113.498
## CHmRun 52.643
## CRuns 80.151
## CRBI 123.467
## CWalks -147.199
## LeagueN 38.267
## DivisionW -64.632
## PutOuts 80.503
## Assists 35.048
## Errors -36.151
## NewLeagueN -22.178
##
## , , 16 comps
##
## Salary
## AtBat -298.80
## Hits 296.64
## HmRun 19.60
## Runs 19.73
## RBI -26.52
## Walks 122.88
## Years -96.53
## CAtBat 42.65
## CHits 160.27
## CHmRun 41.29
## CRuns 11.37
## CRBI 184.00
## CWalks -145.30
## LeagueN 28.22
## DivisionW -66.13
## PutOuts 75.89
## Assists 42.46
## Errors -26.03
## NewLeagueN -16.87
##
## , , 17 comps
##
## Salary
## AtBat -347.25
## Hits 354.83
## HmRun 3.01
## Runs -29.28
## RBI 10.58
## Walks 137.57
## Years -63.30
## CAtBat -20.73
## CHits 149.15
## CHmRun 142.73
## CRuns 227.83
## CRBI -27.97
## CWalks -204.32
## LeagueN 30.84
## DivisionW -60.59
## PutOuts 81.31
## Assists 47.01
## Errors -22.95
## NewLeagueN -14.96
##
## , , 18 comps
##
## Salary
## AtBat -287.1639
## Hits 330.3183
## HmRun 35.8569
## Runs -55.7545
## RBI -25.4324
## Walks 133.8275
## Years -15.0312
## CAtBat -425.9233
## CHits 151.1037
## CHmRun -0.3535
## CRuns 452.9583
## CRBI 239.5045
## CWalks -206.9835
## LeagueN 31.7984
## DivisionW -58.5995
## PutOuts 78.7188
## Assists 54.5750
## Errors -22.7108
## NewLeagueN -13.0026
##
## , , 19 comps
##
## Salary
## AtBat -291.65
## Hits 338.47
## HmRun 37.93
## Runs -60.69
## RBI -27.05
## Walks 135.33
## Years -16.73
## CAtBat -391.78
## CHits 86.85
## CHmRun -14.21
## CRuns 481.66
## CRBI 261.19
## CWalks -214.30
## LeagueN 31.31
## DivisionW -58.53
## PutOuts 78.91
## Assists 53.83
## Errors -22.20
## NewLeagueN -12.37
summary(pcr.fit)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 452 348.9 352.2 353.5 352.8 350.1 349.1
## adjCV 452 348.7 351.8 352.9 352.1 349.3 348.0
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 349.6 350.9 352.9 353.8 355.0 356.2 363.5
## adjCV 348.5 349.8 351.6 352.3 353.4 354.5 361.6
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 355.2 357.4 347.6 350.1 349.2 352.6
## adjCV 352.8 355.2 345.5 347.6 346.7 349.8
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26
## Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 94.96 96.28 97.26 97.98 98.65 99.15 99.47
## Salary 46.75 46.86 47.76 47.82 47.85 48.10 50.40
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.75 99.89 99.97 99.99 100.00
## Salary 50.55 53.01 53.85 54.61 54.61
Setting the scale=TRUE does the standardizing each predictor prior to generating the principal componenets.
From the summary of the model, we can also see that with 13 principal components 99% of variability is addressed.
We can also plot the mean squared error using the validationplot() function. We can find out that with 6 principal componenets we get the least MSEP
validationplot(pcr.fit,val.type="MSEP")
Let us now carry out this exercise with the train and test sets.
set.seed(1)
pcr.fit2 <- pcr(Salary~.,data=Hitters,subset=train,scale=TRUE,validation="CV")
validationplot(pcr.fit2,val.type="MSEP")
We see now that the model with 7 principal components, get the lowest MSE.
Let us now do the prediction on the test set with the model with 7 PC’s
pcr.pred2 <- predict(pcr.fit2,x[test,],ncomp=7)
mean((pcr.pred2-y.test)^2)
## [1] 96556
let us now fit the model on the full data set with 7 principal componenets.
pcr.fit3 <- pcr(y~x,scale=TRUE,ncomp=7)
summary(pcr.fit3)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26
## y 40.63 41.58 42.17 43.22 44.90 46.48 46.69