This exercise is to demonstrate methods of selecting the best linear model from a array of options available. We will be using the Hitters data set available in the ISLR package for this.

library(ISLR)
## 
## Attaching package: 'ISLR'
## 
## The following objects are masked _by_ '.GlobalEnv':
## 
##     Carseats, Hitters
fix(Hitters)

Best Subset Selection.

In the best subset selection model, multiple models are fit with (p,k)(factorial) combinations and the mean value of the best model from each combination is selected. With this we get a range of models with various statistics like R2,adj R2,Cp, AIC,BIC etc.

The function regsubsets() gets us various models with combinations of variables. This function is available in leaps() package

library(leaps)
hitter.full <- regsubsets(Salary~.,Hitters)
summary(hitter.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "

By default the regsubsets() function only gives the first 8 models. The function nvmax() gives all the variables. Let us try that and get the summary statistics.

hitter.full <- regsubsets(Salary~.,Hitters,nvmax=19)
hitter.summary <-summary(hitter.full)
names(hitter.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

Let us look at the variuos summary statistics of these models and plot the values to find the desired models

par(mfrow=c(2,2))
plot(hitter.summary$rsq,type="l",xlab="number of variables",ylab="R-Squared")
which.max(hitter.summary$rsq)
## [1] 19
points(19,hitter.summary$rsq[19],col="red",cex=2,pch=20)
plot(hitter.summary$rss,type ="l")
which.min(hitter.summary$rss)
## [1] 19
points(19,hitter.summary$rss[19],col="blue",cex=3,pch=20)

plot(hitter.summary$cp,type="l",col="green")
which.min(hitter.summary$cp)
## [1] 10
points(10,hitter.summary$cp[10],col="red",cex=2,pch=20)

plot(hitter.summary$bic,type="l",col="purple")
which.min(hitter.summary$bic)
## [1] 6
points(6,hitter.summary$bic[6],col="green",cex=2,pch=20)

plot of chunk unnamed-chunk-4

The regsubs() function also has an inbuilt plot() function to display the selected variables for the best model with a given number of predictors.

par(mar = rep(2, 4))
plot(hitter.full,scale="r2")

plot of chunk unnamed-chunk-5

plot(hitter.full,scale="adjr2")

plot of chunk unnamed-chunk-5

plot(hitter.full,scale="Cp")

plot of chunk unnamed-chunk-5

plot(hitter.full,scale="bic")

plot of chunk unnamed-chunk-5

Let us also look at the coefficients with the lowest values of bic

coef(hitter.full,6)
## (Intercept)       AtBat        Hits       Walks        CRBI   DivisionW 
##     91.5118     -1.8686      7.6044      3.6976      0.6430   -122.9515 
##     PutOuts 
##      0.2643

Forward and Backward stepwise selection:

Let us now learn the forward and backward stepwise selection procedures to select the right model.

The forward and backward stepwise selection can be carried out with the regsubsets() function with methods = “forward” or “backward”

hitters.fwd <- regsubsets(Salary~.,data=Hitters,nvmax=19,method="forward")
hit.fwd.sum <- summary(hitters.fwd)


hitters.bwd <- regsubsets(Salary~.,data=Hitters,nvmax=19,method="backward")
hit.bwd.sum <-summary(hitters.bwd)

Let us look which is best models with respect to the various statistics

names(hit.fwd.sum)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
which.min(hit.fwd.sum$cp)
## [1] 10
which.min(hit.bwd.sum$cp)
## [1] 10
which.min(hit.fwd.sum$bic)
## [1] 6
which.min(hit.bwd.sum$bic)
## [1] 8

Let us get the coefficients for the above minimum values

coef(hitters.fwd,10)
## (Intercept)       AtBat        Hits       Walks      CAtBat       CRuns 
##    162.5354     -2.1687      6.9180      5.7732     -0.1301      1.4082 
##        CRBI      CWalks   DivisionW     PutOuts     Assists 
##      0.7743     -0.8308   -112.3801      0.2974      0.2832

Choosing from variuos models using Validation set approach and cross validation.

We know that irrespective of what results we get from the training set, the real estimate of the error is what we get from the cross validation sets and test sets. Let us now attempt to select the best models based on the results from the cross validation sets.

Let us now create trainining sets and test sets from the data. We are going to randomly sample from the Hitters dataset to get the training set and test sets.

Hitters <- na.omit(Hitters)
set.seed(1)
train = sample(c(TRUE,FALSE),nrow(Hitters),rep=TRUE)
test = (!train)

After the above we get a subset of training and test sets. Let us now find the best model using the regsubsets() function on to the training set

hitters.best <- regsubsets(Salary~.,data=Hitters[train,],nvmax=19)

Let us introduce a new function called the model.matrix() to make a matrix out of the test set models

test.mat <- model.matrix(Salary~.,data=Hitters[test,])

Next we will take each of the best models, predict the outcomes on the test set and calculate the MSE for all the models. This will help in identifying the best model with the smallest MSE.

errorval <- rep(NA,19)
for (i in 1:19) {
  coef1 <- coef(hitters.best,id=i)
  predict1 <- test.mat[,names(coef1)]%*%coef1
  errorval[i] <- mean((Hitters$Salary[test]-predict1)^2)  
  
}

Let us make a special function for the above steps so that we can call the function again and again

predict.regsubsets <- function(object,newdata,id,...){
  form = as.formula(object$call[[2]])
  mat = model.matrix(form,newdata)
  coefi <- coef(object,id=id)
  xvars = names(coefi)
  mat[,xvars]%*%coefi 
  
}

From the above we can get all the values of the MSE. From this we can find the minimum value. This is found to be the model with 10 coefficients. Let us find this model

which.min(errorval)
## [1] 10
plot(errorval,type="l",col="blue")
points(10,errorval[10],col="red",cex=2,pch=20)

plot of chunk unnamed-chunk-15

Now we have seen the model work with the regsubsets() function, let us do a cross validation tests to find the best model.

This involves, performing best subset selection within each of the k training sets.Let us see this in action

k = 10
set.seed(1)

folds = sample(1:k,nrow(Hitters),replace=TRUE)
cv.error = matrix(NA,k,19,dimnames=list(NULL,paste(1:19)))

After defining the basic building blocks let us do a for loop for performing the cross validation

for (j in 1:k){
  best.fit <- regsubsets(Salary~.,data=Hitters[folds!=j,],nvmax=19)
  for(i in 1:19){
    predict <- predict.regsubsets(best.fit,Hitters[folds==j,],id=i)
    cv.error[j,i] = mean((Hitters$Salary[folds==j]-predict)^2)
    
    
  } 
  
  
}

Now we have got all the errors in a matrix form, let us find the mean values accross all the folds

cv.errors.mean <- apply(cv.error,2,mean)
which.min(cv.errors.mean)
## 11 
## 11

As seen from the cross validation test, the 11th model results in the lowest MSE

Ridge Regression and Lasso

The Ridge regression and the Lasso models can be fit using the package glmnet(). The glmnet uses an x matrix which is the matrix of predictors. The x matrix have to be numerical,quantitative values. The package will convert any qualitative values to quantitative values. The next variable is the y vector which is the vector of responses. Let us do this with the Hitters data.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 1.9-8
x = model.matrix(Salary~.,Hitters)[,-1]
y = Hitters$Salary

Ridge Regression

The glmnet() function has an alpha argument. When alpha = 0 a ridge regression model is fit and when alpha = 1, a lasso model is fit.

glmnet() package has a default set of lambda values. However we are using a grid of lambda values ranging from 10^10 to 10^-2

grid = 10^seq(10,-2,length = 100)
ridge.mod = glmnet(x,y,alpha=0,lambda=grid)
dim(coef(ridge.mod))
## [1]  20 100
names(ridge.mod)
##  [1] "a0"        "beta"      "df"        "dim"       "lambda"   
##  [6] "dev.ratio" "nulldev"   "npasses"   "jerr"      "offset"   
## [11] "call"      "nobs"

Let us look at some of the estimates for the model for some values of lambda

ridge.mod$lambda[50]
## [1] 11498
coef(ridge.mod)[,50]
## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##  407.356050    0.036957    0.138180    0.524630    0.230702    0.239841 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##    0.289619    1.107703    0.003132    0.011654    0.087546    0.023380 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##    0.024138    0.025015    0.085028   -6.215441    0.016483    0.002613 
##      Errors  NewLeagueN 
##   -0.020503    0.301434

One charachteristics of the lambda is, when lambda is large the l2 norm, which is the distance of the coefficient term from the origin will be the smallest. l2 norm is denoted by the expression, sqrt(sum(coef(ridge.mod)[-1,lambda])^2). Let us look at this fact for two values of lambda

sqrt(sum(coef(ridge.mod)[-1,50])^2)
## [1] 3.088
sqrt(sum(coef(ridge.mod)[-1,60])^2)
## [1] 24.62

We can use the predict() function for many purposes like predicting the coefficients of the model for an arbitary value of lambda

predict(ridge.mod,s=50,type="coefficients")[1:20,]
## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##   4.877e+01  -3.581e-01   1.969e+00  -1.278e+00   1.146e+00   8.038e-01 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##   2.716e+00  -6.218e+00   5.448e-03   1.065e-01   6.245e-01   2.215e-01 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##   2.187e-01  -1.500e-01   4.593e+01  -1.182e+02   2.502e-01   1.216e-01 
##      Errors  NewLeagueN 
##  -3.279e+00  -9.497e+00

Let us now split the data into training and test sets with a different approach. If we remember we did the training data split earlier using TRUE and FALSE approach. Here we achieve this with a different approach.

set.seed(1)
train = sample(1:nrow(x),nrow(x)/2)
test=(-train)
y.test=y[test]

Let us now fit a regression model on the training set and evaluate its MSE on the test set, using lambda = 4.Let us also calculate the MSE of the ridge regression model.

ridge.mod1 = glmnet(x[train,],y[train],alpha=0,lambda=grid,thresh=1e-12)
ridge.pred1 = predict(ridge.mod1,s=4,newx = x[test,])
mean((ridge.pred1-y.test)^2)
## [1] 101037

We also learnt that a model with lambda = 0 is equivalent to a linear model

ridge.pred2 <- predict(ridge.mod1,s=0,newx=x[test,],exact=T)
mean((ridge.pred2-y.test)^2)
## [1] 114783
lm(y~x,subset=train)
## 
## Call:
## lm(formula = y ~ x, subset = train)
## 
## Coefficients:
## (Intercept)       xAtBat        xHits       xHmRun        xRuns  
##    299.4285      -2.5403       8.3668      11.6451      -9.0992  
##        xRBI       xWalks       xYears      xCAtBat       xCHits  
##      2.4410       9.2344     -22.9367      -0.1815      -0.1160  
##     xCHmRun       xCRuns        xCRBI      xCWalks     xLeagueN  
##     -1.3389       3.3284       0.0754      -1.0784      59.7607  
##  xDivisionW     xPutOuts     xAssists      xErrors  xNewLeagueN  
##    -98.8623       0.3409       0.3416      -0.6421      -0.6744
predict(ridge.mod1,s=0,exact=T,type="coefficients")[1:20,]
## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##   299.42884    -2.54015     8.36612    11.64401    -9.09878     2.44152 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##     9.23404   -22.93584    -0.18161    -0.11561    -1.33837     3.32818 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##     0.07512    -1.07829    59.76529   -98.85997     0.34086     0.34166 
##      Errors  NewLeagueN 
##    -0.64206    -0.67606

glmnet() has an inbuilt function to do cross validation. Let us carry out cross validation and find the MSE for both the cross validated set and also the test set. The function for doing cross validation in the glmnet() function is cv.glmnet

set.seed(1)
cv.out1 <- cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out1)

plot of chunk unnamed-chunk-27

bestlam = cv.out1$lambda.min
bestlam
## [1] 211.7

We can see that the lambda which results in the smallest cross validation error is 212. Let us find the test MSE associated with this.

Let us find the ridge regression model on the full data set, using the value of lambda chosen by cross validation and examine the coefficient estimates

ridge.pred3 <- predict(ridge.mod1,s=bestlam,newx=x[test,])
mean((ridge.pred3-y.test)^2)
## [1] 96016
out <- glmnet(x,y,alpha=0)
predict(out,type="coefficients",s=bestlam)[1:20,]
## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##     9.88487     0.03144     1.00883     0.13928     1.11321     0.87319 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##     1.80410     0.13074     0.01114     0.06490     0.45159     0.12900 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##     0.13738     0.02909    27.18228   -91.63411     0.19149     0.04255 
##      Errors  NewLeagueN 
##    -1.81244     7.21208

Lasso model

Let us now fit a Lasso model on the data

lasso.mod <- glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)

plot of chunk unnamed-chunk-29

Let us now do a cross validation and find the lambda which gives the lowest MSE and then predict the new model with the lowest MSE.

cv.out2 <- cv.glmnet(x[train,],y[train],alpha=1)
lowlam <- cv.out2$lambda.min
lasso.pred <- predict(lasso.mod,newx=x[test,],s=lowlam)
mean((lasso.pred-y.test)^2)
## [1] 101918

Let us also have a look at the coefficients from a lasso model

out2 <- glmnet(x,y,alpha=1)
lasso.coef <- predict(out2,type="coefficients",s=lowlam)[1:20,]
lasso.coef
## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##     52.0558      0.0000      1.7367      0.0000      0.0000      0.0000 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##      2.0331      0.0000      0.0000      0.0000      0.0000      0.1938 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##      0.3902      0.0000      0.0000    -75.5811      0.1809      0.0000 
##      Errors  NewLeagueN 
##      0.0000      0.0000

We can now see that the Lasso model has many of its coefficients as zero. This is the major difference between a ridge regression model and the lasso model. While the ridge regression model uses all the coefficients, lasso model is a parsimonious model and makes the not too significant coefficients as zero.

PCR and PLS Regression

Let us now examine the Principal component regression and PLS regression

Principal component regression can be done by the pcr() function which is part of the pls library.

library(pls)
## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(2)
pcr.fit = pcr(Salary~.,data=Hitters,scale=TRUE,validation="CV")

names(pcr.fit)
##  [1] "coefficients"  "scores"        "loadings"      "Yloadings"    
##  [5] "projection"    "Xmeans"        "Ymeans"        "fitted.values"
##  [9] "residuals"     "Xvar"          "Xtotvar"       "fit.time"     
## [13] "ncomp"         "method"        "scale"         "validation"   
## [17] "call"          "terms"         "model"
pcr.fit$coefficients
## , , 1 comps
## 
##              Salary
## AtBat      21.13208
## Hits       20.87321
## HmRun      21.77988
## Runs       21.13706
## RBI        25.06280
## Walks      22.26530
## Years      30.11446
## CAtBat     35.21789
## CHits      35.24760
## CHmRun     33.99409
## CRuns      36.04328
## CRBI       36.27081
## CWalks     33.76213
## LeagueN    -5.80504
## DivisionW  -2.74158
## PutOuts     8.28030
## Assists    -0.08969
## Errors     -0.83758
## NewLeagueN -4.46644
## 
## , , 2 comps
## 
##            Salary
## AtBat      29.439
## Hits       29.039
## HmRun      26.913
## Runs       29.313
## RBI        31.871
## Walks      27.235
## Years      24.435
## CAtBat     31.043
## CHits      31.289
## CHmRun     31.260
## CRuns      32.314
## CRBI       32.633
## CWalks     29.600
## LeagueN    -7.866
## DivisionW  -3.535
## PutOuts    11.651
## Assists     3.561
## Errors      3.508
## NewLeagueN -6.146
## 
## , , 3 comps
## 
##            Salary
## AtBat      31.596
## Hits       30.841
## HmRun      21.651
## Runs       28.895
## RBI        30.092
## Walks      28.346
## Years      25.277
## CAtBat     33.077
## CHits      33.388
## CHmRun     29.161
## CRuns      33.604
## CRBI       32.997
## CWalks     30.625
## LeagueN     5.466
## DivisionW  -3.930
## PutOuts    12.899
## Assists    13.245
## Errors     12.827
## NewLeagueN  7.110
## 
## , , 4 comps
## 
##            Salary
## AtBat      30.412
## Hits       30.175
## HmRun      30.390
## Runs       30.746
## RBI        35.242
## Walks      33.186
## Years      21.745
## CAtBat     29.700
## CHits      30.285
## CHmRun     31.913
## CRuns      31.041
## CRBI       32.750
## CWalks     29.499
## LeagueN    20.141
## DivisionW  -5.514
## PutOuts    23.556
## Assists    -6.174
## Errors     -2.808
## NewLeagueN 22.589
## 
## , , 5 comps
## 
##             Salary
## AtBat       28.766
## Hits        30.447
## HmRun       25.844
## Runs        33.001
## RBI         33.820
## Walks       35.088
## Years       22.351
## CAtBat      29.015
## CHits       29.786
## CHmRun      30.002
## CRuns       32.069
## CRBI        31.112
## CWalks      31.487
## LeagueN     19.439
## DivisionW  -63.204
## PutOuts     17.360
## Assists     -5.523
## Errors      -6.044
## NewLeagueN  21.743
## 
## , , 6 comps
## 
##             Salary
## AtBat       24.363
## Hits        25.321
## HmRun       16.518
## Runs        24.484
## RBI         26.860
## Walks       33.874
## Years       24.423
## CAtBat      30.534
## CHits       31.618
## CHmRun      27.460
## CRuns       32.498
## CRBI        31.828
## CWalks      33.606
## LeagueN     10.911
## DivisionW  -68.868
## PutOuts     74.954
## Assists     -3.328
## Errors       3.192
## NewLeagueN  11.960
## 
## , , 7 comps
## 
##             Salary
## AtBat       27.005
## Hits        28.531
## HmRun        4.031
## Runs        29.464
## RBI         18.974
## Walks       47.659
## Years       24.126
## CAtBat      30.832
## CHits       32.112
## CHmRun      21.812
## CRuns       34.054
## CRBI        28.901
## CWalks      37.991
## LeagueN      9.022
## DivisionW  -66.069
## PutOuts     74.483
## Assists     -3.655
## Errors      -6.005
## NewLeagueN  11.401
## 
## , , 8 comps
## 
##              Salary
## AtBat       31.2842
## Hits        34.6957
## HmRun        0.4426
## Runs        31.2803
## RBI         19.0549
## Walks       37.7735
## Years       26.3194
## CAtBat      33.1633
## CHits       35.1968
## CHmRun      17.8551
## CRuns       35.3957
## CRBI        28.8660
## CWalks      33.8181
## LeagueN      8.6066
## DivisionW  -66.0226
## PutOuts     75.5157
## Assists     -4.8535
## Errors     -10.7815
## NewLeagueN  12.4752
## 
## , , 9 comps
## 
##             Salary
## AtBat       30.923
## Hits        32.870
## HmRun        4.219
## Runs        26.206
## RBI         22.941
## Walks       37.138
## Years       26.004
## CAtBat      31.544
## CHits       32.384
## CHmRun      22.832
## CRuns       32.624
## CRBI        30.388
## CWalks      33.742
## LeagueN      8.555
## DivisionW  -65.920
## PutOuts     78.304
## Assists     16.091
## Errors     -28.836
## NewLeagueN  13.615
## 
## , , 10 comps
## 
##             Salary
## AtBat       45.462
## Hits        45.897
## HmRun      -30.836
## Runs        29.425
## RBI          5.746
## Walks       25.064
## Years      -25.164
## CAtBat      21.446
## CHits       24.674
## CHmRun      87.780
## CRuns       33.634
## CRBI        58.444
## CWalks      25.593
## LeagueN     10.145
## DivisionW  -67.068
## PutOuts     77.713
## Assists     10.070
## Errors     -26.041
## NewLeagueN  11.293
## 
## , , 11 comps
## 
##             Salary
## AtBat       42.525
## Hits        45.461
## HmRun      -24.736
## Runs        38.844
## RBI         -1.598
## Walks       19.742
## Years      -30.943
## CAtBat      22.359
## CHits       25.575
## CHmRun      85.373
## CRuns       38.211
## CRBI        55.511
## CWalks      31.342
## LeagueN     27.666
## DivisionW  -66.659
## PutOuts     78.785
## Assists     12.698
## Errors     -28.331
## NewLeagueN  -5.118
## 
## , , 12 comps
## 
##             Salary
## AtBat       44.695
## Hits        48.006
## HmRun      -31.358
## Runs        32.078
## RBI          5.716
## Walks       23.639
## Years      -23.484
## CAtBat      21.046
## CHits       23.819
## CHmRun      86.932
## CRuns       33.392
## CRBI        58.049
## CWalks      26.806
## LeagueN     36.219
## DivisionW  -66.672
## PutOuts     77.911
## Assists     10.634
## Errors     -28.138
## NewLeagueN -14.150
## 
## , , 13 comps
## 
##             Salary
## AtBat       41.804
## Hits        48.267
## HmRun      -39.242
## Runs         4.424
## RBI         37.644
## Walks       26.615
## Years      -67.372
## CAtBat      31.825
## CHits       43.173
## CHmRun      62.986
## CRuns       48.067
## CRBI        60.434
## CWalks      38.898
## LeagueN     36.880
## DivisionW  -66.620
## PutOuts     75.142
## Assists      8.196
## Errors     -30.678
## NewLeagueN -18.672
## 
## , , 14 comps
## 
##             Salary
## AtBat       -43.09
## Hits        -10.12
## HmRun       -38.87
## Runs         78.32
## RBI          39.72
## Walks        92.57
## Years       -91.23
## CAtBat       78.44
## CHits       123.65
## CHmRun       51.94
## CRuns        83.86
## CRBI        118.31
## CWalks     -165.39
## LeagueN      41.45
## DivisionW   -68.33
## PutOuts      78.46
## Assists      31.95
## Errors      -37.21
## NewLeagueN  -26.82
## 
## , , 15 comps
## 
##              Salary
## AtBat       -82.658
## Hits         -8.376
## HmRun       -65.282
## Runs        113.125
## RBI          74.012
## Walks        79.455
## Years       -86.312
## CAtBat       66.717
## CHits       113.498
## CHmRun       52.643
## CRuns        80.151
## CRBI        123.467
## CWalks     -147.199
## LeagueN      38.267
## DivisionW   -64.632
## PutOuts      80.503
## Assists      35.048
## Errors      -36.151
## NewLeagueN  -22.178
## 
## , , 16 comps
## 
##             Salary
## AtBat      -298.80
## Hits        296.64
## HmRun        19.60
## Runs         19.73
## RBI         -26.52
## Walks       122.88
## Years       -96.53
## CAtBat       42.65
## CHits       160.27
## CHmRun       41.29
## CRuns        11.37
## CRBI        184.00
## CWalks     -145.30
## LeagueN      28.22
## DivisionW   -66.13
## PutOuts      75.89
## Assists      42.46
## Errors      -26.03
## NewLeagueN  -16.87
## 
## , , 17 comps
## 
##             Salary
## AtBat      -347.25
## Hits        354.83
## HmRun         3.01
## Runs        -29.28
## RBI          10.58
## Walks       137.57
## Years       -63.30
## CAtBat      -20.73
## CHits       149.15
## CHmRun      142.73
## CRuns       227.83
## CRBI        -27.97
## CWalks     -204.32
## LeagueN      30.84
## DivisionW   -60.59
## PutOuts      81.31
## Assists      47.01
## Errors      -22.95
## NewLeagueN  -14.96
## 
## , , 18 comps
## 
##               Salary
## AtBat      -287.1639
## Hits        330.3183
## HmRun        35.8569
## Runs        -55.7545
## RBI         -25.4324
## Walks       133.8275
## Years       -15.0312
## CAtBat     -425.9233
## CHits       151.1037
## CHmRun       -0.3535
## CRuns       452.9583
## CRBI        239.5045
## CWalks     -206.9835
## LeagueN      31.7984
## DivisionW   -58.5995
## PutOuts      78.7188
## Assists      54.5750
## Errors      -22.7108
## NewLeagueN  -13.0026
## 
## , , 19 comps
## 
##             Salary
## AtBat      -291.65
## Hits        338.47
## HmRun        37.93
## Runs        -60.69
## RBI         -27.05
## Walks       135.33
## Years       -16.73
## CAtBat     -391.78
## CHits        86.85
## CHmRun      -14.21
## CRuns       481.66
## CRBI        261.19
## CWalks     -214.30
## LeagueN      31.31
## DivisionW   -58.53
## PutOuts      78.91
## Assists      53.83
## Errors      -22.20
## NewLeagueN  -12.37
summary(pcr.fit)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    348.9    352.2    353.5    352.8    350.1    349.1
## adjCV          452    348.7    351.8    352.9    352.1    349.3    348.0
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       349.6    350.9    352.9     353.8     355.0     356.2     363.5
## adjCV    348.5    349.8    351.6     352.3     353.4     354.5     361.6
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        355.2     357.4     347.6     350.1     349.2     352.6
## adjCV     352.8     355.2     345.5     347.6     346.7     349.8
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         94.96    96.28     97.26     97.98     98.65     99.15     99.47
## Salary    46.75    46.86     47.76     47.82     47.85     48.10     50.40
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.75     99.89     99.97     99.99    100.00
## Salary     50.55     53.01     53.85     54.61     54.61

Setting the scale=TRUE does the standardizing each predictor prior to generating the principal componenets.

From the summary of the model, we can also see that with 13 principal components 99% of variability is addressed.

We can also plot the mean squared error using the validationplot() function. We can find out that with 6 principal componenets we get the least MSEP

validationplot(pcr.fit,val.type="MSEP")

plot of chunk unnamed-chunk-33

Let us now carry out this exercise with the train and test sets.

set.seed(1)
pcr.fit2 <- pcr(Salary~.,data=Hitters,subset=train,scale=TRUE,validation="CV")
validationplot(pcr.fit2,val.type="MSEP")

plot of chunk unnamed-chunk-34

We see now that the model with 7 principal components, get the lowest MSE.

Let us now do the prediction on the test set with the model with 7 PC’s

pcr.pred2 <- predict(pcr.fit2,x[test,],ncomp=7)
mean((pcr.pred2-y.test)^2)
## [1] 96556

let us now fit the model on the full data set with 7 principal componenets.

pcr.fit3 <- pcr(y~x,scale=TRUE,ncomp=7)
summary(pcr.fit3)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X    38.31    60.16    70.84    79.03    84.29    88.63    92.26
## y    40.63    41.58    42.17    43.22    44.90    46.48    46.69