PCR and PLS Regression

Principal Components Regression
Partial Least Squares

Principal Components Regression

Principal components regression (PCR) can be performed using the pcr() function, which is part of the pls library. We now apply PCR to the Hitters data, in order to predict Salary. Again, ensure that the missing values have been removed from the data, as described in Section 6.5.

library(ISLR)


Attaching package: 㤼㸱ISLR㤼㸲

The following object is masked _by_ 㤼㸱.GlobalEnv㤼㸲:

    Hitters

Hitters=na.omit(Hitters)
library(pls)


Attaching package: 㤼㸱pls㤼㸲

The following object is masked from 㤼㸱package:stats㤼㸲:

    loadings

set.seed(2)
pcr.fit=pcr(Salary~., data=Hitters, scale=TRUE, validation ="CV")

The syntax for the pcr() function is similar to that for lm(), with a few additional options. Setting scale=TRUE has the effect of standardizing each predictor, using (6.6), prior to generating the principal components, so that the scale on which each variable is measured will not have an effect. Setting validation="CV" causes pcr() to compute the ten-fold cross-validation error for each possible value of M, the number of principal components used. The resulting fit can be examined using summary().

summary(pcr.fit)

Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: svdpc
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps
CV             452    351.9    353.2    355.0    352.8    348.4    343.6    345.5    347.7    349.6     351.4     352.1     353.5     358.2     349.7     349.4     339.9     341.6     339.2
adjCV          452    351.6    352.7    354.4    352.1    347.6    342.7    344.7    346.7    348.5     350.1     350.7     352.0     356.5     348.0     347.7     338.2     339.7     337.2
       19 comps
CV        339.6
adjCV     337.6

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96    96.28     97.26     97.98     98.65     99.15     99.47     99.75     99.89     99.97     99.99    100.00
Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75    46.86     47.76     47.82     47.85     48.10     50.40     50.55     53.01     53.85     54.61     54.61

The CV score is provided for each possible number of components, ranging from M = 0 onwards. (We have printed the CV output only up to M = 4.) Note that pcr() reports the root mean squared error; in order to obtain the usual MSE, we must square this quantity. For instance, a root mean squared error of 352.8 corresponds to an MSE of \(352.8^2 = 124,468\).

One can also plot the cross-validation scores using the validationplot() function. Using val.type="MSEP" will cause the cross-validation MSE to be plotted.

validationplot(pcr.fit,val.type="MSEP")

We see that the smallest cross-validation error occurs when M = 16 components are used. This is barely fewer than M = 19, which amounts to simply performing least squares, because when all of the components are used in PCR no dimension reduction occurs. However, from the plot we also see that the cross-validation error is roughly the same when only one component is included in the model. This suggests that a model that uses just a small number of components might suffice.

The summary() function also provides the percentage of variance explained in the predictors and in the response using different numbers of components. This concept is discussed in greater detail in Chapter 10. Briefly, we can think of this as the amount of information about the predictors or the response that is captured using M principal components. For example, setting M = 1 only captures 38.31% of all the variance, or information, in the predictors. In contrast, using M = 6 increases the value to 88.63%. If we were to use all M = p = 19 components, this would increase to 100%.

We now perform PCR on the training data and evaluate its test set performance.

set.seed(1)
pcr.fit=pcr(Salary~., data=Hitters, subset=train, scale=TRUE, validation ="CV")
validationplot(pcr.fit, val.type="MSEP")

Now we find that the lowest cross-validation error occurs when M = 7 component are used. We compute the test MSE as follows.

pcr.pred=predict(pcr.fit,x[test ,],ncomp =7)
mean((pcr.pred-y.test)^2)

[1] 140751.3

This test set MSE is competitive with the results obtained using ridge regression and the lasso. However, as a result of the way PCR is implemented, the final model is more difficult to interpret because it does not perform any kind of variable selection or even directly produce coefficient estimates.

Finally, we fit PCR on the full data set, using M = 7, the number of components identified by cross-validation.

pcr.fit=pcr(y~x,scale=TRUE,ncomp=7)
summary(pcr.fit)

Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: svdpc
Number of components considered: 7
TRAINING: % variance explained
   1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X    38.31    60.16    70.84    79.03    84.29    88.63    92.26
y    40.63    41.58    42.17    43.22    44.90    46.48    46.69

Partial Least Squares

We implement partial least squares (PLS) using the plsr() function, also plsr() in the pls library. The syntax is just like that of the pcr() function.

set.seed(1)
pls.fit=plsr(Salary~.,data=Hitters,subset=train,scale=TRUE,validation ="CV")
summary (pls.fit)

Data:   X dimension: 131 19 
    Y dimension: 131 1
Fit method: kernelpls
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps
CV           428.3    325.5    329.9    328.8    339.0    338.9    340.1    339.0    347.1    346.4     343.4     341.5     345.4     356.4     348.4     349.1     350.0     344.2     344.5
adjCV        428.3    325.0    328.2    327.2    336.6    336.1    336.6    336.2    343.4    342.8     340.2     338.3     341.8     351.1     344.2     345.0     345.9     340.4     340.6
       19 comps
CV        345.0
adjCV     341.1

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
X         39.13    48.80    60.09    75.07    78.58    81.12    88.21    90.71    93.17     96.05     97.08     97.61     97.97     98.70     99.12     99.61     99.70     99.95    100.00
Salary    46.36    50.72    52.23    53.03    54.07    54.77    55.05    55.66    55.95     56.12     56.47     56.68     57.37     57.76     58.08     58.17     58.49     58.56     58.62

validationplot(pls.fit,val.type="MSEP")

The lowest cross-validation error occurs when only M = 2 partial least squares directions are used. We now evaluate the corresponding test set MSE.

pls.pred=predict(pls.fit,x[test,],ncomp =2)
mean((pls.pred -y.test)^2)

[1] 145367.7

The test MSE is comparable to, but slightly higher than, the test MSE obtained using ridge regression, the lasso, and PCR. Finally, we perform PLS using the full data set, using M = 2, the number of components identified by cross-validation.

pls.fit=plsr(Salary~.,data=Hitters,scale=TRUE,ncomp=2)
summary(pls.fit)

Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: kernelpls
Number of components considered: 2
TRAINING: % variance explained
        1 comps  2 comps
X         38.08    51.03
Salary    43.05    46.40

Notice that the percentage of variance in Salary that the two-component PLS fit explains, 46.40%, is almost as much as that explained using the final seven-component model PCR fit, 46.69%. This is because PCR only attempts to maximize the amount of variance explained in the predictors, while PLS searches for directions that explain variance in both the predictors and the response.

LS0tDQp0aXRsZTogIlBDUiBhbmQgUExTIFJlZ3Jlc3Npb24iDQpvdXRwdXQ6IA0KICBodG1sX25vdGVib29rOg0KICAgIHRvYzogdHJ1ZQ0KLS0tDQoNCiMjIFByaW5jaXBhbCBDb21wb25lbnRzIFJlZ3Jlc3Npb24NClByaW5jaXBhbCBjb21wb25lbnRzIHJlZ3Jlc3Npb24gKFBDUikgY2FuIGJlIHBlcmZvcm1lZCB1c2luZyB0aGUgYHBjcigpYCBmdW5jdGlvbiwgd2hpY2ggaXMgcGFydCBvZiB0aGUgYHBsc2AgbGlicmFyeS4gV2Ugbm93IGFwcGx5IFBDUiB0byB0aGUgYEhpdHRlcnNgIGRhdGEsIGluIG9yZGVyIHRvIHByZWRpY3QgYFNhbGFyeWAuIEFnYWluLCBlbnN1cmUgdGhhdCB0aGUgbWlzc2luZyB2YWx1ZXMgaGF2ZSBiZWVuIHJlbW92ZWQgZnJvbSB0aGUgZGF0YSwgYXMgZGVzY3JpYmVkIGluIFNlY3Rpb24gNi41Lg0KDQpgYGB7ciBwbHN9DQpsaWJyYXJ5KElTTFIpDQpIaXR0ZXJzPW5hLm9taXQoSGl0dGVycykNCmxpYnJhcnkocGxzKQ0Kc2V0LnNlZWQoMikNCnBjci5maXQ9cGNyKFNhbGFyeX4uLCBkYXRhPUhpdHRlcnMsIHNjYWxlPVRSVUUsIHZhbGlkYXRpb24gPSJDViIpDQpgYGANCg0KVGhlIHN5bnRheCBmb3IgdGhlIGBwY3IoKWAgZnVuY3Rpb24gaXMgc2ltaWxhciB0byB0aGF0IGZvciBgbG0oKWAsIHdpdGggYSBmZXcgYWRkaXRpb25hbCBvcHRpb25zLiBTZXR0aW5nIGBzY2FsZT1UUlVFYCBoYXMgdGhlIGVmZmVjdCBvZiBzdGFuZGFyZGl6aW5nIGVhY2ggcHJlZGljdG9yLCB1c2luZyAoNi42KSwgcHJpb3IgdG8gZ2VuZXJhdGluZyB0aGUgcHJpbmNpcGFsIGNvbXBvbmVudHMsIHNvIHRoYXQgdGhlIHNjYWxlIG9uIHdoaWNoIGVhY2ggdmFyaWFibGUgaXMgbWVhc3VyZWQgd2lsbCBub3QgaGF2ZSBhbiBlZmZlY3QuIFNldHRpbmcgYHZhbGlkYXRpb249IkNWImAgY2F1c2VzIGBwY3IoKWAgdG8gY29tcHV0ZSB0aGUgdGVuLWZvbGQgY3Jvc3MtdmFsaWRhdGlvbiBlcnJvciBmb3IgZWFjaCBwb3NzaWJsZSB2YWx1ZSBvZiBNLCB0aGUgbnVtYmVyIG9mIHByaW5jaXBhbCBjb21wb25lbnRzIHVzZWQuIFRoZSByZXN1bHRpbmcgZml0IGNhbiBiZSBleGFtaW5lZCB1c2luZyBgc3VtbWFyeSgpYC4NCg0KYGBge3IgcGxzdW1tYXJ5fQ0Kc3VtbWFyeShwY3IuZml0KQ0KYGBgDQoNClRoZSBDViBzY29yZSBpcyBwcm92aWRlZCBmb3IgZWFjaCBwb3NzaWJsZSBudW1iZXIgb2YgY29tcG9uZW50cywgcmFuZ2luZyBmcm9tICpNKiA9IDAgb253YXJkcy4gKFdlIGhhdmUgcHJpbnRlZCB0aGUgQ1Ygb3V0cHV0IG9ubHkgdXAgdG8gKk0qID0gNC4pIE5vdGUgdGhhdCBgcGNyKClgIHJlcG9ydHMgdGhlICpyb290IG1lYW4gc3F1YXJlZCBlcnJvcio7IGluIG9yZGVyIHRvIG9idGFpbiB0aGUgdXN1YWwgTVNFLCB3ZSBtdXN0IHNxdWFyZSB0aGlzIHF1YW50aXR5LiBGb3IgaW5zdGFuY2UsIGEgcm9vdCBtZWFuIHNxdWFyZWQgZXJyb3Igb2YgMzUyLjggY29ycmVzcG9uZHMgdG8gYW4gTVNFIG9mICQzNTIuOF4yID0gMTI0LDQ2OCQuIA0KDQpPbmUgY2FuIGFsc28gcGxvdCB0aGUgY3Jvc3MtdmFsaWRhdGlvbiBzY29yZXMgdXNpbmcgdGhlIGB2YWxpZGF0aW9ucGxvdCgpYCBmdW5jdGlvbi4gVXNpbmcgYHZhbC50eXBlPSJNU0VQImAgd2lsbCBjYXVzZSB0aGUgY3Jvc3MtdmFsaWRhdGlvbiBNU0UgdG8gYmUgcGxvdHRlZC4NCg0KYGBge3IgdmFsaWRhdGlvbnBsb3R9DQp2YWxpZGF0aW9ucGxvdChwY3IuZml0LHZhbC50eXBlPSJNU0VQIikNCmBgYA0KDQpXZSBzZWUgdGhhdCB0aGUgc21hbGxlc3QgY3Jvc3MtdmFsaWRhdGlvbiBlcnJvciBvY2N1cnMgd2hlbiBNID0gMTYgY29tcG9uZW50cyBhcmUgdXNlZC4gVGhpcyBpcyBiYXJlbHkgZmV3ZXIgdGhhbiAqTSogPSAxOSwgd2hpY2ggYW1vdW50cyB0byBzaW1wbHkgcGVyZm9ybWluZyBsZWFzdCBzcXVhcmVzLCBiZWNhdXNlIHdoZW4gYWxsIG9mIHRoZSBjb21wb25lbnRzIGFyZSB1c2VkIGluIFBDUiBubyBkaW1lbnNpb24gcmVkdWN0aW9uIG9jY3Vycy4gSG93ZXZlciwgZnJvbSB0aGUgcGxvdCB3ZSBhbHNvIHNlZSB0aGF0IHRoZSBjcm9zcy12YWxpZGF0aW9uIGVycm9yIGlzIHJvdWdobHkgdGhlIHNhbWUgd2hlbiBvbmx5IG9uZSBjb21wb25lbnQgaXMgaW5jbHVkZWQgaW4gdGhlIG1vZGVsLiBUaGlzIHN1Z2dlc3RzIHRoYXQgYSBtb2RlbCB0aGF0IHVzZXMganVzdCBhIHNtYWxsIG51bWJlciBvZiBjb21wb25lbnRzIG1pZ2h0IHN1ZmZpY2UuDQoNClRoZSBgc3VtbWFyeSgpYCBmdW5jdGlvbiBhbHNvIHByb3ZpZGVzIHRoZSBwZXJjZW50YWdlIG9mIHZhcmlhbmNlIGV4cGxhaW5lZCBpbiB0aGUgcHJlZGljdG9ycyBhbmQgaW4gdGhlIHJlc3BvbnNlIHVzaW5nIGRpZmZlcmVudCBudW1iZXJzIG9mIGNvbXBvbmVudHMuIFRoaXMgY29uY2VwdCBpcyBkaXNjdXNzZWQgaW4gZ3JlYXRlciBkZXRhaWwgaW4gQ2hhcHRlciAxMC4gQnJpZWZseSwgd2UgY2FuIHRoaW5rIG9mIHRoaXMgYXMgdGhlIGFtb3VudCBvZiBpbmZvcm1hdGlvbiBhYm91dCB0aGUgcHJlZGljdG9ycyBvciB0aGUgcmVzcG9uc2UgdGhhdCBpcyBjYXB0dXJlZCB1c2luZyAqTSogcHJpbmNpcGFsIGNvbXBvbmVudHMuIEZvciBleGFtcGxlLCBzZXR0aW5nICpNKiA9IDEgb25seSBjYXB0dXJlcyAzOC4zMSUgb2YgYWxsIHRoZSB2YXJpYW5jZSwgb3IgaW5mb3JtYXRpb24sIGluIHRoZSBwcmVkaWN0b3JzLiBJbiBjb250cmFzdCwgdXNpbmcgTSA9IDYgaW5jcmVhc2VzIHRoZSB2YWx1ZSB0byA4OC42MyUuIElmIHdlIHdlcmUgdG8gdXNlIGFsbCAqTSogPSAqcCogPSAxOSBjb21wb25lbnRzLCB0aGlzIHdvdWxkIGluY3JlYXNlIHRvIDEwMCUuDQoNCldlIG5vdyBwZXJmb3JtIFBDUiBvbiB0aGUgdHJhaW5pbmcgZGF0YSBhbmQgZXZhbHVhdGUgaXRzIHRlc3Qgc2V0IHBlcmZvcm1hbmNlLg0KDQpgYGB7ciBwY3J9DQpzZXQuc2VlZCgxKQ0KcGNyLmZpdD1wY3IoU2FsYXJ5fi4sIGRhdGE9SGl0dGVycywgc3Vic2V0PXRyYWluLCBzY2FsZT1UUlVFLCB2YWxpZGF0aW9uID0iQ1YiKQ0KdmFsaWRhdGlvbnBsb3QocGNyLmZpdCwgdmFsLnR5cGU9Ik1TRVAiKQ0KYGBgDQoNCk5vdyB3ZSBmaW5kIHRoYXQgdGhlIGxvd2VzdCBjcm9zcy12YWxpZGF0aW9uIGVycm9yIG9jY3VycyB3aGVuICpNKiA9IDcgY29tcG9uZW50IGFyZSB1c2VkLiBXZSBjb21wdXRlIHRoZSB0ZXN0IE1TRSBhcyBmb2xsb3dzLg0KDQpgYGB7ciBwY3Jtc2V9DQpwY3IucHJlZD1wcmVkaWN0KHBjci5maXQseFt0ZXN0ICxdLG5jb21wID03KQ0KbWVhbigocGNyLnByZWQteS50ZXN0KV4yKQ0KYGBgDQoNClRoaXMgdGVzdCBzZXQgTVNFIGlzIGNvbXBldGl0aXZlIHdpdGggdGhlIHJlc3VsdHMgb2J0YWluZWQgdXNpbmcgcmlkZ2UgcmVncmVzc2lvbiBhbmQgdGhlIGxhc3NvLiBIb3dldmVyLCBhcyBhIHJlc3VsdCBvZiB0aGUgd2F5IFBDUiBpcyBpbXBsZW1lbnRlZCwgdGhlIGZpbmFsIG1vZGVsIGlzIG1vcmUgZGlmZmljdWx0IHRvIGludGVycHJldCBiZWNhdXNlIGl0IGRvZXMgbm90IHBlcmZvcm0gYW55IGtpbmQgb2YgdmFyaWFibGUgc2VsZWN0aW9uIG9yIGV2ZW4gZGlyZWN0bHkgcHJvZHVjZSBjb2VmZmljaWVudCBlc3RpbWF0ZXMuDQoNCkZpbmFsbHksIHdlIGZpdCBQQ1Igb24gdGhlIGZ1bGwgZGF0YSBzZXQsIHVzaW5nICpNKiA9IDcsIHRoZSBudW1iZXIgb2YgY29tcG9uZW50cyBpZGVudGlmaWVkIGJ5IGNyb3NzLXZhbGlkYXRpb24uDQoNCmBgYHtyIHBjcjd9DQpwY3IuZml0PXBjcih5fngsc2NhbGU9VFJVRSxuY29tcD03KQ0Kc3VtbWFyeShwY3IuZml0KQ0KYGBgDQoNCiMjIFBhcnRpYWwgTGVhc3QgU3F1YXJlcw0KV2UgaW1wbGVtZW50IHBhcnRpYWwgbGVhc3Qgc3F1YXJlcyAoUExTKSB1c2luZyB0aGUgYHBsc3IoKWAgZnVuY3Rpb24sIGFsc28gYHBsc3IoKWAgaW4gdGhlIGBwbHNgIGxpYnJhcnkuIFRoZSBzeW50YXggaXMganVzdCBsaWtlIHRoYXQgb2YgdGhlIGBwY3IoKWAgZnVuY3Rpb24uDQoNCmBgYHtyIHBsc3J9DQpzZXQuc2VlZCgxKQ0KcGxzLmZpdD1wbHNyKFNhbGFyeX4uLGRhdGE9SGl0dGVycyxzdWJzZXQ9dHJhaW4sc2NhbGU9VFJVRSx2YWxpZGF0aW9uID0iQ1YiKQ0Kc3VtbWFyeSAocGxzLmZpdCkNCnZhbGlkYXRpb25wbG90KHBscy5maXQsdmFsLnR5cGU9Ik1TRVAiKQ0KYGBgDQoNClRoZSBsb3dlc3QgY3Jvc3MtdmFsaWRhdGlvbiBlcnJvciBvY2N1cnMgd2hlbiBvbmx5ICpNKiA9IDIgcGFydGlhbCBsZWFzdCBzcXVhcmVzIGRpcmVjdGlvbnMgYXJlIHVzZWQuIFdlIG5vdyBldmFsdWF0ZSB0aGUgY29ycmVzcG9uZGluZyB0ZXN0IHNldCBNU0UuDQoNCmBgYHtyIHBsc3J2YWxpZH0NCnBscy5wcmVkPXByZWRpY3QocGxzLmZpdCx4W3Rlc3QsXSxuY29tcCA9MikNCm1lYW4oKHBscy5wcmVkIC15LnRlc3QpXjIpDQpgYGANCg0KVGhlIHRlc3QgTVNFIGlzIGNvbXBhcmFibGUgdG8sIGJ1dCBzbGlnaHRseSBoaWdoZXIgdGhhbiwgdGhlIHRlc3QgTVNFIG9idGFpbmVkIHVzaW5nIHJpZGdlIHJlZ3Jlc3Npb24sIHRoZSBsYXNzbywgYW5kIFBDUi4gRmluYWxseSwgd2UgcGVyZm9ybSBQTFMgdXNpbmcgdGhlIGZ1bGwgZGF0YSBzZXQsIHVzaW5nICpNKiA9IDIsIHRoZSBudW1iZXIgb2YgY29tcG9uZW50cyBpZGVudGlmaWVkIGJ5IGNyb3NzLXZhbGlkYXRpb24uDQoNCmBgYHtyIHBsc3IyfQ0KcGxzLmZpdD1wbHNyKFNhbGFyeX4uLGRhdGE9SGl0dGVycyxzY2FsZT1UUlVFLG5jb21wPTIpDQpzdW1tYXJ5KHBscy5maXQpDQpgYGANCg0KTm90aWNlIHRoYXQgdGhlIHBlcmNlbnRhZ2Ugb2YgdmFyaWFuY2UgaW4gU2FsYXJ5IHRoYXQgdGhlIHR3by1jb21wb25lbnQgUExTIGZpdCBleHBsYWlucywgNDYuNDAlLCBpcyBhbG1vc3QgYXMgbXVjaCBhcyB0aGF0IGV4cGxhaW5lZCB1c2luZyB0aGUgZmluYWwgc2V2ZW4tY29tcG9uZW50IG1vZGVsIFBDUiBmaXQsIDQ2LjY5JS4gVGhpcyBpcyBiZWNhdXNlIFBDUiBvbmx5IGF0dGVtcHRzIHRvIG1heGltaXplIHRoZSBhbW91bnQgb2YgdmFyaWFuY2UgZXhwbGFpbmVkIGluIHRoZSBwcmVkaWN0b3JzLCB3aGlsZSBQTFMgc2VhcmNoZXMgZm9yIGRpcmVjdGlvbnMgdGhhdCBleHBsYWluIHZhcmlhbmNlIGluIGJvdGggdGhlIHByZWRpY3RvcnMgYW5kIHRoZSByZXNwb25zZS4=