PCA and Partial Least Squares

An implementation in R markdown

Paul Jozefek

2020-06-29

Dimension Reduction Methods

Subset selection and Shrinkage methods help control model variance while using the original predictors.11 \(X_1,X_2,\dotsc,X_p\) An alternative approach is to transform the predictors before fitting a model.22 Referred to as dimension reduction

Let \(Z_1,Z_2,\dotsc,Z_m\) represent \(M<p\) linear combinations of the original \(p\) predictors.

\[Z_m=\sum_{j=1}^p\phi_{jm}X_j\] for some constants \(\phi_{1m},\phi_{2m},\dotsc,\phi_{pm},\space m=1,\dotsc,M\)

Once this is complete we can then fit a linear regression model

\[y_i=\theta_0+\sum_{m=1}^M\theta_mz_{im}+\epsilon_i,\space\space i=1,\dotsc,n\]

Instead of \(\beta_1,\dotsc,\beta_p\) we have \(\theta_1,\dotsc,\theta_m\).

If \(M<p\) then the dimension of the problem has been reduced from \(p+1\) to \(M+1\).33 Number of predictors plus the intercept

Transforming the coefficients can increase their bias, but if \(p\) is large relative to \(n\) then there is often a significant decrease in the variance.

Principal Components Regression

FIGURE 1: The blue solid line represents the first principal component and the black solid line represents the second principal component

Principal component analysis (PCA) is one of the dimension reduction techniques. The first principal component is the direction along which the observations vary the most. In Figure 1 the first principal component is represented by the blue line. If we projected 100 observations onto this line they would have the largest possible variance.

The line is given by the formula:

\[Z_1=0.839\times(pop-\overline{pop})+0.544\times(ad-\overline{ad})\]

\(\phi_{11}=0.839\) and \(\phi_{21}=0.544\)44 The principal component loadings
\(\overline{pop}\) represents the mean of pop
\(\overline{ad}\) represents the mean of ad
This linear combination of pop and ad, such that \(\phi_{11}^2+\phi_{21}^2=1\), yields the highest variance

\[z_{i1}=0.839\times(pop_i-\overline{pop})+0.544\times(ad_i-\overline{ad})\]

\(z_{11},\dotsc,z_{n1}\) are known as the principal component scores.

We can also interpret the first principal component as the line that is closest to the data. In this case, the line minimizes the sum of squared distances between each point and the line.

The values of \(Z_1\) can be thought of as single-number summaries of both pop and ad. If

\[z_{i1}=0.839\times(pop_i-\overline{pop})+0.544\times(ad_i-\overline{ad})<0\]

then this indicates a below average population size and below average ad spending. If pop and ad have a strong linear relationship then a single-number summary will work well.

The second principal component, \(Z_2\), is a linear combination of variables that is uncorrelated with \(Z_1\).55 And has the largest variance subject to this constraint It is represented as a black line in Figure 1. The zero correlation condition means that it must be perpendicular to the first principal component.

The line is given by the formula:

\[Z_2=0.544\times(pop-\overline{pop})-0.839\times(ad-\overline{ad})\]

The second principal component scores are much closer to zero and so they capture far less information.

If there were more predictors then additional principal components could be constructed. Each would be designed to maximize variance subject to the constraint of being uncorrelated with the preceding components.

Principal Components Regression Approach

Principal components regression begins by constructing \(Z_1,\dotsc,Z_M\) and continues by using these new predictors to fit a linear regression model. We assume that

A small number of principal components explain most of the data variability
The directions in which \(X_1,\dotsc,X_p\) show the most variation are the directions that are associated with the response, \(Y\).

If the assumptions hold and the principle components hold most of the important information66 Relative to the response then PCR will outperform a least squares model.77 Fitting \(M<p\) predictors mitigates overfitting

It is important to note that

PCR tends to do well when the first few principal components explain most of the variation in the predictors.88 And have a strong relationship with the response
Although we perform a regression with \(M<p\) predictors it is not a feature selection method. Each \(M\) principal component is constructed from all \(p\) original predictors.
It is more closely related to ridge regression then the lasso.
If \(M=P\) then PCR is equivalent to least squares

The number of principal components, \(M\), is typically chosen by cross validation. Predictors are typically standardized to ensure that all variables are on the same scale.99 Otherwise high variance variables will play a larger role

Partial Least Squares

PCR combines predictors in an unsupervised way.1010 The response, \(Y\), is not used to generate the principal component Partial Least Squares (PLS) is a supervised alternative.1111 Attempts to find directions that explain both the response and the predictors

It is computed as follows:

Standardize the \(p\) predictors and the response \(Y\).
Compute the first direction, \(Z_1\), by setting each \(\phi_{j1}\) equal to the coefficient from simple linear regression.1212 This coefficient is proportional to the correlation between \(Y\) and \(X_j\). The highest weights go to the predictors that are most strongly related to the response.
To identify the second PLS direction each variable, \(X_j\), is regressed on \(Z_1\) and the residuals are taken.1313 The residuals represent the remaining information that has not been explained by \(Z_1\).
We can then use these residuals to compute \(Z_2\) in the same way that \(Z_1\) was computed on the original data.
This iterative approach can be repeated \(M\) times to generate \(Z_1,\dotsc,Z_M\).
Finally, we use a linear regression model to predict \(Y\) based on the PLS components.

\(M\) is usually chosen by cross-validation.

PCR Example

We can apply PCR to the Hitters dataset in order to predict Salary. First, we must remove any missing values.

> library(ISLR) #load package
> names(Hitters) #column names

 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
 [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
[13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
[19] "Salary"    "NewLeague"

> dim(Hitters) #Rows and Columns

[1] 322  20

> #Number of NA values for salary
> sum(is.na(Hitters$Salary))

[1] 59

> Hitters=na.omit(Hitters) #Remove NAs
> dim(Hitters) #Rows and Columns

[1] 263  20

> sum(is.na(Hitters)) #Number of NA values

[1] 0

Next, we can use the pls package to fit the model. The function allows us to standardize the data and use 10 fold cross-validation for each possible value of \(M\).

> library(pls) #load library
> set.seed(2) #make replicable
> #Fit PCR model
> #Standardize data with scale=TRUE
> #10 fold cross-validation for each value of M
> pcr.fit=pcr(Salary~.,data=Hitters, scale=TRUE, 
+             validation="CV")

We can view the resulting fit with summary. The CV score is provided for every value of \(M\) and represents the root mean squared error. To obtain the MSE we must square it.1414 MSE of \(352.8^2=124,468\)

The summmary function also provides the percentage of variance explained in the predictors and the response. For example, \(M=1\) captures \(38.31\%\) of the variance in the predictors and \(40.63\%\) of the variance in the response.

> #View resulting fit
> summary(pcr.fit)

Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: svdpc
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV             452    348.9    352.2    353.5    352.8    350.1    349.1
adjCV          452    348.7    351.8    352.9    352.1    349.3    348.0
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       349.6    350.9    352.9     353.8     355.0     356.2     363.5
adjCV    348.5    349.8    351.6     352.3     353.4     354.5     361.6
       14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
CV        355.2     357.4     347.6     350.1     349.2     352.6
adjCV     352.8     355.2     345.5     347.6     346.7     349.8

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96
Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75
        9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
X         96.28     97.26     97.98     98.65     99.15     99.47     99.75
Salary    46.86     47.76     47.82     47.85     48.10     50.40     50.55
        16 comps  17 comps  18 comps  19 comps
X          99.89     99.97     99.99    100.00
Salary     53.01     53.85     54.61     54.61

We can see from the plot that the smallest root MSE occurs when \(M=16\). However, it is not much smaller then when \(M=1\) and so a simpler model could be used. Also, \(M=19\) is equivalent to \(M=p\) and the least squares model.1515 PCR results in only a modest improvement over linear regression

> #Plot Root MSE by number of components
> validationplot(pcr.fit, val.type="MSEP")

FIGURE 2: Plot of Root MSE by the number of components

We can also perform PCR on a training set and then evaluate the model on a test set.

> set.seed(1) # make replicable
> #training set = half the data
> train=sample(1:nrow(Hitters),nrow(Hitters)/2)
> #test set will be remaining rows
> test=(-train)
> #response variable
> y=Hitters$Salary
> #response for test set
> y.test=y[test]
> 
> set.seed(1) #make replicable
> #perform PCR on training data
> pcr.fit=pcr(Salary~.,data=Hitters,subset=train,
+             scale=TRUE, validation="CV")
> #View MSE plot
> validationplot(pcr.fit,val.type = "MSEP")

FIGURE 3: Plot of Root MSE by number of components for the training set

We can see from the plot that the lowest CV error occurs at \(M=7\).

Using the predict funtion we find that the test MSE is \(96556\)

> #Predict on test set with M=7
> pcr.pred=predict(pcr.fit,Hitters[test,],ncomp=7)
> #MSE value
> mean((pcr.pred-y.test)^2)

[1] 96556.22

Finally, we can fit the full model using \(M=7\).

> #PCR on full dataset with M=7
> pcr.fit=pcr(Salary~.,data=Hitters,
+             scale=TRUE, ncomp=7)
> #view resulting fit
> summary(pcr.fit)

Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: svdpc
Number of components considered: 7
TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X         38.31    60.16    70.84    79.03    84.29    88.63    92.26
Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69

PLS Example

We can apply PLS with the plsr() function.

> set.seed(1)#make replicable
> #Fit PLS model on training set
> pls.fit=plsr(Salary~.,data=Hitters,subset=train,
+              scale=TRUE, validation="CV")
> #view model results
> summary(pls.fit)

Data:   X dimension: 131 19 
    Y dimension: 131 1
Fit method: kernelpls
Number of components considered: 19

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV           464.6    394.2    391.5    393.1    395.0    415.0    424.0
adjCV        464.6    393.4    390.2    391.1    392.9    411.5    418.8
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       424.5    415.8    404.6     407.1     412.0     414.4     410.3
adjCV    418.9    411.4    400.7     402.2     407.2     409.3     405.6
       14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
CV        406.2     408.6     410.5     408.8     407.8     410.2
adjCV     401.8     403.9     405.6     404.1     403.2     405.5

TRAINING: % variance explained
        1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X         38.12    53.46    66.05    74.49    79.33    84.56    87.09    90.74
Salary    33.58    38.96    41.57    42.43    44.04    45.59    47.05    47.53
        9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
X         92.55     93.94     97.23     97.88     98.35     98.85     99.11
Salary    48.42     49.68     50.04     50.54     50.78     50.92     51.04
        16 comps  17 comps  18 comps  19 comps
X          99.43     99.78     99.99    100.00
Salary     51.11     51.15     51.16     51.18

> #Plot root MSE
> validationplot(pls.fit,val.type = "MSEP")

FIGURE 4: Root MSE by PLS component

The lowest CV score occurs when \(M=2\).

The test set MSE of \(101417\) is higher than it was for PCR

> #predictions for the test set with M=2
> pls.pred=predict(pls.fit,Hitters[test,],ncomp=2)
> #MSE value
> mean((pls.pred-y.test)^2)

[1] 101417.5

Finally, we perform PLS using the full dataset with \(M=2\).

> #PLS model on full data with M=2
> pls.fit=plsr(Salary~.,data=Hitters,scale=TRUE,
+              ncomp=2)
> #View results
> summary(pls.fit)

Data:   X dimension: 263 19 
    Y dimension: 263 1
Fit method: kernelpls
Number of components considered: 2
TRAINING: % variance explained
        1 comps  2 comps
X         38.08    51.03
Salary    43.05    46.40