Dimension Reduction Methods
Subset selection and Shrinkage methods help control model variance while using the original predictors.1 \(X_1,X_2,\dotsc,X_p\) An alternative approach is to transform the predictors before fitting a model.2 Referred to as dimension reduction
Let \(Z_1,Z_2,\dotsc,Z_m\) represent \(M<p\) linear combinations of the original \(p\) predictors.
\[Z_m=\sum_{j=1}^p\phi_{jm}X_j\] for some constants \(\phi_{1m},\phi_{2m},\dotsc,\phi_{pm},\space m=1,\dotsc,M\)
Once this is complete we can then fit a linear regression model
\[y_i=\theta_0+\sum_{m=1}^M\theta_mz_{im}+\epsilon_i,\space\space i=1,\dotsc,n\]
Instead of \(\beta_1,\dotsc,\beta_p\) we have \(\theta_1,\dotsc,\theta_m\).
If \(M<p\) then the dimension of the problem has been reduced from \(p+1\) to \(M+1\).3 Number of predictors plus the intercept
Transforming the coefficients can increase their bias, but if \(p\) is large relative to \(n\) then there is often a significant decrease in the variance.
Principal Components Regression
FIGURE 1: The blue solid line represents the first principal component and the black solid line represents the second principal component
Principal component analysis (PCA) is one of the dimension reduction techniques. The first principal component is the direction along which the observations vary the most. In Figure 1 the first principal component is represented by the blue line. If we projected 100 observations onto this line they would have the largest possible variance.
The line is given by the formula:
\[Z_1=0.839\times(pop-\overline{pop})+0.544\times(ad-\overline{ad})\]
popadpop and ad, such that \(\phi_{11}^2+\phi_{21}^2=1\), yields the highest variance\[z_{i1}=0.839\times(pop_i-\overline{pop})+0.544\times(ad_i-\overline{ad})\]
We can also interpret the first principal component as the line that is closest to the data. In this case, the line minimizes the sum of squared distances between each point and the line.
The values of \(Z_1\) can be thought of as single-number summaries of both pop and ad. If
\[z_{i1}=0.839\times(pop_i-\overline{pop})+0.544\times(ad_i-\overline{ad})<0\]
then this indicates a below average population size and below average ad spending. If pop and ad have a strong linear relationship then a single-number summary will work well.
The second principal component, \(Z_2\), is a linear combination of variables that is uncorrelated with \(Z_1\).5 And has the largest variance subject to this constraint It is represented as a black line in Figure 1. The zero correlation condition means that it must be perpendicular to the first principal component.
The line is given by the formula:
\[Z_2=0.544\times(pop-\overline{pop})-0.839\times(ad-\overline{ad})\]
The second principal component scores are much closer to zero and so they capture far less information.
If there were more predictors then additional principal components could be constructed. Each would be designed to maximize variance subject to the constraint of being uncorrelated with the preceding components.
Principal Components Regression Approach
Principal components regression begins by constructing \(Z_1,\dotsc,Z_M\) and continues by using these new predictors to fit a linear regression model. We assume that
If the assumptions hold and the principle components hold most of the important information6 Relative to the response then PCR will outperform a least squares model.7 Fitting \(M<p\) predictors mitigates overfitting
It is important to note that
The number of principal components, \(M\), is typically chosen by cross validation. Predictors are typically standardized to ensure that all variables are on the same scale.9 Otherwise high variance variables will play a larger role
Partial Least Squares
PCR combines predictors in an unsupervised way.10 The response, \(Y\), is not used to generate the principal component Partial Least Squares (PLS) is a supervised alternative.11 Attempts to find directions that explain both the response and the predictors
It is computed as follows:
\(M\) is usually chosen by cross-validation.
PCR Example
We can apply PCR to the Hitters dataset in order to predict Salary. First, we must remove any missing values.
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
[1] 322 20
[1] 59
[1] 263 20
[1] 0
Next, we can use the pls package to fit the model. The function allows us to standardize the data and use 10 fold cross-validation for each possible value of \(M\).
> library(pls) #load library
> set.seed(2) #make replicable
> #Fit PCR model
> #Standardize data with scale=TRUE
> #10 fold cross-validation for each value of M
> pcr.fit=pcr(Salary~.,data=Hitters, scale=TRUE,
+ validation="CV")We can view the resulting fit with summary. The CV score is provided for every value of \(M\) and represents the root mean squared error. To obtain the MSE we must square it.14 MSE of \(352.8^2=124,468\)
The summmary function also provides the percentage of variance explained in the predictors and the response. For example, \(M=1\) captures \(38.31\%\) of the variance in the predictors and \(40.63\%\) of the variance in the response.
Data: X dimension: 263 19
Y dimension: 263 1
Fit method: svdpc
Number of components considered: 19
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 452 348.9 352.2 353.5 352.8 350.1 349.1
adjCV 452 348.7 351.8 352.9 352.1 349.3 348.0
7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
CV 349.6 350.9 352.9 353.8 355.0 356.2 363.5
adjCV 348.5 349.8 351.6 352.3 353.4 354.5 361.6
14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
CV 355.2 357.4 347.6 350.1 349.2 352.6
adjCV 352.8 355.2 345.5 347.6 346.7 349.8
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
X 38.31 60.16 70.84 79.03 84.29 88.63 92.26 94.96
Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69 46.75
9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
X 96.28 97.26 97.98 98.65 99.15 99.47 99.75
Salary 46.86 47.76 47.82 47.85 48.10 50.40 50.55
16 comps 17 comps 18 comps 19 comps
X 99.89 99.97 99.99 100.00
Salary 53.01 53.85 54.61 54.61
We can see from the plot that the smallest root MSE occurs when \(M=16\). However, it is not much smaller then when \(M=1\) and so a simpler model could be used. Also, \(M=19\) is equivalent to \(M=p\) and the least squares model.15 PCR results in only a modest improvement over linear regression
FIGURE 2: Plot of Root MSE by the number of components
We can also perform PCR on a training set and then evaluate the model on a test set.
> set.seed(1) # make replicable
> #training set = half the data
> train=sample(1:nrow(Hitters),nrow(Hitters)/2)
> #test set will be remaining rows
> test=(-train)
> #response variable
> y=Hitters$Salary
> #response for test set
> y.test=y[test]
>
> set.seed(1) #make replicable
> #perform PCR on training data
> pcr.fit=pcr(Salary~.,data=Hitters,subset=train,
+ scale=TRUE, validation="CV")
> #View MSE plot
> validationplot(pcr.fit,val.type = "MSEP")FIGURE 3: Plot of Root MSE by number of components for the training set
We can see from the plot that the lowest CV error occurs at \(M=7\).
Using the predict funtion we find that the test MSE is \(96556\)
> #Predict on test set with M=7
> pcr.pred=predict(pcr.fit,Hitters[test,],ncomp=7)
> #MSE value
> mean((pcr.pred-y.test)^2)[1] 96556.22
Finally, we can fit the full model using \(M=7\).
> #PCR on full dataset with M=7
> pcr.fit=pcr(Salary~.,data=Hitters,
+ scale=TRUE, ncomp=7)
> #view resulting fit
> summary(pcr.fit)Data: X dimension: 263 19
Y dimension: 263 1
Fit method: svdpc
Number of components considered: 7
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 38.31 60.16 70.84 79.03 84.29 88.63 92.26
Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69
PLS Example
We can apply PLS with the plsr() function.
> set.seed(1)#make replicable
> #Fit PLS model on training set
> pls.fit=plsr(Salary~.,data=Hitters,subset=train,
+ scale=TRUE, validation="CV")
> #view model results
> summary(pls.fit)Data: X dimension: 131 19
Y dimension: 131 1
Fit method: kernelpls
Number of components considered: 19
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 464.6 394.2 391.5 393.1 395.0 415.0 424.0
adjCV 464.6 393.4 390.2 391.1 392.9 411.5 418.8
7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
CV 424.5 415.8 404.6 407.1 412.0 414.4 410.3
adjCV 418.9 411.4 400.7 402.2 407.2 409.3 405.6
14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
CV 406.2 408.6 410.5 408.8 407.8 410.2
adjCV 401.8 403.9 405.6 404.1 403.2 405.5
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
X 38.12 53.46 66.05 74.49 79.33 84.56 87.09 90.74
Salary 33.58 38.96 41.57 42.43 44.04 45.59 47.05 47.53
9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
X 92.55 93.94 97.23 97.88 98.35 98.85 99.11
Salary 48.42 49.68 50.04 50.54 50.78 50.92 51.04
16 comps 17 comps 18 comps 19 comps
X 99.43 99.78 99.99 100.00
Salary 51.11 51.15 51.16 51.18
FIGURE 4: Root MSE by PLS component
The lowest CV score occurs when \(M=2\).
The test set MSE of \(101417\) is higher than it was for PCR
> #predictions for the test set with M=2
> pls.pred=predict(pls.fit,Hitters[test,],ncomp=2)
> #MSE value
> mean((pls.pred-y.test)^2)[1] 101417.5
Finally, we perform PLS using the full dataset with \(M=2\).
> #PLS model on full data with M=2
> pls.fit=plsr(Salary~.,data=Hitters,scale=TRUE,
+ ncomp=2)
> #View results
> summary(pls.fit)Data: X dimension: 263 19
Y dimension: 263 1
Fit method: kernelpls
Number of components considered: 2
TRAINING: % variance explained
1 comps 2 comps
X 38.08 51.03
Salary 43.05 46.40