For these examples I will be using, again, the performance data of LPGA golfers from 2009 available at http://www.stat.ufl.edu/~winner/data/lpga2009.dat. The data set is made up of 146 rows and contains the following variables:

  1. Golfer id 4-8
  2. average drive (yards) 10-16
  3. percent of fairways hit 18-24
  4. percent of greens reached in regulation 26-32
  5. average putts per round 34-40
  6. percent of sand saves (2 shots to hole) 42-48
  7. prize winnings ($1000s) 50-56
  8. ln(prize) 58-64
  9. tournaments played in 70-72
  10. green in regulation putts per hole 55-80
  11. completed tournaments 86-88
  12. average percentile in tournaments (high is good) 90-96
  13. rounds completed 100-104
  14. average strokes per round 106-112

We will again use the natural log of prize winnings ‘ln(prize)’ as our dependent variable. The rest of the varaibles, except for Golfer id and prize winnings, will be used as independent variables. The first method we will use here is Principal Components Regression, available through the pls library.

First, we will load the pls library and prepare the downloaded data for use in this example. this includes providing names to each of the columns in the data set.

library(pls)
## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1234)
lpga <- read.table('lpga2009.dat',header=F)
names(lpga) <- c('ID','AvgDrv','PerFair','PerGreen','AvgPutts','PerSandSav','Prize','lnPrize','Tours','PuttsPH','CompTours','AvgPercentile','RoundComp','AvgStrokes')

The data is now copied to another object, where the Golfer id and prize winning columns are removed and the remaining set is split into training and test sets.

train <- sample(146,100)
golf <- lpga[,-c(1,7)] # columns for Golfer id and prize winnings are removed.
golf.train <- golf[train,] # training set
golf.test <- golf[-train,] # test set

We can now use the pcr() function from the pls library to fit a model on the training data, utilizing all of the other variables to describe their relationship to the natural log of the thousands of dollars earned as prize money. the data will be scaled and ten fold cross validation will be applied via ‘CV’ to find the error with each number of principal components used. The results can be examined via summary() or plotted through validationplot(). In validationplot(), cross validation scores are plotted with ‘MSEP’ for mean squared error to be reported.

pcr.golf.train <- pcr(lnPrize~.,data=golf.train,scale=TRUE,validation='CV')
validationplot(pcr.golf.train,val.type='MSEP')

summary(pcr.golf.train)
## Data:    X dimension: 100 11 
##  Y dimension: 100 1
## Fit method: svdpc
## Number of components considered: 11
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.286   0.4291   0.4342   0.4330   0.4344   0.4327   0.4251
## adjCV        1.286   0.4283   0.4330   0.4317   0.4332   0.4312   0.4228
##        7 comps  8 comps  9 comps  10 comps  11 comps
## CV      0.4251   0.4322   0.4301    0.4334    0.4618
## adjCV   0.4228   0.4295   0.4272    0.4297    0.4562
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X          53.39    69.72    81.03    89.51    95.83    97.64    98.70
## lnPrize    89.38    89.46    89.60    89.79    90.03    90.98    91.06
##          8 comps  9 comps  10 comps  11 comps
## X          99.44    99.76     99.91    100.00
## lnPrize    91.06    91.33     91.67     91.69

As you can see through the plot, there is a great deal of error that is eliminated at the first principal component. However, there is little to be gained after the first component. In looking at the summary, nothing much is gained after the first component for cross validation or variance explained in determining the dependent variable.

One may note that the least amount of cross validation error is had with the first six components. However the improvement in error with five additional principal components may not outweigh a simpler model. To find out, we will find mean squared errors for the test data using both principal component regression models using one and six components.

pcr.golf.pred <- predict(pcr.golf.train,golf.test,ncomp=1) # using one component
mean((pcr.golf.pred - golf.test$lnPrize)^2)
## [1] 0.3132283
pcr.golf.pred2 <- predict(pcr.golf.train,golf.test,ncomp=6) # using six components
mean((pcr.golf.pred2 - golf.test$lnPrize)^2)
## [1] 0.2947868

As you can see, there is little gained in mean squared error reduction between the two models. This is also evident when summary() is used on models of one and six components when the whole data set is used.

pcr.golf.final1 <- pcr(lnPrize~.,data=golf,scale=TRUE,ncomp=1)
pcr.golf.final2 <- pcr(lnPrize~.,data=golf,scale=TRUE,ncomp=6)
summary(pcr.golf.final1)
## Data:    X dimension: 146 11 
##  Y dimension: 146 1
## Fit method: svdpc
## Number of components considered: 1
## TRAINING: % variance explained
##          1 comps
## X          53.34
## lnPrize    89.25
summary(pcr.golf.final2)
## Data:    X dimension: 146 11 
##  Y dimension: 146 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X          53.34    68.69    80.78    89.27    95.70    97.80
## lnPrize    89.25    89.26    89.26    89.27    89.27    90.62

We will also examine the data using Partial Least Squares. The method is also available from the pls library via a the plsr() funciton. The syntax is similar to pcr(). We will, again, fit a model using the natural log of the prize winnings in thousands of dollars as the dependent variable. The summary() function is then used to show cross validation error and the percent of variance explained.

set.seed(5678)
pls.golf <- plsr(lnPrize~.,data=golf.train,scale=TRUE,validation='CV')
summary(pls.golf)
## Data:    X dimension: 100 11 
##  Y dimension: 100 1
## Fit method: kernelpls
## Number of components considered: 11
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.286   0.4342   0.4377   0.4330   0.4276   0.4262   0.4300
## adjCV        1.286   0.4330   0.4353   0.4308   0.4252   0.4238   0.4278
##        7 comps  8 comps  9 comps  10 comps  11 comps
## CV      0.4367   0.4347   0.4371    0.4452    0.4648
## adjCV   0.4330   0.4314   0.4335    0.4408    0.4589
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X          53.38    63.63    73.77    81.20    89.43    96.12    98.06
## lnPrize    89.66    90.46    90.87    91.19    91.28    91.36    91.59
##          8 comps  9 comps  10 comps  11 comps
## X          98.93    99.19     99.86    100.00
## lnPrize    91.65    91.68     91.69     91.69

From the summary, we can discern that the minimal amount of cross validation error occurs with the use of five components. With this in mind, we construct a predictive model with five components and measure it against the test data. A model fitted against the whole data set is made with five components and the summary of its results is observed.

pls.golf.pred <- predict(pls.golf,golf.test,ncomp=5)
mean((pls.golf.pred - golf.test$lnPrize)^2)
## [1] 0.285122
pls.golf.final <- plsr(lnPrize~.,data=golf,scale=TRUE,ncomp=5)
summary(pls.golf.final)
## Data:    X dimension: 146 11 
##  Y dimension: 146 1
## Fit method: kernelpls
## Number of components considered: 5
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps
## X          53.34    57.70    69.61    79.79    87.25
## lnPrize    89.37    90.57    90.78    90.82    90.92