For these examples I will be using, again, the performance data of LPGA golfers from 2009 available at http://www.stat.ufl.edu/~winner/data/lpga2009.dat. The data set is made up of 146 rows and contains the following variables:
We will again use the natural log of prize winnings ‘ln(prize)’ as our dependent variable. The rest of the varaibles, except for Golfer id and prize winnings, will be used as independent variables. The first method we will use here is Principal Components Regression, available through the pls library.
First, we will load the pls library and prepare the downloaded data for use in this example. this includes providing names to each of the columns in the data set.
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1234)
lpga <- read.table('lpga2009.dat',header=F)
names(lpga) <- c('ID','AvgDrv','PerFair','PerGreen','AvgPutts','PerSandSav','Prize','lnPrize','Tours','PuttsPH','CompTours','AvgPercentile','RoundComp','AvgStrokes')
The data is now copied to another object, where the Golfer id and prize winning columns are removed and the remaining set is split into training and test sets.
train <- sample(146,100)
golf <- lpga[,-c(1,7)] # columns for Golfer id and prize winnings are removed.
golf.train <- golf[train,] # training set
golf.test <- golf[-train,] # test set
We can now use the pcr() function from the pls library to fit a model on the training data, utilizing all of the other variables to describe their relationship to the natural log of the thousands of dollars earned as prize money. the data will be scaled and ten fold cross validation will be applied via ‘CV’ to find the error with each number of principal components used. The results can be examined via summary() or plotted through validationplot(). In validationplot(), cross validation scores are plotted with ‘MSEP’ for mean squared error to be reported.
pcr.golf.train <- pcr(lnPrize~.,data=golf.train,scale=TRUE,validation='CV')
validationplot(pcr.golf.train,val.type='MSEP')
summary(pcr.golf.train)
## Data: X dimension: 100 11
## Y dimension: 100 1
## Fit method: svdpc
## Number of components considered: 11
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 1.286 0.4291 0.4342 0.4330 0.4344 0.4327 0.4251
## adjCV 1.286 0.4283 0.4330 0.4317 0.4332 0.4312 0.4228
## 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 0.4251 0.4322 0.4301 0.4334 0.4618
## adjCV 0.4228 0.4295 0.4272 0.4297 0.4562
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 53.39 69.72 81.03 89.51 95.83 97.64 98.70
## lnPrize 89.38 89.46 89.60 89.79 90.03 90.98 91.06
## 8 comps 9 comps 10 comps 11 comps
## X 99.44 99.76 99.91 100.00
## lnPrize 91.06 91.33 91.67 91.69
As you can see through the plot, there is a great deal of error that is eliminated at the first principal component. However, there is little to be gained after the first component. In looking at the summary, nothing much is gained after the first component for cross validation or variance explained in determining the dependent variable.
One may note that the least amount of cross validation error is had with the first six components. However the improvement in error with five additional principal components may not outweigh a simpler model. To find out, we will find mean squared errors for the test data using both principal component regression models using one and six components.
pcr.golf.pred <- predict(pcr.golf.train,golf.test,ncomp=1) # using one component
mean((pcr.golf.pred - golf.test$lnPrize)^2)
## [1] 0.3132283
pcr.golf.pred2 <- predict(pcr.golf.train,golf.test,ncomp=6) # using six components
mean((pcr.golf.pred2 - golf.test$lnPrize)^2)
## [1] 0.2947868
As you can see, there is little gained in mean squared error reduction between the two models. This is also evident when summary() is used on models of one and six components when the whole data set is used.
pcr.golf.final1 <- pcr(lnPrize~.,data=golf,scale=TRUE,ncomp=1)
pcr.golf.final2 <- pcr(lnPrize~.,data=golf,scale=TRUE,ncomp=6)
summary(pcr.golf.final1)
## Data: X dimension: 146 11
## Y dimension: 146 1
## Fit method: svdpc
## Number of components considered: 1
## TRAINING: % variance explained
## 1 comps
## X 53.34
## lnPrize 89.25
summary(pcr.golf.final2)
## Data: X dimension: 146 11
## Y dimension: 146 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 53.34 68.69 80.78 89.27 95.70 97.80
## lnPrize 89.25 89.26 89.26 89.27 89.27 90.62
We will also examine the data using Partial Least Squares. The method is also available from the pls library via a the plsr() funciton. The syntax is similar to pcr(). We will, again, fit a model using the natural log of the prize winnings in thousands of dollars as the dependent variable. The summary() function is then used to show cross validation error and the percent of variance explained.
set.seed(5678)
pls.golf <- plsr(lnPrize~.,data=golf.train,scale=TRUE,validation='CV')
summary(pls.golf)
## Data: X dimension: 100 11
## Y dimension: 100 1
## Fit method: kernelpls
## Number of components considered: 11
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 1.286 0.4342 0.4377 0.4330 0.4276 0.4262 0.4300
## adjCV 1.286 0.4330 0.4353 0.4308 0.4252 0.4238 0.4278
## 7 comps 8 comps 9 comps 10 comps 11 comps
## CV 0.4367 0.4347 0.4371 0.4452 0.4648
## adjCV 0.4330 0.4314 0.4335 0.4408 0.4589
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 53.38 63.63 73.77 81.20 89.43 96.12 98.06
## lnPrize 89.66 90.46 90.87 91.19 91.28 91.36 91.59
## 8 comps 9 comps 10 comps 11 comps
## X 98.93 99.19 99.86 100.00
## lnPrize 91.65 91.68 91.69 91.69
From the summary, we can discern that the minimal amount of cross validation error occurs with the use of five components. With this in mind, we construct a predictive model with five components and measure it against the test data. A model fitted against the whole data set is made with five components and the summary of its results is observed.
pls.golf.pred <- predict(pls.golf,golf.test,ncomp=5)
mean((pls.golf.pred - golf.test$lnPrize)^2)
## [1] 0.285122
pls.golf.final <- plsr(lnPrize~.,data=golf,scale=TRUE,ncomp=5)
summary(pls.golf.final)
## Data: X dimension: 146 11
## Y dimension: 146 1
## Fit method: kernelpls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 53.34 57.70 69.61 79.79 87.25
## lnPrize 89.37 90.57 90.78 90.82 90.92