Predictive Modeling with R

Joel Correa da Rosa
December 02, 2015

Predictive Modeling with R

Models for predicting outcomes in classification and regression problems are developed in different areas.

Pattern Recognition
Machine Learning
Statistical Learning

There is a growing overlap between all these areas.

The General Goal

The general goal is To build a model that can accurately (better than random) predict out-of-sample cases.
It is not sufficient to have only a good model, but one needs to prove that it works for unseen data and it does better than random models.
Classic statistical inference plays minor role.

The Basics

Let's simulate some data and a linear relationship between Y and X.

set.seed(123)

# a normal distributed variables
X<-rnorm(100,0,1)

# a poorly correlated variable
Y<-0.5+0.1*X+rnorm(100,0,2)

Splitting the Data into Training / Test sets

# generating indexes for training
ind.train<-sample(1:100,85)

# training sets
Xtrain<-X[ind.train]
Ytrain<-Y[ind.train]

# test sets
Xtest<-X[-ind.train]
Ytest<-Y[-ind.train]

# Data frames for training and test
Dtrain = cbind.data.frame(X= Xtrain, Y = Ytrain)
Dtest = cbind.data.frame(X= Xtest, Y = Ytest)

Use Training for Learning, Test for Generalization

# fitting a linear regression
fit.lm<-lm(Y~X,data=Dtrain)
summary(fit.lm)


Call:
lm(formula = Y ~ X, data = Dtrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9442 -1.5268 -0.0528  1.2107  6.3753 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.45785    0.21772   2.103   0.0385 *
X           -0.04619    0.22930  -0.201   0.8409  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.996 on 83 degrees of freedom
Multiple R-squared:  0.0004886, Adjusted R-squared:  -0.01155 
F-statistic: 0.04057 on 1 and 83 DF,  p-value: 0.8409

# prediction in training set
pred.lm.train<-predict(fit.lm,newdata=Dtrain)

# predictions in test set
pred.lm.test<-predict(fit.lm,newdata=Dtest)

Evaluating the Quality of the Predictions (Training)

# let's evaluate the quality of the predictions

# in-sample
summary(lm(Ytrain ~ -1+pred.lm.train))


Call:
lm(formula = Ytrain ~ -1 + pred.lm.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9442 -1.5268 -0.0528  1.2107  6.3753 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
pred.lm.train   1.0000     0.4727   2.116   0.0373 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.984 on 84 degrees of freedom
Multiple R-squared:  0.05059,   Adjusted R-squared:  0.03929 
F-statistic: 4.476 on 1 and 84 DF,  p-value: 0.03733

Evaluating the Quality of the Predictions (Testing)

# out-of-sample
summary(lm(Ytest ~  -1+pred.lm.test))


Call:
lm(formula = Ytest ~ -1 + pred.lm.test)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0969 -1.1133  0.3382  0.4633  2.6611 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
pred.lm.test  -1.3630     0.7369   -1.85   0.0856 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.304 on 14 degrees of freedom
Multiple R-squared:  0.1964,    Adjusted R-squared:  0.139 
F-statistic: 3.421 on 1 and 14 DF,  p-value: 0.08559

Overfitting (Illustration)

Application of a decision tree to model the relationship between X and Y.

require(rpart)
# fitting a linear regression
fit.rp<-rpart(Y~X,data=Dtrain)

# prediction in training set
pred.rp.train<-predict(fit.rp,newdata=Dtrain)

# predictions in test set
pred.rp.test<-predict(fit.rp,newdata=Dtest)

Overfitting (Illustration)

# let's evaluate the quality of the predictions
# in-sample
summary(lm(Ytrain ~ -1+pred.rp.train))


Call:
lm(formula = Ytrain ~ -1 + pred.rp.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0729 -1.4916 -0.1019  1.2215  4.9288 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
pred.rp.train   1.0000     0.1962   5.098 2.09e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.78 on 84 degrees of freedom
Multiple R-squared:  0.2363,    Adjusted R-squared:  0.2272 
F-statistic: 25.98 on 1 and 84 DF,  p-value: 2.092e-06

# out-of-sample
summary(lm(Ytest ~  -1+pred.rp.test))


Call:
lm(formula = Ytest ~ -1 + pred.rp.test)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9756 -1.3492  0.1331  0.6140  2.1433 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
pred.rp.test  -0.7114     0.3240  -2.196   0.0455 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.255 on 14 degrees of freedom
Multiple R-squared:  0.2562,    Adjusted R-squared:  0.203 
F-statistic: 4.822 on 1 and 14 DF,  p-value: 0.04546

Overfitting (Illustration)

A 'good model' can work very badly in new samples and a 'bad model' can work very well in new samples. This is usually related to over fitting.

good = high explanation of data variation bad = low explanation of data variation

Package caret in R

caret (classification and regression training)

Caret is a powerful alternative to run predictive models for classification and regression in R.

require(caret)

Functionalities:
- Pre-processing
Data splitting
Feature Selection
Model Tuning
Variable Importance

Available algorithms in caret for regression and classification

http://rpackages.ianhowson.com/cran/caret/man/models.html

The characteristics of the algorithms can be found here:

Data splitting an example

Download data from Agresti.

require(caret)
require(foreign)
cdata <- read.dta("http://www.ats.ucla.edu/stat/data/crime.dta")
summary(cdata)

      sid          state               crime            murder      
 Min.   : 1.0   Length:51          Min.   :  82.0   Min.   : 1.600  
 1st Qu.:13.5   Class :character   1st Qu.: 326.5   1st Qu.: 3.900  
 Median :26.0   Mode  :character   Median : 515.0   Median : 6.800  
 Mean   :26.0                      Mean   : 612.8   Mean   : 8.727  
 3rd Qu.:38.5                      3rd Qu.: 773.0   3rd Qu.:10.350  
 Max.   :51.0                      Max.   :2922.0   Max.   :78.500  
    pctmetro         pctwhite         pcths          poverty     
 Min.   : 24.00   Min.   :31.80   Min.   :64.30   Min.   : 8.00  
 1st Qu.: 49.55   1st Qu.:79.35   1st Qu.:73.50   1st Qu.:10.70  
 Median : 69.80   Median :87.60   Median :76.70   Median :13.10  
 Mean   : 67.39   Mean   :84.12   Mean   :76.22   Mean   :14.26  
 3rd Qu.: 83.95   3rd Qu.:92.60   3rd Qu.:80.10   3rd Qu.:17.40  
 Max.   :100.00   Max.   :98.50   Max.   :86.60   Max.   :26.40  
     single     
 Min.   : 8.40  
 1st Qu.:10.05  
 Median :10.90  
 Mean   :11.33  
 3rd Qu.:12.05  
 Max.   :22.10

Description of the example data

For our data analysis below, we will use the crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name (state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the percent of the population living in metropolitan areas (pctmetro), the percent of the population that is white (pctwhite), percent of population with a high school education or above (pcths), percent of population living under poverty line (poverty), and percent of population that are single parents (single). It has 51 observations. We are going to exclude sid, state and murders.

Example: Create Data Partition

Create data partition, 80% training, 20% testing for predicting crime.

# subsetting to exclude state id, state name and murder
mydata<-subset(cdata,select=-c(sid,state,murder))
inTrain<-createDataPartition(mydata$crime,
                             p=.8,list=TRUE,times=10)

mydata.train<-mydata[inTrain[[1]],]
mydata.test<-mydata[-inTrain[[1]],]

Example: Create Data resamples

Create data partition, 80% training, 20% testing for predicting crime.

# create data resamples
inResample<-createResample(mydata$crime,
                           list=TRUE,
                           times=10)

mydata.resample<-mydata[inResample[[1]],]

Example: Create Data Folds

This function is very useful for performing cross-validation.

# creating folds 
myFolds<-createFolds(mydata$crime,
                     list=TRUE,
                     returnTrain=TRUE
)

mydata.Fold<-mydata

obs: for time series data analysis there is a function called createTimeSlices.

Tuning parameters

Resamples, folds or partitions can be used as elements to tune parameters in some of the available models.

Methods for tuning the algorithms:

bootstrap
- Leave-One-Out Cross-Validation
- Leave-Group-Out Cross-Validation
- cross-validation
- repeated cross-validation

Choosing the design for Tuning parameters

fitControl<-trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 1,
)

Training an Algorithm for Tuning the Parameters (Partitioned Data)

Example CART (Classification and Regression Trees). This algorithm has one tuning parameter that is the complexity of the tree.

Case 1 : Based on the traditional partitioning into training/test sets

trainRp1 <-train(crime~.,
                 data=mydata.train,
                 method='rpart',
                 metric='RMSE'
)

trainRp1

CART 

43 samples
 5 predictors

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 43, 43, 43, 43, 43, 43, ... 
Resampling results across tuning parameters:

  cp         RMSE      Rsquared   RMSE SD   Rsquared SD
  0.0000000  414.9420  0.3280516  164.4986  0.1422958  
  0.1354936  443.2821  0.2542748  163.1864  0.1341698  
  0.3580494  452.6743  0.2392352  158.3217  0.1214761  

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was cp = 0.

Training an Algorithm for Tuning the Parameters (Resampled Data)

Case 2: Based on resampled data.

trainRp2 <-train(crime~.,
                 data=mydata.resample,
                 method='rpart',
                 metric ='RMSE'
)

trainRp2

CART 

51 samples
 5 predictors

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 51, 51, 51, 51, 51, 51, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   RMSE SD   Rsquared SD
  0.01613366  370.5856  0.4375553  148.2835  0.1865361  
  0.11139495  380.5858  0.4032198  159.8199  0.2395796  
  0.41249488  392.3272  0.3320403  147.0896  0.2276589  

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was cp = 0.01613366.

Training an Algorithm for Tuning the Parameters (10 Created Folds)

Case 3: Based on 10 Created Folds

fitControl<-trainControl(method = "repeatedcv",
                         number = 10,
                         repeats = 1,
                         index = myFolds,
                         savePredictions = TRUE)


trainRp3 <-train(crime~.,
                 data=mydata,
                 method='rpart',
                 metric='RMSE',
                 trControl=fitControl
)

trainRp3

CART 

51 samples
 5 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 47, 46, 45, 46, 45, 46, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   RMSE SD   Rsquared SD
  0.03311338  321.4761  0.5597865  228.7972  0.25321420 
  0.15841943  336.9384  0.5024808  242.2787  0.27447612 
  0.33294737  387.6255  0.1674825  215.4202  0.09272875 

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was cp = 0.03311338.

Model Tuning

A great amount of information is stored in the object generated by train command.

# renaming the object
rx<-trainRp3

# best tuning
rx$bestTune

          cp
1 0.03311338

# results for different tuning parameters
rx$results

          cp     RMSE  Rsquared   RMSESD RsquaredSD
1 0.03311338 321.4761 0.5597865 228.7972 0.25321420
2 0.15841943 336.9384 0.5024808 242.2787 0.27447612
3 0.33294737 387.6255 0.1674825 215.4202 0.09272875

Predictions at each Cross-Validated Fold

# predictions for each observations at each fold
head(rx$pred)

      pred  obs rowIndex         cp Resample
1 390.4167 1206        9 0.03311338   Fold01
2 390.4167  496       16 0.03311338   Fold01
3 390.4167  138       30 0.03311338   Fold01
4 390.4167  627       31 0.03311338   Fold01
5 407.4828 1206        9 0.15841943   Fold01
6 407.4828  496       16 0.15841943   Fold01

# check that the elements in the first fold are exactly the ones randomly selected 
setdiff(1:dim(mydata.Fold)[1],myFolds[[1]])

[1]  9 16 30 31

Exploring the Variable Importance

An useful functionality from caret is the metric of variable importance.

vip.rp<-varImp(trainRp3)
vip.rp

rpart variable importance

         Overall
pctmetro  100.00
single     84.92
pctwhite   66.42
poverty    49.24
pcths       0.00

(Information about variable importance) [http://topepo.github.io/caret/varimp.html]

Looking at the Final Model

Let's see what is inside the final model and see the the configuration of the final tree.

final.Tree<-trainRp3$finalModel
final.Tree

n= 51 

node), split, n, deviance, yval
      * denotes terminal node

1) root 51 9728475.0  612.8431  
  2) pctwhite>=88.65 24  688536.0  345.5417 *
  3) pctwhite< 88.65 27 5800869.0  850.4444  
    6) pctmetro< 84.75 20  809479.8  709.1000 *
    7) pctmetro>=84.75 7 3450209.0 1254.2860 *

#plot(final.Tree)
#text(final.Tree)

Trying another approaches: Multivariate Adaptive Regression Spline

fitControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
index = myFolds,
savePredictions = TRUE,
preProcOptions = c('center','scale'))


trainMARS <-train(crime~.,
data=mydata,
method='earth',
metric='RMSE',
trControl=fitControl
)

trainMARS

Multivariate Adaptive Regression Spline 

51 samples
 5 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 47, 46, 45, 46, 45, 46, ... 
Resampling results across tuning parameters:

  nprune  RMSE      Rsquared   RMSE SD   Rsquared SD
   2      350.3875  0.4072159  232.7888  0.2685322  
   8      246.8368  0.6964006  200.6891  0.2582312  
  14      246.8368  0.6964006  200.6891  0.2582312  

Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nprune = 8 and degree = 1.

Trying another approaches: Multiple Regression

fitControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
index = myFolds,
savePredictions = TRUE,
preProcOptions = c('center','scale'))


trainGLM <-train(crime~.,
data=mydata,
method='glm',
metric='RMSE',
trControl=fitControl
)

trainGLM

Generalized Linear Model 

51 samples
 5 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times) 
Summary of sample sizes: 47, 46, 45, 46, 45, 46, ... 
Resampling results

  RMSE      Rsquared   RMSE SD   Rsquared SD
  231.1326  0.7169405  97.47757  0.2534948

Comparing Results (RMSE and Rsquared) from Different Algorithms

results <- resamples(list(GLM=trainGLM, CART=trainRp3, MARS=trainMARS),metric='RMSE')
summary(results)


Call:
summary.resamples(object = results)

Models: GLM, CART, MARS 
Number of resamples: 10 

RMSE 
       Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
GLM  126.20   146.4  212.5 231.1   309.6 409.4    0
CART 126.80   187.1  227.8 321.5   384.8 895.6    0
MARS  62.94   148.9  204.8 246.8   228.4 785.0    0

Rsquared 
         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
GLM  0.286200  0.5851 0.7917 0.7169  0.9224 0.9832    0
CART 0.004113  0.5720 0.6172 0.5598  0.6754 0.8862    1
MARS 0.224100  0.5713 0.7766 0.6964  0.8789 0.9547    0

bwplot(results,metric='RMSE')

bwplot(results,metric='Rsquared')

dotplot(results,metric='RMSE')

dotplot(results,metric='Rsquared')

Assignment #1

Load the Boston housing data in package MASS

require(MASS)
data(Boston)
head(Boston)

     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
  lstat medv
1  4.98 24.0
2  9.14 21.6
3  4.03 34.7
4  2.94 33.4
5  5.33 36.2
6  5.21 28.7

The outcome variable is *medv median value of housing in a boston district. Extract 20% of the data for testing and fit the models with the 80% remaining. Compare the predictions in Test samples by different algorithms.

Classification Example

require(kernlab)
data(spam)
help(spam)

Spam Training 70% Testing 30%

indtrain<-createDataPartition(spam$type,p=0.7,times=1,list=FALSE)
spam.train<-spam[indtrain,]
spam.test<-spam[-indtrain,]

Train a logistic regression

fitControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
index = myFolds,
savePredictions = TRUE,
preProcOptions = c('center','scale'))
spam.glm<-train(type~.,data=spam.train,method='glm')
spam.glm

Generalized Linear Model 

3222 samples
  57 predictors
   2 classes: 'nonspam', 'spam' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 3222, 3222, 3222, 3222, 3222, 3222, ... 
Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD  
  0.9166912  0.8251432  0.01707908   0.03282495

Prediction out of sample

Predicted Values

pred.spam<-predict(spam.glm,newdata=spam.test,type='raw')

Confusion Matrix

cM<-table(pred.spam,spam.test$type)
cM


pred.spam nonspam spam
  nonspam     793   55
  spam         43  488

Measures of performance

confusionMatrix(cM)

Confusion Matrix and Statistics


pred.spam nonspam spam
  nonspam     793   55
  spam         43  488

               Accuracy : 0.9289          
                 95% CI : (0.9141, 0.9419)
    No Information Rate : 0.6062          
    P-Value [Acc > NIR] : <2e-16          

                  Kappa : 0.8506          
 Mcnemar's Test P-Value : 0.2665          

            Sensitivity : 0.9486          
            Specificity : 0.8987          
         Pos Pred Value : 0.9351          
         Neg Pred Value : 0.9190          
             Prevalence : 0.6062          
         Detection Rate : 0.5751          
   Detection Prevalence : 0.6149          
      Balanced Accuracy : 0.9236          

       'Positive' Class : nonspam

ROC Curve

require(pROC)
pred.spam<-predict(spam.glm,newdata=spam.test,type='prob')
rocobj <- plot.roc(ifelse(spam.test$type=='spam',1,0),pred.spam[,'spam'])
plot(rocobj)


Call:
plot.roc.default(x = ifelse(spam.test$type == "spam", 1, 0),     predictor = pred.spam[, "spam"])

Data: pred.spam[, "spam"] in 836 controls (ifelse(spam.test$type == "spam", 1, 0) 0) < 543 cases (ifelse(spam.test$type == "spam", 1, 0) 1).
Area under the curve: 0.9765

Assignment #2

Fit another algorithm for classification
Compare predictions in test samples

Important Algorithms for Classification / Regression

Partial Least Squares (PLS)
Elastic Net (GLMnet)
Random Forests (RF)

require(pls)
require(glmnet)
require(randomForest)

All these models can be applied to regression and classification(binary or multiclass) problems.

Flexible Methods
They handle high-dimensional data ( \( p>>n \) ).
Selection of important features

GLMnet (quick description)

Generalized linear model
Coefficients fitted by penalized maximum likelihood.

Tuning parameters:

\( \lambda \): magnitude of the penalty
\( \alpha \) : elastic net parameter, ranges from 0 to 1. In particular (\( \alpha =1 \) LASSO) , (\( \alpha = 0 \), Ridge Regression)

PLS (quick description)

Dimension reduction by estimating scores for latent factors (both in Y and X) and regression

Tuning parameter:

\( ncomp \) : number of components (latent factors)

Instead of creating factors to explain the variation in \( X'X \) , it create factors to explain variation in \( X'Y \).

RF (quick description)

It is an ensemble technique that builds a multitude of trees and predict the class by majority of votes or the quantitative outcome by averaging predictions of individual trees.

Tuning parameter:

\( mtry \): number of variables randomly sampled as candidates at each split.

It is a successful example of how combining weak predictors can generate good models. Be careful, it may easily create overfitting.

Application of GLMnet, PLS and RF

Downloading the dataset leukemia from the package plsgenomics

require('plsgenomics')
data(leukemia)

# matrix of gene expressions
X<-leukemia$X

# cancer status
Y<-factor(leukemia$Y)

# number of markers
dimP<-dim(leukemia$X)[2]
dimP

[1] 3051

# number of observations
dimN<-dim(leukemia$X)[1]
dimN

[1] 38

Information about the data

help("leukemia")

Individual Packages

require('pls')
require('plsgenomics')
require('randomForest')

Training the Algorithms

myControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
savePredictions =TRUE)

# glmnet                          
fit.glm<-train(X,Y,method='glmnet',tuneGrid=data.frame(alpha=0,lambda=1),trControl = myControl)

# pls
fit.pls<-train(X,Y,method='pls',trControl=myControl)

# random forest
fit.rf<-train(X,Y,method='rf',tuneGrid = data.frame(mtry=2),trControl=myControl)

Comparing Results (Accuracy and Kappa) from Different Algorithms

results <- resamples(list(GLM=fit.glm, PLS=fit.pls, RF=fit.rf))
summary(results)


Call:
summary.resamples(object = results)

Models: GLM, PLS, RF 
Number of resamples: 10 

Accuracy 
      Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
GLM 1.0000    1.00   1.00 1.0000       1    1    0
PLS 1.0000    1.00   1.00 1.0000       1    1    0
RF  0.6667    0.75   0.75 0.8417       1    1    0

Kappa 
    Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
GLM    1       1   1.00 1.00       1    1    0
PLS    1       1   1.00 1.00       1    1    0
RF     0       0   0.25 0.45       1    1    0

bwplot(results)

dotplot(results)

Retrieving information from the final models

The estimated tuning parameters:

# pls tuning parameters
fit.pls$bestTune

  ncomp
1     1

# glmnet tuning parameter
fit.glm$bestTune

  alpha lambda
1     0      1

# random forests tuning parameter
fit.rf$bestTune

  mtry
1    2

Estimation of Model Parameters

coef<-fit.pls$finalModel$coefficients

Application: Regression (Genotypic Predictors of HIV 1 Drug Resistance)

Download the data in the link :

data for regression

load('hiv.rda')

# training data
Data.Train <-cbind.data.frame(X = hiv.train$x , Y = hiv.train$y)

# test data
Data.Test <-cbind.data.frame(X = hiv.test$x , Y = hiv.test$y)

Fitting Regression Models for HIV data

hivControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
savePredictions =TRUE)

# glmnet                          
fit.glm<-train(Y~.,data=Data.Train,method='glmnet',tuneLength=10,trControl =hivControl)

# pls
fit.pls<-train(Y~.,data=Data.Train,method='pls',trControl=myControl)

# random forest
fit.rf<-train(Y~.,data=Data.Train,method='rf',tuneGrid = data.frame(mtry=c(2,5,10,20,30)),trControl=hivControl)

Prediction in Test Samples

pred.test.glm<-predict(fit.glm,newdata=Data.Test)
pred.test.pls<-predict(fit.pls,newdata=Data.Test)
pred.test.rf<-predict(fit.rf,newdata=Data.Test)

A matrix of predicted values

MatrixPredictions<-cbind.data.frame(
Pred.PLS = pred.test.pls , 
Pred.GLM = pred.test.glm ,
Pred.RF = pred.test.rf,
Y = Data.Test$Y
)

head(MatrixPredictions)

   Pred.PLS  Pred.GLM   Pred.RF         Y
1 0.6743904 0.4441626 0.6362682 0.2552725
2 2.3104745 2.4364440 2.0835473 2.3010300
3 2.1292015 2.0730297 2.1793279 2.3010300
4 0.6642276 0.4737561 0.7195138 0.3802112
5 2.0439181 2.0577978 2.0885126 2.3010300
6 1.9862238 2.0197845 2.2231039 2.3010300

Correlation between preditictions

Note that all methods are highly correlated

cor(MatrixPredictions)

          Pred.PLS  Pred.GLM   Pred.RF         Y
Pred.PLS 1.0000000 0.9795299 0.9614987 0.9480400
Pred.GLM 0.9795299 1.0000000 0.9694709 0.9669947
Pred.RF  0.9614987 0.9694709 1.0000000 0.9765041
Y        0.9480400 0.9669947 0.9765041 1.0000000

Assignment # 3

Verify the variable importance and if there is concordance between the methods.
Compare the performance in validation samples
Correlate the coefficients obtained in GLMnet and PLS