Joel Correa da Rosa
December 02, 2015
Models for predicting outcomes in classification and regression problems are developed in different areas.
There is a growing overlap between all these areas.
The general goal is To build a model that can accurately (better than random) predict out-of-sample cases.
It is not sufficient to have only a good model, but one needs to prove that it works for unseen data and it does better than random models.
Classic statistical inference plays minor role.
Let's simulate some data and a linear relationship between Y and X.
set.seed(123)
# a normal distributed variables
X<-rnorm(100,0,1)
# a poorly correlated variable
Y<-0.5+0.1*X+rnorm(100,0,2)
# generating indexes for training
ind.train<-sample(1:100,85)
# training sets
Xtrain<-X[ind.train]
Ytrain<-Y[ind.train]
# test sets
Xtest<-X[-ind.train]
Ytest<-Y[-ind.train]
# Data frames for training and test
Dtrain = cbind.data.frame(X= Xtrain, Y = Ytrain)
Dtest = cbind.data.frame(X= Xtest, Y = Ytest)
# fitting a linear regression
fit.lm<-lm(Y~X,data=Dtrain)
summary(fit.lm)
Call:
lm(formula = Y ~ X, data = Dtrain)
Residuals:
Min 1Q Median 3Q Max
-3.9442 -1.5268 -0.0528 1.2107 6.3753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.45785 0.21772 2.103 0.0385 *
X -0.04619 0.22930 -0.201 0.8409
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.996 on 83 degrees of freedom
Multiple R-squared: 0.0004886, Adjusted R-squared: -0.01155
F-statistic: 0.04057 on 1 and 83 DF, p-value: 0.8409
# prediction in training set
pred.lm.train<-predict(fit.lm,newdata=Dtrain)
# predictions in test set
pred.lm.test<-predict(fit.lm,newdata=Dtest)
# let's evaluate the quality of the predictions
# in-sample
summary(lm(Ytrain ~ -1+pred.lm.train))
Call:
lm(formula = Ytrain ~ -1 + pred.lm.train)
Residuals:
Min 1Q Median 3Q Max
-3.9442 -1.5268 -0.0528 1.2107 6.3753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
pred.lm.train 1.0000 0.4727 2.116 0.0373 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.984 on 84 degrees of freedom
Multiple R-squared: 0.05059, Adjusted R-squared: 0.03929
F-statistic: 4.476 on 1 and 84 DF, p-value: 0.03733
# out-of-sample
summary(lm(Ytest ~ -1+pred.lm.test))
Call:
lm(formula = Ytest ~ -1 + pred.lm.test)
Residuals:
Min 1Q Median 3Q Max
-2.0969 -1.1133 0.3382 0.4633 2.6611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
pred.lm.test -1.3630 0.7369 -1.85 0.0856 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.304 on 14 degrees of freedom
Multiple R-squared: 0.1964, Adjusted R-squared: 0.139
F-statistic: 3.421 on 1 and 14 DF, p-value: 0.08559
Application of a decision tree to model the relationship between X and Y.
require(rpart)
# fitting a linear regression
fit.rp<-rpart(Y~X,data=Dtrain)
# prediction in training set
pred.rp.train<-predict(fit.rp,newdata=Dtrain)
# predictions in test set
pred.rp.test<-predict(fit.rp,newdata=Dtest)
# let's evaluate the quality of the predictions
# in-sample
summary(lm(Ytrain ~ -1+pred.rp.train))
Call:
lm(formula = Ytrain ~ -1 + pred.rp.train)
Residuals:
Min 1Q Median 3Q Max
-4.0729 -1.4916 -0.1019 1.2215 4.9288
Coefficients:
Estimate Std. Error t value Pr(>|t|)
pred.rp.train 1.0000 0.1962 5.098 2.09e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.78 on 84 degrees of freedom
Multiple R-squared: 0.2363, Adjusted R-squared: 0.2272
F-statistic: 25.98 on 1 and 84 DF, p-value: 2.092e-06
# out-of-sample
summary(lm(Ytest ~ -1+pred.rp.test))
Call:
lm(formula = Ytest ~ -1 + pred.rp.test)
Residuals:
Min 1Q Median 3Q Max
-1.9756 -1.3492 0.1331 0.6140 2.1433
Coefficients:
Estimate Std. Error t value Pr(>|t|)
pred.rp.test -0.7114 0.3240 -2.196 0.0455 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.255 on 14 degrees of freedom
Multiple R-squared: 0.2562, Adjusted R-squared: 0.203
F-statistic: 4.822 on 1 and 14 DF, p-value: 0.04546
A 'good model' can work very badly in new samples and a 'bad model' can work very well in new samples. This is usually related to over fitting.
good = high explanation of data variation bad = low explanation of data variation
caret (classification and regression training)
Caret is a powerful alternative to run predictive models for classification and regression in R.
require(caret)
http://rpackages.ianhowson.com/cran/caret/man/models.html
The characteristics of the algorithms can be found here:
Download data from Agresti.
require(caret)
require(foreign)
cdata <- read.dta("http://www.ats.ucla.edu/stat/data/crime.dta")
summary(cdata)
sid state crime murder
Min. : 1.0 Length:51 Min. : 82.0 Min. : 1.600
1st Qu.:13.5 Class :character 1st Qu.: 326.5 1st Qu.: 3.900
Median :26.0 Mode :character Median : 515.0 Median : 6.800
Mean :26.0 Mean : 612.8 Mean : 8.727
3rd Qu.:38.5 3rd Qu.: 773.0 3rd Qu.:10.350
Max. :51.0 Max. :2922.0 Max. :78.500
pctmetro pctwhite pcths poverty
Min. : 24.00 Min. :31.80 Min. :64.30 Min. : 8.00
1st Qu.: 49.55 1st Qu.:79.35 1st Qu.:73.50 1st Qu.:10.70
Median : 69.80 Median :87.60 Median :76.70 Median :13.10
Mean : 67.39 Mean :84.12 Mean :76.22 Mean :14.26
3rd Qu.: 83.95 3rd Qu.:92.60 3rd Qu.:80.10 3rd Qu.:17.40
Max. :100.00 Max. :98.50 Max. :86.60 Max. :26.40
single
Min. : 8.40
1st Qu.:10.05
Median :10.90
Mean :11.33
3rd Qu.:12.05
Max. :22.10
For our data analysis below, we will use the crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name (state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the percent of the population living in metropolitan areas (pctmetro), the percent of the population that is white (pctwhite), percent of population with a high school education or above (pcths), percent of population living under poverty line (poverty), and percent of population that are single parents (single). It has 51 observations. We are going to exclude sid, state and murders.
Create data partition, 80% training, 20% testing for predicting crime.
# subsetting to exclude state id, state name and murder
mydata<-subset(cdata,select=-c(sid,state,murder))
inTrain<-createDataPartition(mydata$crime,
p=.8,list=TRUE,times=10)
mydata.train<-mydata[inTrain[[1]],]
mydata.test<-mydata[-inTrain[[1]],]
Create data partition, 80% training, 20% testing for predicting crime.
# create data resamples
inResample<-createResample(mydata$crime,
list=TRUE,
times=10)
mydata.resample<-mydata[inResample[[1]],]
This function is very useful for performing cross-validation.
# creating folds
myFolds<-createFolds(mydata$crime,
list=TRUE,
returnTrain=TRUE
)
mydata.Fold<-mydata
obs: for time series data analysis there is a function called createTimeSlices.
Resamples, folds or partitions can be used as elements to tune parameters in some of the available models.
Methods for tuning the algorithms:
fitControl<-trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
)
Example CART (Classification and Regression Trees). This algorithm has one tuning parameter that is the complexity of the tree.
Case 1 : Based on the traditional partitioning into training/test sets
trainRp1 <-train(crime~.,
data=mydata.train,
method='rpart',
metric='RMSE'
)
trainRp1
CART
43 samples
5 predictors
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 43, 43, 43, 43, 43, 43, ...
Resampling results across tuning parameters:
cp RMSE Rsquared RMSE SD Rsquared SD
0.0000000 414.9420 0.3280516 164.4986 0.1422958
0.1354936 443.2821 0.2542748 163.1864 0.1341698
0.3580494 452.6743 0.2392352 158.3217 0.1214761
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.
Case 2: Based on resampled data.
trainRp2 <-train(crime~.,
data=mydata.resample,
method='rpart',
metric ='RMSE'
)
trainRp2
CART
51 samples
5 predictors
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 51, 51, 51, 51, 51, 51, ...
Resampling results across tuning parameters:
cp RMSE Rsquared RMSE SD Rsquared SD
0.01613366 370.5856 0.4375553 148.2835 0.1865361
0.11139495 380.5858 0.4032198 159.8199 0.2395796
0.41249488 392.3272 0.3320403 147.0896 0.2276589
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01613366.
Case 3: Based on 10 Created Folds
fitControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
index = myFolds,
savePredictions = TRUE)
trainRp3 <-train(crime~.,
data=mydata,
method='rpart',
metric='RMSE',
trControl=fitControl
)
trainRp3
CART
51 samples
5 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 47, 46, 45, 46, 45, 46, ...
Resampling results across tuning parameters:
cp RMSE Rsquared RMSE SD Rsquared SD
0.03311338 321.4761 0.5597865 228.7972 0.25321420
0.15841943 336.9384 0.5024808 242.2787 0.27447612
0.33294737 387.6255 0.1674825 215.4202 0.09272875
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.03311338.
A great amount of information is stored in the object generated by train command.
# renaming the object
rx<-trainRp3
# best tuning
rx$bestTune
cp
1 0.03311338
# results for different tuning parameters
rx$results
cp RMSE Rsquared RMSESD RsquaredSD
1 0.03311338 321.4761 0.5597865 228.7972 0.25321420
2 0.15841943 336.9384 0.5024808 242.2787 0.27447612
3 0.33294737 387.6255 0.1674825 215.4202 0.09272875
# predictions for each observations at each fold
head(rx$pred)
pred obs rowIndex cp Resample
1 390.4167 1206 9 0.03311338 Fold01
2 390.4167 496 16 0.03311338 Fold01
3 390.4167 138 30 0.03311338 Fold01
4 390.4167 627 31 0.03311338 Fold01
5 407.4828 1206 9 0.15841943 Fold01
6 407.4828 496 16 0.15841943 Fold01
# check that the elements in the first fold are exactly the ones randomly selected
setdiff(1:dim(mydata.Fold)[1],myFolds[[1]])
[1] 9 16 30 31
An useful functionality from caret is the metric of variable importance.
vip.rp<-varImp(trainRp3)
vip.rp
rpart variable importance
Overall
pctmetro 100.00
single 84.92
pctwhite 66.42
poverty 49.24
pcths 0.00
(Information about variable importance) [http://topepo.github.io/caret/varimp.html]
Let's see what is inside the final model and see the the configuration of the final tree.
final.Tree<-trainRp3$finalModel
final.Tree
n= 51
node), split, n, deviance, yval
* denotes terminal node
1) root 51 9728475.0 612.8431
2) pctwhite>=88.65 24 688536.0 345.5417 *
3) pctwhite< 88.65 27 5800869.0 850.4444
6) pctmetro< 84.75 20 809479.8 709.1000 *
7) pctmetro>=84.75 7 3450209.0 1254.2860 *
#plot(final.Tree)
#text(final.Tree)
fitControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
index = myFolds,
savePredictions = TRUE,
preProcOptions = c('center','scale'))
trainMARS <-train(crime~.,
data=mydata,
method='earth',
metric='RMSE',
trControl=fitControl
)
trainMARS
Multivariate Adaptive Regression Spline
51 samples
5 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 47, 46, 45, 46, 45, 46, ...
Resampling results across tuning parameters:
nprune RMSE Rsquared RMSE SD Rsquared SD
2 350.3875 0.4072159 232.7888 0.2685322
8 246.8368 0.6964006 200.6891 0.2582312
14 246.8368 0.6964006 200.6891 0.2582312
Tuning parameter 'degree' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 8 and degree = 1.
fitControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
index = myFolds,
savePredictions = TRUE,
preProcOptions = c('center','scale'))
trainGLM <-train(crime~.,
data=mydata,
method='glm',
metric='RMSE',
trControl=fitControl
)
trainGLM
Generalized Linear Model
51 samples
5 predictors
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 1 times)
Summary of sample sizes: 47, 46, 45, 46, 45, 46, ...
Resampling results
RMSE Rsquared RMSE SD Rsquared SD
231.1326 0.7169405 97.47757 0.2534948
results <- resamples(list(GLM=trainGLM, CART=trainRp3, MARS=trainMARS),metric='RMSE')
summary(results)
Call:
summary.resamples(object = results)
Models: GLM, CART, MARS
Number of resamples: 10
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
GLM 126.20 146.4 212.5 231.1 309.6 409.4 0
CART 126.80 187.1 227.8 321.5 384.8 895.6 0
MARS 62.94 148.9 204.8 246.8 228.4 785.0 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
GLM 0.286200 0.5851 0.7917 0.7169 0.9224 0.9832 0
CART 0.004113 0.5720 0.6172 0.5598 0.6754 0.8862 1
MARS 0.224100 0.5713 0.7766 0.6964 0.8789 0.9547 0
bwplot(results,metric='RMSE')
bwplot(results,metric='Rsquared')
dotplot(results,metric='RMSE')
dotplot(results,metric='Rsquared')
Load the Boston housing data in package MASS
require(MASS)
data(Boston)
head(Boston)
crim zn indus chas nox rm age dis rad tax ptratio black
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
lstat medv
1 4.98 24.0
2 9.14 21.6
3 4.03 34.7
4 2.94 33.4
5 5.33 36.2
6 5.21 28.7
The outcome variable is *medv median value of housing in a boston district. Extract 20% of the data for testing and fit the models with the 80% remaining. Compare the predictions in Test samples by different algorithms.
require(kernlab)
data(spam)
help(spam)
indtrain<-createDataPartition(spam$type,p=0.7,times=1,list=FALSE)
spam.train<-spam[indtrain,]
spam.test<-spam[-indtrain,]
fitControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
index = myFolds,
savePredictions = TRUE,
preProcOptions = c('center','scale'))
spam.glm<-train(type~.,data=spam.train,method='glm')
spam.glm
Generalized Linear Model
3222 samples
57 predictors
2 classes: 'nonspam', 'spam'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 3222, 3222, 3222, 3222, 3222, 3222, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.9166912 0.8251432 0.01707908 0.03282495
Predicted Values
pred.spam<-predict(spam.glm,newdata=spam.test,type='raw')
Confusion Matrix
cM<-table(pred.spam,spam.test$type)
cM
pred.spam nonspam spam
nonspam 793 55
spam 43 488
confusionMatrix(cM)
Confusion Matrix and Statistics
pred.spam nonspam spam
nonspam 793 55
spam 43 488
Accuracy : 0.9289
95% CI : (0.9141, 0.9419)
No Information Rate : 0.6062
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8506
Mcnemar's Test P-Value : 0.2665
Sensitivity : 0.9486
Specificity : 0.8987
Pos Pred Value : 0.9351
Neg Pred Value : 0.9190
Prevalence : 0.6062
Detection Rate : 0.5751
Detection Prevalence : 0.6149
Balanced Accuracy : 0.9236
'Positive' Class : nonspam
require(pROC)
pred.spam<-predict(spam.glm,newdata=spam.test,type='prob')
rocobj <- plot.roc(ifelse(spam.test$type=='spam',1,0),pred.spam[,'spam'])
plot(rocobj)
Call:
plot.roc.default(x = ifelse(spam.test$type == "spam", 1, 0), predictor = pred.spam[, "spam"])
Data: pred.spam[, "spam"] in 836 controls (ifelse(spam.test$type == "spam", 1, 0) 0) < 543 cases (ifelse(spam.test$type == "spam", 1, 0) 1).
Area under the curve: 0.9765
require(pls)
require(glmnet)
require(randomForest)
All these models can be applied to regression and classification(binary or multiclass) problems.
\( \lambda \): magnitude of the penalty
\( \alpha \) : elastic net parameter, ranges from 0 to 1. In particular (\( \alpha =1 \) LASSO) , (\( \alpha = 0 \), Ridge Regression)
Dimension reduction by estimating scores for latent factors (both in Y and X) and regression
Instead of creating factors to explain the variation in \( X'X \) , it create factors to explain variation in \( X'Y \).
It is an ensemble technique that builds a multitude of trees and predict the class by majority of votes or the quantitative outcome by averaging predictions of individual trees.
It is a successful example of how combining weak predictors can generate good models. Be careful, it may easily create overfitting.
Downloading the dataset leukemia from the package plsgenomics
require('plsgenomics')
data(leukemia)
# matrix of gene expressions
X<-leukemia$X
# cancer status
Y<-factor(leukemia$Y)
# number of markers
dimP<-dim(leukemia$X)[2]
dimP
[1] 3051
# number of observations
dimN<-dim(leukemia$X)[1]
dimN
[1] 38
help("leukemia")
require('pls')
require('plsgenomics')
require('randomForest')
myControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
savePredictions =TRUE)
# glmnet
fit.glm<-train(X,Y,method='glmnet',tuneGrid=data.frame(alpha=0,lambda=1),trControl = myControl)
# pls
fit.pls<-train(X,Y,method='pls',trControl=myControl)
# random forest
fit.rf<-train(X,Y,method='rf',tuneGrid = data.frame(mtry=2),trControl=myControl)
results <- resamples(list(GLM=fit.glm, PLS=fit.pls, RF=fit.rf))
summary(results)
Call:
summary.resamples(object = results)
Models: GLM, PLS, RF
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
GLM 1.0000 1.00 1.00 1.0000 1 1 0
PLS 1.0000 1.00 1.00 1.0000 1 1 0
RF 0.6667 0.75 0.75 0.8417 1 1 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
GLM 1 1 1.00 1.00 1 1 0
PLS 1 1 1.00 1.00 1 1 0
RF 0 0 0.25 0.45 1 1 0
bwplot(results)
dotplot(results)
The estimated tuning parameters:
# pls tuning parameters
fit.pls$bestTune
ncomp
1 1
# glmnet tuning parameter
fit.glm$bestTune
alpha lambda
1 0 1
# random forests tuning parameter
fit.rf$bestTune
mtry
1 2
coef<-fit.pls$finalModel$coefficients
Download the data in the link :
load('hiv.rda')
# training data
Data.Train <-cbind.data.frame(X = hiv.train$x , Y = hiv.train$y)
# test data
Data.Test <-cbind.data.frame(X = hiv.test$x , Y = hiv.test$y)
hivControl<-trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
savePredictions =TRUE)
# glmnet
fit.glm<-train(Y~.,data=Data.Train,method='glmnet',tuneLength=10,trControl =hivControl)
# pls
fit.pls<-train(Y~.,data=Data.Train,method='pls',trControl=myControl)
# random forest
fit.rf<-train(Y~.,data=Data.Train,method='rf',tuneGrid = data.frame(mtry=c(2,5,10,20,30)),trControl=hivControl)
pred.test.glm<-predict(fit.glm,newdata=Data.Test)
pred.test.pls<-predict(fit.pls,newdata=Data.Test)
pred.test.rf<-predict(fit.rf,newdata=Data.Test)
MatrixPredictions<-cbind.data.frame(
Pred.PLS = pred.test.pls ,
Pred.GLM = pred.test.glm ,
Pred.RF = pred.test.rf,
Y = Data.Test$Y
)
head(MatrixPredictions)
Pred.PLS Pred.GLM Pred.RF Y
1 0.6743904 0.4441626 0.6362682 0.2552725
2 2.3104745 2.4364440 2.0835473 2.3010300
3 2.1292015 2.0730297 2.1793279 2.3010300
4 0.6642276 0.4737561 0.7195138 0.3802112
5 2.0439181 2.0577978 2.0885126 2.3010300
6 1.9862238 2.0197845 2.2231039 2.3010300
Note that all methods are highly correlated
cor(MatrixPredictions)
Pred.PLS Pred.GLM Pred.RF Y
Pred.PLS 1.0000000 0.9795299 0.9614987 0.9480400
Pred.GLM 0.9795299 1.0000000 0.9694709 0.9669947
Pred.RF 0.9614987 0.9694709 1.0000000 0.9765041
Y 0.9480400 0.9669947 0.9765041 1.0000000