Multivariate Adaptive Regression Splines defeated all its competitors, except for Gradient Boosted Trees. However, the test data set AUC between the two models were extremely similar.
Single tree models (CART, ctree), as expected, performed the worst.
Load the Credit Card data set from the AER package. The data set contains a response vector card that indicates whether a consumer was declined (no) or approved for a credit card (yes).
I use the createDataPartition function from the caret package to split the data frame between training and test data sets.
Some predictor variables were removed to prevent complete separation between the approves and declines.
library(AER)
## Loading required package: car
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Loading required package: survival
## Loading required package: splines
data(CreditCard)
CreditCard$Class <- CreditCard$card
CreditCard <- subset(CreditCard, select=-c(card, expenditure, share))
set.seed(1984)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:survival':
##
## cluster
training <- createDataPartition(CreditCard$Class, p = 0.6, list=FALSE)
trainData <- CreditCard[training,]
testData <- CreditCard[-training,]
Fit a glm model (logistic regression) and score the test data set.
glmModel <- glm(Class~ . , data=trainData, family=binomial)
pred.glmModel <- predict(glmModel, newdata=testData, type="response")
Calculate the AUC for the test data set.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc.glmModel <- pROC::roc(testData$Class, pred.glmModel)
auc.glmModel <- pROC::auc(roc.glmModel)
Fit a glm using a boosting algorithm (as opposed to MLE). Unlike the glm function, glmboost will perform variable selection. After fitting the model, score the test data set and measure the AUC.
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
set.seed(2014)
glmBoostModel <- train(Class ~ ., data=trainData, method = "glmboost", metric="ROC", trControl = fitControl, tuneLength=5, center=TRUE, family=Binomial(link = c("logit")))
## Loading required package: mboost
## Loading required package: parallel
## Loading required package: stabs
## This is mboost 2.4-0. See 'package?mboost' and the NEWS file
## for a complete list of changes.
##
##
## Attaching package: 'mboost'
##
## The following object is masked from 'package:ggplot2':
##
## %+%
pred.glmBoostModel <- as.vector(predict(glmBoostModel, newdata=testData, type="prob")[,"yes"])
roc.glmBoostModel <- pROC::roc(testData$Class, pred.glmBoostModel)
auc.glmBoostModel <- pROC::auc(roc.glmBoostModel)
Try CART, conditional inference tree, elastic net, multivariate adaptive regression splines, boosted trees, and random forest.
# CART
set.seed(2014)
cartModel <- train(Class ~ ., data=trainData, method = "rpart", metric="ROC", trControl = fitControl, tuneLength=5)
## Loading required package: rpart
pred.cartModel <- as.vector(predict(cartModel, newdata=testData, type="prob")[,"yes"])
roc.cartModel <- pROC::roc(testData$Class, pred.cartModel)
auc.cartModel <- pROC::auc(roc.cartModel)
# Conditional Inference Tree
set.seed(2014)
partyModel <- train(Class ~ ., data=trainData, method = "ctree", metric="ROC", trControl = fitControl, tuneLength=5)
## Loading required package: party
## Loading required package: grid
## Loading required package: strucchange
## Loading required package: modeltools
## Loading required package: stats4
pred.partyModel <- as.vector(predict(partyModel, newdata=testData, type="prob")[,"yes"])
roc.partyModel <- pROC::roc(testData$Class, pred.partyModel)
auc.partyModel <- pROC::auc(roc.partyModel)
# Elastic Net
set.seed(2014)
eNetModel <- train(Class ~ ., data=trainData, method = "glmnet", metric="ROC", trControl = fitControl, family="binomial", tuneLength=5)
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 1.9-8
##
##
## Attaching package: 'glmnet'
##
## The following object is masked from 'package:pROC':
##
## auc
pred.eNetModel <- as.vector(predict(eNetModel, newdata=testData, type="prob")[,"yes"])
roc.eNetModel <- pROC::roc(testData$Class, pred.eNetModel)
auc.eNetModel <- pROC::auc(roc.eNetModel)
# Earth
set.seed(2014)
earthModel <- train(Class ~ ., data=trainData, method = "earth", glm=list(family=binomial), metric="ROC", trControl = fitControl, tuneLength=5)
## Loading required package: earth
## Loading required package: plotmo
## Loading required package: plotrix
pred.earthModel <- as.vector(predict(earthModel, newdata=testData, type="prob")[,"yes"])
roc.earthModel <- pROC::roc(testData$Class, pred.earthModel)
auc.earthModel <- pROC::auc(roc.earthModel)
# Boosted Trees
set.seed(2014)
gbmModel <- train(Class ~ ., data=trainData, method = "gbm", metric="ROC", trControl = fitControl, verbose=FALSE, tuneLength=5)
## Loading required package: gbm
## Loaded gbm 2.1
## Loading required package: plyr
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:modeltools':
##
## empty
pred.gbmModel <- as.vector(predict(gbmModel, newdata=testData, type="prob")[,"yes"])
roc.gbmModel <- pROC::roc(testData$Class, pred.gbmModel)
auc.gbmModel <- pROC::auc(roc.gbmModel)
# Random Forest
set.seed(2014)
rfModel <- train(Class ~ ., data=trainData, method = "rf", metric="ROC", trControl = fitControl, verbose=FALSE, tuneLength=5)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
pred.rfModel <- as.vector(predict(rfModel, newdata=testData, type="prob")[,"yes"])
roc.rfModel <- pROC::roc(testData$Class, pred.rfModel)
auc.rfModel <- pROC::auc(roc.rfModel)
Plot AUC, on the test data set, for each model.
test.auc <- data.frame(model=c("glm","glmboost","gbm","glmnet","earth","cart","ctree","rForest"),auc=c(auc.glmModel, auc.glmBoostModel, auc.gbmModel, auc.eNetModel, auc.earthModel, auc.cartModel, auc.partyModel, auc.rfModel))
test.auc <- test.auc[order(test.auc$auc, decreasing=TRUE),]
test.auc$model <- factor(test.auc$model, levels=test.auc$model)
test.auc
## model auc
## 3 gbm 0.8064419
## 5 earth 0.7974908
## 2 glmboost 0.7923004
## 1 glm 0.7917616
## 8 rForest 0.7841884
## 4 glmnet 0.7791223
## 7 ctree 0.7704716
## 6 cart 0.7473996
library(ggplot2)
theme_set(theme_gray(base_size = 18))
qplot(x=model, y=auc, data=test.auc, geom="bar", stat="identity", position = "dodge")+ geom_bar(fill = "light blue", stat="identity")
Plot tuning parameters that were chosen by repeated CV.
plot(cartModel)
plot(partyModel)
plot(eNetModel)
plot(glmBoostModel)
plot(gbmModel)
plot(earthModel)
plot(rfModel)
The test data set performance of the Multivariate Adaptive Regression Splines (MARS) model was almost tied with the performance of the Gradient Boosted Trees (gbm) model. Unfortunately, gbm models are difficult to implement and extremely difficult to interpret.
Repeated CV chose an additive MARS model (degree=1) with knots in some of the predictor variables. MARS is capable of fitting non-additive models via interactions between the hinge functions of the predictor variables.
The hinge functions transform the predictor variables to capture the non-linear relationships between the predictor variables and the log odds of approval.
# Tuning parms chosen via repeated CV
earthModel$bestTune
## nprune degree
## 5 14 1
# Coefficients of the MARS model
summary(earthModel)
## Call: earth(x="[792, 9]-too-long-to-display", y="[792, 1]-too-long-to-display",
## glm=structure(list(family=function(link="logit")
## {
## linktemp<-substitute(link)
## if(!is.character(linktemp))
## linktemp<-deparse(linktemp)
## okLinks<-c("logit", "probit", "cloglog", "cauchit",
## "log")...
##
## GLM coefficients
## yes
## (Intercept) -4.4270847
## h(3-reports) 2.4643180
## h(2.656-income) -1.8421880
## owneryes 0.8318101
## selfempyes -1.2516007
## h(2-dependents) 0.5637676
## h(months-228) -0.0126266
## h(3-active) -0.9309330
##
## Earth selected 8 of 16 terms, and 7 of 9 predictors
## Importance: reports, active, income, dependents, months, selfempyes, ...
## Number of terms at each degree of interaction: 1 7 (additive model)
## Earth GCV 0.1033357 RSS 78.77113 GRSq 0.4084182 RSq 0.4291738
##
## GLM null.deviance 844.0365 (791 dof) deviance 482.9423 (784 dof) iters 6
One benefit of using the earth package to build MARS models is the ability to visually inspect the partial effect of each predictor variable in the model.
For example, increasing the number of derogatory remarks in your credit report from 1 to 2, reduces the log odds of approval. However, there is no effect on log odds if the number of derogs increase from 6 to 8.
plotmo(earthModel$finalModel, type="link")
##
## grid: reports age income owneryes selfempyes dependents months
## 0 31.75 3 0 0 0 30
## majorcards active
## 1 6