Getting the individual probability from Gradient Boosting Models

Motivation

The convenience of twang package of getting weights from choosing estimands causing me so curious how I can manually fitting PS model using Gradient Boosting Models.
Hence, below I demonstrate how I can do it using the caret and xgboost packages
Goal:
- Fitting logistic model vs GBM
- Extract the individual probability from GBM
- Calculating the accuracy between logistic vs GBM
- Comparing individual probability between 2 methods

Setup

library(tidyverse)
library(caret)
library(xgboost)
library(pROC)

Load data

# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
df <- na.omit(PimaIndiansDiabetes2)

Description from ?mlbench::PimaIndiansDiabetes2
A data frame with 768 observations on 9 variables.

Usage data(PimaIndiansDiabetes)
data(PimaIndiansDiabetes2)

Format pregnant Number of times pregnant
glucose Plasma glucose concentration (glucose tolerance test)
pressure Diastolic blood pressure (mm Hg)
triceps Triceps skin fold thickness (mm)
insulin 2-Hour serum insulin (mu U/ml)
mass Body mass index (weight in kg/(height in m)^2) pedigree Diabetes pedigree function
age Age (years)
diabetes Class variable (test for diabetes)

Details
The data set PimaIndiansDiabetes2 contains a corrected version of the original data set. While the UCI repository index claims that there are no missing values, closer inspection of the data shows several physical impossibilities, e.g., blood pressure or body mass index of 0.

In PimaIndiansDiabetes2, all zero values of glucose, pressure, triceps, insulin and mass have been set to NA, see also Wahba et al (1995) and Ripley (1996).

Source
Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu)

These data have been taken from the UCI Repository Of Machine Learning Databases at

ftp://ftp.ics.uci.edu/pub/machine-learning-databases

http://www.ics.uci.edu/~mlearn/MLRepository.html

and were converted to R format by Friedrich Leisch.

# Inspect the data
sample_n(df, 3)

Divide train, test datasets

# Split the data into training and test set
set.seed(123)
training.samples <- df$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- df[training.samples, ]
test.data <- df[-training.samples, ]

Fit and predict with traditional logistic regression

logit.model <- glm(diabetes ~., data = train.data, family = binomial)
summary(logit.model)

## 
## Call:
## glm(formula = diabetes ~ ., family = binomial, data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5832  -0.6544  -0.3292   0.6248   2.5968  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.053e+01  1.440e+00  -7.317 2.54e-13 ***
## pregnant     1.005e-01  6.127e-02   1.640  0.10092    
## glucose      3.710e-02  6.486e-03   5.719 1.07e-08 ***
## pressure    -3.876e-04  1.383e-02  -0.028  0.97764    
## triceps      1.418e-02  1.998e-02   0.710  0.47800    
## insulin      5.940e-04  1.508e-03   0.394  0.69371    
## mass         7.997e-02  3.180e-02   2.515  0.01190 *  
## pedigree     1.329e+00  4.823e-01   2.756  0.00585 ** 
## age          2.718e-02  2.020e-02   1.346  0.17840    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 398.80  on 313  degrees of freedom
## Residual deviance: 267.18  on 305  degrees of freedom
## AIC: 285.18
## 
## Number of Fisher Scoring iterations: 5

# function which allows use to make predictions based on different probability cutoffs.
get_logistic_pred = function(mod, data, res = "y", pos = 1, neg = 0, cut = 0.5) {
  probs = predict(mod, newdata = data, type = "response")
  ifelse(probs > cut, pos, neg)
}

test_pred_50 <- get_logistic_pred(logit.model, data = test.data, res = "diabetes",
                                  pos = "pos", neg = "neg", cut = 0.5)

test_tab_50 = table(predicted = test_pred_50, actual = test.data$diabetes)
test_con_mat_50 = confusionMatrix(test_tab_50, positive = "pos")
test_con_mat_50

## Confusion Matrix and Statistics
## 
##          actual
## predicted neg pos
##       neg  44  11
##       pos   8  15
##                                          
##                Accuracy : 0.7564         
##                  95% CI : (0.646, 0.8465)
##     No Information Rate : 0.6667         
##     P-Value [Acc > NIR] : 0.05651        
##                                          
##                   Kappa : 0.4356         
##                                          
##  Mcnemar's Test P-Value : 0.64636        
##                                          
##             Sensitivity : 0.5769         
##             Specificity : 0.8462         
##          Pos Pred Value : 0.6522         
##          Neg Pred Value : 0.8000         
##              Prevalence : 0.3333         
##          Detection Rate : 0.1923         
##    Detection Prevalence : 0.2949         
##       Balanced Accuracy : 0.7115         
##                                          
##        'Positive' Class : pos            
##

test_prob = predict(logit.model, newdata = test.data, type = "response")
test_roc = roc(test.data$diabetes ~ test_prob, plot = TRUE, print.auc = TRUE)

## Setting levels: control = neg, case = pos

## Setting direction: controls < cases

Gradient Boosting Models

# Fit the model on the training set
set.seed(123)
xgb.model <- train(
   diabetes ~., data = train.data, method = "xgbTree",
   trControl = trainControl("cv", number = 50)
)

Note:
Checked and realized that with the 50-fold of cv would get the highest accuracy. That means the fold choosing lower or higher lead to insufficient accuracy compared to logistic model above.

This is how we get the individual probabilty for who got the pos with diabetes

p.xgb.prob <- pull(predict(xgb.model, type="prob"), pos)

Automatically getting accuracy from confusionMatrix

test_pred_gbm <- predict(xgb.model, test.data, type="raw")
test_tab_gbm = table(predicted = test_pred_gbm, actual = test.data$diabetes)
test_con_mat_gbm = confusionMatrix(test_tab_gbm, positive = "pos")
test_con_mat_gbm

## Confusion Matrix and Statistics
## 
##          actual
## predicted neg pos
##       neg  43   7
##       pos   9  19
##                                          
##                Accuracy : 0.7949         
##                  95% CI : (0.6884, 0.878)
##     No Information Rate : 0.6667         
##     P-Value [Acc > NIR] : 0.009224       
##                                          
##                   Kappa : 0.5472         
##                                          
##  Mcnemar's Test P-Value : 0.802587       
##                                          
##             Sensitivity : 0.7308         
##             Specificity : 0.8269         
##          Pos Pred Value : 0.6786         
##          Neg Pred Value : 0.8600         
##              Prevalence : 0.3333         
##          Detection Rate : 0.2436         
##    Detection Prevalence : 0.3590         
##       Balanced Accuracy : 0.7788         
##                                          
##        'Positive' Class : pos            
##

Manually choosing cut-off points at 50% (same as logistic cut-off point above), producing the same results with the type = "raw" in predict function.

test_predp_gbm <- pull(predict(xgb.model, test.data, type="prob"), pos)
test_pred_50gbm <- ifelse(test_predp_gbm > 0.5, "pos", "neg")
test_tab_50gbm <- table(predicted = test_pred_50gbm, actual = test.data$diabetes)
test_con_mat_50gbm = confusionMatrix(test_tab_50gbm, positive = "pos")
test_con_mat_50gbm

Look at the probability between 2 methods for each individual

We compare probability of each individual between 2 methods

p.logit <- predict(logit.model, newdata = train.data, type = "response")
p.xgb.prob <- pull(predict(xgb.model, data = train.data, type="prob"), pos)
df.prob <- cbind(train.data["diabetes"], p.logit, p.xgb.prob)
head(df.prob)

The probability of individual between 2 methods are different !!!

Some side notes

Comparing Different Species of Cross-Validation, refer to http://appliedpredictivemodeling.com/blog/2014/11/27/vpuig01pqbklmi72b8lcl3ij5hj2qm
eXtreme Gradient Boosting with method = 'xgbTree' refer to https://topepo.github.io/caret/train-models-by-tag.html#boosting
trControl = trainControl() refer to https://topepo.github.io/caret/using-your-own-model-in-train.html and https://stats.stackexchange.com/questions/44343/in-caret-what-is-the-real-difference-between-cv-and-repeatedcv
Logistic Regression with glm() refer to https://daviddalpiaz.github.io/r4sl/logistic-regression.html#logistic-regression-with-glm
GBM refer to http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/139-gradient-boosting-essentials-in-r-using-xgboost/ and https://xgboost.readthedocs.io/en/stable/R-package/xgboostPresentation.html