1 Introduction

In this paper, we consider one of the Driven-Data competetion problems - Blood Donation Prediction (https://www.drivendata.org/competitions/2/warm-up-predict-blood-donations/). In this problem we need to predict the probability for a donor that made previously at least one donation would donate blood again.

1.1 Libraries

library(caret) # library used for grid search
library(anomalize) # library for removing outliers
library(Metrics) # library for calculating logLoss
library(dplyr) # library for data engineering
library(xgboost) # library for XGBoost

1.2 Data import

To evaluate our model, we are going to use data from train.csv file as a our dataset.

setwd("~/Dropbox/RProjects/bloodProject/")
dataset <-read.csv("train.csv")

2 Data Engineering

Looking at the structure of our train dadtaset.

str(dataset)
'data.frame':   576 obs. of  6 variables:
 $ X                          : int  619 664 441 160 358 335 47 164 736 436 ...
 $ Months.since.Last.Donation : int  2 0 1 2 1 4 2 1 5 0 ...
 $ Number.of.Donations        : int  50 13 16 20 24 4 7 12 46 3 ...
 $ Total.Volume.Donated..c.c..: int  12500 3250 4000 5000 6000 1000 1750 3000 11500 750 ...
 $ Months.since.First.Donation: int  98 28 35 45 77 4 14 35 98 4 ...
 $ Made.Donation.in.March.2007: int  1 1 1 1 0 0 1 0 1 0 ...

We can see that we can get some further information by doing some manupulations on the existing variables.

dataset$TotalDonationPeriod <-dataset$Months.since.First.Donation - dataset$Months.since.Last.Donation
dataset$SingleDonationPeriod <-dataset$TotalDonationPeriod/dataset$Number.of.Donations
dataset$Months.since.First.Donation <- NULL

We further divide our dataset into train and test datasets.

index <- createDataPartition(dataset$Made.Donation.in.March.2007, p=0.7,list=F)
train <- dataset[index,] # train dataset
ts <- dataset[-index,] # test dataset

Transforming data from int to numeric format.

train[1:6] <- lapply(train[1:6], as.numeric)
boxplot(train$Months.since.Last.Donation)

Let’s see if there are outliers present in our predictors in the train dataset.

boxplot(train$Number.of.Donations)

boxplot(train$Total.Volume.Donated..c.c..)

boxplot(train$Months.since.Last.Donation)

We remove outliers from Months.since.Last.Donation.

train$Anomaly <- iqr(train$Months.since.Last.Donation, alpha = 0.07, max_anoms = 0.2)
table(iqr(train$Months.since.Last.Donation, alpha = 0.07, max_anoms = 0.2))

 No Yes 
403   1 
train <- train[train$Anomaly!="Yes",]
train$Anomaly <- NULL
boxplot(train$Months.since.Last.Donation)

Now we remove outliers from Number.of.Donations variable.

train$Anomaly <- iqr(train$Number.of.Donations, alpha = 0.07, max_anoms = 0.2)
table(iqr(train$Number.of.Donations, alpha = 0.07, max_anoms = 0.2))

 No Yes 
393  10 
train <- train[train$Anomaly!="Yes",]
train$Anomaly <- NULL
boxplot(train$Number.of.Donations)

Reemove outliers from the variable Total.Volume.Donated..c.c..

train$Anomaly <- iqr(train$Total.Volume.Donated..c.c.., alpha = 0.07, max_anoms = 0.2)
table(iqr(train$Total.Volume.Donated..c.c.., alpha = 0.07, max_anoms = 0.2))

 No 
393 
train <- train[train$Anomaly!="Yes",]
train$Anomaly <- NULL
boxplot(train$Total.Volume.Donated..c.c..)

We check if we have NAs in out dataset.

sum(is.na(train))
[1] 0

Now we remove variables we do not need.

train$X <- NULL

3 Model 1: Rpart using Grid Search with Cross Validation

We are going to use train method from library caret to find and use the best hyperparameters with our method. As a etric we are going to use LogLoss function. As a method we are going to use rpart method.

We also going to use cross validation to tune our model.

train$Made.Donation.in.March.2007 <- as.factor(train$Made.Donation.in.March.2007)
train$Made.Donation.in.March.2007 <- ifelse (train$Made.Donation.in.March.2007==1, "yes", "no")

We are going give cv to method parameter d to specify that we are going to use cross-validation.

fitControl <- trainControl(method = "cv",number = 10, classProbs = TRUE, summaryFunction = mnLogLoss)
set.seed(7) # for grid search we need to remove effect of randomization
fit_rpart2 <- train(Made.Donation.in.March.2007~.,
                    data=train,
                    method="rpart",
                    trControl = fitControl, 
                    metric="logLoss",
                    maximize = FALSE
)
print(fit_rpart2)
CART 

393 samples
  5 predictors
  2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 353, 354, 353, 354, 354, 354, ... 
Resampling results across tuning parameters:

  cp          logLoss  
  0.00000000  0.6050988
  0.01098901  0.5818781
  0.02930403  0.5286188

logLoss was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.02930403.

3.1 Result

Now we are going to use our model to get the predicted probability values.

predict <- predict(fit_rpart2,ts,type="prob")
head(predict)
         no       yes
1 0.7684478 0.2315522
2 0.7684478 0.2315522
3 0.7684478 0.2315522
4 0.7684478 0.2315522
5 0.7684478 0.2315522
6 0.7684478 0.2315522
predictedGridSearch <- predict[,2]
logLoss(ts$Made.Donation.in.March.2007, predictedGridSearch)
[1] 0.5563002

4 Model 2: XGBoost

For XGBoost we need to present our predictors as a sparse matrix. And our DV as a vector. But before we do that, we nedd to transform values in Made.Donation.in.March.2007 back to numeric.

train$X <- NULL
ts$X <- NULL
train$Made.Donation.in.March.2007 <- ifelse (train$Made.Donation.in.March.2007=="yes", 1, 0)
train$Made.Donation.in.March.2007 <- as.numeric(train$Made.Donation.in.March.2007)
str(ts)
'data.frame':   172 obs. of  6 variables:
 $ Months.since.Last.Donation : int  2 1 4 2 5 2 4 1 2 2 ...
 $ Number.of.Donations        : int  20 24 4 7 46 10 6 14 8 12 ...
 $ Total.Volume.Donated..c.c..: int  5000 6000 1000 1750 11500 2500 1500 3500 2000 3000 ...
 $ Made.Donation.in.March.2007: int  1 0 0 1 1 1 0 0 1 1 ...
 $ TotalDonationPeriod        : int  43 76 0 12 93 26 10 57 26 45 ...
 $ SingleDonationPeriod       : num  2.15 3.17 0 1.71 2.02 ...
str(train)
'data.frame':   393 obs. of  6 variables:
 $ Months.since.Last.Donation : num  0 1 1 0 1 2 2 2 2 2 ...
 $ Number.of.Donations        : num  13 16 12 3 13 6 5 14 15 6 ...
 $ Total.Volume.Donated..c.c..: num  3250 4000 3000 750 3250 1500 1250 3500 3750 1500 ...
 $ Made.Donation.in.March.2007: num  1 1 0 0 0 1 1 1 1 1 ...
 $ TotalDonationPeriod        : num  28 34 34 4 46 13 9 46 47 13 ...
 $ SingleDonationPeriod       : num  2.15 2.12 2.83 1.33 3.54 ...
tr_matrix <- data.matrix(select(train,-c(Made.Donation.in.March.2007)))
ts_matrix <- data.matrix(select(ts,-c(Made.Donation.in.March.2007)))
train_target <- train$Made.Donation.in.March.2007
test_target <- ts$Made.Donation.in.March.2007

Creating two objects for train and test that would store our sparse matrix and target variable.

dtrain <- xgb.DMatrix(data=tr_matrix,label=train_target)
ctest <- xgb.DMatrix(data=ts_matrix,label=test_target)

Creating watchlist to see the intermediate results.

watchlist = list(train=dtrain,test=ctest)

Creating XGBoost with our parameters.

bst <- xgb.train(data=dtrain,
                 booster="gbtree",
                 watchlist = watchlist,
                 nrounds = 30000,
                 objective = "binary:logistic",
                 eval_metric="logloss",
                 maximize = F,
                 early_stopping_rounds = 10,
                 max_depth=10,
                 subsample=0.7,
                 colsample_bytree=0.7,
                 lambda=0.01,
                 alpha=0.0001
                 
)
[1] train-logloss:0.551250  test-logloss:0.585536 
Multiple eval metrics are present. Will use test_logloss for early stopping.
Will train until test_logloss hasn't improved in 10 rounds.

[2] train-logloss:0.489601  test-logloss:0.540814 
[3] train-logloss:0.433033  test-logloss:0.514940 
[4] train-logloss:0.396171  test-logloss:0.505360 
[5] train-logloss:0.375179  test-logloss:0.505753 
[6] train-logloss:0.353535  test-logloss:0.519028 
[7] train-logloss:0.336738  test-logloss:0.526354 
[8] train-logloss:0.325858  test-logloss:0.536506 
[9] train-logloss:0.317087  test-logloss:0.536040 
[10]    train-logloss:0.307157  test-logloss:0.534515 
[11]    train-logloss:0.300113  test-logloss:0.550196 
[12]    train-logloss:0.292402  test-logloss:0.547536 
[13]    train-logloss:0.283911  test-logloss:0.542848 
[14]    train-logloss:0.280164  test-logloss:0.563901 
Stopping. Best iteration:
[4] train-logloss:0.396171  test-logloss:0.505360

4.1 Result

pred <- predict(bst,ctest)
logLoss(ts$Made.Donation.in.March.2007,pred)
[1] 0.5053602

5 Conclusion

Comparing LogLoss values of XGBoost model and RPart using grid search and cross validation, we can see that the latter gave us a model with a better result.

 

A work by YOUR NAME

YOUREMAIL@gmail.com