In this paper, we consider one of the Driven-Data competetion problems - Blood Donation Prediction (https://www.drivendata.org/competitions/2/warm-up-predict-blood-donations/). In this problem we need to predict the probability for a donor that made previously at least one donation would donate blood again.
library(caret) # library used for grid search
library(anomalize) # library for removing outliers
library(Metrics) # library for calculating logLoss
library(dplyr) # library for data engineering
library(xgboost) # library for XGBoost
To evaluate our model, we are going to use data from train.csv file as a our dataset.
setwd("~/Dropbox/RProjects/bloodProject/")
dataset <-read.csv("train.csv")
Looking at the structure of our train dadtaset.
str(dataset)
'data.frame': 576 obs. of 6 variables:
$ X : int 619 664 441 160 358 335 47 164 736 436 ...
$ Months.since.Last.Donation : int 2 0 1 2 1 4 2 1 5 0 ...
$ Number.of.Donations : int 50 13 16 20 24 4 7 12 46 3 ...
$ Total.Volume.Donated..c.c..: int 12500 3250 4000 5000 6000 1000 1750 3000 11500 750 ...
$ Months.since.First.Donation: int 98 28 35 45 77 4 14 35 98 4 ...
$ Made.Donation.in.March.2007: int 1 1 1 1 0 0 1 0 1 0 ...
We can see that we can get some further information by doing some manupulations on the existing variables.
dataset$TotalDonationPeriod <-dataset$Months.since.First.Donation - dataset$Months.since.Last.Donation
dataset$SingleDonationPeriod <-dataset$TotalDonationPeriod/dataset$Number.of.Donations
dataset$Months.since.First.Donation <- NULL
We further divide our dataset into train and test datasets.
index <- createDataPartition(dataset$Made.Donation.in.March.2007, p=0.7,list=F)
train <- dataset[index,] # train dataset
ts <- dataset[-index,] # test dataset
Transforming data from int to numeric format.
train[1:6] <- lapply(train[1:6], as.numeric)
boxplot(train$Months.since.Last.Donation)
Let’s see if there are outliers present in our predictors in the train dataset.
boxplot(train$Number.of.Donations)
boxplot(train$Total.Volume.Donated..c.c..)
boxplot(train$Months.since.Last.Donation)
We remove outliers from Months.since.Last.Donation.
train$Anomaly <- iqr(train$Months.since.Last.Donation, alpha = 0.07, max_anoms = 0.2)
table(iqr(train$Months.since.Last.Donation, alpha = 0.07, max_anoms = 0.2))
No Yes
403 1
train <- train[train$Anomaly!="Yes",]
train$Anomaly <- NULL
boxplot(train$Months.since.Last.Donation)
Now we remove outliers from Number.of.Donations variable.
train$Anomaly <- iqr(train$Number.of.Donations, alpha = 0.07, max_anoms = 0.2)
table(iqr(train$Number.of.Donations, alpha = 0.07, max_anoms = 0.2))
No Yes
393 10
train <- train[train$Anomaly!="Yes",]
train$Anomaly <- NULL
boxplot(train$Number.of.Donations)
Reemove outliers from the variable Total.Volume.Donated..c.c..
train$Anomaly <- iqr(train$Total.Volume.Donated..c.c.., alpha = 0.07, max_anoms = 0.2)
table(iqr(train$Total.Volume.Donated..c.c.., alpha = 0.07, max_anoms = 0.2))
No
393
train <- train[train$Anomaly!="Yes",]
train$Anomaly <- NULL
boxplot(train$Total.Volume.Donated..c.c..)
We check if we have NAs in out dataset.
sum(is.na(train))
[1] 0
Now we remove variables we do not need.
train$X <- NULL
We are going to use train method from library caret to find and use the best hyperparameters with our method. As a etric we are going to use LogLoss function. As a method we are going to use rpart method.
We also going to use cross validation to tune our model.
train$Made.Donation.in.March.2007 <- as.factor(train$Made.Donation.in.March.2007)
train$Made.Donation.in.March.2007 <- ifelse (train$Made.Donation.in.March.2007==1, "yes", "no")
We are going give cv to method parameter d to specify that we are going to use cross-validation.
fitControl <- trainControl(method = "cv",number = 10, classProbs = TRUE, summaryFunction = mnLogLoss)
set.seed(7) # for grid search we need to remove effect of randomization
fit_rpart2 <- train(Made.Donation.in.March.2007~.,
data=train,
method="rpart",
trControl = fitControl,
metric="logLoss",
maximize = FALSE
)
print(fit_rpart2)
CART
393 samples
5 predictors
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 353, 354, 353, 354, 354, 354, ...
Resampling results across tuning parameters:
cp logLoss
0.00000000 0.6050988
0.01098901 0.5818781
0.02930403 0.5286188
logLoss was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.02930403.
Now we are going to use our model to get the predicted probability values.
predict <- predict(fit_rpart2,ts,type="prob")
head(predict)
no yes
1 0.7684478 0.2315522
2 0.7684478 0.2315522
3 0.7684478 0.2315522
4 0.7684478 0.2315522
5 0.7684478 0.2315522
6 0.7684478 0.2315522
predictedGridSearch <- predict[,2]
logLoss(ts$Made.Donation.in.March.2007, predictedGridSearch)
[1] 0.5563002
For XGBoost we need to present our predictors as a sparse matrix. And our DV as a vector. But before we do that, we nedd to transform values in Made.Donation.in.March.2007 back to numeric.
train$X <- NULL
ts$X <- NULL
train$Made.Donation.in.March.2007 <- ifelse (train$Made.Donation.in.March.2007=="yes", 1, 0)
train$Made.Donation.in.March.2007 <- as.numeric(train$Made.Donation.in.March.2007)
str(ts)
'data.frame': 172 obs. of 6 variables:
$ Months.since.Last.Donation : int 2 1 4 2 5 2 4 1 2 2 ...
$ Number.of.Donations : int 20 24 4 7 46 10 6 14 8 12 ...
$ Total.Volume.Donated..c.c..: int 5000 6000 1000 1750 11500 2500 1500 3500 2000 3000 ...
$ Made.Donation.in.March.2007: int 1 0 0 1 1 1 0 0 1 1 ...
$ TotalDonationPeriod : int 43 76 0 12 93 26 10 57 26 45 ...
$ SingleDonationPeriod : num 2.15 3.17 0 1.71 2.02 ...
str(train)
'data.frame': 393 obs. of 6 variables:
$ Months.since.Last.Donation : num 0 1 1 0 1 2 2 2 2 2 ...
$ Number.of.Donations : num 13 16 12 3 13 6 5 14 15 6 ...
$ Total.Volume.Donated..c.c..: num 3250 4000 3000 750 3250 1500 1250 3500 3750 1500 ...
$ Made.Donation.in.March.2007: num 1 1 0 0 0 1 1 1 1 1 ...
$ TotalDonationPeriod : num 28 34 34 4 46 13 9 46 47 13 ...
$ SingleDonationPeriod : num 2.15 2.12 2.83 1.33 3.54 ...
tr_matrix <- data.matrix(select(train,-c(Made.Donation.in.March.2007)))
ts_matrix <- data.matrix(select(ts,-c(Made.Donation.in.March.2007)))
train_target <- train$Made.Donation.in.March.2007
test_target <- ts$Made.Donation.in.March.2007
Creating two objects for train and test that would store our sparse matrix and target variable.
dtrain <- xgb.DMatrix(data=tr_matrix,label=train_target)
ctest <- xgb.DMatrix(data=ts_matrix,label=test_target)
Creating watchlist to see the intermediate results.
watchlist = list(train=dtrain,test=ctest)
Creating XGBoost with our parameters.
bst <- xgb.train(data=dtrain,
booster="gbtree",
watchlist = watchlist,
nrounds = 30000,
objective = "binary:logistic",
eval_metric="logloss",
maximize = F,
early_stopping_rounds = 10,
max_depth=10,
subsample=0.7,
colsample_bytree=0.7,
lambda=0.01,
alpha=0.0001
)
[1] train-logloss:0.551250 test-logloss:0.585536
Multiple eval metrics are present. Will use test_logloss for early stopping.
Will train until test_logloss hasn't improved in 10 rounds.
[2] train-logloss:0.489601 test-logloss:0.540814
[3] train-logloss:0.433033 test-logloss:0.514940
[4] train-logloss:0.396171 test-logloss:0.505360
[5] train-logloss:0.375179 test-logloss:0.505753
[6] train-logloss:0.353535 test-logloss:0.519028
[7] train-logloss:0.336738 test-logloss:0.526354
[8] train-logloss:0.325858 test-logloss:0.536506
[9] train-logloss:0.317087 test-logloss:0.536040
[10] train-logloss:0.307157 test-logloss:0.534515
[11] train-logloss:0.300113 test-logloss:0.550196
[12] train-logloss:0.292402 test-logloss:0.547536
[13] train-logloss:0.283911 test-logloss:0.542848
[14] train-logloss:0.280164 test-logloss:0.563901
Stopping. Best iteration:
[4] train-logloss:0.396171 test-logloss:0.505360
pred <- predict(bst,ctest)
logLoss(ts$Made.Donation.in.March.2007,pred)
[1] 0.5053602
Comparing LogLoss values of XGBoost model and RPart using grid search and cross validation, we can see that the latter gave us a model with a better result.
A work by YOUR NAME
YOUREMAIL@gmail.com