Predicting Blood Donations Project

For machine learning practice, I decided to run through a project from the Driven Data competition website. Specifically, I worked on the “Predict Blood Donations” competitions which is from their warm up colection and contains a simple dataset from the UCI Machine Learning repository.

Getting Started

To get started, I load in the following packages.

# Load packages
library(readr)
library(caret)
library(dplyr)
library(car)

Next I load in the data and rename a few features (mainly so they fit on the HTML page here).

# Load data
raw.data <- read_csv("data/dd-train.csv")
names(raw.data) <- make.names(names(raw.data))  # Make names more R friendly
colnames(raw.data)[1] <- "id" # rename the column that has the ID
colnames(raw.data)[6] <- "Class"  # rename the column that has our class (to make it fit on the page)

Next I split the test data into a train and validation set

# Split data into train set and validation set (80:20 slit)
set.seed(7)
validationIndex <- createDataPartition(raw.data$Class, p=0.80, list=FALSE)
validation.set <- raw.data[-validationIndex,]
train.set <- raw.data[validationIndex,]

Exploring The Data

Let’s first take a look at some summary statistics.

dim(train.set)

## [1] 461   6

summary(train.set)

##        id        Months.since.Last.Donation Number.of.Donations
##  Min.   :  0.0   Min.   : 0.000             Min.   : 1.000     
##  1st Qu.:194.0   1st Qu.: 2.000             1st Qu.: 2.000     
##  Median :386.0   Median : 6.000             Median : 4.000     
##  Mean   :384.6   Mean   : 9.488             Mean   : 5.586     
##  3rd Qu.:577.0   3rd Qu.:14.000             3rd Qu.: 7.000     
##  Max.   :747.0   Max.   :74.000             Max.   :46.000     
##  Total.Volume.Donated..c.c.. Months.since.First.Donation     Class       
##  Min.   :  250               Min.   : 2.00               Min.   :0.0000  
##  1st Qu.:  500               1st Qu.:16.00               1st Qu.:0.0000  
##  Median : 1000               Median :28.00               Median :0.0000  
##  Mean   : 1396               Mean   :34.88               Mean   :0.2473  
##  3rd Qu.: 1750               3rd Qu.:50.00               3rd Qu.:0.0000  
##  Max.   :11500               Max.   :98.00               Max.   :1.0000

Visualization

It would be a good idea to visualize the data so we can get a better idea on what the data looks like. We will use the base graphics for the most part since we want something quick and dirty. There is no need to go for fancy graphics here.

par(mfrow=c(2,2))
for(i in 2:5) {
  hist(train.set[,i], main=names(train.set)[i])
}

par(mfrow=c(2,2))
for(i in 2:5) {
  plot(density(train.set[,i]), main=names(train.set)[i])
}

jittered_x <- sapply(train.set[,2:5], jitter)
pairs(jittered_x, names(train.set[,2:5]), col=(train.set$Class)+1)

Cleaning Up

It is a good idea to see if any of the features are corrolated since highly corrolated features may affect our final model perfomance.

cor(train.set[,2:5])  # Check for correlations

##                             Months.since.Last.Donation Number.of.Donations
## Months.since.Last.Donation                   1.0000000          -0.1554319
## Number.of.Donations                         -0.1554319           1.0000000
## Total.Volume.Donated..c.c..                 -0.1554319           1.0000000
## Months.since.First.Donation                  0.1830686           0.6228027
##                             Total.Volume.Donated..c.c..
## Months.since.Last.Donation                   -0.1554319
## Number.of.Donations                           1.0000000
## Total.Volume.Donated..c.c..                   1.0000000
## Months.since.First.Donation                   0.6228027
##                             Months.since.First.Donation
## Months.since.Last.Donation                    0.1830686
## Number.of.Donations                           0.6228027
## Total.Volume.Donated..c.c..                   0.6228027
## Months.since.First.Donation                   1.0000000

So here we see that Number.of.Donations correlates with Total.Volume.Donated..c.c.. which makes sense so let’s remove the total volume feature.

train.set$Total.Volume.Donated..c.c.. <- NULL # remove correlated column

Feature Engineering

I have decided to create two new variables. One which counts the number of donations per month and one for the ratio of last to first donation months.

train.set <- train.set %>% mutate(donations.per.month = Number.of.Donations/Months.since.First.Donation)
train.set <- train.set %>% mutate(tenure.ratio = Months.since.Last.Donation/Months.since.First.Donation)

Final Clean Up

This competition required that the submission use class probabilities so we will change our required class to be a factor of Yes/No.

#Recode the class labels to Yes/No (required when using class probs)
required.labels <- train.set['Class']
recoded.labels <- recode(required.labels$Class, "0='No'; 1 = 'Yes'")
train.set$Class <- recoded.labels
train.set$Class  <-as.factor(train.set$Class) # Make the class variable a factor

Lastly I remove the ID coumn since we won’t use this when creating the model.

train.set$id <- NULL  #Remove id column

Now we are ready to start investigating some models.

Model Baselines

First we set up our test harness where we will do 10 fold cross validation with 3 repeats. Also, since this competition is being judged on logLoss we set that up as our metric.

# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", summaryFunction=mnLogLoss, number=10, repeats=3, 
                             classProbs=TRUE)
metric <- "logLoss"

As a start we will train using Logistic Regression (glm), Linear Discriminate Analysis (lda), Regularized Logistic Regression (glmnet), Classification and Regressikon Trees (rpart), and Support Vector Machines with Radial Basis Functions (svmRadial) using the caret package.

set.seed(7)
fit.glm <- train(Class~., data=train.set, method="glm", metric=metric, trControl=trainControl) # GLM

set.seed(7)
fit.lda <- train(Class~., data=train.set, method="lda", metric=metric, trControl=trainControl) # LDA

set.seed(7)
fit.glmnet <- train(Class~., data=train.set, method="glmnet", metric=metric,trControl=trainControl)  # GLMNET

set.seed(7)
fit.cart <- train(Class~., data=train.set, method="rpart", metric=metric,trControl=trainControl)  # CART

set.seed(7)
fit.svm <- train(Class~., data=train.set, method="svmRadial", metric=metric, trControl=trainControl)  # SVM

Now let’s compare the performance of the algoritms.

# Compare algorithms
results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, CART=fit.cart, SVM=fit.svm))
summary(results)
dotplot(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: LG, LDA, GLMNET, CART, SVM 
## Number of resamples: 30 
## 
## logLoss 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## LG     0.4386  0.4732 0.5120 0.5114  0.5402 0.6013    0
## LDA    0.4404  0.4698 0.5129 0.5121  0.5447 0.6096    0
## GLMNET 0.4502  0.4774 0.5145 0.5098  0.5319 0.5957    0
## CART   0.4284  0.5009 0.5490 0.6112  0.5927 1.3160    0
## SVM    0.4778  0.5109 0.5266 0.5268  0.5476 0.5837    0

The objective here is to minimize the logLoss so Logistic Regression, Linear Discriminate Analysis and Regularized Logistic Regression appear to perform best in this baseline with GLMNET having the best performance (by a very small margin).

Using the Box Cox Transformation

We saw earlier that there was some skewness in the data so let’s see if a Box Cox transformation will help.

preProcess="BoxCox"  # pre prcess with a Box Cox transformation 
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", summaryFunction=mnLogLoss, number=10, repeats=3, 
                             classProbs=TRUE)
metric <- "logLoss"

set.seed(7)
fit.glm <- train(Class~., data=train.set, method="glm", metric=metric, trControl=trainControl, preProc=preProcess) # GLM

set.seed(7)
fit.lda <- train(Class~., data=train.set, method="lda", metric=metric, trControl=trainControl, preProc=preProcess) # LDA

set.seed(7)
fit.glmnet <- train(Class~., data=train.set, method="glmnet", metric=metric,trControl=trainControl, preProc=preProcess)  # GLMNET

set.seed(7)
fit.cart <- train(Class~., data=train.set, method="rpart", metric=metric,trControl=trainControl, preProc=preProcess)  # CART

set.seed(7)
fit.svm <- train(Class~., data=train.set, method="svmRadial", metric=metric, trControl=trainControl, preProc=preProcess)  # SVM

# Compare algorithms
results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, CART=fit.cart, SVM=fit.svm))
summary(results)
dotplot(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: LG, LDA, GLMNET, CART, SVM 
## Number of resamples: 30 
## 
## logLoss 
##          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## LG     0.4277  0.4691 0.5042 0.5051  0.5357 0.5896    0
## LDA    0.4445  0.4780 0.5040 0.5068  0.5313 0.5852    0
## GLMNET 0.4361  0.4697 0.5078 0.5047  0.5325 0.5931    0
## CART   0.4284  0.5009 0.5490 0.6112  0.5927 1.3160    0
## SVM    0.4689  0.5041 0.5256 0.5231  0.5452 0.5717    0

Here we see a slight improvement in the scores for GLM, LDA and GLMNET with GLMNET once again having the best performance (barely).

But next, let’s try some ensemble methods to see if they are better.

Ensemble Methods

The methods we will try are Random Forest, Stochastic Gradient Boosting and C5.0 and we will use the same training control as before.

# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", summaryFunction=mnLogLoss, number=10, repeats=3, 
                             classProbs=TRUE)
metric <- "logLoss"
preProcess = "BoxCox"

set.seed(7)
fit.rf <- train(Class~., data=train.set, method="rf", metric=metric, preProc=preProcess,
                trControl=trainControl)  # Random Forest

set.seed(7)
fit.gbm <- train(Class~., data=train.set, method="gbm", metric=metric, preProc=preProcess,
                 trControl=trainControl, verbose=FALSE)  # Stochastic Gradient Boosting

set.seed(7)
fit.c50 <- train(Class~., data=train.set, method="C5.0", metric=metric, preProc=preProcess,
                 trControl=trainControl)  # C5.0

# Compare results
ensembleResults <- resamples(list(RF=fit.rf, GBM=fit.gbm, C50=fit.c50))
summary(ensembleResults)
dotplot(ensembleResults)

## 
## Call:
## summary.resamples(object = ensembleResults)
## 
## Models: RF, GBM, C50 
## Number of resamples: 30 
## 
## logLoss 
##       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## RF  0.4612  0.5928 0.7128 0.9276  1.2130 2.7900    0
## GBM 0.4237  0.4862 0.5077 0.5080  0.5372 0.5881    0
## C50 0.4285  0.5477 0.5503 0.5453  0.5683 0.5925    0

The performance of GBM is the best ensemble method here and is close to the GLMNET algorithm.

But from now on, we will just concentrate on the GLMNET algorithm since it is a little faster to run.

Tuning Regularized Logistic Regression

To see if we can squeeze a little bit more performance out of the GLMNET, we will tweak the mixing percentage (alpha) and regularization parameter (lambda) and see what happens.

preProcess="BoxCox"

# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", summaryFunction=mnLogLoss, number=10, repeats=3, 
                            classProbs=TRUE)
metric <- "logLoss"

set.seed(7)
grid <- expand.grid(alpha=c((70:100)/100), lambda = c(0.001, 0.0009, 0.0008,0.0007))
fit.glmnet <- train(Class~., data=train.set, method="glmnet", metric=metric, tuneGrid=grid,
                 preProc=preProcess, trControl=trainControl)

plot(fit.glmnet)

best.aplha <- fit.glmnet$bestTune$alpha
best.lambda <- fit.glmnet$bestTune$lambda

So the tuned values that minimize the log loss are mixing percentage (alpha) = 1 and regularization parameter (lambda) = 810^{-4} which yeilds a very slight improvement over what we found earlier. We could spend more time tuning these further but the payoff probably won’t be that great.

Check The Validation Set

Now let’s see what we get when we try to predict on the hold out set from earlier.

# Match our training set
validation.set$Total.Volume.Donated..c.c.. <- NULL # remove correlated column

# Redo the feature engineering
validation.set <- validation.set %>% mutate(donations.per.month = Number.of.Donations/Months.since.First.Donation)
validation.set <- validation.set %>% mutate(tenure.ratio = Months.since.Last.Donation/Months.since.First.Donation)

# Recode the class labels to Yes/No (required when using class probs)
required.labels <- validation.set['Class']
recoded.labels <- recode(required.labels$Class, "0='No'; 1 = 'Yes'")
validation.set$Class <- recoded.labels
validation.set$Class  <-as.factor(validation.set$Class) # Make the class variable a factor
set.seed(7)
test.pred <- predict(fit.glmnet, newdata=validation.set[, c(2:4, 6:7)], type  = "prob")

# Function to calculate logLoss
LogLoss <- function(actual, predicted, eps=0.00001) {
  predicted <- pmin(pmax(predicted, eps), 1-eps)
  -1/length(actual)*(sum(actual*log(predicted)+(1-actual)*log(1-predicted)))
}

# In order to use the logLoss function we need to recode the class labels back to to 0 and 1
required.labels <- validation.set['Class']
recoded.labels <- recode(required.labels$Class, "'No'=0; 'Yes'=1")
validation.set$Class <- recoded.labels

# Now get logLoss on the validation set prediction
ll <- LogLoss(as.numeric(as.character(validation.set$Class)), test.pred$Yes)

So the logLoss on this validation set is 0.3718009 which is pretty good for our purpose.

Making A Prediction on the Test Set

Now we are ready to make our prediction on the test set and create a file to submit to Driven Data.

##### Predict on Test Set
test.data <- read_csv("data/dd-test.csv")
names(test.data) <- make.names(names(test.data))  # Make names more R friendly
test.data$Total.Volume.Donated..c.c.. <- NULL # remove this column (corrolated with Months.since.Last.Donation)
test.data <- test.data %>% mutate(donations.per.month = Number.of.Donations/Months.since.First.Donation)
test.data <- test.data %>% mutate(tenure.ratio = Months.since.Last.Donation/Months.since.First.Donation)
test.set <- test.data[,-1] # remove id column

# Make predicitons
set.seed(7)
predictions <- predict(fit.glmnet, newdata=test.set, type  = "prob")
pred.df <- as.data.frame(predictions$Yes)
pred.df$id <- test.data$NA.
pred.df <- pred.df[c(2,1)]

# Prepare date for writing to csv using thre format from the submission format file
submission_format <- read.csv("data/BloodDonationSubmissionFormat.csv", check.names=FALSE)
colnames(pred.df) <- colnames(submission_format)
write.csv(pred.df, file="submission4.csv", row.names=FALSE )

This gives a score of 0.4490 on Driven Data which yeilds a rank of 165 of about 1000 participents.

submission

Not bad.

By the way, the workflow I have used here is based on a template from Machine Learning Mastery so you should check that site out for good information on starting in machine learning.