For machine learning practice, I decided to run through a project from the Driven Data competition website. Specifically, I worked on the “Predict Blood Donations” competitions which is from their warm up colection and contains a simple dataset from the UCI Machine Learning repository.
To get started, I load in the following packages.
# Load packages
library(readr)
library(caret)
library(dplyr)
library(car)
Next I load in the data and rename a few features (mainly so they fit on the HTML page here).
# Load data
raw.data <- read_csv("data/dd-train.csv")
names(raw.data) <- make.names(names(raw.data)) # Make names more R friendly
colnames(raw.data)[1] <- "id" # rename the column that has the ID
colnames(raw.data)[6] <- "Class" # rename the column that has our class (to make it fit on the page)
Next I split the test data into a train and validation set
# Split data into train set and validation set (80:20 slit)
set.seed(7)
validationIndex <- createDataPartition(raw.data$Class, p=0.80, list=FALSE)
validation.set <- raw.data[-validationIndex,]
train.set <- raw.data[validationIndex,]
Let’s first take a look at some summary statistics.
dim(train.set)
## [1] 461 6
summary(train.set)
## id Months.since.Last.Donation Number.of.Donations
## Min. : 0.0 Min. : 0.000 Min. : 1.000
## 1st Qu.:194.0 1st Qu.: 2.000 1st Qu.: 2.000
## Median :386.0 Median : 6.000 Median : 4.000
## Mean :384.6 Mean : 9.488 Mean : 5.586
## 3rd Qu.:577.0 3rd Qu.:14.000 3rd Qu.: 7.000
## Max. :747.0 Max. :74.000 Max. :46.000
## Total.Volume.Donated..c.c.. Months.since.First.Donation Class
## Min. : 250 Min. : 2.00 Min. :0.0000
## 1st Qu.: 500 1st Qu.:16.00 1st Qu.:0.0000
## Median : 1000 Median :28.00 Median :0.0000
## Mean : 1396 Mean :34.88 Mean :0.2473
## 3rd Qu.: 1750 3rd Qu.:50.00 3rd Qu.:0.0000
## Max. :11500 Max. :98.00 Max. :1.0000
It would be a good idea to visualize the data so we can get a better idea on what the data looks like. We will use the base graphics for the most part since we want something quick and dirty. There is no need to go for fancy graphics here.
par(mfrow=c(2,2))
for(i in 2:5) {
hist(train.set[,i], main=names(train.set)[i])
}
par(mfrow=c(2,2))
for(i in 2:5) {
plot(density(train.set[,i]), main=names(train.set)[i])
}
jittered_x <- sapply(train.set[,2:5], jitter)
pairs(jittered_x, names(train.set[,2:5]), col=(train.set$Class)+1)
It is a good idea to see if any of the features are corrolated since highly corrolated features may affect our final model perfomance.
cor(train.set[,2:5]) # Check for correlations
## Months.since.Last.Donation Number.of.Donations
## Months.since.Last.Donation 1.0000000 -0.1554319
## Number.of.Donations -0.1554319 1.0000000
## Total.Volume.Donated..c.c.. -0.1554319 1.0000000
## Months.since.First.Donation 0.1830686 0.6228027
## Total.Volume.Donated..c.c..
## Months.since.Last.Donation -0.1554319
## Number.of.Donations 1.0000000
## Total.Volume.Donated..c.c.. 1.0000000
## Months.since.First.Donation 0.6228027
## Months.since.First.Donation
## Months.since.Last.Donation 0.1830686
## Number.of.Donations 0.6228027
## Total.Volume.Donated..c.c.. 0.6228027
## Months.since.First.Donation 1.0000000
So here we see that Number.of.Donations
correlates with Total.Volume.Donated..c.c..
which makes sense so let’s remove the total volume feature.
train.set$Total.Volume.Donated..c.c.. <- NULL # remove correlated column
I have decided to create two new variables. One which counts the number of donations per month and one for the ratio of last to first donation months.
train.set <- train.set %>% mutate(donations.per.month = Number.of.Donations/Months.since.First.Donation)
train.set <- train.set %>% mutate(tenure.ratio = Months.since.Last.Donation/Months.since.First.Donation)
This competition required that the submission use class probabilities so we will change our required class to be a factor of Yes/No.
#Recode the class labels to Yes/No (required when using class probs)
required.labels <- train.set['Class']
recoded.labels <- recode(required.labels$Class, "0='No'; 1 = 'Yes'")
train.set$Class <- recoded.labels
train.set$Class <-as.factor(train.set$Class) # Make the class variable a factor
Lastly I remove the ID coumn since we won’t use this when creating the model.
train.set$id <- NULL #Remove id column
Now we are ready to start investigating some models.
First we set up our test harness where we will do 10 fold cross validation with 3 repeats. Also, since this competition is being judged on logLoss we set that up as our metric.
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", summaryFunction=mnLogLoss, number=10, repeats=3,
classProbs=TRUE)
metric <- "logLoss"
As a start we will train using Logistic Regression (glm), Linear Discriminate Analysis (lda), Regularized Logistic Regression (glmnet), Classification and Regressikon Trees (rpart), and Support Vector Machines with Radial Basis Functions (svmRadial) using the caret package.
set.seed(7)
fit.glm <- train(Class~., data=train.set, method="glm", metric=metric, trControl=trainControl) # GLM
set.seed(7)
fit.lda <- train(Class~., data=train.set, method="lda", metric=metric, trControl=trainControl) # LDA
set.seed(7)
fit.glmnet <- train(Class~., data=train.set, method="glmnet", metric=metric,trControl=trainControl) # GLMNET
set.seed(7)
fit.cart <- train(Class~., data=train.set, method="rpart", metric=metric,trControl=trainControl) # CART
set.seed(7)
fit.svm <- train(Class~., data=train.set, method="svmRadial", metric=metric, trControl=trainControl) # SVM
Now let’s compare the performance of the algoritms.
# Compare algorithms
results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, CART=fit.cart, SVM=fit.svm))
summary(results)
dotplot(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LG, LDA, GLMNET, CART, SVM
## Number of resamples: 30
##
## logLoss
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LG 0.4386 0.4732 0.5120 0.5114 0.5402 0.6013 0
## LDA 0.4404 0.4698 0.5129 0.5121 0.5447 0.6096 0
## GLMNET 0.4502 0.4774 0.5145 0.5098 0.5319 0.5957 0
## CART 0.4284 0.5009 0.5490 0.6112 0.5927 1.3160 0
## SVM 0.4778 0.5109 0.5266 0.5268 0.5476 0.5837 0
The objective here is to minimize the logLoss so Logistic Regression, Linear Discriminate Analysis and Regularized Logistic Regression appear to perform best in this baseline with GLMNET having the best performance (by a very small margin).
We saw earlier that there was some skewness in the data so let’s see if a Box Cox transformation will help.
preProcess="BoxCox" # pre prcess with a Box Cox transformation
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", summaryFunction=mnLogLoss, number=10, repeats=3,
classProbs=TRUE)
metric <- "logLoss"
set.seed(7)
fit.glm <- train(Class~., data=train.set, method="glm", metric=metric, trControl=trainControl, preProc=preProcess) # GLM
set.seed(7)
fit.lda <- train(Class~., data=train.set, method="lda", metric=metric, trControl=trainControl, preProc=preProcess) # LDA
set.seed(7)
fit.glmnet <- train(Class~., data=train.set, method="glmnet", metric=metric,trControl=trainControl, preProc=preProcess) # GLMNET
set.seed(7)
fit.cart <- train(Class~., data=train.set, method="rpart", metric=metric,trControl=trainControl, preProc=preProcess) # CART
set.seed(7)
fit.svm <- train(Class~., data=train.set, method="svmRadial", metric=metric, trControl=trainControl, preProc=preProcess) # SVM
# Compare algorithms
results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, CART=fit.cart, SVM=fit.svm))
summary(results)
dotplot(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: LG, LDA, GLMNET, CART, SVM
## Number of resamples: 30
##
## logLoss
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LG 0.4277 0.4691 0.5042 0.5051 0.5357 0.5896 0
## LDA 0.4445 0.4780 0.5040 0.5068 0.5313 0.5852 0
## GLMNET 0.4361 0.4697 0.5078 0.5047 0.5325 0.5931 0
## CART 0.4284 0.5009 0.5490 0.6112 0.5927 1.3160 0
## SVM 0.4689 0.5041 0.5256 0.5231 0.5452 0.5717 0
Here we see a slight improvement in the scores for GLM, LDA and GLMNET with GLMNET once again having the best performance (barely).
But next, let’s try some ensemble methods to see if they are better.
The methods we will try are Random Forest, Stochastic Gradient Boosting and C5.0 and we will use the same training control as before.
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", summaryFunction=mnLogLoss, number=10, repeats=3,
classProbs=TRUE)
metric <- "logLoss"
preProcess = "BoxCox"
set.seed(7)
fit.rf <- train(Class~., data=train.set, method="rf", metric=metric, preProc=preProcess,
trControl=trainControl) # Random Forest
set.seed(7)
fit.gbm <- train(Class~., data=train.set, method="gbm", metric=metric, preProc=preProcess,
trControl=trainControl, verbose=FALSE) # Stochastic Gradient Boosting
set.seed(7)
fit.c50 <- train(Class~., data=train.set, method="C5.0", metric=metric, preProc=preProcess,
trControl=trainControl) # C5.0
# Compare results
ensembleResults <- resamples(list(RF=fit.rf, GBM=fit.gbm, C50=fit.c50))
summary(ensembleResults)
dotplot(ensembleResults)
##
## Call:
## summary.resamples(object = ensembleResults)
##
## Models: RF, GBM, C50
## Number of resamples: 30
##
## logLoss
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.4612 0.5928 0.7128 0.9276 1.2130 2.7900 0
## GBM 0.4237 0.4862 0.5077 0.5080 0.5372 0.5881 0
## C50 0.4285 0.5477 0.5503 0.5453 0.5683 0.5925 0
The performance of GBM is the best ensemble method here and is close to the GLMNET algorithm.
But from now on, we will just concentrate on the GLMNET algorithm since it is a little faster to run.
To see if we can squeeze a little bit more performance out of the GLMNET, we will tweak the mixing percentage (alpha) and regularization parameter (lambda) and see what happens.
preProcess="BoxCox"
# 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", summaryFunction=mnLogLoss, number=10, repeats=3,
classProbs=TRUE)
metric <- "logLoss"
set.seed(7)
grid <- expand.grid(alpha=c((70:100)/100), lambda = c(0.001, 0.0009, 0.0008,0.0007))
fit.glmnet <- train(Class~., data=train.set, method="glmnet", metric=metric, tuneGrid=grid,
preProc=preProcess, trControl=trainControl)
plot(fit.glmnet)
best.aplha <- fit.glmnet$bestTune$alpha
best.lambda <- fit.glmnet$bestTune$lambda
So the tuned values that minimize the log loss are mixing percentage (alpha) = 1 and regularization parameter (lambda) = 810^{-4} which yeilds a very slight improvement over what we found earlier. We could spend more time tuning these further but the payoff probably won’t be that great.
Now let’s see what we get when we try to predict on the hold out set from earlier.
# Match our training set
validation.set$Total.Volume.Donated..c.c.. <- NULL # remove correlated column
# Redo the feature engineering
validation.set <- validation.set %>% mutate(donations.per.month = Number.of.Donations/Months.since.First.Donation)
validation.set <- validation.set %>% mutate(tenure.ratio = Months.since.Last.Donation/Months.since.First.Donation)
# Recode the class labels to Yes/No (required when using class probs)
required.labels <- validation.set['Class']
recoded.labels <- recode(required.labels$Class, "0='No'; 1 = 'Yes'")
validation.set$Class <- recoded.labels
validation.set$Class <-as.factor(validation.set$Class) # Make the class variable a factor
set.seed(7)
test.pred <- predict(fit.glmnet, newdata=validation.set[, c(2:4, 6:7)], type = "prob")
# Function to calculate logLoss
LogLoss <- function(actual, predicted, eps=0.00001) {
predicted <- pmin(pmax(predicted, eps), 1-eps)
-1/length(actual)*(sum(actual*log(predicted)+(1-actual)*log(1-predicted)))
}
# In order to use the logLoss function we need to recode the class labels back to to 0 and 1
required.labels <- validation.set['Class']
recoded.labels <- recode(required.labels$Class, "'No'=0; 'Yes'=1")
validation.set$Class <- recoded.labels
# Now get logLoss on the validation set prediction
ll <- LogLoss(as.numeric(as.character(validation.set$Class)), test.pred$Yes)
So the logLoss on this validation set is 0.3718009 which is pretty good for our purpose.
Now we are ready to make our prediction on the test set and create a file to submit to Driven Data.
##### Predict on Test Set
test.data <- read_csv("data/dd-test.csv")
names(test.data) <- make.names(names(test.data)) # Make names more R friendly
test.data$Total.Volume.Donated..c.c.. <- NULL # remove this column (corrolated with Months.since.Last.Donation)
test.data <- test.data %>% mutate(donations.per.month = Number.of.Donations/Months.since.First.Donation)
test.data <- test.data %>% mutate(tenure.ratio = Months.since.Last.Donation/Months.since.First.Donation)
test.set <- test.data[,-1] # remove id column
# Make predicitons
set.seed(7)
predictions <- predict(fit.glmnet, newdata=test.set, type = "prob")
pred.df <- as.data.frame(predictions$Yes)
pred.df$id <- test.data$NA.
pred.df <- pred.df[c(2,1)]
# Prepare date for writing to csv using thre format from the submission format file
submission_format <- read.csv("data/BloodDonationSubmissionFormat.csv", check.names=FALSE)
colnames(pred.df) <- colnames(submission_format)
write.csv(pred.df, file="submission4.csv", row.names=FALSE )
This gives a score of 0.4490 on Driven Data which yeilds a rank of 165 of about 1000 participents.
Not bad.
By the way, the workflow I have used here is based on a template from Machine Learning Mastery so you should check that site out for good information on starting in machine learning.