Setup:

Here I am installing the necessary packages and loading the required libraries.

# Load standard libraries
library(tidyverse)
library(caTools)
library(pROC)
library(randomForest)

In this problem set we will use the dataset from Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence,”Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018). This dataset is currently being used for a competition on http://www.DrivenData.org. Information on the dataset and variables can be found at: https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.names.

# Load data called 'TransfusionData.csv'
donorData <- read.csv('TransfusionData.csv')
1a
str(donorData)
## 'data.frame':    748 obs. of  5 variables:
##  $ Recency..months.                          : num  2 0 1 2 1 4 2 1 2 5 ...
##  $ Frequency..times.                         : int  50 13 16 20 24 4 7 12 9 46 ...
##  $ Monetary..c.c..blood.                     : int  12500 3250 4000 5000 6000 1000 1750 3000 2250 11500 ...
##  $ Time..months.                             : num  98 28 35 45 77 4 14 35 22 98 ...
##  $ whether.he.she.donated.blood.in.March.2007: int  1 1 1 1 0 0 1 0 1 1 ...
1b

Preparing the data for easier processing.

donorData$whether.he.she.donated.blood.in.March.2007 <- as.factor(donorData$whether.he.she.donated.blood.in.March.2007)
n <- c("Recency", "Frequency", "Monetary", "Time", "donatedMar07")
colnames(donorData) <- n
1c

Providing some basic summary statistics to become more familiar with the dataset.

summary(donorData)
##     Recency         Frequency         Monetary          Time      
##  Min.   : 0.000   Min.   : 1.000   Min.   :  250   Min.   : 2.00  
##  1st Qu.: 2.750   1st Qu.: 2.000   1st Qu.:  500   1st Qu.:16.00  
##  Median : 7.000   Median : 4.000   Median : 1000   Median :28.00  
##  Mean   : 9.507   Mean   : 5.515   Mean   : 1379   Mean   :34.28  
##  3rd Qu.:14.000   3rd Qu.: 7.000   3rd Qu.: 1750   3rd Qu.:50.00  
##  Max.   :74.000   Max.   :50.000   Max.   :12500   Max.   :98.00  
##  donatedMar07
##  0:570       
##  1:178       
##              
##              
##              
## 

2

Evaluating the performance of a few different statistical learning methods. We will fit a particular statistical learning method on a set of observations and measure its performance on a set of observations.

2a

Discussing the advantages of using a training/test split when evaluating statistical models.

  • This method allows to curb overfitting of the data because only training data set is used for building the model. When eventually the model is run on the test data to predict the outcomes, this method avoids overfitting and is much closer to the real world applications.
2b

Splitting our data into a and set based on an 80-20 split, in other words, 80% of the observations will be in the training set.

# code adapted from https://rpubs.com/ID_Tech/S1 AND https://stackoverflow.com/a/31634462

# Set seed for reproducibility
set.seed(112718)

#splits the data in the ratio mentioned in SplitRatio. After splitting marks these rows as logical 
# TRUE and the the remaining are marked as logical FALSE
sample <- sample.split(donorData, SplitRatio = .8)

# creates a training dataset named train with rows which are marked as TRUE
donor_train <- subset(donorData, sample == TRUE)

# creates a training dataset named test with rows which are marked as FALSE
donor_test  <- subset(donorData, sample == FALSE)

3

In this problem our goal is to predict whether someone will donate blood in March 2007. First considering a simple logistic regression model for whether an individual donated blood in March 2007 based on frequency of donations.

3a

Fitting the model described above using the function in R.

model <- glm(donatedMar07 ~ Frequency, donor_train, family = "binomial")
3b

Describing the model summary.

summary(model)
## 
## Call:
## glm(formula = donatedMar07 ~ Frequency, family = "binomial", 
##     data = donor_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0932  -0.6946  -0.6181  -0.5941   1.9086  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.73127    0.14910 -11.611  < 2e-16 ***
## Frequency    0.08644    0.01798   4.806 1.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 641.23  on 597  degrees of freedom
## Residual deviance: 615.30  on 596  degrees of freedom
## AIC: 619.3
## 
## Number of Fisher Scoring iterations: 4
  • A logistic model is defined as the log of odds follows a linear relation.
  • Odds = Probability / (1-Probability)
  • In our case, Probability refers to the Probability of Donation.
  • Therefore, essentially our model tells us that log of probability of blood donation given the frequency of donation is equal to an intercept plus slope times the frequncy of donation.

  • Now, we can interpret the above model as:
  • log (odds of a person donating blood given the frequency of donations) = 1.731 - 0.087 * Frequency

  • Intuition:
  • Log odds of a person donating blood if the frequency of donation is zero = 1.731
  • Log odds of a person donating blood for a unit increase in frequency = -0.087

  • If we take the exponent of the intercept and the slope:

exp(model$coefficients)
## (Intercept)   Frequency 
##   0.1770586   1.0902858
  • This implies:
  • Log odds ratio of a person donating blood if the frequency of donation is zero = 5.65
  • Log odds ratio of a person donating blood for a unit increase in frequency = 0.917
  • That is to say that there is a 92% increase in the probability of donating blood a unit increase in frequency of the blood donations in the past.

4

Next, considering the performance of this model.

4a

Predicting donations in March 2007 for each observation in your test set using the model fit in 3a. Saving these predictions as y_hat.

y_hat <- predict(model, donor_test, type = "response")
4b

Using a threshold of 0.4 to classify predictions. Using a confusion matrix, determining the number of false positives on the test data.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
#results[1,1] <- confusionMatrix(testing$classe,predict.rpart)$overall[1]

#The confusion matrix is an mxm, where m is the number of classes to be predicted. For binary classification problems, the number of classes is 2, thus the confusion matrix will have 2 rows and columns.

# It is needed to choose a decision threshold t in order to label the instances as positives or negatives. If the probability assigned to the instance by the classifier is greater than t, it is labeled as positive and if the probability is lower than the decision threshold, it is labeled as negative.

#https://www.neuraldesigner.com/blog/methods-binary-classification

#In our example, the threshold is given as 0.4. Therefore, any probability above 0.4 is considered to be 1 otherwise 0.

#Translating the prediction probabilities based on threshold value into actual value
predictedY_hatValues <- y_hat
predictedY_hatValues[predictedY_hatValues > 0.4] = 1
predictedY_hatValues[predictedY_hatValues <= 0.4] = 0

#confusion matrix only works with "factor" type. Hence we need to coerce vectors into factor type
confusionMatrix(
  factor(donor_test$donatedMar07, levels=0:1), 
  factor(predictedY_hatValues, levels=0:1)
  )#$overall[[1]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 100   8
##          1  38   4
##                                           
##                Accuracy : 0.6933          
##                  95% CI : (0.6129, 0.7659)
##     No Information Rate : 0.92            
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0271          
##  Mcnemar's Test P-Value : 1.904e-05       
##                                           
##             Sensitivity : 0.72464         
##             Specificity : 0.33333         
##          Pos Pred Value : 0.92593         
##          Neg Pred Value : 0.09524         
##              Prevalence : 0.92000         
##          Detection Rate : 0.66667         
##    Detection Prevalence : 0.72000         
##       Balanced Accuracy : 0.52899         
##                                           
##        'Positive' Class : 0               
## 
#table(donor_test$donatedMar07, y_hat > 0.4)
  • False positives (FP) are values that are actually negatives and are classified as positives.
  • As above, FPs = 38
  • Therefore, as per our model, we classified 38 people as donors, but they were actually non-donors.
4c

Calculating the accuracy rate of your y_hat predictions.

cat("Accuracy = ", confusionMatrix(
  factor(donor_test$donatedMar07, levels=0:1), 
  factor(predictedY_hatValues, levels=0:1)
  )$overall[[1]] *100 ,"%")
## Accuracy =  69.33333 %
4d

Using the function to plot the ROC curve for this model.

ROC <- roc(donatedMar07~Frequency, data = donor_test)
summary(ROC)
##                    Length Class  Mode     
## percent              1    -none- logical  
## sensitivities       22    -none- numeric  
## specificities       22    -none- numeric  
## thresholds          22    -none- numeric  
## direction            1    -none- character
## cases               42    -none- numeric  
## controls           108    -none- numeric  
## fun.sesp             1    -none- function 
## auc                  1    auc    numeric  
## call                 3    -none- call     
## original.predictor 150    -none- numeric  
## original.response  150    factor numeric  
## predictor          150    -none- numeric  
## response           150    factor numeric  
## levels               2    -none- character
ROC
## 
## Call:
## roc.formula(formula = donatedMar07 ~ Frequency, data = donor_test)
## 
## Data: Frequency in 108 controls (donatedMar07 0) < 42 cases (donatedMar07 1).
## Area under the curve: 0.6183
plot(ROC)

* Sensitivity is the true positive rate and specificity is the true negative rate. * Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. * Since, the black curve is above the grey line, this means that the model has a higher true positive rate. That is to say that prediction from our model is more accurate. * Larger the area under the black curve (AUC), better the model is. The current AUC is about 62% which corresponds to a satisfactory model.

5

Suppose we use the data to construct a new predictor variable based on a donor’s average number of months between donations.

5a

Why might this be an interesting variable to help predict whether an individual will donate in March 2007?

  • This can be interesting in the sense that higher a donor’s average number of months between donations, lesser is the probability of him/her donating blood in March 2007
5b

A function to add this predictor to the full dataset.

addNewColumn <- function(){
 
  Data2 <- donorData
  Data2$month_span_between_donations <- 
    (donorData$Time - donorData$Recency) / donorData$Frequency
  
  return(Data2)
}
donorData <- addNewColumn()

Rerunning the train test split

# creates a training dataset named train with rows which are marked as TRUE
donor_train = subset(donorData, sample == TRUE)
# creates a training dataset named test with rows which are marked as FALSE
donor_test  = subset(donorData, sample == FALSE)
5c

Fitting a second logistic regression model including only this new feature.Comparing this model to the model in 3a?

model2 <- glm(donatedMar07 ~ month_span_between_donations, donor_train, family = "binomial")
summary(model2)
## 
## Call:
## glm(formula = donatedMar07 ~ month_span_between_donations, family = "binomial", 
##     data = donor_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7905  -0.7638  -0.6997  -0.4727   2.2152  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.00297    0.13284  -7.550 4.34e-14 ***
## month_span_between_donations -0.05592    0.02478  -2.257    0.024 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 641.23  on 597  degrees of freedom
## Residual deviance: 635.42  on 596  degrees of freedom
## AIC: 639.42
## 
## Number of Fisher Scoring iterations: 4
  • As compared to the Frequency variable in model1, the new variable in model2 is not as good a predictor and as significant. This can be ascertained on the nasis of their corresponding p-values.
5d

Repeating 4a and 4b for this new model.Interpreting this new confusion matrix.

y_hat2 <- predict(model2, donor_test, type = "response")

#Translating the prediction probabilities based on threshold value into actual value
predictedY_hatValues2 <- y_hat2
predictedY_hatValues2[predictedY_hatValues2 > 0.4] = 1
predictedY_hatValues2[predictedY_hatValues2 <= 0.4] = 0

#confusion matrix only works with "factor" type. Hence we need to coerce vectors into factor type
confusionMatrix(
  factor(donor_test$donatedMar07, levels=0:1), 
  factor(predictedY_hatValues2, levels=0:1)
  )#$overall[[1]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 108   0
##          1  42   0
##                                           
##                Accuracy : 0.72            
##                  95% CI : (0.6409, 0.7902)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : 2.509e-10       
##                                           
##             Sensitivity : 0.72            
##             Specificity :   NA            
##          Pos Pred Value :   NA            
##          Neg Pred Value :   NA            
##              Prevalence : 1.00            
##          Detection Rate : 0.72            
##    Detection Prevalence : 0.72            
##       Balanced Accuracy :   NA            
##                                           
##        'Positive' Class : 0               
## 
cat("Accuracy = ", confusionMatrix(
  factor(donor_test$donatedMar07, levels=0:1), 
  factor(predictedY_hatValues2, levels=0:1)
  )$overall[[1]] *100 ,"%")
## Accuracy =  72 %
5e

Use the function to fit a multiple logistic regression model with monetary and recency as predictors and makeing predictions for the test set.

model3 <- glm(donatedMar07 ~ Monetary+Recency, donor_train, family = "binomial")
summary(model3)
## 
## Call:
## glm(formula = donatedMar07 ~ Monetary + Recency, family = "binomial", 
##     data = donor_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1202  -0.7967  -0.4855  -0.2475   2.6460  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.014e-01  1.989e-01  -3.526 0.000422 ***
## Monetary     2.804e-04  7.403e-05   3.787 0.000152 ***
## Recency     -1.234e-01  1.934e-02  -6.382 1.75e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 641.23  on 597  degrees of freedom
## Residual deviance: 561.53  on 595  degrees of freedom
## AIC: 567.53
## 
## Number of Fisher Scoring iterations: 5
y_hat3 <- predict(model3, donor_test, type = "response")
5f

Calculating the accuracy rate of your y_hat3 predictions.

predictedY_hatValues3 <- y_hat3
predictedY_hatValues3[predictedY_hatValues3 > 0.4] = 1
predictedY_hatValues3[predictedY_hatValues3 <= 0.4] = 0

cat("Accuracy = ", confusionMatrix(
  factor(donor_test$donatedMar07, levels=0:1), 
  factor(predictedY_hatValues3, levels=0:1)
  )$overall[[1]] *100 ,"%")
## Accuracy =  70.66667 %
5f

Creating a correlation matrix for the donorData (without the donated March 2007 variable).

dat <- donorData[, -5]
cor(dat)
##                                  Recency   Frequency    Monetary      Time
## Recency                       1.00000000 -0.18274547 -0.18274547 0.1606181
## Frequency                    -0.18274547  1.00000000  1.00000000 0.6349403
## Monetary                     -0.18274547  1.00000000  1.00000000 0.6349403
## Time                          0.16061809  0.63494027  0.63494027 1.0000000
## month_span_between_donations -0.05144134  0.03177855  0.03177855 0.5924734
##                              month_span_between_donations
## Recency                                       -0.05144134
## Frequency                                      0.03177855
## Monetary                                       0.03177855
## Time                                           0.59247339
## month_span_between_donations                   1.00000000
  • From the above matrix it is clear that Monetary i.e., the volume of blood donated is positvely correlated with Frequency, that is the frequency of blood donations.

  • It is to be kept in mind that correlation does not neccesarily mean causation.

5g

We have a very limited set of variables in this datset from which to build a model. If you could add in new data or create other calculated variables to aid in model building what would you add and why?

I’d add the following two varibales:

  • Haemoglobin: A measure of the haemoglobin will determine if a person will be donating blood or not

  • Alcohol Consumed in last 48 hours: This will be a binary variable (1 for consumption of alcohol and 0 for non-consumption of alcohol). I’d include this variable because a person who has consumed alcohol recently is less likely to donate blood.

6

Another very popular classifier used in data science is called a .

6a

Using the function to fit a random forest model with monetary and recency as predictors. Making predictions for the test set using the random forest model.

model.rf <- randomForest::randomForest(donatedMar07 ~ Recency + Monetary, data = donor_test)
y_hat4 <- predict(model.rf, donor_test)

6b

Comparing the accuracy of each of the models from this problem set using ROC curves.

#Translating the prediction probabilities based on threshold value into actual value
predictedY_hatValues4 <- y_hat4
predictedY_hatValues4[predictedY_hatValues4 > 0.4] = 1
## Warning in Ops.factor(predictedY_hatValues4, 0.4): '>' not meaningful for
## factors
predictedY_hatValues4[predictedY_hatValues4 <= 0.4] = 0
## Warning in Ops.factor(predictedY_hatValues4, 0.4): '<=' not meaningful for
## factors
#confusion matrix only works with "factor" type. Hence we need to coerce vectors into factor type
confusionMatrix(
  factor(donor_test$donatedMar07, levels=0:1), 
  factor(predictedY_hatValues4, levels=0:1)
  )#$overall[[1]]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 103   5
##          1  21  21
##                                           
##                Accuracy : 0.8267          
##                  95% CI : (0.7564, 0.8835)
##     No Information Rate : 0.8267          
##     P-Value [Acc > NIR] : 0.552140        
##                                           
##                   Kappa : 0.5135          
##  Mcnemar's Test P-Value : 0.003264        
##                                           
##             Sensitivity : 0.8306          
##             Specificity : 0.8077          
##          Pos Pred Value : 0.9537          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.8267          
##          Detection Rate : 0.6867          
##    Detection Prevalence : 0.7200          
##       Balanced Accuracy : 0.8192          
##                                           
##        'Positive' Class : 0               
## 
cat("Accuracy = ", confusionMatrix(
  factor(donor_test$donatedMar07, levels=0:1), 
  factor(predictedY_hatValues4, levels=0:1)
  )$overall[[1]] *100 ,"%")
## Accuracy =  82.66667 %
results <- data.frame(matrix(nrow = 2,ncol = 1))
row.names(results) <- c("Logistic Regression", "Random Forest")
colnames(results) <- c("Accuracy")
results[1,1] <- confusionMatrix(
  factor(donor_test$donatedMar07, levels=0:1), 
  factor(predictedY_hatValues, levels=0:1)
  )$overall[[1]]
results[2,1] <- confusionMatrix(
  factor(donor_test$donatedMar07, levels=0:1), 
  factor(predictedY_hatValues4, levels=0:1)
  )$overall[[1]]

results
ROC2 <- roc(donor_test$donatedMar07, model.rf$votes[,2])

plot(ROC, main = paste("ROC Curve for Logistic Regression AUC = ", auc(ROC)))

plot(ROC2, main = paste("ROC Curve for Random Forest AUC = ", auc(ROC2)))

* From above it is clear that the accuracy of a random forest is more than that of the logistic regression glm model.