Here I am installing the necessary packages and loading the required libraries.
# Load standard libraries
library(tidyverse)
library(caTools)
library(pROC)
library(randomForest)
In this problem set we will use the dataset from Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence,”Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018). This dataset is currently being used for a competition on http://www.DrivenData.org. Information on the dataset and variables can be found at: https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.names.
# Load data called 'TransfusionData.csv'
donorData <- read.csv('TransfusionData.csv')
str(donorData)
## 'data.frame': 748 obs. of 5 variables:
## $ Recency..months. : num 2 0 1 2 1 4 2 1 2 5 ...
## $ Frequency..times. : int 50 13 16 20 24 4 7 12 9 46 ...
## $ Monetary..c.c..blood. : int 12500 3250 4000 5000 6000 1000 1750 3000 2250 11500 ...
## $ Time..months. : num 98 28 35 45 77 4 14 35 22 98 ...
## $ whether.he.she.donated.blood.in.March.2007: int 1 1 1 1 0 0 1 0 1 1 ...
Preparing the data for easier processing.
donorData$whether.he.she.donated.blood.in.March.2007 <- as.factor(donorData$whether.he.she.donated.blood.in.March.2007)
n <- c("Recency", "Frequency", "Monetary", "Time", "donatedMar07")
colnames(donorData) <- n
Providing some basic summary statistics to become more familiar with the dataset.
summary(donorData)
## Recency Frequency Monetary Time
## Min. : 0.000 Min. : 1.000 Min. : 250 Min. : 2.00
## 1st Qu.: 2.750 1st Qu.: 2.000 1st Qu.: 500 1st Qu.:16.00
## Median : 7.000 Median : 4.000 Median : 1000 Median :28.00
## Mean : 9.507 Mean : 5.515 Mean : 1379 Mean :34.28
## 3rd Qu.:14.000 3rd Qu.: 7.000 3rd Qu.: 1750 3rd Qu.:50.00
## Max. :74.000 Max. :50.000 Max. :12500 Max. :98.00
## donatedMar07
## 0:570
## 1:178
##
##
##
##
Evaluating the performance of a few different statistical learning methods. We will fit a particular statistical learning method on a set of observations and measure its performance on a set of observations.
Discussing the advantages of using a training/test split when evaluating statistical models.
Splitting our data into a and set based on an 80-20 split, in other words, 80% of the observations will be in the training set.
# code adapted from https://rpubs.com/ID_Tech/S1 AND https://stackoverflow.com/a/31634462
# Set seed for reproducibility
set.seed(112718)
#splits the data in the ratio mentioned in SplitRatio. After splitting marks these rows as logical
# TRUE and the the remaining are marked as logical FALSE
sample <- sample.split(donorData, SplitRatio = .8)
# creates a training dataset named train with rows which are marked as TRUE
donor_train <- subset(donorData, sample == TRUE)
# creates a training dataset named test with rows which are marked as FALSE
donor_test <- subset(donorData, sample == FALSE)
In this problem our goal is to predict whether someone will donate blood in March 2007. First considering a simple logistic regression model for whether an individual donated blood in March 2007 based on frequency of donations.
Fitting the model described above using the function in R.
model <- glm(donatedMar07 ~ Frequency, donor_train, family = "binomial")
Describing the model summary.
summary(model)
##
## Call:
## glm(formula = donatedMar07 ~ Frequency, family = "binomial",
## data = donor_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0932 -0.6946 -0.6181 -0.5941 1.9086
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.73127 0.14910 -11.611 < 2e-16 ***
## Frequency 0.08644 0.01798 4.806 1.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 641.23 on 597 degrees of freedom
## Residual deviance: 615.30 on 596 degrees of freedom
## AIC: 619.3
##
## Number of Fisher Scoring iterations: 4
Therefore, essentially our model tells us that log of probability of blood donation given the frequency of donation is equal to an intercept plus slope times the frequncy of donation.
log (odds of a person donating blood given the frequency of donations) = 1.731 - 0.087 * Frequency
Log odds of a person donating blood for a unit increase in frequency = -0.087
If we take the exponent of the intercept and the slope:
exp(model$coefficients)
## (Intercept) Frequency
## 0.1770586 1.0902858
Next, considering the performance of this model.
Predicting donations in March 2007 for each observation in your test set using the model fit in 3a. Saving these predictions as y_hat.
y_hat <- predict(model, donor_test, type = "response")
Using a threshold of 0.4 to classify predictions. Using a confusion matrix, determining the number of false positives on the test data.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
#results[1,1] <- confusionMatrix(testing$classe,predict.rpart)$overall[1]
#The confusion matrix is an mxm, where m is the number of classes to be predicted. For binary classification problems, the number of classes is 2, thus the confusion matrix will have 2 rows and columns.
# It is needed to choose a decision threshold t in order to label the instances as positives or negatives. If the probability assigned to the instance by the classifier is greater than t, it is labeled as positive and if the probability is lower than the decision threshold, it is labeled as negative.
#https://www.neuraldesigner.com/blog/methods-binary-classification
#In our example, the threshold is given as 0.4. Therefore, any probability above 0.4 is considered to be 1 otherwise 0.
#Translating the prediction probabilities based on threshold value into actual value
predictedY_hatValues <- y_hat
predictedY_hatValues[predictedY_hatValues > 0.4] = 1
predictedY_hatValues[predictedY_hatValues <= 0.4] = 0
#confusion matrix only works with "factor" type. Hence we need to coerce vectors into factor type
confusionMatrix(
factor(donor_test$donatedMar07, levels=0:1),
factor(predictedY_hatValues, levels=0:1)
)#$overall[[1]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 100 8
## 1 38 4
##
## Accuracy : 0.6933
## 95% CI : (0.6129, 0.7659)
## No Information Rate : 0.92
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0271
## Mcnemar's Test P-Value : 1.904e-05
##
## Sensitivity : 0.72464
## Specificity : 0.33333
## Pos Pred Value : 0.92593
## Neg Pred Value : 0.09524
## Prevalence : 0.92000
## Detection Rate : 0.66667
## Detection Prevalence : 0.72000
## Balanced Accuracy : 0.52899
##
## 'Positive' Class : 0
##
#table(donor_test$donatedMar07, y_hat > 0.4)
Calculating the accuracy rate of your y_hat predictions.
cat("Accuracy = ", confusionMatrix(
factor(donor_test$donatedMar07, levels=0:1),
factor(predictedY_hatValues, levels=0:1)
)$overall[[1]] *100 ,"%")
## Accuracy = 69.33333 %
Using the function to plot the ROC curve for this model.
ROC <- roc(donatedMar07~Frequency, data = donor_test)
summary(ROC)
## Length Class Mode
## percent 1 -none- logical
## sensitivities 22 -none- numeric
## specificities 22 -none- numeric
## thresholds 22 -none- numeric
## direction 1 -none- character
## cases 42 -none- numeric
## controls 108 -none- numeric
## fun.sesp 1 -none- function
## auc 1 auc numeric
## call 3 -none- call
## original.predictor 150 -none- numeric
## original.response 150 factor numeric
## predictor 150 -none- numeric
## response 150 factor numeric
## levels 2 -none- character
ROC
##
## Call:
## roc.formula(formula = donatedMar07 ~ Frequency, data = donor_test)
##
## Data: Frequency in 108 controls (donatedMar07 0) < 42 cases (donatedMar07 1).
## Area under the curve: 0.6183
plot(ROC)
* Sensitivity is the true positive rate and specificity is the true negative rate. * Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. * Since, the black curve is above the grey line, this means that the model has a higher true positive rate. That is to say that prediction from our model is more accurate. * Larger the area under the black curve (AUC), better the model is. The current AUC is about 62% which corresponds to a satisfactory model.
Suppose we use the data to construct a new predictor variable based on a donor’s average number of months between donations.
Why might this be an interesting variable to help predict whether an individual will donate in March 2007?
A function to add this predictor to the full dataset.
addNewColumn <- function(){
Data2 <- donorData
Data2$month_span_between_donations <-
(donorData$Time - donorData$Recency) / donorData$Frequency
return(Data2)
}
donorData <- addNewColumn()
Rerunning the train test split
# creates a training dataset named train with rows which are marked as TRUE
donor_train = subset(donorData, sample == TRUE)
# creates a training dataset named test with rows which are marked as FALSE
donor_test = subset(donorData, sample == FALSE)
Fitting a second logistic regression model including only this new feature.Comparing this model to the model in 3a?
model2 <- glm(donatedMar07 ~ month_span_between_donations, donor_train, family = "binomial")
summary(model2)
##
## Call:
## glm(formula = donatedMar07 ~ month_span_between_donations, family = "binomial",
## data = donor_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7905 -0.7638 -0.6997 -0.4727 2.2152
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.00297 0.13284 -7.550 4.34e-14 ***
## month_span_between_donations -0.05592 0.02478 -2.257 0.024 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 641.23 on 597 degrees of freedom
## Residual deviance: 635.42 on 596 degrees of freedom
## AIC: 639.42
##
## Number of Fisher Scoring iterations: 4
Repeating 4a and 4b for this new model.Interpreting this new confusion matrix.
y_hat2 <- predict(model2, donor_test, type = "response")
#Translating the prediction probabilities based on threshold value into actual value
predictedY_hatValues2 <- y_hat2
predictedY_hatValues2[predictedY_hatValues2 > 0.4] = 1
predictedY_hatValues2[predictedY_hatValues2 <= 0.4] = 0
#confusion matrix only works with "factor" type. Hence we need to coerce vectors into factor type
confusionMatrix(
factor(donor_test$donatedMar07, levels=0:1),
factor(predictedY_hatValues2, levels=0:1)
)#$overall[[1]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 108 0
## 1 42 0
##
## Accuracy : 0.72
## 95% CI : (0.6409, 0.7902)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
## Mcnemar's Test P-Value : 2.509e-10
##
## Sensitivity : 0.72
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 1.00
## Detection Rate : 0.72
## Detection Prevalence : 0.72
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
cat("Accuracy = ", confusionMatrix(
factor(donor_test$donatedMar07, levels=0:1),
factor(predictedY_hatValues2, levels=0:1)
)$overall[[1]] *100 ,"%")
## Accuracy = 72 %
Use the function to fit a multiple logistic regression model with monetary and recency as predictors and makeing predictions for the test set.
model3 <- glm(donatedMar07 ~ Monetary+Recency, donor_train, family = "binomial")
summary(model3)
##
## Call:
## glm(formula = donatedMar07 ~ Monetary + Recency, family = "binomial",
## data = donor_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1202 -0.7967 -0.4855 -0.2475 2.6460
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.014e-01 1.989e-01 -3.526 0.000422 ***
## Monetary 2.804e-04 7.403e-05 3.787 0.000152 ***
## Recency -1.234e-01 1.934e-02 -6.382 1.75e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 641.23 on 597 degrees of freedom
## Residual deviance: 561.53 on 595 degrees of freedom
## AIC: 567.53
##
## Number of Fisher Scoring iterations: 5
y_hat3 <- predict(model3, donor_test, type = "response")
Calculating the accuracy rate of your y_hat3 predictions.
predictedY_hatValues3 <- y_hat3
predictedY_hatValues3[predictedY_hatValues3 > 0.4] = 1
predictedY_hatValues3[predictedY_hatValues3 <= 0.4] = 0
cat("Accuracy = ", confusionMatrix(
factor(donor_test$donatedMar07, levels=0:1),
factor(predictedY_hatValues3, levels=0:1)
)$overall[[1]] *100 ,"%")
## Accuracy = 70.66667 %
Creating a correlation matrix for the donorData (without the donated March 2007 variable).
dat <- donorData[, -5]
cor(dat)
## Recency Frequency Monetary Time
## Recency 1.00000000 -0.18274547 -0.18274547 0.1606181
## Frequency -0.18274547 1.00000000 1.00000000 0.6349403
## Monetary -0.18274547 1.00000000 1.00000000 0.6349403
## Time 0.16061809 0.63494027 0.63494027 1.0000000
## month_span_between_donations -0.05144134 0.03177855 0.03177855 0.5924734
## month_span_between_donations
## Recency -0.05144134
## Frequency 0.03177855
## Monetary 0.03177855
## Time 0.59247339
## month_span_between_donations 1.00000000
From the above matrix it is clear that Monetary i.e., the volume of blood donated is positvely correlated with Frequency, that is the frequency of blood donations.
It is to be kept in mind that correlation does not neccesarily mean causation.
We have a very limited set of variables in this datset from which to build a model. If you could add in new data or create other calculated variables to aid in model building what would you add and why?
I’d add the following two varibales:
Haemoglobin: A measure of the haemoglobin will determine if a person will be donating blood or not
Alcohol Consumed in last 48 hours: This will be a binary variable (1 for consumption of alcohol and 0 for non-consumption of alcohol). I’d include this variable because a person who has consumed alcohol recently is less likely to donate blood.
Another very popular classifier used in data science is called a .
Using the function to fit a random forest model with monetary and recency as predictors. Making predictions for the test set using the random forest model.
model.rf <- randomForest::randomForest(donatedMar07 ~ Recency + Monetary, data = donor_test)
y_hat4 <- predict(model.rf, donor_test)
Comparing the accuracy of each of the models from this problem set using ROC curves.
#Translating the prediction probabilities based on threshold value into actual value
predictedY_hatValues4 <- y_hat4
predictedY_hatValues4[predictedY_hatValues4 > 0.4] = 1
## Warning in Ops.factor(predictedY_hatValues4, 0.4): '>' not meaningful for
## factors
predictedY_hatValues4[predictedY_hatValues4 <= 0.4] = 0
## Warning in Ops.factor(predictedY_hatValues4, 0.4): '<=' not meaningful for
## factors
#confusion matrix only works with "factor" type. Hence we need to coerce vectors into factor type
confusionMatrix(
factor(donor_test$donatedMar07, levels=0:1),
factor(predictedY_hatValues4, levels=0:1)
)#$overall[[1]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 103 5
## 1 21 21
##
## Accuracy : 0.8267
## 95% CI : (0.7564, 0.8835)
## No Information Rate : 0.8267
## P-Value [Acc > NIR] : 0.552140
##
## Kappa : 0.5135
## Mcnemar's Test P-Value : 0.003264
##
## Sensitivity : 0.8306
## Specificity : 0.8077
## Pos Pred Value : 0.9537
## Neg Pred Value : 0.5000
## Prevalence : 0.8267
## Detection Rate : 0.6867
## Detection Prevalence : 0.7200
## Balanced Accuracy : 0.8192
##
## 'Positive' Class : 0
##
cat("Accuracy = ", confusionMatrix(
factor(donor_test$donatedMar07, levels=0:1),
factor(predictedY_hatValues4, levels=0:1)
)$overall[[1]] *100 ,"%")
## Accuracy = 82.66667 %
results <- data.frame(matrix(nrow = 2,ncol = 1))
row.names(results) <- c("Logistic Regression", "Random Forest")
colnames(results) <- c("Accuracy")
results[1,1] <- confusionMatrix(
factor(donor_test$donatedMar07, levels=0:1),
factor(predictedY_hatValues, levels=0:1)
)$overall[[1]]
results[2,1] <- confusionMatrix(
factor(donor_test$donatedMar07, levels=0:1),
factor(predictedY_hatValues4, levels=0:1)
)$overall[[1]]
results
ROC2 <- roc(donor_test$donatedMar07, model.rf$votes[,2])
plot(ROC, main = paste("ROC Curve for Logistic Regression AUC = ", auc(ROC)))
plot(ROC2, main = paste("ROC Curve for Random Forest AUC = ", auc(ROC2)))
* From above it is clear that the accuracy of a random forest is more than that of the logistic regression glm model.