R (Recency - months since last donation)
F (Frequency - total number of donation)
M (Monetary - total blood donated in c.c.)
T (Time - months since first donation) and
a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).
R (Recency - months since last donation)
F (Frequency - total number of donation)
M (Monetary - total blood donated in c.c.)
T (Time - months since first donation)
Dataset [URL] (https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center)
trainset <- read.csv("trainblood.csv", header =T, na.strings=c("","NA"))
testset <- read.csv("testblood.csv", header =T, na.strings=c("","NA"))
#remove X since it is the donor
trainset1 <-trainset [c(-1)]
str(trainset1)
'data.frame': 576 obs. of 5 variables:
$ months_since_last_donations : int 2 0 1 2 1 4 2 1 5 0 ...
$ number_of_donations : int 50 13 16 20 24 4 7 12 46 3 ...
$ volume_donated : int 12500 3250 4000 5000 6000 1000 1750 3000 11500 750 ...
$ months_since_first_donations: int 98 28 35 45 77 4 14 35 98 4 ...
$ donation_made_March_2007 : int 1 1 1 1 0 0 1 0 1 0 ...
no_model<-glm(donation_made_March_2007~1,data = trainset1, family=binomial)
summary(no_model)
Call:
glm(formula = donation_made_March_2007 ~ 1, family = binomial,
data = trainset1)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7401 -0.7401 -0.7401 -0.7401 1.6905
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.15497 0.09762 -11.83 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 634.29 on 575 degrees of freedom
Residual deviance: 634.29 on 575 degrees of freedom
AIC: 636.29
Number of Fisher Scoring iterations: 4
full_model <-glm(donation_made_March_2007~.,data = trainset1, family=binomial)
summary(full_model)
Call:
glm(formula = donation_made_March_2007 ~ ., family = binomial,
data = trainset1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5102 -0.8079 -0.5273 -0.2427 2.5545
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.585643 0.201818 -2.902 0.00371 **
months_since_last_donations -0.091026 0.018955 -4.802 1.57e-06 ***
number_of_donations 0.129921 0.029102 4.464 8.03e-06 ***
volume_donated NA NA NA NA
months_since_first_donations -0.018797 0.006588 -2.853 0.00433 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 634.29 on 575 degrees of freedom
Residual deviance: 556.61 on 572 degrees of freedom
AIC: 564.61
Number of Fisher Scoring iterations: 5
Observe that the no_model AIC = 636.29, while the full_model AIC is 564.61
Observe the full_model on variable volumn_donated is NA, so we should remove the independent variable manually. Look for a better efficient way on next command run which it will do variable selection for best model.
# shows last equation when run this model - iterative
logistic_steps = step(no_model, scope=list(lower=formula(no_model), upper=formula(full_model)),direction="both",family=binomial)
Start: AIC=636.29
donation_made_March_2007 ~ 1
Df Deviance AIC
+ months_since_last_donations 1 582.69 586.69
+ number_of_donations 1 609.32 613.32
+ volume_donated 1 609.32 613.32
<none> 634.29 636.29
+ months_since_first_donations 1 634.06 638.06
Step: AIC=586.69
donation_made_March_2007 ~ months_since_last_donations
Df Deviance AIC
+ number_of_donations 1 565.56 571.56
+ volume_donated 1 565.56 571.56
<none> 582.69 586.69
+ months_since_first_donations 1 582.07 588.07
- months_since_last_donations 1 634.29 636.29
Step: AIC=571.56
donation_made_March_2007 ~ months_since_last_donations + number_of_donations
Df Deviance AIC
+ months_since_first_donations 1 556.61 564.61
<none> 565.56 571.56
- number_of_donations 1 582.69 586.69
- months_since_last_donations 1 609.32 613.32
Step: AIC=564.61
donation_made_March_2007 ~ months_since_last_donations + number_of_donations +
months_since_first_donations
Df Deviance AIC
<none> 556.61 564.61
- months_since_first_donations 1 565.56 571.56
- number_of_donations 1 582.07 588.07
- months_since_last_donations 1 583.03 589.03
The step command runs from no model to full model and attempts to generate a best model. Look at how the AIC numbers starts from 636.9 and slowly drops to last AIC = 564.61. If you have many independent variables, it will helps you select the variables efficiently.
So the best model is below.
Select_model<-glm(donation_made_March_2007 ~ months_since_last_donations + number_of_donations + months_since_first_donations ,data=trainset1,
family=binomial)
summary(Select_model)
Call:
glm(formula = donation_made_March_2007 ~ months_since_last_donations +
number_of_donations + months_since_first_donations, family = binomial,
data = trainset1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5102 -0.8079 -0.5273 -0.2427 2.5545
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.585643 0.201818 -2.902 0.00371 **
months_since_last_donations -0.091026 0.018955 -4.802 1.57e-06 ***
number_of_donations 0.129921 0.029102 4.464 8.03e-06 ***
months_since_first_donations -0.018797 0.006588 -2.853 0.00433 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 634.29 on 575 degrees of freedom
Residual deviance: 556.61 on 572 degrees of freedom
AIC: 564.61
Number of Fisher Scoring iterations: 5
suppressWarnings(suppressMessages(library(car)))
vif(Select_model)
months_since_last_donations number_of_donations months_since_first_donations
1.084934 2.121902 2.203637
# vif value not greater than 5, so no relationship between independent variables.
str(testset) # earlier you open the test csv file
'data.frame': 200 obs. of 5 variables:
$ X : int 659 276 263 303 83 500 530 244 249 728 ...
$ months_since_last_donations : int 2 21 4 11 4 3 4 14 23 14 ...
$ number_of_donations : int 12 7 1 11 12 21 2 1 2 4 ...
$ volume_donated : int 3000 1750 250 2750 3000 5250 500 250 500 1000 ...
$ months_since_first_donations: int 52 38 4 38 34 42 4 14 87 64 ...
# remove un-necessary columns since you know the equation above
testset1 <-testset [c(-1,-4)]
str(testset1)
'data.frame': 200 obs. of 3 variables:
$ months_since_last_donations : int 2 21 4 11 4 3 4 14 23 14 ...
$ number_of_donations : int 12 7 1 11 12 21 2 1 2 4 ...
$ months_since_first_donations: int 52 38 4 38 34 42 4 14 87 64 ...
#create a new variable called results and append to testset1
testset1$results <- predict(Select_model, testset1, type="response")
#you use the View(testset1) to check its probability on the testset dataset
# sample result
head(testset1)