This exercise is based on the book data mining for business analytics by SBPYPL. In this scenario a book club is trying to send mail out to book club members to see if they want to buy a book called Art History of Florence. But they don’t want to send mail out to all members because most members will not buy the book and they will waste money on mailing. So the goal of the exercise is to find a way to send mail to only those with a high probability of actually buying the book.
Table of variables.
book_club = read.csv('./CharlesBookClub.csv')
str(book_club)
## 'data.frame': 4000 obs. of 24 variables:
## $ Seq. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ID. : int 25 29 46 47 51 60 61 79 81 90 ...
## $ Gender : int 1 0 1 1 1 1 1 1 1 1 ...
## $ M : int 297 128 138 228 257 145 190 187 252 240 ...
## $ R : int 14 8 22 2 10 6 16 14 10 6 ...
## $ F : int 2 2 7 1 1 2 1 1 1 3 ...
## $ FirstPurch : int 22 10 56 2 10 12 16 14 10 20 ...
## $ ChildBks : int 0 0 2 0 0 0 0 1 0 0 ...
## $ YouthBks : int 1 0 1 0 0 0 0 0 0 0 ...
## $ CookBks : int 1 0 2 0 0 0 0 0 0 1 ...
## $ DoItYBks : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RefBks : int 0 0 1 0 0 0 0 0 0 0 ...
## $ ArtBks : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GeogBks : int 0 0 1 0 0 0 1 0 0 0 ...
## $ ItalCook : int 0 0 1 0 0 0 0 0 0 0 ...
## $ ItalAtlas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ItalArt : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Florence : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Related.Purchase: int 0 0 2 0 0 0 1 0 0 0 ...
## $ Mcode : int 5 4 4 5 5 4 4 4 5 5 ...
## $ Rcode : int 4 3 4 1 3 2 4 4 3 2 ...
## $ Fcode : int 2 2 3 1 1 2 1 1 1 3 ...
## $ Yes_Florence : int 0 0 0 0 0 0 0 0 0 0 ...
## $ No_Florence : int 1 1 1 1 1 1 1 1 1 1 ...
# Convert categorical variables to factor
book_club$Florence = as.factor(book_club$Florence)
book_club$Gender = as.factor(ifelse(book_club$Gender == 1, "F", "M"))
book_club$Mcode = as.factor(book_club$Mcode)
book_club$Rcode = as.factor(book_club$Rcode)
book_club$Fcode = as.factor(book_club$Fcode)
Shuffle data and split into train/validation
book_club = sample(book_club)
train_idx = 1:ceiling(.6*nrow(book_club))
train = book_club[train_idx, ]
validation = book_club[-train_idx, ]
gender_n = train %>%
group_by(Gender) %>%
summarise(gender_n = n())
ggplot(train %>%
group_by(Gender, Florence) %>%
summarise(gender_florence_n = n()) %>%
inner_join(gender_n, by = "Gender") %>%
mutate(fraction = gender_florence_n / gender_n), aes(Gender, fraction, fill = Florence)) + geom_bar(stat = "identity")
Rcode_n = train %>%
group_by(Rcode) %>%
summarise(Rcode_n = n())
ggplot(train %>%
group_by(Rcode, Florence) %>%
summarise(Rcode_florence_n = n()) %>%
inner_join(Rcode_n, by = "Rcode") %>%
mutate(fraction = Rcode_florence_n / Rcode_n), aes(Rcode, fraction, fill = Florence)) + geom_bar(stat = "identity")
Mcode_n = train %>%
group_by(Mcode) %>%
summarise(Mcode_n = n())
ggplot(train %>%
group_by(Mcode, Florence) %>%
summarise(Mcode_florence_n = n()) %>%
inner_join(Mcode_n, by = "Mcode") %>%
mutate(fraction = Mcode_florence_n / Mcode_n), aes(Mcode, fraction, fill = Florence)) + geom_bar(stat = "identity")
Fcode_n = train %>%
group_by(Fcode) %>%
summarise(Fcode_n = n())
ggplot(train %>%
group_by(Fcode, Florence) %>%
summarise(Fcode_florence_n = n()) %>%
inner_join(Fcode_n, by = "Fcode") %>%
mutate(fraction = Fcode_florence_n / Fcode_n), aes(Fcode, fraction, fill = Florence)) + geom_bar(stat = "identity")
ggplot(train) + geom_boxplot(aes(Florence, M, group = Florence))
ggplot(train) + geom_boxplot(aes(Florence, F, group = Florence))
ggplot(train) + geom_boxplot(aes(Florence, R, group = Florence))
ArtBks_n = train %>%
group_by(ArtBks) %>%
summarise(ArtBks_n = n())
ggplot(train %>%
group_by(ArtBks, Florence) %>%
summarise(ArtBks_florence_n = n()) %>%
inner_join(ArtBks_n, by = "ArtBks") %>%
mutate(fraction = ArtBks_florence_n / ArtBks_n), aes(ArtBks, fraction, fill = Florence)) + geom_bar(stat = "identity")
ItalArt_n = train %>%
group_by(ItalArt) %>%
summarise(ItalArt_n = n())
ggplot(train %>%
group_by(ItalArt, Florence) %>%
summarise(ItalArt_florence_n = n()) %>%
inner_join(ItalArt_n, by = "ItalArt") %>%
mutate(fraction = ItalArt_florence_n / ItalArt_n), aes(ItalArt, fraction, fill = Florence)) + geom_bar(stat = "identity")
Related.Purchase_n = train %>%
group_by(Related.Purchase) %>%
summarise(Related.Purchase_n = n())
ggplot(train %>%
group_by(Related.Purchase, Florence) %>%
summarise(Related.Purchase_florence_n = n()) %>%
inner_join(Related.Purchase_n, by = "Related.Purchase") %>%
mutate(fraction = Related.Purchase_florence_n / Related.Purchase_n), aes(Related.Purchase, fraction, fill = Florence)) + geom_bar(stat = "identity")
ggplot(train) + geom_boxplot(aes(Florence, FirstPurch, group = Florence))
Response rate for training data as a whole
train_response_rate = sum(train$Florence == 1) / nrow(train)
train_response_rate
## [1] 0.09083333
Response rate for each RFM categories
response_rates_by_rfm_group = train %>%
group_by(Rcode, Fcode, Mcode) %>%
summarise(response_rate = mean(Florence == 1), n = n()) %>%
arrange(desc(response_rate))
Groups with response rate > overall response rate
rfm_groups_response_rate_gt_mean = response_rates_by_rfm_group %>%
filter(response_rate > train_response_rate)
validation response rate using rfm groups w/ rate > overall response rate
validation_response_rate_gt_mean = validation %>%
inner_join(rfm_groups_response_rate_gt_mean, by = c("Rcode", "Fcode", "Mcode"))
mean(validation_response_rate_gt_mean$Florence == 1)
## [1] 0.05643994
Segment data based on response rate
response_rates_by_rfm_group = response_rates_by_rfm_group %>%
mutate(segment = ifelse(response_rate > 2 * train_response_rate, 1, ifelse(response_rate > train_response_rate & response_rate < 2 * train_response_rate, 2, 3)))
Gains curve
pred = validation %>%
inner_join(response_rates_by_rfm_group, by = c("Rcode", "Fcode", "Mcode")) %>%
arrange(desc(response_rate)) %>%
select(Florence, response_rate)
lift = lift(relevel(as.factor(Florence), ref="1") ~ response_rate, data = pred)
ggplot(lift)
We can see from the gains plot that the RFM model does worse than the baseline model of uniformly distributed yes’s. The optimal case would be if top 10% of cases contained all yes’s. However in the current model we have to sift through all cases in order to find all the yes’s.
features = c("Rcode", "Fcode", "Mcode", "FirstPurch", "Related.Purchase")
# Convert factors back to numeric since it performs better with knn
train.knn = train
train.knn$Rcode = as.numeric(as.character(train.knn$Rcode))
train.knn$Fcode = as.numeric(as.character(train.knn$Fcode))
train.knn$Mcode = as.numeric(as.character(train.knn$Mcode))
train.knn$Related.Purchase = as.numeric(as.character(train.knn$Related.Purchase))
validation.knn = validation
validation.knn$Rcode = as.numeric(as.character(validation$Rcode))
validation.knn$Fcode = as.numeric(as.character(validation$Fcode))
validation.knn$Mcode = as.numeric(as.character(validation$Mcode))
validation.knn$Related.Purchase = as.numeric(as.character(validation$Related.Purchase))
# Normalization
preprocess = preProcess(train.knn[, features], method = c("center", "scale"))
features_train = predict(preprocess, train.knn[, features])
features_validation = predict(preprocess, validation.knn[, features])
k = 1:14
for(k in k) {
pred = FNN::knn(train = features_train, test = features_validation, cl = relevel(train$Florence, ref = "1"), k = k)
print(k)
print(confusionMatrix(data = pred, reference = validation$Florence, positive = "1"))
}
## [1] 1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1358 108
## 1 122 12
##
## Accuracy : 0.8562
## 95% CI : (0.8381, 0.8731)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : 0.0167
## Mcnemar's Test P-Value : 0.3913
##
## Sensitivity : 0.10000
## Specificity : 0.91757
## Pos Pred Value : 0.08955
## Neg Pred Value : 0.92633
## Prevalence : 0.07500
## Detection Rate : 0.00750
## Detection Prevalence : 0.08375
## Balanced Accuracy : 0.50878
##
## 'Positive' Class : 1
##
## [1] 2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1462 119
## 1 18 1
##
## Accuracy : 0.9144
## 95% CI : (0.8996, 0.9276)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.9494
##
## Kappa : -0.0062
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.008333
## Specificity : 0.987838
## Pos Pred Value : 0.052632
## Neg Pred Value : 0.924731
## Prevalence : 0.075000
## Detection Rate : 0.000625
## Detection Prevalence : 0.011875
## Balanced Accuracy : 0.498086
##
## 'Positive' Class : 1
##
## [1] 3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1441 115
## 1 39 5
##
## Accuracy : 0.9038
## 95% CI : (0.8882, 0.9178)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.9992
##
## Kappa : 0.0216
## Mcnemar's Test P-Value : 1.506e-09
##
## Sensitivity : 0.041667
## Specificity : 0.973649
## Pos Pred Value : 0.113636
## Neg Pred Value : 0.926093
## Prevalence : 0.075000
## Detection Rate : 0.003125
## Detection Prevalence : 0.027500
## Balanced Accuracy : 0.507658
##
## 'Positive' Class : 1
##
## [1] 4
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1467 119
## 1 13 1
##
## Accuracy : 0.9175
## 95% CI : (0.9029, 0.9305)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.8813
##
## Kappa : -8e-04
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.008333
## Specificity : 0.991216
## Pos Pred Value : 0.071429
## Neg Pred Value : 0.924968
## Prevalence : 0.075000
## Detection Rate : 0.000625
## Detection Prevalence : 0.008750
## Balanced Accuracy : 0.499775
##
## 'Positive' Class : 1
##
## [1] 5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1458 116
## 1 22 4
##
## Accuracy : 0.9138
## 95% CI : (0.8989, 0.927)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.9582
##
## Kappa : 0.0289
## Mcnemar's Test P-Value : 2.439e-15
##
## Sensitivity : 0.03333
## Specificity : 0.98514
## Pos Pred Value : 0.15385
## Neg Pred Value : 0.92630
## Prevalence : 0.07500
## Detection Rate : 0.00250
## Detection Prevalence : 0.01625
## Balanced Accuracy : 0.50923
##
## 'Positive' Class : 1
##
## [1] 6
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1473 117
## 1 7 3
##
## Accuracy : 0.9225
## 95% CI : (0.9083, 0.9351)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.6693
##
## Kappa : 0.035
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.025000
## Specificity : 0.995270
## Pos Pred Value : 0.300000
## Neg Pred Value : 0.926415
## Prevalence : 0.075000
## Detection Rate : 0.001875
## Detection Prevalence : 0.006250
## Balanced Accuracy : 0.510135
##
## 'Positive' Class : 1
##
## [1] 7
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1471 116
## 1 9 4
##
## Accuracy : 0.9219
## 95% CI : (0.9076, 0.9346)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.7026
##
## Kappa : 0.0462
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.033333
## Specificity : 0.993919
## Pos Pred Value : 0.307692
## Neg Pred Value : 0.926906
## Prevalence : 0.075000
## Detection Rate : 0.002500
## Detection Prevalence : 0.008125
## Balanced Accuracy : 0.513626
##
## 'Positive' Class : 1
##
## [1] 8
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1474 120
## 1 6 0
##
## Accuracy : 0.9212
## 95% CI : (0.907, 0.934)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.7341
##
## Kappa : -0.0072
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.00000
## Specificity : 0.99595
## Pos Pred Value : 0.00000
## Neg Pred Value : 0.92472
## Prevalence : 0.07500
## Detection Rate : 0.00000
## Detection Prevalence : 0.00375
## Balanced Accuracy : 0.49797
##
## 'Positive' Class : 1
##
## [1] 9
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1471 120
## 1 9 0
##
## Accuracy : 0.9194
## 95% CI : (0.9049, 0.9322)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.8171
##
## Kappa : -0.0106
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.000000
## Specificity : 0.993919
## Pos Pred Value : 0.000000
## Neg Pred Value : 0.924576
## Prevalence : 0.075000
## Detection Rate : 0.000000
## Detection Prevalence : 0.005625
## Balanced Accuracy : 0.496959
##
## 'Positive' Class : 1
##
## [1] 10
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1475 120
## 1 5 0
##
## Accuracy : 0.9219
## 95% CI : (0.9076, 0.9346)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.7026
##
## Kappa : -0.006
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.000000
## Specificity : 0.996622
## Pos Pred Value : 0.000000
## Neg Pred Value : 0.924765
## Prevalence : 0.075000
## Detection Rate : 0.000000
## Detection Prevalence : 0.003125
## Balanced Accuracy : 0.498311
##
## 'Positive' Class : 1
##
## [1] 11
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1474 120
## 1 6 0
##
## Accuracy : 0.9212
## 95% CI : (0.907, 0.934)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.7341
##
## Kappa : -0.0072
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.00000
## Specificity : 0.99595
## Pos Pred Value : 0.00000
## Neg Pred Value : 0.92472
## Prevalence : 0.07500
## Detection Rate : 0.00000
## Detection Prevalence : 0.00375
## Balanced Accuracy : 0.49797
##
## 'Positive' Class : 1
##
## [1] 12
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1477 120
## 1 3 0
##
## Accuracy : 0.9231
## 95% CI : (0.909, 0.9357)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.6346
##
## Kappa : -0.0037
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.000000
## Specificity : 0.997973
## Pos Pred Value : 0.000000
## Neg Pred Value : 0.924859
## Prevalence : 0.075000
## Detection Rate : 0.000000
## Detection Prevalence : 0.001875
## Balanced Accuracy : 0.498986
##
## 'Positive' Class : 1
##
## [1] 13
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1476 120
## 1 4 0
##
## Accuracy : 0.9225
## 95% CI : (0.9083, 0.9351)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.6693
##
## Kappa : -0.0049
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 0.9973
## Pos Pred Value : 0.0000
## Neg Pred Value : 0.9248
## Prevalence : 0.0750
## Detection Rate : 0.0000
## Detection Prevalence : 0.0025
## Balanced Accuracy : 0.4986
##
## 'Positive' Class : 1
##
## [1] 14
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1476 120
## 1 4 0
##
## Accuracy : 0.9225
## 95% CI : (0.9083, 0.9351)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 0.6693
##
## Kappa : -0.0049
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 0.9973
## Pos Pred Value : 0.0000
## Neg Pred Value : 0.9248
## Prevalence : 0.0750
## Detection Rate : 0.0000
## Detection Prevalence : 0.0025
## Balanced Accuracy : 0.4986
##
## 'Positive' Class : 1
##
We can see that accuracy would be 92% if we always guessed no purchase, since only ~9% are purchase. Therefore the difficulty is in correctly classifying the purchase cases while still getting the no purchase cases correct. k=1 seems to have the highest sensitivity (TP / TP + FN) and a reasonable specificity (TN / TN + FP).
features = c("Related.Purchase", "ArtBks", "FirstPurch", "R", "F", "M", "Rcode", "Mcode", "Fcode", "Gender")
features_train = train[, features]
features_validation = validation[, features]
# Dummy encoding
features_train_dummy = dummyVars(~ ., data = features_train)
features_validation_dummy = dummyVars(~ ., data = features_validation)
features_train = predict(features_train_dummy, features_train)
features_validation = predict(features_train_dummy, features_validation)
# Normalization
preprocess = preProcess(features_train, method = c("center", "scale"))
features_train = predict(preprocess, features_train)
features_validation = predict(preprocess, features_validation)
# Response should be numeric
train$Florence = as.numeric(as.character(train$Florence))
Model 1: full set of predictors
train_data = data.frame(cbind(features_train, Florence = train$Florence))
features_validation = data.frame(features_validation)
model1 = glm(Florence ~ ., data = train_data, family = "binomial")
# Omit na due to base cases in categorical columns
odds = na.omit(exp(coef(model1)))
round(data.frame(summary(model1)$coefficients, odds = odds), 5)
## Estimate Std..Error z.value Pr...z.. odds
## (Intercept) -2.54475 0.08469 -30.04685 0.00000 0.07849
## Related.Purchase 0.28198 0.09537 2.95654 0.00311 1.32575
## ArtBks 0.24720 0.07719 3.20240 0.00136 1.28043
## FirstPurch -0.18996 0.18639 -1.01918 0.30812 0.82699
## R -0.18983 0.16651 -1.14006 0.25426 0.82710
## F 0.31190 0.19090 1.63383 0.10230 1.36602
## M -0.11307 0.15156 -0.74600 0.45567 0.89309
## Rcode.1 0.12973 0.10740 1.20794 0.22707 1.13852
## Rcode.2 0.10394 0.11561 0.89903 0.36863 1.10953
## Rcode.3 -0.00986 0.11389 -0.08657 0.93101 0.99019
## Mcode.1 0.03293 0.08058 0.40859 0.68284 1.03347
## Mcode.2 0.00428 0.10614 0.04030 0.96786 1.00429
## Mcode.3 -0.12353 0.13221 -0.93433 0.35013 0.88379
## Mcode.4 -0.01827 0.11813 -0.15467 0.87708 0.98189
## Fcode.1 -0.03072 0.14273 -0.21526 0.82956 0.96974
## Fcode.2 -0.09071 0.12914 -0.70239 0.48244 0.91329
## Gender.F -0.29724 0.06845 -4.34270 0.00001 0.74286
pred = predict(model1, newdata = features_validation, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
confusionMatrix(data = ifelse(pred > .1, 1, 0), reference = validation$Florence, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1072 86
## 1 408 34
##
## Accuracy : 0.6912
## 95% CI : (0.668, 0.7138)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0034
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.28333
## Specificity : 0.72432
## Pos Pred Value : 0.07692
## Neg Pred Value : 0.92573
## Prevalence : 0.07500
## Detection Rate : 0.02125
## Detection Prevalence : 0.27625
## Balanced Accuracy : 0.50383
##
## 'Positive' Class : 1
##
lift = lift(relevel(validation$Florence, ref="1") ~ pred)
ggplot(lift)
Interpretation: The baseline on average has .11 odds of buying Florence. Looking at some significant variables, a 1 unit increase in child books is associated with .68 odds of buying Florence. A 1 unit increase in frequency is associated with 1.4 odds of buying Florence. Gender and cookbook purchases are associated with ~.5 odds of buying Florence.
Apply stepwise variable selection
model2 = step(model1)
## Start: AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Rcode.3 + Rcode.4 + Mcode.1 + Mcode.2 +
## Mcode.3 + Mcode.4 + Mcode.5 + Fcode.1 + Fcode.2 + Fcode.3 +
## Gender.F + Gender.M
##
##
## Step: AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Rcode.3 + Rcode.4 + Mcode.1 + Mcode.2 +
## Mcode.3 + Mcode.4 + Mcode.5 + Fcode.1 + Fcode.2 + Fcode.3 +
## Gender.F
##
##
## Step: AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Rcode.3 + Rcode.4 + Mcode.1 + Mcode.2 +
## Mcode.3 + Mcode.4 + Mcode.5 + Fcode.1 + Fcode.2 + Gender.F
##
##
## Step: AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Rcode.3 + Rcode.4 + Mcode.1 + Mcode.2 +
## Mcode.3 + Mcode.4 + Fcode.1 + Fcode.2 + Gender.F
##
##
## Step: AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Rcode.3 + Mcode.1 + Mcode.2 + Mcode.3 +
## Mcode.4 + Fcode.1 + Fcode.2 + Gender.F
##
## Df Deviance AIC
## - Mcode.2 1 1326.2 1358.2
## - Rcode.3 1 1326.2 1358.2
## - Mcode.4 1 1326.2 1358.2
## - Fcode.1 1 1326.2 1358.2
## - Mcode.1 1 1326.3 1358.3
## - Fcode.2 1 1326.6 1358.6
## - M 1 1326.7 1358.7
## - Rcode.2 1 1327.0 1359.0
## - Mcode.3 1 1327.0 1359.0
## - FirstPurch 1 1327.2 1359.2
## - R 1 1327.5 1359.5
## - Rcode.1 1 1327.6 1359.6
## <none> 1326.2 1360.2
## - F 1 1328.8 1360.8
## - Related.Purchase 1 1334.7 1366.7
## - ArtBks 1 1336.3 1368.3
## - Gender.F 1 1344.4 1376.4
##
## Step: AIC=1358.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Rcode.3 + Mcode.1 + Mcode.3 + Mcode.4 +
## Fcode.1 + Fcode.2 + Gender.F
##
## Df Deviance AIC
## - Rcode.3 1 1326.2 1356.2
## - Fcode.1 1 1326.2 1356.2
## - Mcode.4 1 1326.2 1356.2
## - Mcode.1 1 1326.3 1356.3
## - Fcode.2 1 1326.6 1356.6
## - Rcode.2 1 1327.0 1357.0
## - M 1 1327.1 1357.1
## - FirstPurch 1 1327.2 1357.2
## - Mcode.3 1 1327.5 1357.5
## - R 1 1327.5 1357.5
## - Rcode.1 1 1327.6 1357.6
## <none> 1326.2 1358.2
## - F 1 1328.8 1358.8
## - Related.Purchase 1 1334.7 1364.7
## - ArtBks 1 1336.3 1366.3
## - Gender.F 1 1344.4 1374.4
##
## Step: AIC=1356.16
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Mcode.1 + Mcode.3 + Mcode.4 + Fcode.1 +
## Fcode.2 + Gender.F
##
## Df Deviance AIC
## - Mcode.4 1 1326.2 1354.2
## - Fcode.1 1 1326.2 1354.2
## - Mcode.1 1 1326.3 1354.3
## - Fcode.2 1 1326.7 1354.7
## - M 1 1327.2 1355.2
## - FirstPurch 1 1327.2 1355.2
## - Mcode.3 1 1327.5 1355.5
## - Rcode.2 1 1328.0 1356.0
## - R 1 1328.0 1356.0
## <none> 1326.2 1356.2
## - F 1 1328.8 1356.8
## - Rcode.1 1 1329.1 1357.1
## - Related.Purchase 1 1334.7 1362.7
## - ArtBks 1 1336.3 1364.3
## - Gender.F 1 1344.4 1372.4
##
## Step: AIC=1354.2
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Mcode.1 + Mcode.3 + Fcode.1 + Fcode.2 +
## Gender.F
##
## Df Deviance AIC
## - Fcode.1 1 1326.2 1352.2
## - Mcode.1 1 1326.4 1352.4
## - Fcode.2 1 1326.7 1352.7
## - FirstPurch 1 1327.2 1353.2
## - M 1 1327.4 1353.4
## - Mcode.3 1 1327.6 1353.6
## - Rcode.2 1 1328.0 1354.0
## - R 1 1328.1 1354.1
## <none> 1326.2 1354.2
## - F 1 1328.8 1354.8
## - Rcode.1 1 1329.2 1355.2
## - Related.Purchase 1 1334.7 1360.7
## - ArtBks 1 1336.4 1362.4
## - Gender.F 1 1344.4 1370.4
##
## Step: AIC=1352.24
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Mcode.1 + Mcode.3 + Fcode.2 + Gender.F
##
## Df Deviance AIC
## - Mcode.1 1 1326.5 1350.5
## - Fcode.2 1 1326.8 1350.8
## - FirstPurch 1 1327.2 1351.2
## - M 1 1327.4 1351.4
## - Mcode.3 1 1327.7 1351.7
## - Rcode.2 1 1328.0 1352.0
## - R 1 1328.2 1352.2
## <none> 1326.2 1352.2
## - Rcode.1 1 1329.2 1353.2
## - F 1 1329.7 1353.7
## - Related.Purchase 1 1334.7 1358.7
## - ArtBks 1 1336.5 1360.5
## - Gender.F 1 1344.5 1368.5
##
## Step: AIC=1350.46
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Mcode.3 + Fcode.2 + Gender.F
##
## Df Deviance AIC
## - Fcode.2 1 1327.2 1349.2
## - FirstPurch 1 1327.5 1349.5
## - M 1 1327.9 1349.9
## - Mcode.3 1 1328.1 1350.1
## - Rcode.2 1 1328.3 1350.3
## - R 1 1328.4 1350.4
## <none> 1326.5 1350.5
## - Rcode.1 1 1329.5 1351.5
## - F 1 1329.9 1351.9
## - Related.Purchase 1 1335.0 1357.0
## - ArtBks 1 1336.7 1358.7
## - Gender.F 1 1344.8 1366.8
##
## Step: AIC=1349.17
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M +
## Rcode.1 + Rcode.2 + Mcode.3 + Gender.F
##
## Df Deviance AIC
## - FirstPurch 1 1328.1 1348.1
## - M 1 1328.7 1348.7
## - Mcode.3 1 1328.8 1348.8
## - Rcode.2 1 1328.9 1348.9
## <none> 1327.2 1349.2
## - R 1 1329.2 1349.2
## - Rcode.1 1 1330.2 1350.2
## - F 1 1331.2 1351.2
## - Related.Purchase 1 1335.7 1355.7
## - ArtBks 1 1337.6 1357.6
## - Gender.F 1 1345.4 1365.4
##
## Step: AIC=1348.14
## Florence ~ Related.Purchase + ArtBks + R + F + M + Rcode.1 +
## Rcode.2 + Mcode.3 + Gender.F
##
## Df Deviance AIC
## - M 1 1329.8 1347.8
## - Mcode.3 1 1329.9 1347.9
## - Rcode.2 1 1330.0 1348.0
## <none> 1328.1 1348.1
## - Rcode.1 1 1331.4 1349.4
## - F 1 1333.0 1351.0
## - R 1 1334.2 1352.2
## - Related.Purchase 1 1336.5 1354.5
## - ArtBks 1 1338.4 1356.4
## - Gender.F 1 1346.3 1364.3
##
## Step: AIC=1347.82
## Florence ~ Related.Purchase + ArtBks + R + F + Rcode.1 + Rcode.2 +
## Mcode.3 + Gender.F
##
## Df Deviance AIC
## - Mcode.3 1 1330.6 1346.6
## - Rcode.2 1 1331.8 1347.8
## <none> 1329.8 1347.8
## - Rcode.1 1 1333.2 1349.2
## - F 1 1333.3 1349.3
## - R 1 1335.7 1351.7
## - Related.Purchase 1 1337.4 1353.4
## - ArtBks 1 1341.1 1357.1
## - Gender.F 1 1347.8 1363.8
##
## Step: AIC=1346.61
## Florence ~ Related.Purchase + ArtBks + R + F + Rcode.1 + Rcode.2 +
## Gender.F
##
## Df Deviance AIC
## - Rcode.2 1 1332.6 1346.6
## <none> 1330.6 1346.6
## - Rcode.1 1 1334.0 1348.0
## - F 1 1334.9 1348.9
## - R 1 1336.5 1350.5
## - Related.Purchase 1 1338.2 1352.2
## - ArtBks 1 1341.9 1355.9
## - Gender.F 1 1348.6 1362.6
##
## Step: AIC=1346.6
## Florence ~ Related.Purchase + ArtBks + R + F + Rcode.1 + Gender.F
##
## Df Deviance AIC
## - Rcode.1 1 1334.5 1346.5
## <none> 1332.6 1346.6
## - F 1 1337.2 1349.2
## - Related.Purchase 1 1340.0 1352.0
## - ArtBks 1 1343.8 1355.8
## - R 1 1347.2 1359.2
## - Gender.F 1 1350.9 1362.9
##
## Step: AIC=1346.46
## Florence ~ Related.Purchase + ArtBks + R + F + Gender.F
##
## Df Deviance AIC
## <none> 1334.5 1346.5
## - F 1 1339.1 1349.1
## - Related.Purchase 1 1341.7 1351.7
## - ArtBks 1 1345.8 1355.8
## - Gender.F 1 1352.8 1362.8
## - R 1 1359.6 1369.6
A smaller set of variables were found that decreases AIC.
Try another model with a subset of these variables.
round(data.frame(summary(model2)$coefficients, odds = exp(coef(model2))), 5)
## Estimate Std..Error z.value Pr...z.. odds
## (Intercept) -2.53377 0.08418 -30.09993 0.00000 0.07936
## Related.Purchase 0.25520 0.09387 2.71862 0.00656 1.29072
## ArtBks 0.25890 0.07630 3.39306 0.00069 1.29551
## R -0.40501 0.08570 -4.72615 0.00000 0.66697
## F 0.18044 0.08283 2.17849 0.02937 1.19775
## Gender.F -0.29681 0.06814 -4.35620 0.00001 0.74318
pred = predict(model2, newdata = features_validation, type = "response")
confusionMatrix(data = ifelse(pred > .1, 1, 0), reference = validation$Florence, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1069 84
## 1 411 36
##
## Accuracy : 0.6906
## 95% CI : (0.6673, 0.7132)
## No Information Rate : 0.925
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0099
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.30000
## Specificity : 0.72230
## Pos Pred Value : 0.08054
## Neg Pred Value : 0.92715
## Prevalence : 0.07500
## Detection Rate : 0.02250
## Detection Prevalence : 0.27937
## Balanced Accuracy : 0.51115
##
## 'Positive' Class : 1
##
lift = lift(relevel(as.factor(validation$Florence), ref="1") ~ pred)
ggplot(lift)
Interpretation: All variables are significant. A 1 unit increase in art book purchases is associated with 1.3 odds of buying Florence, and frequency a 1.19 odds. Other variables show a decrease in odds, including somewhat surprisingly recency, which shows a .67 odds.
Of all the models, the logisitc regression model2 is the best because it has a .3 sensitivity at a .1 cutoff. A low cutoff was used since we are interested in targeting buyers and the buyers have a low propensitiy to buy. Lowering the cutoff leads to more false positives, but it’s not too bad.
It’s interesting to see the model coefficient odds correspond to the EDA. For example, we can see males have a increased chance of buying Florence over females, and this can be seen in the .74 odds compared to males. Similarly, we can see in the distribution of recency that those who made a previouse purchase more recently are less likely to buy Florence. We can also see a noticable increase in propensity to buy Florence as the number of art book purchases goes up.
We can also see from the lift plots that the model2 is able to capture about 62% of buyers when only sending mail out to 50% of the population, an increase over the baseline model, since the baseline model assumes we would have to send mail to 62% of poplulation to capture 62% of buyers.