This exercise is based on the book data mining for business analytics by SBPYPL. In this scenario a book club is trying to send mail out to book club members to see if they want to buy a book called Art History of Florence. But they don’t want to send mail out to all members because most members will not buy the book and they will waste money on mailing. So the goal of the exercise is to find a way to send mail to only those with a high probability of actually buying the book.

Table of variables.

book_club = read.csv('./CharlesBookClub.csv')
str(book_club)

## 'data.frame':    4000 obs. of  24 variables:
##  $ Seq.            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ID.             : int  25 29 46 47 51 60 61 79 81 90 ...
##  $ Gender          : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ M               : int  297 128 138 228 257 145 190 187 252 240 ...
##  $ R               : int  14 8 22 2 10 6 16 14 10 6 ...
##  $ F               : int  2 2 7 1 1 2 1 1 1 3 ...
##  $ FirstPurch      : int  22 10 56 2 10 12 16 14 10 20 ...
##  $ ChildBks        : int  0 0 2 0 0 0 0 1 0 0 ...
##  $ YouthBks        : int  1 0 1 0 0 0 0 0 0 0 ...
##  $ CookBks         : int  1 0 2 0 0 0 0 0 0 1 ...
##  $ DoItYBks        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RefBks          : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ ArtBks          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GeogBks         : int  0 0 1 0 0 0 1 0 0 0 ...
##  $ ItalCook        : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ ItalAtlas       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ItalArt         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Florence        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Related.Purchase: int  0 0 2 0 0 0 1 0 0 0 ...
##  $ Mcode           : int  5 4 4 5 5 4 4 4 5 5 ...
##  $ Rcode           : int  4 3 4 1 3 2 4 4 3 2 ...
##  $ Fcode           : int  2 2 3 1 1 2 1 1 1 3 ...
##  $ Yes_Florence    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ No_Florence     : int  1 1 1 1 1 1 1 1 1 1 ...

Preprocess

# Convert categorical variables to factor
book_club$Florence = as.factor(book_club$Florence)
book_club$Gender = as.factor(ifelse(book_club$Gender == 1, "F", "M"))
book_club$Mcode = as.factor(book_club$Mcode)
book_club$Rcode = as.factor(book_club$Rcode)
book_club$Fcode = as.factor(book_club$Fcode)

Shuffle data and split into train/validation

book_club = sample(book_club)
train_idx = 1:ceiling(.6*nrow(book_club))
train = book_club[train_idx, ]
validation = book_club[-train_idx, ]

Exploratory Data Analysis

gender_n = train %>%
  group_by(Gender) %>%
  summarise(gender_n = n())

ggplot(train %>%
  group_by(Gender, Florence) %>%
  summarise(gender_florence_n = n()) %>%
  inner_join(gender_n, by = "Gender") %>%
  mutate(fraction = gender_florence_n / gender_n), aes(Gender, fraction, fill = Florence)) + geom_bar(stat = "identity")

Rcode_n = train %>%
  group_by(Rcode) %>%
  summarise(Rcode_n = n())

ggplot(train %>%
  group_by(Rcode, Florence) %>%
  summarise(Rcode_florence_n = n()) %>%
  inner_join(Rcode_n, by = "Rcode") %>%
  mutate(fraction = Rcode_florence_n / Rcode_n), aes(Rcode, fraction, fill = Florence)) + geom_bar(stat = "identity")

Mcode_n = train %>%
  group_by(Mcode) %>%
  summarise(Mcode_n = n())

ggplot(train %>%
  group_by(Mcode, Florence) %>%
  summarise(Mcode_florence_n = n()) %>%
  inner_join(Mcode_n, by = "Mcode") %>%
  mutate(fraction = Mcode_florence_n / Mcode_n), aes(Mcode, fraction, fill = Florence)) + geom_bar(stat = "identity")

Fcode_n = train %>%
  group_by(Fcode) %>%
  summarise(Fcode_n = n())

ggplot(train %>%
  group_by(Fcode, Florence) %>%
  summarise(Fcode_florence_n = n()) %>%
  inner_join(Fcode_n, by = "Fcode") %>%
  mutate(fraction = Fcode_florence_n / Fcode_n), aes(Fcode, fraction, fill = Florence)) + geom_bar(stat = "identity")

ggplot(train) + geom_boxplot(aes(Florence, M, group = Florence))

ggplot(train) + geom_boxplot(aes(Florence, F, group = Florence))

ggplot(train) + geom_boxplot(aes(Florence, R, group = Florence))

ArtBks_n = train %>%
  group_by(ArtBks) %>%
  summarise(ArtBks_n = n())

ggplot(train %>%
  group_by(ArtBks, Florence) %>%
  summarise(ArtBks_florence_n = n()) %>%
  inner_join(ArtBks_n, by = "ArtBks") %>%
  mutate(fraction = ArtBks_florence_n / ArtBks_n), aes(ArtBks, fraction, fill = Florence)) + geom_bar(stat = "identity")

ItalArt_n = train %>%
  group_by(ItalArt) %>%
  summarise(ItalArt_n = n())

ggplot(train %>%
  group_by(ItalArt, Florence) %>%
  summarise(ItalArt_florence_n = n()) %>%
  inner_join(ItalArt_n, by = "ItalArt") %>%
  mutate(fraction = ItalArt_florence_n / ItalArt_n), aes(ItalArt, fraction, fill = Florence)) + geom_bar(stat = "identity")

Related.Purchase_n = train %>%
  group_by(Related.Purchase) %>%
  summarise(Related.Purchase_n = n())

ggplot(train %>%
  group_by(Related.Purchase, Florence) %>%
  summarise(Related.Purchase_florence_n = n()) %>%
  inner_join(Related.Purchase_n, by = "Related.Purchase") %>%
  mutate(fraction = Related.Purchase_florence_n / Related.Purchase_n), aes(Related.Purchase, fraction, fill = Florence)) + geom_bar(stat = "identity")

ggplot(train) + geom_boxplot(aes(Florence, FirstPurch, group = Florence))

Response rate for training data as a whole

train_response_rate = sum(train$Florence == 1) / nrow(train)
train_response_rate

## [1] 0.09083333

RFM Model

Response rate for each RFM categories

response_rates_by_rfm_group = train %>%
  group_by(Rcode, Fcode, Mcode) %>%
  summarise(response_rate = mean(Florence == 1), n = n()) %>%
  arrange(desc(response_rate))

Groups with response rate > overall response rate

rfm_groups_response_rate_gt_mean = response_rates_by_rfm_group %>%
  filter(response_rate > train_response_rate)

validation response rate using rfm groups w/ rate > overall response rate

validation_response_rate_gt_mean = validation %>%
  inner_join(rfm_groups_response_rate_gt_mean, by = c("Rcode", "Fcode",  "Mcode"))

mean(validation_response_rate_gt_mean$Florence == 1)

## [1] 0.05643994

Segment data based on response rate

response_rates_by_rfm_group = response_rates_by_rfm_group %>%
  mutate(segment = ifelse(response_rate > 2 * train_response_rate, 1, ifelse(response_rate > train_response_rate & response_rate < 2  * train_response_rate, 2, 3)))

Gains curve

pred = validation %>%
  inner_join(response_rates_by_rfm_group, by = c("Rcode", "Fcode", "Mcode")) %>%
  arrange(desc(response_rate)) %>%
  select(Florence, response_rate)

lift = lift(relevel(as.factor(Florence), ref="1") ~ response_rate, data = pred)
ggplot(lift)

We can see from the gains plot that the RFM model does worse than the baseline model of uniformly distributed yes’s. The optimal case would be if top 10% of cases contained all yes’s. However in the current model we have to sift through all cases in order to find all the yes’s.

k-NN Model

features = c("Rcode", "Fcode", "Mcode", "FirstPurch", "Related.Purchase")

# Convert factors back to numeric since it performs better with knn
train.knn = train
train.knn$Rcode = as.numeric(as.character(train.knn$Rcode))
train.knn$Fcode = as.numeric(as.character(train.knn$Fcode))
train.knn$Mcode = as.numeric(as.character(train.knn$Mcode))
train.knn$Related.Purchase = as.numeric(as.character(train.knn$Related.Purchase))

validation.knn = validation
validation.knn$Rcode = as.numeric(as.character(validation$Rcode))
validation.knn$Fcode = as.numeric(as.character(validation$Fcode))
validation.knn$Mcode = as.numeric(as.character(validation$Mcode))
validation.knn$Related.Purchase = as.numeric(as.character(validation$Related.Purchase))

# Normalization
preprocess = preProcess(train.knn[, features], method = c("center", "scale"))
features_train = predict(preprocess, train.knn[, features])
features_validation = predict(preprocess, validation.knn[, features])

k = 1:14

for(k in k) {
  pred = FNN::knn(train = features_train, test = features_validation, cl = relevel(train$Florence, ref = "1"), k = k)
  
  print(k)
  print(confusionMatrix(data = pred, reference = validation$Florence, positive = "1"))
}

## [1] 1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1358  108
##          1  122   12
##                                           
##                Accuracy : 0.8562          
##                  95% CI : (0.8381, 0.8731)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.0167          
##  Mcnemar's Test P-Value : 0.3913          
##                                           
##             Sensitivity : 0.10000         
##             Specificity : 0.91757         
##          Pos Pred Value : 0.08955         
##          Neg Pred Value : 0.92633         
##              Prevalence : 0.07500         
##          Detection Rate : 0.00750         
##    Detection Prevalence : 0.08375         
##       Balanced Accuracy : 0.50878         
##                                           
##        'Positive' Class : 1               
##                                           
## [1] 2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1462  119
##          1   18    1
##                                           
##                Accuracy : 0.9144          
##                  95% CI : (0.8996, 0.9276)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 0.9494          
##                                           
##                   Kappa : -0.0062         
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.008333        
##             Specificity : 0.987838        
##          Pos Pred Value : 0.052632        
##          Neg Pred Value : 0.924731        
##              Prevalence : 0.075000        
##          Detection Rate : 0.000625        
##    Detection Prevalence : 0.011875        
##       Balanced Accuracy : 0.498086        
##                                           
##        'Positive' Class : 1               
##                                           
## [1] 3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1441  115
##          1   39    5
##                                           
##                Accuracy : 0.9038          
##                  95% CI : (0.8882, 0.9178)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 0.9992          
##                                           
##                   Kappa : 0.0216          
##  Mcnemar's Test P-Value : 1.506e-09       
##                                           
##             Sensitivity : 0.041667        
##             Specificity : 0.973649        
##          Pos Pred Value : 0.113636        
##          Neg Pred Value : 0.926093        
##              Prevalence : 0.075000        
##          Detection Rate : 0.003125        
##    Detection Prevalence : 0.027500        
##       Balanced Accuracy : 0.507658        
##                                           
##        'Positive' Class : 1               
##                                           
## [1] 4
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1467  119
##          1   13    1
##                                           
##                Accuracy : 0.9175          
##                  95% CI : (0.9029, 0.9305)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 0.8813          
##                                           
##                   Kappa : -8e-04          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.008333        
##             Specificity : 0.991216        
##          Pos Pred Value : 0.071429        
##          Neg Pred Value : 0.924968        
##              Prevalence : 0.075000        
##          Detection Rate : 0.000625        
##    Detection Prevalence : 0.008750        
##       Balanced Accuracy : 0.499775        
##                                           
##        'Positive' Class : 1               
##                                           
## [1] 5
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1458  116
##          1   22    4
##                                          
##                Accuracy : 0.9138         
##                  95% CI : (0.8989, 0.927)
##     No Information Rate : 0.925          
##     P-Value [Acc > NIR] : 0.9582         
##                                          
##                   Kappa : 0.0289         
##  Mcnemar's Test P-Value : 2.439e-15      
##                                          
##             Sensitivity : 0.03333        
##             Specificity : 0.98514        
##          Pos Pred Value : 0.15385        
##          Neg Pred Value : 0.92630        
##              Prevalence : 0.07500        
##          Detection Rate : 0.00250        
##    Detection Prevalence : 0.01625        
##       Balanced Accuracy : 0.50923        
##                                          
##        'Positive' Class : 1              
##                                          
## [1] 6
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1473  117
##          1    7    3
##                                           
##                Accuracy : 0.9225          
##                  95% CI : (0.9083, 0.9351)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 0.6693          
##                                           
##                   Kappa : 0.035           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.025000        
##             Specificity : 0.995270        
##          Pos Pred Value : 0.300000        
##          Neg Pred Value : 0.926415        
##              Prevalence : 0.075000        
##          Detection Rate : 0.001875        
##    Detection Prevalence : 0.006250        
##       Balanced Accuracy : 0.510135        
##                                           
##        'Positive' Class : 1               
##                                           
## [1] 7
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1471  116
##          1    9    4
##                                           
##                Accuracy : 0.9219          
##                  95% CI : (0.9076, 0.9346)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 0.7026          
##                                           
##                   Kappa : 0.0462          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.033333        
##             Specificity : 0.993919        
##          Pos Pred Value : 0.307692        
##          Neg Pred Value : 0.926906        
##              Prevalence : 0.075000        
##          Detection Rate : 0.002500        
##    Detection Prevalence : 0.008125        
##       Balanced Accuracy : 0.513626        
##                                           
##        'Positive' Class : 1               
##                                           
## [1] 8
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1474  120
##          1    6    0
##                                         
##                Accuracy : 0.9212        
##                  95% CI : (0.907, 0.934)
##     No Information Rate : 0.925         
##     P-Value [Acc > NIR] : 0.7341        
##                                         
##                   Kappa : -0.0072       
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.00000       
##             Specificity : 0.99595       
##          Pos Pred Value : 0.00000       
##          Neg Pred Value : 0.92472       
##              Prevalence : 0.07500       
##          Detection Rate : 0.00000       
##    Detection Prevalence : 0.00375       
##       Balanced Accuracy : 0.49797       
##                                         
##        'Positive' Class : 1             
##                                         
## [1] 9
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1471  120
##          1    9    0
##                                           
##                Accuracy : 0.9194          
##                  95% CI : (0.9049, 0.9322)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 0.8171          
##                                           
##                   Kappa : -0.0106         
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000000        
##             Specificity : 0.993919        
##          Pos Pred Value : 0.000000        
##          Neg Pred Value : 0.924576        
##              Prevalence : 0.075000        
##          Detection Rate : 0.000000        
##    Detection Prevalence : 0.005625        
##       Balanced Accuracy : 0.496959        
##                                           
##        'Positive' Class : 1               
##                                           
## [1] 10
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1475  120
##          1    5    0
##                                           
##                Accuracy : 0.9219          
##                  95% CI : (0.9076, 0.9346)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 0.7026          
##                                           
##                   Kappa : -0.006          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000000        
##             Specificity : 0.996622        
##          Pos Pred Value : 0.000000        
##          Neg Pred Value : 0.924765        
##              Prevalence : 0.075000        
##          Detection Rate : 0.000000        
##    Detection Prevalence : 0.003125        
##       Balanced Accuracy : 0.498311        
##                                           
##        'Positive' Class : 1               
##                                           
## [1] 11
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1474  120
##          1    6    0
##                                         
##                Accuracy : 0.9212        
##                  95% CI : (0.907, 0.934)
##     No Information Rate : 0.925         
##     P-Value [Acc > NIR] : 0.7341        
##                                         
##                   Kappa : -0.0072       
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.00000       
##             Specificity : 0.99595       
##          Pos Pred Value : 0.00000       
##          Neg Pred Value : 0.92472       
##              Prevalence : 0.07500       
##          Detection Rate : 0.00000       
##    Detection Prevalence : 0.00375       
##       Balanced Accuracy : 0.49797       
##                                         
##        'Positive' Class : 1             
##                                         
## [1] 12
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1477  120
##          1    3    0
##                                          
##                Accuracy : 0.9231         
##                  95% CI : (0.909, 0.9357)
##     No Information Rate : 0.925          
##     P-Value [Acc > NIR] : 0.6346         
##                                          
##                   Kappa : -0.0037        
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.000000       
##             Specificity : 0.997973       
##          Pos Pred Value : 0.000000       
##          Neg Pred Value : 0.924859       
##              Prevalence : 0.075000       
##          Detection Rate : 0.000000       
##    Detection Prevalence : 0.001875       
##       Balanced Accuracy : 0.498986       
##                                          
##        'Positive' Class : 1              
##                                          
## [1] 13
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1476  120
##          1    4    0
##                                           
##                Accuracy : 0.9225          
##                  95% CI : (0.9083, 0.9351)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 0.6693          
##                                           
##                   Kappa : -0.0049         
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.9973          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.9248          
##              Prevalence : 0.0750          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0025          
##       Balanced Accuracy : 0.4986          
##                                           
##        'Positive' Class : 1               
##                                           
## [1] 14
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1476  120
##          1    4    0
##                                           
##                Accuracy : 0.9225          
##                  95% CI : (0.9083, 0.9351)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 0.6693          
##                                           
##                   Kappa : -0.0049         
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.9973          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.9248          
##              Prevalence : 0.0750          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0025          
##       Balanced Accuracy : 0.4986          
##                                           
##        'Positive' Class : 1               
##

We can see that accuracy would be 92% if we always guessed no purchase, since only ~9% are purchase. Therefore the difficulty is in correctly classifying the purchase cases while still getting the no purchase cases correct. k=1 seems to have the highest sensitivity (TP / TP + FN) and a reasonable specificity (TN / TN + FP).

Logistic Regression Model

features = c("Related.Purchase", "ArtBks", "FirstPurch", "R", "F", "M", "Rcode", "Mcode", "Fcode", "Gender")

features_train = train[, features]
features_validation = validation[, features]

# Dummy encoding
features_train_dummy = dummyVars(~ ., data = features_train)
features_validation_dummy = dummyVars(~ ., data = features_validation)
features_train = predict(features_train_dummy, features_train)
features_validation = predict(features_train_dummy, features_validation)

# Normalization
preprocess = preProcess(features_train, method = c("center", "scale"))
features_train = predict(preprocess, features_train)
features_validation = predict(preprocess, features_validation)

# Response should be numeric
train$Florence = as.numeric(as.character(train$Florence))

Model 1: full set of predictors

train_data = data.frame(cbind(features_train, Florence = train$Florence))
features_validation = data.frame(features_validation)

model1 = glm(Florence ~ ., data = train_data, family = "binomial")

# Omit na due to base cases in categorical columns
odds = na.omit(exp(coef(model1)))

round(data.frame(summary(model1)$coefficients, odds = odds), 5)

##                  Estimate Std..Error   z.value Pr...z..    odds
## (Intercept)      -2.54475    0.08469 -30.04685  0.00000 0.07849
## Related.Purchase  0.28198    0.09537   2.95654  0.00311 1.32575
## ArtBks            0.24720    0.07719   3.20240  0.00136 1.28043
## FirstPurch       -0.18996    0.18639  -1.01918  0.30812 0.82699
## R                -0.18983    0.16651  -1.14006  0.25426 0.82710
## F                 0.31190    0.19090   1.63383  0.10230 1.36602
## M                -0.11307    0.15156  -0.74600  0.45567 0.89309
## Rcode.1           0.12973    0.10740   1.20794  0.22707 1.13852
## Rcode.2           0.10394    0.11561   0.89903  0.36863 1.10953
## Rcode.3          -0.00986    0.11389  -0.08657  0.93101 0.99019
## Mcode.1           0.03293    0.08058   0.40859  0.68284 1.03347
## Mcode.2           0.00428    0.10614   0.04030  0.96786 1.00429
## Mcode.3          -0.12353    0.13221  -0.93433  0.35013 0.88379
## Mcode.4          -0.01827    0.11813  -0.15467  0.87708 0.98189
## Fcode.1          -0.03072    0.14273  -0.21526  0.82956 0.96974
## Fcode.2          -0.09071    0.12914  -0.70239  0.48244 0.91329
## Gender.F         -0.29724    0.06845  -4.34270  0.00001 0.74286

pred = predict(model1, newdata = features_validation, type = "response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

confusionMatrix(data = ifelse(pred > .1, 1, 0), reference = validation$Florence, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1072   86
##          1  408   34
##                                          
##                Accuracy : 0.6912         
##                  95% CI : (0.668, 0.7138)
##     No Information Rate : 0.925          
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0034         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.28333        
##             Specificity : 0.72432        
##          Pos Pred Value : 0.07692        
##          Neg Pred Value : 0.92573        
##              Prevalence : 0.07500        
##          Detection Rate : 0.02125        
##    Detection Prevalence : 0.27625        
##       Balanced Accuracy : 0.50383        
##                                          
##        'Positive' Class : 1              
##

lift = lift(relevel(validation$Florence, ref="1") ~ pred)
ggplot(lift)

Interpretation: The baseline on average has .11 odds of buying Florence. Looking at some significant variables, a 1 unit increase in child books is associated with .68 odds of buying Florence. A 1 unit increase in frequency is associated with 1.4 odds of buying Florence. Gender and cookbook purchases are associated with ~.5 odds of buying Florence.

Apply stepwise variable selection

model2 = step(model1)

## Start:  AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Rcode.3 + Rcode.4 + Mcode.1 + Mcode.2 + 
##     Mcode.3 + Mcode.4 + Mcode.5 + Fcode.1 + Fcode.2 + Fcode.3 + 
##     Gender.F + Gender.M
## 
## 
## Step:  AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Rcode.3 + Rcode.4 + Mcode.1 + Mcode.2 + 
##     Mcode.3 + Mcode.4 + Mcode.5 + Fcode.1 + Fcode.2 + Fcode.3 + 
##     Gender.F
## 
## 
## Step:  AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Rcode.3 + Rcode.4 + Mcode.1 + Mcode.2 + 
##     Mcode.3 + Mcode.4 + Mcode.5 + Fcode.1 + Fcode.2 + Gender.F
## 
## 
## Step:  AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Rcode.3 + Rcode.4 + Mcode.1 + Mcode.2 + 
##     Mcode.3 + Mcode.4 + Fcode.1 + Fcode.2 + Gender.F
## 
## 
## Step:  AIC=1360.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Rcode.3 + Mcode.1 + Mcode.2 + Mcode.3 + 
##     Mcode.4 + Fcode.1 + Fcode.2 + Gender.F
## 
##                    Df Deviance    AIC
## - Mcode.2           1   1326.2 1358.2
## - Rcode.3           1   1326.2 1358.2
## - Mcode.4           1   1326.2 1358.2
## - Fcode.1           1   1326.2 1358.2
## - Mcode.1           1   1326.3 1358.3
## - Fcode.2           1   1326.6 1358.6
## - M                 1   1326.7 1358.7
## - Rcode.2           1   1327.0 1359.0
## - Mcode.3           1   1327.0 1359.0
## - FirstPurch        1   1327.2 1359.2
## - R                 1   1327.5 1359.5
## - Rcode.1           1   1327.6 1359.6
## <none>                  1326.2 1360.2
## - F                 1   1328.8 1360.8
## - Related.Purchase  1   1334.7 1366.7
## - ArtBks            1   1336.3 1368.3
## - Gender.F          1   1344.4 1376.4
## 
## Step:  AIC=1358.15
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Rcode.3 + Mcode.1 + Mcode.3 + Mcode.4 + 
##     Fcode.1 + Fcode.2 + Gender.F
## 
##                    Df Deviance    AIC
## - Rcode.3           1   1326.2 1356.2
## - Fcode.1           1   1326.2 1356.2
## - Mcode.4           1   1326.2 1356.2
## - Mcode.1           1   1326.3 1356.3
## - Fcode.2           1   1326.6 1356.6
## - Rcode.2           1   1327.0 1357.0
## - M                 1   1327.1 1357.1
## - FirstPurch        1   1327.2 1357.2
## - Mcode.3           1   1327.5 1357.5
## - R                 1   1327.5 1357.5
## - Rcode.1           1   1327.6 1357.6
## <none>                  1326.2 1358.2
## - F                 1   1328.8 1358.8
## - Related.Purchase  1   1334.7 1364.7
## - ArtBks            1   1336.3 1366.3
## - Gender.F          1   1344.4 1374.4
## 
## Step:  AIC=1356.16
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Mcode.1 + Mcode.3 + Mcode.4 + Fcode.1 + 
##     Fcode.2 + Gender.F
## 
##                    Df Deviance    AIC
## - Mcode.4           1   1326.2 1354.2
## - Fcode.1           1   1326.2 1354.2
## - Mcode.1           1   1326.3 1354.3
## - Fcode.2           1   1326.7 1354.7
## - M                 1   1327.2 1355.2
## - FirstPurch        1   1327.2 1355.2
## - Mcode.3           1   1327.5 1355.5
## - Rcode.2           1   1328.0 1356.0
## - R                 1   1328.0 1356.0
## <none>                  1326.2 1356.2
## - F                 1   1328.8 1356.8
## - Rcode.1           1   1329.1 1357.1
## - Related.Purchase  1   1334.7 1362.7
## - ArtBks            1   1336.3 1364.3
## - Gender.F          1   1344.4 1372.4
## 
## Step:  AIC=1354.2
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Mcode.1 + Mcode.3 + Fcode.1 + Fcode.2 + 
##     Gender.F
## 
##                    Df Deviance    AIC
## - Fcode.1           1   1326.2 1352.2
## - Mcode.1           1   1326.4 1352.4
## - Fcode.2           1   1326.7 1352.7
## - FirstPurch        1   1327.2 1353.2
## - M                 1   1327.4 1353.4
## - Mcode.3           1   1327.6 1353.6
## - Rcode.2           1   1328.0 1354.0
## - R                 1   1328.1 1354.1
## <none>                  1326.2 1354.2
## - F                 1   1328.8 1354.8
## - Rcode.1           1   1329.2 1355.2
## - Related.Purchase  1   1334.7 1360.7
## - ArtBks            1   1336.4 1362.4
## - Gender.F          1   1344.4 1370.4
## 
## Step:  AIC=1352.24
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Mcode.1 + Mcode.3 + Fcode.2 + Gender.F
## 
##                    Df Deviance    AIC
## - Mcode.1           1   1326.5 1350.5
## - Fcode.2           1   1326.8 1350.8
## - FirstPurch        1   1327.2 1351.2
## - M                 1   1327.4 1351.4
## - Mcode.3           1   1327.7 1351.7
## - Rcode.2           1   1328.0 1352.0
## - R                 1   1328.2 1352.2
## <none>                  1326.2 1352.2
## - Rcode.1           1   1329.2 1353.2
## - F                 1   1329.7 1353.7
## - Related.Purchase  1   1334.7 1358.7
## - ArtBks            1   1336.5 1360.5
## - Gender.F          1   1344.5 1368.5
## 
## Step:  AIC=1350.46
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Mcode.3 + Fcode.2 + Gender.F
## 
##                    Df Deviance    AIC
## - Fcode.2           1   1327.2 1349.2
## - FirstPurch        1   1327.5 1349.5
## - M                 1   1327.9 1349.9
## - Mcode.3           1   1328.1 1350.1
## - Rcode.2           1   1328.3 1350.3
## - R                 1   1328.4 1350.4
## <none>                  1326.5 1350.5
## - Rcode.1           1   1329.5 1351.5
## - F                 1   1329.9 1351.9
## - Related.Purchase  1   1335.0 1357.0
## - ArtBks            1   1336.7 1358.7
## - Gender.F          1   1344.8 1366.8
## 
## Step:  AIC=1349.17
## Florence ~ Related.Purchase + ArtBks + FirstPurch + R + F + M + 
##     Rcode.1 + Rcode.2 + Mcode.3 + Gender.F
## 
##                    Df Deviance    AIC
## - FirstPurch        1   1328.1 1348.1
## - M                 1   1328.7 1348.7
## - Mcode.3           1   1328.8 1348.8
## - Rcode.2           1   1328.9 1348.9
## <none>                  1327.2 1349.2
## - R                 1   1329.2 1349.2
## - Rcode.1           1   1330.2 1350.2
## - F                 1   1331.2 1351.2
## - Related.Purchase  1   1335.7 1355.7
## - ArtBks            1   1337.6 1357.6
## - Gender.F          1   1345.4 1365.4
## 
## Step:  AIC=1348.14
## Florence ~ Related.Purchase + ArtBks + R + F + M + Rcode.1 + 
##     Rcode.2 + Mcode.3 + Gender.F
## 
##                    Df Deviance    AIC
## - M                 1   1329.8 1347.8
## - Mcode.3           1   1329.9 1347.9
## - Rcode.2           1   1330.0 1348.0
## <none>                  1328.1 1348.1
## - Rcode.1           1   1331.4 1349.4
## - F                 1   1333.0 1351.0
## - R                 1   1334.2 1352.2
## - Related.Purchase  1   1336.5 1354.5
## - ArtBks            1   1338.4 1356.4
## - Gender.F          1   1346.3 1364.3
## 
## Step:  AIC=1347.82
## Florence ~ Related.Purchase + ArtBks + R + F + Rcode.1 + Rcode.2 + 
##     Mcode.3 + Gender.F
## 
##                    Df Deviance    AIC
## - Mcode.3           1   1330.6 1346.6
## - Rcode.2           1   1331.8 1347.8
## <none>                  1329.8 1347.8
## - Rcode.1           1   1333.2 1349.2
## - F                 1   1333.3 1349.3
## - R                 1   1335.7 1351.7
## - Related.Purchase  1   1337.4 1353.4
## - ArtBks            1   1341.1 1357.1
## - Gender.F          1   1347.8 1363.8
## 
## Step:  AIC=1346.61
## Florence ~ Related.Purchase + ArtBks + R + F + Rcode.1 + Rcode.2 + 
##     Gender.F
## 
##                    Df Deviance    AIC
## - Rcode.2           1   1332.6 1346.6
## <none>                  1330.6 1346.6
## - Rcode.1           1   1334.0 1348.0
## - F                 1   1334.9 1348.9
## - R                 1   1336.5 1350.5
## - Related.Purchase  1   1338.2 1352.2
## - ArtBks            1   1341.9 1355.9
## - Gender.F          1   1348.6 1362.6
## 
## Step:  AIC=1346.6
## Florence ~ Related.Purchase + ArtBks + R + F + Rcode.1 + Gender.F
## 
##                    Df Deviance    AIC
## - Rcode.1           1   1334.5 1346.5
## <none>                  1332.6 1346.6
## - F                 1   1337.2 1349.2
## - Related.Purchase  1   1340.0 1352.0
## - ArtBks            1   1343.8 1355.8
## - R                 1   1347.2 1359.2
## - Gender.F          1   1350.9 1362.9
## 
## Step:  AIC=1346.46
## Florence ~ Related.Purchase + ArtBks + R + F + Gender.F
## 
##                    Df Deviance    AIC
## <none>                  1334.5 1346.5
## - F                 1   1339.1 1349.1
## - Related.Purchase  1   1341.7 1351.7
## - ArtBks            1   1345.8 1355.8
## - Gender.F          1   1352.8 1362.8
## - R                 1   1359.6 1369.6

A smaller set of variables were found that decreases AIC.

Try another model with a subset of these variables.

round(data.frame(summary(model2)$coefficients, odds = exp(coef(model2))), 5)

##                  Estimate Std..Error   z.value Pr...z..    odds
## (Intercept)      -2.53377    0.08418 -30.09993  0.00000 0.07936
## Related.Purchase  0.25520    0.09387   2.71862  0.00656 1.29072
## ArtBks            0.25890    0.07630   3.39306  0.00069 1.29551
## R                -0.40501    0.08570  -4.72615  0.00000 0.66697
## F                 0.18044    0.08283   2.17849  0.02937 1.19775
## Gender.F         -0.29681    0.06814  -4.35620  0.00001 0.74318

pred = predict(model2, newdata = features_validation, type = "response")

confusionMatrix(data = ifelse(pred > .1, 1, 0), reference = validation$Florence, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1069   84
##          1  411   36
##                                           
##                Accuracy : 0.6906          
##                  95% CI : (0.6673, 0.7132)
##     No Information Rate : 0.925           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0099          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.30000         
##             Specificity : 0.72230         
##          Pos Pred Value : 0.08054         
##          Neg Pred Value : 0.92715         
##              Prevalence : 0.07500         
##          Detection Rate : 0.02250         
##    Detection Prevalence : 0.27937         
##       Balanced Accuracy : 0.51115         
##                                           
##        'Positive' Class : 1               
##

lift = lift(relevel(as.factor(validation$Florence), ref="1") ~ pred)
ggplot(lift)

Interpretation: All variables are significant. A 1 unit increase in art book purchases is associated with 1.3 odds of buying Florence, and frequency a 1.19 odds. Other variables show a decrease in odds, including somewhat surprisingly recency, which shows a .67 odds.

Summary

Of all the models, the logisitc regression model2 is the best because it has a .3 sensitivity at a .1 cutoff. A low cutoff was used since we are interested in targeting buyers and the buyers have a low propensitiy to buy. Lowering the cutoff leads to more false positives, but it’s not too bad.

It’s interesting to see the model coefficient odds correspond to the EDA. For example, we can see males have a increased chance of buying Florence over females, and this can be seen in the .74 odds compared to males. Similarly, we can see in the distribution of recency that those who made a previouse purchase more recently are less likely to buy Florence. We can also see a noticable increase in propensity to buy Florence as the number of art book purchases goes up.

We can also see from the lift plots that the model2 is able to capture about 62% of buyers when only sending mail out to 50% of the population, an increase over the baseline model, since the baseline model assumes we would have to send mail to 62% of poplulation to capture 62% of buyers.

Data analytics for business - Book Club