STATS 615 - Project 1

Project 1

The Problem

CBC sent mailings to its club members each month containing the latest offerings. On the surface, CBC appeared very successful: mailing volume was increasing, book selection was diversifying and growing, and their customers database was increasing. However, their “bottom-line profits were falling. The decreasing profits led CBC to revisit their original plan of using database marketing to improve mailing yields and to stay profitable.

Problem 1

What is the response rate for the training data customers taken as a whole? What is the response rate for each of the 4X5X3 = 60 combinations of RFM categories? Which combinations have response rate in the training data above the overall response in the training data?

## [1] 0.08666667

What is the response rate for each of the 4X5X3 = 60 combinations of RFM categories?

##      1 1 1      1 1 2      1 1 3      1 1 4      1 1 5      1 2 2      1 2 3 
## 0.00000000 0.50000000 0.00000000 0.15384615 0.12000000 0.25000000 0.00000000 
##      1 2 4      1 2 5      1 3 2      1 3 3      1 3 4      1 3 5      2 1 1 
## 0.05000000 0.08695652 1.00000000 0.50000000 0.07692308 0.16666667 0.00000000 
##      2 1 2      2 1 3      2 1 4      2 1 5      2 2 2      2 2 3      2 2 4 
## 0.12500000 0.10000000 0.06060606 0.12121212 0.00000000 0.08695652 0.07500000 
##      2 2 5      2 3 3      2 3 4      2 3 5      3 1 1      3 1 2      3 1 3 
## 0.08771930 0.25000000 0.15909091 0.14285714 0.11111111 0.08695652 0.00000000 
##      3 1 4      3 1 5      3 2 2      3 2 3      3 2 4      3 2 5      3 3 2 
## 0.05813953 0.03361345 0.00000000 0.08108108 0.07894737 0.03191489 0.00000000 
##      3 3 3      3 3 4      3 3 5      4 1 1      4 1 2      4 1 3      4 1 4 
## 0.06250000 0.08108108 0.17142857 0.00000000 0.00000000 0.15384615 0.10810811 
##      4 1 5      4 2 2      4 2 3      4 2 4      4 2 5      4 3 2      4 3 3 
## 0.06140351 0.05882353 0.07407407 0.03361345 0.07462687 0.00000000 0.05555556 
##      4 3 4      4 3 5 
## 0.07377049 0.07973422

Notice that there are only 51 combinations that are used in this dataset. The other response rates would be 0.

Which combinations have response rate in the training data above the overall response in the training data?

## [1] "1 1 2"
## [1] "1 1 4"
## [1] "1 1 5"
## [1] "1 2 2"
## [1] "1 2 5"
## [1] "1 3 2"
## [1] "1 3 3"
## [1] "1 3 5"
## [1] "2 1 2"
## [1] "2 1 3"
## [1] "2 1 5"
## [1] "2 2 3"
## [1] "2 2 5"
## [1] "2 3 3"
## [1] "2 3 4"
## [1] "2 3 5"
## [1] "3 1 1"
## [1] "3 1 2"
## [1] "3 3 5"
## [1] "4 1 3"
## [1] "4 1 4"

Problem 2

Suppose that we decide to send promotional mail only to the “above-average” RFM combinations identified in part 1.
Compute the response rate in validation data using these combinations.

## [1] "Combination of the validation data 1 1 2 has probability 0"
## [1] "Combination of the validation data 1 1 4 has probability 0.25"
## [1] "Combination of the validation data 1 1 5 has probability 0.06667"
## [1] "Combination of the validation data 1 2 2 has probability 1"
## [1] "Combination of the validation data 1 2 5 has probability 0.0625"
## [1] "Combination of the validation data 1 3 2 has probability 0"
## [1] "Combination of the validation data 1 3 3 has probability 0.05882"
## [1] "Combination of the validation data 1 3 5 has probability 0.5"
## [1] "Combination of the validation data 2 1 2 has probability 0.08333"
## [1] "Combination of the validation data 2 1 3 has probability 0"
## [1] "Combination of the validation data 2 1 5 has probability 0.33333"
## [1] "Combination of the validation data 2 2 3 has probability 0.13636"
## [1] "Combination of the validation data 2 2 5 has probability 0.25"
## [1] "Combination of the validation data 2 3 3 has probability 0.08696"
## [1] "Combination of the validation data 2 3 4 has probability 0.16129"
## [1] "Combination of the validation data 2 3 5 has probability 0.2"
## [1] "Combination of the validation data 3 1 1 has probability 0.04762"
## [1] "Combination of the validation data 3 1 2 has probability 0.09524"
## [1] "Combination of the validation data 3 3 5 has probability 0.17647"
## [1] "Combination of the validation data 4 1 3 has probability 0.06522"
## [1] "Combination of the validation data 4 1 4 has probability 0"

Problem 3

Rework parts 1 and 2 with three segments:

Segment 1:

RFM combinations that have response rates that exceed twice the overall response rate

In the output below we see the combinations that are twice the training response rate and then use those combinations to check the validation data.

## [1] "1 1 2"
## [1] "1 2 2"
## [1] "1 3 2"
## [1] "1 3 3"
## [1] "2 3 3"

## [1] "Combination of the validation data 1 1 2 has probability 0"
## [1] "Combination of the validation data 1 2 2 has probability 1"
## [1] "Combination of the validation data 1 3 3 has probability 0"
## [1] "Combination of the validation data 1 3 4 has probability 0.05882"
## [1] "Combination of the validation data 2 3 4 has probability 0.08696"

Segment 2:

RFM combinations that exceed the overall response rate but do not exceed twice the overall response rate.

In the output below we see the combinations that exceed the training response rate, but don’t exceed twice the training response rate and then use those combinations to check the validation data.

## [1] "1 1 4"
## [1] "1 1 5"
## [1] "1 2 5"
## [1] "1 3 5"
## [1] "2 1 2"
## [1] "2 1 3"
## [1] "2 1 5"
## [1] "2 2 3"
## [1] "2 2 5"
## [1] "2 3 4"
## [1] "2 3 5"
## [1] "3 1 1"
## [1] "3 1 2"
## [1] "3 3 5"
## [1] "4 1 3"
## [1] "4 1 4"

## [1] "Combination of the validation data 1 1 4 has probability 0.25"
## [1] "Combination of the validation data 1 1 5 has probability 0.06667"
## [1] "Combination of the validation data 1 2 5 has probability 0.0625"
## [1] "Combination of the validation data 2 1 1 has probability 0.5"
## [1] "Combination of the validation data 2 1 3 has probability 0.08333"
## [1] "Combination of the validation data 2 1 4 has probability 0"
## [1] "Combination of the validation data 2 2 2 has probability 0.33333"
## [1] "Combination of the validation data 2 2 4 has probability 0.13636"
## [1] "Combination of the validation data 2 3 3 has probability 0.25"
## [1] "Combination of the validation data 2 3 5 has probability 0.16129"
## [1] "Combination of the validation data 3 1 1 has probability 0.2"
## [1] "Combination of the validation data 3 1 2 has probability 0.04762"
## [1] "Combination of the validation data 3 1 3 has probability 0.09524"
## [1] "Combination of the validation data 4 1 2 has probability 0.17647"
## [1] "Combination of the validation data 4 1 5 has probability 0.06522"
## [1] "Combination of the validation data 4 2 2 has probability 0"

Segment 3:

RFM of the remaining RFM combinations

Printed are the remaining combinations and the response rate of each from the validation data. Recall, the overall dataset doesn’t contain 9 of the combinations, they are skipped here.

## [1] "1 1 1"
## [1] "1 1 3"
## [1] "1 2 3"
## [1] "1 2 4"
## [1] "1 3 4"
## [1] "2 1 1"
## [1] "2 1 4"
## [1] "2 2 2"
## [1] "2 2 4"
## [1] "3 1 3"
## [1] "3 1 4"
## [1] "3 1 5"
## [1] "3 2 2"
## [1] "3 2 3"
## [1] "3 2 4"
## [1] "3 2 5"
## [1] "3 3 2"
## [1] "3 3 3"
## [1] "3 3 4"
## [1] "4 1 1"
## [1] "4 1 2"
## [1] "4 1 5"
## [1] "4 2 2"
## [1] "4 2 3"
## [1] "4 2 4"
## [1] "4 2 5"
## [1] "4 3 2"
## [1] "4 3 3"
## [1] "4 3 4"
## [1] "4 3 5"

## [1] "Combination of the validation data 1 1 1 has probability 0"
## [1] "Combination of the validation data 1 1 3 has probability 0"
## [1] "Combination of the validation data 1 2 3 has probability 0"
## [1] "Combination of the validation data 1 2 4 has probability 0"
## [1] "Combination of the validation data 1 3 5 has probability 0.16667"
## [1] "Combination of the validation data 2 1 2 has probability 0.33333"
## [1] "Combination of the validation data 2 1 5 has probability 0.125"
## [1] "Combination of the validation data 2 2 3 has probability 0.14286"
## [1] "Combination of the validation data 2 2 5 has probability 0.15"
## [1] "Combination of the validation data 3 1 4 has probability 0.08163"
## [1] "Combination of the validation data 3 1 5 has probability 0.04688"
## [1] "Combination of the validation data 3 2 2 has probability 0.08333"
## [1] "Combination of the validation data 3 2 3 has probability 0.04545"
## [1] "Combination of the validation data 3 2 4 has probability 0.09091"
## [1] "Combination of the validation data 3 2 5 has probability 0.05"
## [1] "Combination of the validation data 3 3 3 has probability 0"
## [1] "Combination of the validation data 3 3 4 has probability 0.08333"
## [1] "Combination of the validation data 3 3 5 has probability 0.10884"
## [1] "Combination of the validation data 4 1 1 has probability 0"
## [1] "Combination of the validation data 4 1 3 has probability 0.04545"
## [1] "Combination of the validation data 4 1 4 has probability 0.02985"
## [1] "Combination of the validation data 4 2 3 has probability 0"
## [1] "Combination of the validation data 4 2 4 has probability 0.06494"
## [1] "Combination of the validation data 4 2 5 has probability 0.06667"
## [1] "Combination of the validation data 4 3 2 has probability 0"
## [1] "Combination of the validation data 4 3 3 has probability 0.07692"
## [1] "Combination of the validation data 4 3 4 has probability 0.09091"
## [1] "Combination of the validation data 4 3 5 has probability 0.07107"

Draw the lift curve showing the number of customers in the validation dataset on the x-axis and cumulative number of buyers in the validation dataset on the y-axis. The lift curve consists of three points for the previous three segments.

The RMSE for the fit model as well as more lift charts for the data.

##                    ME      RMSE       MAE MPE MAPE
## Test set -0.005374664 0.2722422 0.1509588 NaN  Inf

Problem 4

Use the k-nearest-neighborhood approach to classify cases k = 1, 2, …, 11, using Florence as the outcome variable. Based on the validation set, find the best k. Remember to normalize all five variables.

##     k accuracy
## 1   1 0.856250
## 2   2 0.913125
## 3   3 0.900625
## 4   4 0.911250
## 5   5 0.910000
## 6   6 0.910625
## 7   7 0.908750
## 8   8 0.918125
## 9   9 0.915625
## 10 10 0.918750
## 11 11 0.916250

It appears that k = 10 has the highest accuracy of .918750 So, lets set k=10 and use that to create a lift curve.

Looking at the Decile-wise lift chart from above we can see a huge initial lift for the first <1% of data. So, among the top 1% of we can see there are 6.2 times as many Art of History of Florence books purchased. Essentially the top 4% we can see a general lift that is greater than 1, the baseline. From my graph it appears that there there will be 1.4 times as many books purchased for the next 1%, 1.8 times as many for the next 1%, and then 2.3 times as many for the next 1%. After this the lift curve is below the baseline, which we would be selling less than we would without a model.

Problem 5

The k-NN prediction algorithm gives a numerical value, which is a weighted average of the values of the Florence variable for the k-nearest neighbors with weights that are inversely proportional to distance. Using the best k that you calculated above with k-NN classification, now run a model with k-NN prediction and compute a lift curve for the validation data.

Looking at the new model we can we see we have a lift for the first 23% of the data vs. the 4% prior. The top 1% will sell 1.5 times as many books, the next 10% will sell 1.8 times as many books, and then the next 12% will sell 1.2 times as many books. We see that there will be a range of falling book sales after tht 23% i.e. after the 23rd percentile.

Problem 6

Use the training set data of 1800 records to construct three logistic regression models with Florence as the outcome variable and each of the following sets of predictors:

The full set of 15 predictors in the dataset

## 
## Call:
## glm(formula = Florence ~ Gender + Mcode + Rcode + Fcode + FirstPurch + 
##     ChildBks + YouthBks + CookBks + DoItYBks + RefBks + ArtBks + 
##     GeogBks + ItalCook + ItalAtlas + ItalArt, family = binomial, 
##     data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1458  -0.4588  -0.3908  -0.3453   2.7169  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.167837   0.578432  -3.748 0.000178 ***
## Gender      -0.068650   0.180849  -0.380 0.704244    
## Mcode       -0.028672   0.100756  -0.285 0.775973    
## Rcode       -0.213095   0.097834  -2.178 0.029396 *  
## Fcode        0.162272   0.156471   1.037 0.299704    
## FirstPurch   0.012988   0.009764   1.330 0.183468    
## ChildBks    -0.117504   0.114495  -1.026 0.304759    
## YouthBks     0.014162   0.151281   0.094 0.925418    
## CookBks     -0.305803   0.115021  -2.659 0.007845 ** 
## DoItYBks    -0.033799   0.139244  -0.243 0.808211    
## RefBks       0.002238   0.155645   0.014 0.988528    
## ArtBks       0.417539   0.112142   3.723 0.000197 ***
## GeogBks      0.281203   0.100583   2.796 0.005178 ** 
## ItalCook     0.049818   0.211186   0.236 0.813515    
## ItalAtlas   -0.170876   0.435269  -0.393 0.694633    
## ItalArt      0.511120   0.296715   1.723 0.084962 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1116.7  on 1799  degrees of freedom
## Residual deviance: 1069.1  on 1784  degrees of freedom
## AIC: 1101.1
## 
## Number of Fisher Scoring iterations: 5

##                ME     RMSE      MAE MPE MAPE
## Test set 2.463776 2.518779 2.463776 Inf  Inf

A subset of predictors that you judge to be the best. I used the ones with significant z-value: RCode, Cook Books, Art Books, and Geog Books.

## 
## Call:
## glm(formula = Florence ~ Rcode + CookBks + ArtBks + GeogBks, 
##     family = binomial, data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3099  -0.4590  -0.3962  -0.3561   2.5454  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.96999    0.27194  -7.244 4.35e-13 ***
## Rcode       -0.17837    0.08524  -2.093 0.036382 *  
## CookBks     -0.12904    0.08374  -1.541 0.123309    
## ArtBks       0.48601    0.10829   4.488 7.19e-06 ***
## GeogBks      0.34440    0.09383   3.670 0.000242 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1116.7  on 1799  degrees of freedom
## Residual deviance: 1079.3  on 1795  degrees of freedom
## AIC: 1089.3
## 
## Number of Fisher Scoring iterations: 5

##                ME     RMSE      MAE MPE MAPE
## Test set 2.432142 2.478631 2.432142 Inf  Inf

Only the R, F, and M variables.

## 
## Call:
## glm(formula = Florence ~ Mcode + Rcode + Fcode, family = binomial, 
##     data = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5677  -0.4695  -0.4377  -0.3951   2.3528  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.29811    0.48973  -4.693  2.7e-06 ***
## Mcode        0.01479    0.09757   0.152   0.8796    
## Rcode       -0.16217    0.08296  -1.955   0.0506 .  
## Fcode        0.21419    0.10687   2.004   0.0450 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1116.7  on 1799  degrees of freedom
## Residual deviance: 1108.0  on 1796  degrees of freedom
## AIC: 1116
## 
## Number of Fisher Scoring iterations: 5

##                ME     RMSE      MAE MPE MAPE
## Test set 2.383853 2.408067 2.383853 Inf  Inf

Create a lift chart summarizing the results from the three logistic regression models created above, along with the expected lift for a random selection of an equal number of customers from the validation dataset.

Problem 7

If the cutoff criterion for a campaign is a 30% likelihood of a purchase, find the customers in the validation data that would be targeted and count the number of buyers in this set.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2021  158
##          1    9   12
##                                           
##                Accuracy : 0.9241          
##                  95% CI : (0.9122, 0.9348)
##     No Information Rate : 0.9227          
##     P-Value [Acc > NIR] : 0.4251          
##                                           
##                   Kappa : 0.1105          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.070588        
##             Specificity : 0.995567        
##          Pos Pred Value : 0.571429        
##          Neg Pred Value : 0.927490        
##              Prevalence : 0.077273        
##          Detection Rate : 0.005455        
##    Detection Prevalence : 0.009545        
##       Balanced Accuracy : 0.533077        
##                                           
##        'Positive' Class : 1               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2021  161
##          1    9    9
##                                           
##                Accuracy : 0.9227          
##                  95% CI : (0.9108, 0.9335)
##     No Information Rate : 0.9227          
##     P-Value [Acc > NIR] : 0.5204          
##                                           
##                   Kappa : 0.0822          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.052941        
##             Specificity : 0.995567        
##          Pos Pred Value : 0.500000        
##          Neg Pred Value : 0.926214        
##              Prevalence : 0.077273        
##          Detection Rate : 0.004091        
##    Detection Prevalence : 0.008182        
##       Balanced Accuracy : 0.524254        
##                                           
##        'Positive' Class : 1               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2030  170
##          1    0    0
##                                           
##                Accuracy : 0.9227          
##                  95% CI : (0.9108, 0.9335)
##     No Information Rate : 0.9227          
##     P-Value [Acc > NIR] : 0.5204          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.00000         
##             Specificity : 1.00000         
##          Pos Pred Value :     NaN         
##          Neg Pred Value : 0.92273         
##              Prevalence : 0.07727         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.00000         
##       Balanced Accuracy : 0.50000         
##                                           
##        'Positive' Class : 1               
##

The below list of numbers refer to which person in the sequence of data is predicted to buy the copy of the book using the 1st model the 15 predictors.

## [1] "Person 41 has a predicted probability of 0.45734 to buy the book"
## [1] "Person 48 has a predicted probability of 0.33687 to buy the book"
## [1] "Person 49 has a predicted probability of 0.36468 to buy the book"
## [1] "Person 51 has a predicted probability of 0.54456 to buy the book"
## [1] "Person 145 has a predicted probability of 0.44901 to buy the book"
## [1] "Person 253 has a predicted probability of 0.30608 to buy the book"
## [1] "Person 266 has a predicted probability of 0.40627 to buy the book"
## [1] "Person 401 has a predicted probability of 0.50213 to buy the book"
## [1] "Person 412 has a predicted probability of 0.45774 to buy the book"
## [1] "Person 425 has a predicted probability of 0.37764 to buy the book"
## [1] "Person 468 has a predicted probability of 0.57547 to buy the book"
## [1] "Person 475 has a predicted probability of 0.31685 to buy the book"
## [1] "Person 522 has a predicted probability of 0.35693 to buy the book"
## [1] "Person 523 has a predicted probability of 0.36226 to buy the book"
## [1] "Person 548 has a predicted probability of 0.34993 to buy the book"
## [1] "Person 599 has a predicted probability of 0.42946 to buy the book"
## [1] "Person 778 has a predicted probability of 0.31181 to buy the book"
## [1] "Person 794 has a predicted probability of 0.37035 to buy the book"
## [1] "Person 808 has a predicted probability of 0.38293 to buy the book"
## [1] "Person 1350 has a predicted probability of 0.30199 to buy the book"
## [1] "Person 1921 has a predicted probability of 0.36539 to buy the book"

The below list of numbers refer to which person in the sequence of data is predicted to buy the copy of the book using the 2nd model that uses Rcode CookBks, ArtBks, and GeogBks predictors.

## [1] "Person 49 has a predicted probability of 0.30847 to buy the book"
## [1] "Person 51 has a predicted probability of 0.49961 to buy the book"
## [1] "Person 145 has a predicted probability of 0.37757 to buy the book"
## [1] "Person 253 has a predicted probability of 0.33941 to buy the book"
## [1] "Person 401 has a predicted probability of 0.40251 to buy the book"
## [1] "Person 412 has a predicted probability of 0.41138 to buy the book"
## [1] "Person 425 has a predicted probability of 0.33665 to buy the book"
## [1] "Person 468 has a predicted probability of 0.50578 to buy the book"
## [1] "Person 475 has a predicted probability of 0.3477 to buy the book"
## [1] "Person 522 has a predicted probability of 0.36753 to buy the book"
## [1] "Person 702 has a predicted probability of 0.32177 to buy the book"
## [1] "Person 808 has a predicted probability of 0.35331 to buy the book"
## [1] "Person 862 has a predicted probability of 0.33396 to buy the book"
## [1] "Person 888 has a predicted probability of 0.33941 to buy the book"
## [1] "Person 1338 has a predicted probability of 0.30328 to buy the book"
## [1] "Person 1350 has a predicted probability of 0.33396 to buy the book"
## [1] "Person 1383 has a predicted probability of 0.32167 to buy the book"
## [1] "Person 1612 has a predicted probability of 0.30063 to buy the book"

The below list of numbers refer to which person in the sequence of data is predicted to buy the copy of the book using the 3rd model that uses Rcode, Fcode, and Mcode predictors.

There are none predicted for this model.

Notice that based off of the confusion matrices we have:

The first model with all 15 predictors has 21 predicted to be buyers, but of those 21 (listed above) only 12 bought the book. The second model with my chosen predictors has 18 predicted to be buyers, but of those 18 (listed above) only 9 bought the book. The third model with just Rcode, Fcode, and Mcode as predictors has 0 predicted to be buyers, and 0 of those bought the book.