Executive Summary

This study examines what drives consumer choices, using statistical methods: linear regression, logistic regression, and SVM. We found that factors such as gender, how much and how often people buy, and the type of products they choose significantly impact decisions. The research shows the complexity of consumer behavior and the effectiveness of using statistical techniques to predict these choices to increase the effectiveness of marketing campaigns. This study also re-affirms that the use of database marketing techniques has improved our understanding of customers and helps target new and existing customers efficiently. It also provides a credible system for factual based decision making.

Problem

The Bookbinders Book Club is trying to improve how it targets customers with its direct mail program to compete with big bookstores and online sellers. The goal is to predict which books customers will likely buy, using data analysis methods. This study explores different prediction techniques to find out which factors are most important in influencing customers’ choices. Another problem presented in this case study is that people’s tastes and preferences are always shifting. Combined with the thousands of new titles being created every year, BBBC needs a way to stay ahead of trends or make meaningful connections between customers and products that enable them to become more responsive to a variety of customer types.

Literature Review

Like other machine learning algorithms, SVM uses mathematical functions when it performs various tasks of prediction or classification. SVM is versatile, effective in high demotions, and works well with data that is not linearly separable. SVM models that have more than two dimensions use a hyperplane to separate data points. The distance between the closes points and the hyperplane is called the margin and the points support vectors. On the website scikit-learn.org, I learned that it is very important to find an optimal point between the smoothness of features and the weight I put on predicting all points perfectly. It spoke of how different kernels can work better for different datasets, and that even custom kernels could be used. SVM is a powerful tool for supervised learning methods. https://scikit-learn.org/stable/modules/svm.html

Methods

This study uses data analysis to predict which books customers will buy, focusing on customer details and past purchases. We used three main techniques: linear regression, logistic regression, and Support Vector Machines (SVMs). The study looks at how well these methods work and how they handle different types of data and their assumptions about how factors relate to purchasing behavior.

Linear regression model - we measured how much variance in the dependent variable can be explained by using variables from the data set as predictors. The model only explained about 24% of “Choice” using all the variables. This is most likely because linear regression assumptions state that the independent variables are to be continuous. A violation produces results that are poor or not meaningful. Another reason this model is not a good choice, is when looking at residual plots, the homoscedasticity and normality of residuals assumptions are also violated.

Logistic regression - Logistic regression models work well when the target variable is binary like the one, we are using for this study, and it works very well with classification tasks. First a logistic model is fit using a training set, and then we use that fitted model to make predictions using new data or a testing set for this report. This model was measured using types of correct or incorrect predictions. The AUC we got after plotting a ROC curve was 67% but because the predictions were slanted a change in the weight of the predictor variable levels could increase model performance. Using the cost function, more weight was placed on our false positive rates because of the amount of positives in the dataset. We also removed “Last_purchase” and “First_purchase because they were so highly correlated. The model did improve, increasing accuracy to 79%.

SVM – The last model we built was a Support Vector Machine learning algorithm (SVM) which is mostly used for classification purposes. We used three types of kernels, a linear, a polynomial, and a RBF depending if the data was linearly separable or not. We set the target variable type to factor and tuned our RBF kernel for the best performing gamma/cost. The first model we ran used a linear kernel . The model did well and predicted about 90% accuracy and the sensitivity was 96% but our specificity was very poor at 28%. Next, we tried using a polynomial kernel type . The model’s accuracy improved as well as the sensitivity, but the specificity dropped even further to about 20%. Our final SVM model was where we used our tuned RBF kernel . Finding a balance between the smoothness of the features using gamma and making correct guesses by adjusting the weight of cost was our goal. The model performed about the same as our polynomial function but ended up doing better for our business scenario.

Data

The data was cleaned and split before opening the file. This dataset contains information from a book club about its customers and their purchasing habits. This data is initially being read in as all number variables however, when exploring the data most of the variables are categorical in nature. The variable “Observation” will be removed because it will reduce the power for linear regression, logistic regression, or svm. There is four continuous variables which are “Amount”, “Frequency”, “First_purchase”, and “Last_purchase”. “First_purchase” may have outliers and may have multicollinearity with “Last_purchase” because they are very correlated.

Training sample:

  • 1600 observations
  • 41.03% of the dataset

Testing sample:

  • 2300 observations
  • 58.97% of the dataset

Conclusions

While targeting without a predictive model yields the highest profit, it also incurs the highest costs and in turn the most risk. Logistic Regression is shown to be more efficient, drastically reducing the cost and only slightly reducing the profit making it a more cost-effective strategy over not using targeted marketing. SVM is also more efficient reducing your investment even further, but with a lower return than the logistic model.

The most significant predictor variables are Gender, Frequency, P_Cook, P_Art.

Negative estimates:

Positive estimates:

Detailed Model Processing and Analysis

Linear regression

Coefficient Review


Call:
lm(formula = Choice ~ . - Observation, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9603 -0.2462 -0.1161  0.1622  1.0588 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.3642284  0.0307411  11.848  < 2e-16 ***
Gender           -0.1309205  0.0200303  -6.536 8.48e-11 ***
Amount_purchased  0.0002736  0.0001110   2.464   0.0138 *  
Frequency        -0.0090868  0.0021791  -4.170 3.21e-05 ***
Last_purchase     0.0970286  0.0135589   7.156 1.26e-12 ***
First_purchase   -0.0020024  0.0018160  -1.103   0.2704    
P_Child          -0.1262584  0.0164011  -7.698 2.41e-14 ***
P_Youth          -0.0963563  0.0201097  -4.792 1.81e-06 ***
P_Cook           -0.1414907  0.0166064  -8.520  < 2e-16 ***
P_DIY            -0.1352313  0.0197873  -6.834 1.17e-11 ***
P_Art             0.1178494  0.0194427   6.061 1.68e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3788 on 1589 degrees of freedom
Multiple R-squared:  0.2401,    Adjusted R-squared:  0.2353 
F-statistic:  50.2 on 10 and 1589 DF,  p-value: < 2.2e-16
  • (Intercept): 0.3642284, p < 2e-16, indicating a significant baseline level.
  • Gender: -0.1309205, p = 8.48e-11, showing a negative effect.
  • Amount_purchased: 0.0002736, p = 0.0138, slightly positive and significant.
  • Frequency: -0.0090868, p = 3.21e-05, negatively associated.
  • Last_purchase: 0.0970286, p = 1.26e-12, significantly positive.
  • First_purchase: -0.0020024, p = 0.2704, not significantly impacting.
  • P_Child: -0.1262584, p = 2.41e-14, strongly negative.
  • P_Youth: -0.0963563, p = 1.81e-06, negatively affecting.
  • P_Cook: -0.1414907, p < 2e-16, significantly negative.
  • P_DIY: -0.1352313, p = 1.17e-11, negative impact.
  • P_Art: 0.1178494, p = 1.68e-09, positively associated.

Summary Statistics

  • R-Squared: 24.01% of variance explained by the model.
  • F-statistic: Supports overall significance, p < 2.2e-16.
  • A linear model does not appear to be a good predictor for this data.

Linear Model Plots

  • Residuals vs Fitted plot shows a pattern suggesting non-linearity
  • Q-Q plot indicates that the residuals are not normally distributed
  • Scale-Location plot points to heteroscedasticity, where residuals have non-constant variance
  • Residuals vs Leverage plot identifies a few outliers.
  • These plots show that the linear model is not the best fit for the data.

MSE - Linear model

Training MSE: 0.1424887

Testing MSE : 0.092551

Logistic regression

Collinearity Check

          Gender Amount_purchased        Frequency    Last_purchase 
        1.023359         1.232172         2.490447        17.706670 
  First_purchase          P_Child          P_Youth           P_Cook 
        9.247748         2.992269         1.761546         3.229097 
           P_DIY            P_Art 
        1.992698         1.938089 
  • Removed Observation since it is only an index marker
  • Removed Last_purchase, and First_purchase from the models due to multicollinearity.
          Gender Amount_purchased        Frequency          P_Child 
        1.020217         1.213528         1.015899         1.215500 
         P_Youth           P_Cook            P_DIY            P_Art 
        1.081019         1.228798         1.179821         1.229491 

Coefficient Review


Call:
glm(formula = Choice ~ . - Observation - Last_purchase - First_purchase, 
    family = "binomial", data = train_data)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -0.286380   0.202966  -1.411  0.15825    
Gender           -0.811948   0.134579  -6.033 1.61e-09 ***
Amount_purchased  0.002406   0.000771   3.120  0.00181 ** 
Frequency        -0.088625   0.010385  -8.534  < 2e-16 ***
P_Child          -0.194796   0.072207  -2.698  0.00698 ** 
P_Youth          -0.031928   0.109605  -0.291  0.77082    
P_Cook           -0.292392   0.072998  -4.005 6.19e-05 ***
P_DIY            -0.279282   0.108094  -2.584  0.00977 ** 
P_Art             1.245842   0.099062  12.576  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1799.5  on 1599  degrees of freedom
Residual deviance: 1445.0  on 1591  degrees of freedom
AIC: 1463

Number of Fisher Scoring iterations: 5
  • Gender (Significant): Males are less likely to make the purchase compared to females.
  • Amount_purchased (Significant): Spending more money is associated with a higher likelihood of purchasing.
  • Frequency (Significant): Buying things more frequently is associated with a lower likelihood of making the purchase
  • P_Child (Significant): Having a preference for child-related products is associated with a lower likelihood of making the choice to purchase
  • P_Youth: Preference for youth-related products does not significantly affect the likelihood of purchasing
  • P_Cook (Significant): A preference for cooking-related products associated with a lower likelihood of purchasing
  • P_DIY (Significant): A preference for DIY-related products is associated with a lower likelihood of purchasing
  • P_Art (Significant): A preference for art-related products increases the likelihood of purchasing, indicating a strong positive association.

Logistic model - Standard .5 p-cut

This is the base model prior to adjusting the p-cut. The p-cut is currently set to .5. This leads to the following results:

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1133   67
         1  258  142
                                          
               Accuracy : 0.7969          
                 95% CI : (0.7763, 0.8163)
    No Information Rate : 0.8694          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.3558          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8145          
            Specificity : 0.6794          
         Pos Pred Value : 0.9442          
         Neg Pred Value : 0.3550          
             Prevalence : 0.8694          
         Detection Rate : 0.7081          
   Detection Prevalence : 0.7500          
      Balanced Accuracy : 0.7470          
                                          
       'Positive' Class : 0               
                                          
Area under the curve: 0.6496

Optimal p-cut calculation

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight1 = 3   # define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 

mean_error = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    mean_error[i] = costfunc(obs = train_data$Choice, pred.p = pred_prob.train, pcut = p.seq[i])  
} # end of the loop

# Plot X axis: pcut | Y axis: associated cost
plot(p.seq, mean_error, xlab = "P Cut", ylab = "Cost")

optimal_pcut = p.seq[which(mean_error==min(mean_error))]
optimal_pcut <- optimal_pcut[1]

Logistic model - Optimal p-cut

After running the cost function with weights of 3 and 1. The optimal p-cut is set to: 0.24 This leads to following results:

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 869 331
         1 116 284
                                          
               Accuracy : 0.7206          
                 95% CI : (0.6979, 0.7425)
    No Information Rate : 0.6156          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3682          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.8822          
            Specificity : 0.4618          
         Pos Pred Value : 0.7242          
         Neg Pred Value : 0.7100          
             Prevalence : 0.6156          
         Detection Rate : 0.5431          
   Detection Prevalence : 0.7500          
      Balanced Accuracy : 0.6720          
                                          
       'Positive' Class : 0               
                                          
Area under the curve: 0.7171

Logistic model - Testing

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1524  572
         1   63  141
                                          
               Accuracy : 0.7239          
                 95% CI : (0.7051, 0.7421)
    No Information Rate : 0.69            
    P-Value [Acc > NIR] : 0.000209        
                                          
                  Kappa : 0.1967          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9603          
            Specificity : 0.1978          
         Pos Pred Value : 0.7271          
         Neg Pred Value : 0.6912          
             Prevalence : 0.6900          
         Detection Rate : 0.6626          
   Detection Prevalence : 0.9113          
      Balanced Accuracy : 0.5790          
                                          
       'Positive' Class : 0               
                                          
Area under the curve: 0.7091

SVM

Creating and tuning the SVM formula due to the data imbalance.

form1 <- as.factor(Choice) ~ .-Observation-First_purchase-Last_purchase
tuned <- tune.svm(form1, data = train_data, gamma = seq(.01, .1, by = .01), cost = seq(.1, 1, by = .1))

SVM Models

Linear


Call:
svm(formula = form1, data = train_data, gamma = tuned$best.parameters$gamma, 
    cost = tuned$best.parameters$cost, type = "C-classification", 
    kernel = "linear")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  1 

Number of Support Vectors:  754
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2015  146
         1   81   58
                                          
               Accuracy : 0.9013          
                 95% CI : (0.8884, 0.9132)
    No Information Rate : 0.9113          
    P-Value [Acc > NIR] : 0.9559          
                                          
                  Kappa : 0.2869          
                                          
 Mcnemar's Test P-Value : 2.159e-05       
                                          
            Sensitivity : 0.9614          
            Specificity : 0.2843          
         Pos Pred Value : 0.9324          
         Neg Pred Value : 0.4173          
             Prevalence : 0.9113          
         Detection Rate : 0.8761          
   Detection Prevalence : 0.9396          
      Balanced Accuracy : 0.6228          
                                          
       'Positive' Class : 0               
                                          
Area under the curve: 0.6228

Polynomial


Call:
svm(formula = form1, data = train_data, gamma = tuned$best.parameters$gamma, 
    cost = tuned$best.parameters$cost, type = "C-classification", 
    kernel = "polynomial")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  polynomial 
       cost:  1 
     degree:  3 
     coef.0:  0 

Number of Support Vectors:  783
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2074  193
         1   22   11
                                          
               Accuracy : 0.9065          
                 95% CI : (0.8939, 0.9181)
    No Information Rate : 0.9113          
    P-Value [Acc > NIR] : 0.8013          
                                          
                  Kappa : 0.0699          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.98950         
            Specificity : 0.05392         
         Pos Pred Value : 0.91487         
         Neg Pred Value : 0.33333         
             Prevalence : 0.91130         
         Detection Rate : 0.90174         
   Detection Prevalence : 0.98565         
      Balanced Accuracy : 0.52171         
                                          
       'Positive' Class : 0               
                                          
Area under the curve: 0.5217

Radial


Call:
svm(formula = form1, data = train_data, gamma = tuned$best.parameters$gamma, 
    cost = tuned$best.parameters$cost, type = "C-classification", 
    kernel = "radial")


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 

Number of Support Vectors:  785
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2055  164
         1   41   40
                                          
               Accuracy : 0.9109          
                 95% CI : (0.8985, 0.9222)
    No Information Rate : 0.9113          
    P-Value [Acc > NIR] : 0.5477          
                                          
                  Kappa : 0.2425          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.9804          
            Specificity : 0.1961          
         Pos Pred Value : 0.9261          
         Neg Pred Value : 0.4938          
             Prevalence : 0.9113          
         Detection Rate : 0.8935          
   Detection Prevalence : 0.9648          
      Balanced Accuracy : 0.5883          
                                          
       'Positive' Class : 0               
                                          
Area under the curve: 0.5883

Results

Comparison of Models

Model Comparison Table
Model R_Squared Accuracy Sensitivity Specificity
Linear Regression 0.2401 NA NA NA
Logistic Regression NA 0.7239 0.9603 0.1978
SVM (Radial) NA 0.9109 0.9804 0.1961
SVM (Linear) NA 0.9013 0.9614 0.2843
SVM (Polynomial) NA 0.9065 0.9895 0.0539

Model Comparison Review

  • Linear Regression shows a 24.01% R-squared value, indicating its explanatory power is limited to about a quarter of the variance in the dependent variable.
  • Logistic Regression is accurate in identifying true positives with a 0.723913 accuracy rate but has low specificity (0.197756).
  • SVM Radial model is highly accurate (0.9108696) and sensitive (0.9804389) but, like Logistic Regression, struggles with specificity (0.1960784).
  • Linear and Polynomial SVMs also perform well, with the Linear variant offering a better balance between detecting true positives and avoiding false negatives.

Profit Equations

\(Expected Response Rate = (Predicted Positives)/Training Data Size\)

\(Targeted Customers = Expected Response Rate*Data Size\)

\(Expected Sales = (True Positives/Targeted Customers)*Data Size\)

\(Expected Profit = (Expected Sales * Revenue) − (Targeted Customers * Mailing Cost)\)

Profit Projections

Projection Comparison Table
Model Response_Rate Mailers_Sent Cost Profit
No Model 9.03 % 50000.0000 32500.0000 13553.000
Logistic Regression 8.87 % 4434.7826 2882.6087 13332.391
SVM (Radial) 3.52 % 1760.8696 1144.5652 10440.620
SVM (Linear) 6.04 % 3021.7391 1964.1304 7824.934
SVM (Polynomial) 1.43 % 717.3913 466.3043 7353.696

Projection Review

  • No Model approach leads in profit and response rate but at the highest cost.
  • Logistic Regression achieves substantial profit with significantly lower costs and a marginally lower response rate compared to “No Model.” This provides similar profit with significantly less risk.
  • SVM (Radial) strategy drastically reduces costs by targeting fewer customers, yet this results in the lowest profit due to its reduced response rate.
  • SVM (Linear) and SVM (Polynomial) models offer varying balances between cost, response rate, and profit, with the Polynomial version showing the lowest profit due to its minimal response rate.