This study examines what drives consumer choices, using statistical methods: linear regression, logistic regression, and SVM. We found that factors such as gender, how much and how often people buy, and the type of products they choose significantly impact decisions. The research shows the complexity of consumer behavior and the effectiveness of using statistical techniques to predict these choices to increase the effectiveness of marketing campaigns. This study also re-affirms that the use of database marketing techniques has improved our understanding of customers and helps target new and existing customers efficiently. It also provides a credible system for factual based decision making.
The Bookbinders Book Club is trying to improve how it targets customers with its direct mail program to compete with big bookstores and online sellers. The goal is to predict which books customers will likely buy, using data analysis methods. This study explores different prediction techniques to find out which factors are most important in influencing customers’ choices. Another problem presented in this case study is that people’s tastes and preferences are always shifting. Combined with the thousands of new titles being created every year, BBBC needs a way to stay ahead of trends or make meaningful connections between customers and products that enable them to become more responsive to a variety of customer types.
Like other machine learning algorithms, SVM uses mathematical functions when it performs various tasks of prediction or classification. SVM is versatile, effective in high demotions, and works well with data that is not linearly separable. SVM models that have more than two dimensions use a hyperplane to separate data points. The distance between the closes points and the hyperplane is called the margin and the points support vectors. On the website scikit-learn.org, I learned that it is very important to find an optimal point between the smoothness of features and the weight I put on predicting all points perfectly. It spoke of how different kernels can work better for different datasets, and that even custom kernels could be used. SVM is a powerful tool for supervised learning methods. https://scikit-learn.org/stable/modules/svm.html
This study uses data analysis to predict which books customers will buy, focusing on customer details and past purchases. We used three main techniques: linear regression, logistic regression, and Support Vector Machines (SVMs). The study looks at how well these methods work and how they handle different types of data and their assumptions about how factors relate to purchasing behavior.
Linear regression model - we measured how much variance in the dependent variable can be explained by using variables from the data set as predictors. The model only explained about 24% of “Choice” using all the variables. This is most likely because linear regression assumptions state that the independent variables are to be continuous. A violation produces results that are poor or not meaningful. Another reason this model is not a good choice, is when looking at residual plots, the homoscedasticity and normality of residuals assumptions are also violated.
Logistic regression - Logistic regression models work well when the target variable is binary like the one, we are using for this study, and it works very well with classification tasks. First a logistic model is fit using a training set, and then we use that fitted model to make predictions using new data or a testing set for this report. This model was measured using types of correct or incorrect predictions. The AUC we got after plotting a ROC curve was 67% but because the predictions were slanted a change in the weight of the predictor variable levels could increase model performance. Using the cost function, more weight was placed on our false positive rates because of the amount of positives in the dataset. We also removed “Last_purchase” and “First_purchase because they were so highly correlated. The model did improve, increasing accuracy to 79%.
SVM – The last model we built was a Support Vector Machine learning algorithm (SVM) which is mostly used for classification purposes. We used three types of kernels, a linear, a polynomial, and a RBF depending if the data was linearly separable or not. We set the target variable type to factor and tuned our RBF kernel for the best performing gamma/cost. The first model we ran used a linear kernel . The model did well and predicted about 90% accuracy and the sensitivity was 96% but our specificity was very poor at 28%. Next, we tried using a polynomial kernel type . The model’s accuracy improved as well as the sensitivity, but the specificity dropped even further to about 20%. Our final SVM model was where we used our tuned RBF kernel . Finding a balance between the smoothness of the features using gamma and making correct guesses by adjusting the weight of cost was our goal. The model performed about the same as our polynomial function but ended up doing better for our business scenario.
The data was cleaned and split before opening the file. This dataset contains information from a book club about its customers and their purchasing habits. This data is initially being read in as all number variables however, when exploring the data most of the variables are categorical in nature. The variable “Observation” will be removed because it will reduce the power for linear regression, logistic regression, or svm. There is four continuous variables which are “Amount”, “Frequency”, “First_purchase”, and “Last_purchase”. “First_purchase” may have outliers and may have multicollinearity with “Last_purchase” because they are very correlated.
Training sample:
Testing sample:
While targeting without a predictive model yields the highest profit, it also incurs the highest costs and in turn the most risk. Logistic Regression is shown to be more efficient, drastically reducing the cost and only slightly reducing the profit making it a more cost-effective strategy over not using targeted marketing. SVM is also more efficient reducing your investment even further, but with a lower return than the logistic model.
The most significant predictor variables are Gender, Frequency, P_Cook, P_Art.
Negative estimates:
Positive estimates:
Call:
lm(formula = Choice ~ . - Observation, data = train_data)
Residuals:
Min 1Q Median 3Q Max
-0.9603 -0.2462 -0.1161 0.1622 1.0588
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3642284 0.0307411 11.848 < 2e-16 ***
Gender -0.1309205 0.0200303 -6.536 8.48e-11 ***
Amount_purchased 0.0002736 0.0001110 2.464 0.0138 *
Frequency -0.0090868 0.0021791 -4.170 3.21e-05 ***
Last_purchase 0.0970286 0.0135589 7.156 1.26e-12 ***
First_purchase -0.0020024 0.0018160 -1.103 0.2704
P_Child -0.1262584 0.0164011 -7.698 2.41e-14 ***
P_Youth -0.0963563 0.0201097 -4.792 1.81e-06 ***
P_Cook -0.1414907 0.0166064 -8.520 < 2e-16 ***
P_DIY -0.1352313 0.0197873 -6.834 1.17e-11 ***
P_Art 0.1178494 0.0194427 6.061 1.68e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3788 on 1589 degrees of freedom
Multiple R-squared: 0.2401, Adjusted R-squared: 0.2353
F-statistic: 50.2 on 10 and 1589 DF, p-value: < 2.2e-16
Training MSE: 0.1424887
Testing MSE : 0.092551
Gender Amount_purchased Frequency Last_purchase
1.023359 1.232172 2.490447 17.706670
First_purchase P_Child P_Youth P_Cook
9.247748 2.992269 1.761546 3.229097
P_DIY P_Art
1.992698 1.938089
Gender Amount_purchased Frequency P_Child
1.020217 1.213528 1.015899 1.215500
P_Youth P_Cook P_DIY P_Art
1.081019 1.228798 1.179821 1.229491
Call:
glm(formula = Choice ~ . - Observation - Last_purchase - First_purchase,
family = "binomial", data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.286380 0.202966 -1.411 0.15825
Gender -0.811948 0.134579 -6.033 1.61e-09 ***
Amount_purchased 0.002406 0.000771 3.120 0.00181 **
Frequency -0.088625 0.010385 -8.534 < 2e-16 ***
P_Child -0.194796 0.072207 -2.698 0.00698 **
P_Youth -0.031928 0.109605 -0.291 0.77082
P_Cook -0.292392 0.072998 -4.005 6.19e-05 ***
P_DIY -0.279282 0.108094 -2.584 0.00977 **
P_Art 1.245842 0.099062 12.576 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1799.5 on 1599 degrees of freedom
Residual deviance: 1445.0 on 1591 degrees of freedom
AIC: 1463
Number of Fisher Scoring iterations: 5
This is the base model prior to adjusting the p-cut. The p-cut is currently set to .5. This leads to the following results:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1133 67
1 258 142
Accuracy : 0.7969
95% CI : (0.7763, 0.8163)
No Information Rate : 0.8694
P-Value [Acc > NIR] : 1
Kappa : 0.3558
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8145
Specificity : 0.6794
Pos Pred Value : 0.9442
Neg Pred Value : 0.3550
Prevalence : 0.8694
Detection Rate : 0.7081
Detection Prevalence : 0.7500
Balanced Accuracy : 0.7470
'Positive' Class : 0
Area under the curve: 0.6496
# define a cost function with input "obs" being observed response
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
weight1 = 3 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01)
mean_error = rep(0, length(p.seq))
for(i in 1:length(p.seq)){
mean_error[i] = costfunc(obs = train_data$Choice, pred.p = pred_prob.train, pcut = p.seq[i])
} # end of the loop
# Plot X axis: pcut | Y axis: associated cost
plot(p.seq, mean_error, xlab = "P Cut", ylab = "Cost")
optimal_pcut = p.seq[which(mean_error==min(mean_error))]
optimal_pcut <- optimal_pcut[1]
After running the cost function with weights of 3 and 1. The optimal p-cut is set to: 0.24 This leads to following results:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 869 331
1 116 284
Accuracy : 0.7206
95% CI : (0.6979, 0.7425)
No Information Rate : 0.6156
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3682
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8822
Specificity : 0.4618
Pos Pred Value : 0.7242
Neg Pred Value : 0.7100
Prevalence : 0.6156
Detection Rate : 0.5431
Detection Prevalence : 0.7500
Balanced Accuracy : 0.6720
'Positive' Class : 0
Area under the curve: 0.7171
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1524 572
1 63 141
Accuracy : 0.7239
95% CI : (0.7051, 0.7421)
No Information Rate : 0.69
P-Value [Acc > NIR] : 0.000209
Kappa : 0.1967
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9603
Specificity : 0.1978
Pos Pred Value : 0.7271
Neg Pred Value : 0.6912
Prevalence : 0.6900
Detection Rate : 0.6626
Detection Prevalence : 0.9113
Balanced Accuracy : 0.5790
'Positive' Class : 0
Area under the curve: 0.7091
form1 <- as.factor(Choice) ~ .-Observation-First_purchase-Last_purchase
tuned <- tune.svm(form1, data = train_data, gamma = seq(.01, .1, by = .01), cost = seq(.1, 1, by = .1))
Call:
svm(formula = form1, data = train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, type = "C-classification",
kernel = "linear")
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 1
Number of Support Vectors: 754
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2015 146
1 81 58
Accuracy : 0.9013
95% CI : (0.8884, 0.9132)
No Information Rate : 0.9113
P-Value [Acc > NIR] : 0.9559
Kappa : 0.2869
Mcnemar's Test P-Value : 2.159e-05
Sensitivity : 0.9614
Specificity : 0.2843
Pos Pred Value : 0.9324
Neg Pred Value : 0.4173
Prevalence : 0.9113
Detection Rate : 0.8761
Detection Prevalence : 0.9396
Balanced Accuracy : 0.6228
'Positive' Class : 0
Area under the curve: 0.6228
Call:
svm(formula = form1, data = train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, type = "C-classification",
kernel = "polynomial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: polynomial
cost: 1
degree: 3
coef.0: 0
Number of Support Vectors: 783
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2074 193
1 22 11
Accuracy : 0.9065
95% CI : (0.8939, 0.9181)
No Information Rate : 0.9113
P-Value [Acc > NIR] : 0.8013
Kappa : 0.0699
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.98950
Specificity : 0.05392
Pos Pred Value : 0.91487
Neg Pred Value : 0.33333
Prevalence : 0.91130
Detection Rate : 0.90174
Detection Prevalence : 0.98565
Balanced Accuracy : 0.52171
'Positive' Class : 0
Area under the curve: 0.5217
Call:
svm(formula = form1, data = train_data, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, type = "C-classification",
kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 785
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2055 164
1 41 40
Accuracy : 0.9109
95% CI : (0.8985, 0.9222)
No Information Rate : 0.9113
P-Value [Acc > NIR] : 0.5477
Kappa : 0.2425
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9804
Specificity : 0.1961
Pos Pred Value : 0.9261
Neg Pred Value : 0.4938
Prevalence : 0.9113
Detection Rate : 0.8935
Detection Prevalence : 0.9648
Balanced Accuracy : 0.5883
'Positive' Class : 0
Area under the curve: 0.5883
| Model Comparison Table | ||||
| Model | R_Squared | Accuracy | Sensitivity | Specificity |
|---|---|---|---|---|
| Linear Regression | 0.2401 | NA | NA | NA |
| Logistic Regression | NA | 0.7239 | 0.9603 | 0.1978 |
| SVM (Radial) | NA | 0.9109 | 0.9804 | 0.1961 |
| SVM (Linear) | NA | 0.9013 | 0.9614 | 0.2843 |
| SVM (Polynomial) | NA | 0.9065 | 0.9895 | 0.0539 |
\(Expected Response Rate = (Predicted Positives)/Training Data Size\)
\(Targeted Customers = Expected Response Rate*Data Size\)
\(Expected Sales = (True Positives/Targeted Customers)*Data Size\)
\(Expected Profit = (Expected Sales * Revenue) − (Targeted Customers * Mailing Cost)\)
| Projection Comparison Table | ||||
| Model | Response_Rate | Mailers_Sent | Cost | Profit |
|---|---|---|---|---|
| No Model | 9.03 % | 50000.0000 | 32500.0000 | 13553.000 |
| Logistic Regression | 8.87 % | 4434.7826 | 2882.6087 | 13332.391 |
| SVM (Radial) | 3.52 % | 1760.8696 | 1144.5652 | 10440.620 |
| SVM (Linear) | 6.04 % | 3021.7391 | 1964.1304 | 7824.934 |
| SVM (Polynomial) | 1.43 % | 717.3913 | 466.3043 | 7353.696 |