As a marketing analyst, we want to increase sales by targeting customers with certain characteristics. From the historical data of 400 customers who have been prospected, we get information about gender, age, and salary category as well as whether he bought our product or not.
Read the data.
CB <- read.csv("Customer_Behaviour.csv")
CBglimpse(CB)## Rows: 400
## Columns: 4
## $ Gender <chr> "Male", "Male", "Female", "Female", "Male", "Male", "Female"…
## $ Age <chr> "< 30", "30-50", "< 30", "< 30", "< 30", "< 30", "< 30", "30…
## $ Salary <chr> "Low", "Low", "Medium", "Medium", "Medium", "Medium", "Mediu…
## $ Purchased <chr> "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No",…
Check Missing Value.
anyNA(CB)## [1] FALSE
colSums(is.na(CB))## Gender Age Salary Purchased
## 0 0 0 0
We need to change the data type of each variable to a data type that matches the data.
CB <- CB %>%
mutate(Gender = as.factor(Gender),
Age = as.factor(Age),
Salary = as.factor(Salary),
Purchased = as.factor(Purchased))
CBglimpse(CB)## Rows: 400
## Columns: 4
## $ Gender <fct> Male, Male, Female, Female, Male, Male, Female, Female, Male…
## $ Age <fct> < 30, 30-50, < 30, < 30, < 30, < 30, < 30, 30-50, < 30, 30-5…
## $ Salary <fct> Low, Low, Medium, Medium, Medium, Medium, Medium, High, Low,…
## $ Purchased <fct> No, No, No, No, No, No, No, Yes, No, No, No, No, No, No, No,…
From the busines question, We are going to build a predictive model
to classify “whether the client buys our product or not”
(Purchased = Yes / No).
levels(CB$Purchased)## [1] "No" "Yes"
From the level above, it can be seen that the target variable consists of two categories, namely “No” and “Yes”.
Check the distribution proportion of target class
prop.table(table(CB$Purchased))##
## No Yes
## 0.6425 0.3575
table(CB$Purchased)##
## No Yes
## 257 143
When viewed from the proportion of the two classes, it is quite balanced, so we don’t really need additional pre-processing to balance the proportion between the two target classes of variables.
Splitting the data into data train(85%) and data test(15%).
RNGkind(sample.kind = "Rounding")
set.seed(417)
# index sampling
index <- sample(x = nrow(CB), size = nrow(CB)*0.85)
# splitting
CB_train <- CB[index , ]
CB_test <- CB[-index , ]Check dimension.
dim(CB_train)## [1] 340 4
dim(CB_test)## [1] 60 4
Eliminating target variable from test dataset.
CB_test_Val <- CB_test %>% select(-Purchased)
dim(CB_test_Val)## [1] 60 3
Check the distribution proportion of target class from data train.
prop.table(table(CB_train$Purchased))##
## No Yes
## 0.65 0.35
table(CB_train$Purchase)##
## No Yes
## 221 119
The proportion is quite balanced.
Skewness Due To Scarcity is one of the characteristics of Naive Bayes Model. To get over it, we will do a Laplace Smoothing when we build the model.
naive_model <- naiveBayes(Purchased ~ .,
data = CB_train,
laplace = 1)
naive_model##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.65 0.35
##
## Conditional probabilities:
## Gender
## Y Female Male
## No 0.4887892 0.5112108
## Yes 0.5454545 0.4545455
##
## Age
## Y < 30 > 50 30-50
## No 0.37946429 0.01785714 0.60267857
## Yes 0.01639344 0.30327869 0.68032787
##
## Salary
## Y High Low Medium
## No 0.08482143 0.20982143 0.70535714
## Yes 0.52459016 0.25409836 0.22131148
naive_pred <- predict(naive_model, newdata = CB_test_Val)
naive_pred## [1] No No No No No No No Yes No No No Yes No No No No No Yes No
## [20] No No No No No No No No No No No No No No No No No No Yes
## [39] Yes Yes Yes No Yes Yes Yes No No No Yes Yes Yes No Yes No Yes Yes No
## [58] No Yes No
## Levels: No Yes
(conf_mat_naive <- table(naive_pred, CB_test$Purchased))##
## naive_pred No Yes
## No 34 9
## Yes 2 15
The results of the confusionmatrix shows that the Naive Bayes classification correctly predicted 15 customers will buy and 2 incorrect predictions. Similarly, the model predicts 34 customers will not buy and 9 predictions incorrectly. What is the level of accuracy?? Let’s see below.
(nb_cm <- confusionMatrix(conf_mat_naive))## Confusion Matrix and Statistics
##
##
## naive_pred No Yes
## No 34 9
## Yes 2 15
##
## Accuracy : 0.8167
## 95% CI : (0.6956, 0.9048)
## No Information Rate : 0.6
## P-Value [Acc > NIR] : 0.0002826
##
## Kappa : 0.5985
##
## Mcnemar's Test P-Value : 0.0704404
##
## Sensitivity : 0.9444
## Specificity : 0.6250
## Pos Pred Value : 0.7907
## Neg Pred Value : 0.8824
## Prevalence : 0.6000
## Detection Rate : 0.5667
## Detection Prevalence : 0.7167
## Balanced Accuracy : 0.7847
##
## 'Positive' Class : No
##
dt_model <- ctree(Purchased ~ .,
CB_train,
control = ctree_control(mincriterion=0.95,
minsplit=20,
minbucket=7))
plot(dt_model, type="simple")dt_model##
## Model formula:
## Purchased ~ Gender + Age + Salary
##
## Fitted party:
## [1] root
## | [2] Salary in High
## | | [3] Age < 30: No (n = 7, err = 14.3%)
## | | [4] Age > 50, 30-50: Yes (n = 74, err = 16.2%)
## | [5] Salary in Low, Medium
## | | [6] Age < 30, 30-50
## | | | [7] Age < 30: No (n = 78, err = 0.0%)
## | | | [8] Age in 30-50
## | | | | [9] Salary in Low: No (n = 42, err = 50.0%)
## | | | | [10] Salary in Medium: No (n = 119, err = 13.4%)
## | | [11] Age > 50: Yes (n = 20, err = 5.0%)
##
## Number of inner nodes: 5
## Number of terminal nodes: 6
the model above is built with default parameters. And to produce a better model we need to do post-prunning tree to get a simpler model. Becaude for a Decission Tree Model it has a better result if the model is simpler.
And to do that We will changing the following parameters:
mincriterion: increase the valueminsplit: increase the valueminbucket: increase the valuedt_model_prun <- ctree(Purchased ~ .,
CB_train,
control = ctree_control(mincriterion=0.97,
minsplit=60,
minbucket=21))
plot(dt_model_prun, type="simple")dt_model_prun##
## Model formula:
## Purchased ~ Gender + Age + Salary
##
## Fitted party:
## [1] root
## | [2] Salary in High
## | | [3] Age < 30, > 50: Yes (n = 26, err = 30.8%)
## | | [4] Age in 30-50: Yes (n = 55, err = 18.2%)
## | [5] Salary in Low, Medium
## | | [6] Age < 30: No (n = 78, err = 0.0%)
## | | [7] Age > 50, 30-50
## | | | [8] Salary in Low: Yes (n = 52, err = 42.3%)
## | | | [9] Salary in Medium: No (n = 129, err = 20.2%)
##
## Number of inner nodes: 4
## Number of terminal nodes: 5
From the result above, the model has become simple enough.
After we train the data train then we can use it directly on the test data.
dt_pred <- predict(dt_model, CB_test)
dt_pred## 4 12 19 23 25 30 32 43 47 48 61 65 73 90 94 98 102 104 105 118
## No No No No No No No Yes No No No Yes No No No No No Yes No No
## 126 128 129 145 165 167 168 169 174 177 181 184 218 226 232 238 247 248 275 279
## No No No No No No No No No No No No No No No No No Yes Yes Yes
## 293 295 300 303 312 313 323 327 329 332 341 344 348 352 357 362 373 379 380 392
## Yes No Yes Yes Yes No No No Yes Yes Yes No Yes No Yes Yes No No Yes No
## Levels: No Yes
(conf_matrix_dt <- table(dt_pred, CB_test$Purchased))##
## dt_pred No Yes
## No 34 9
## Yes 2 15
The results of the confusionmatrix shows that the Decission Tree classification correctly predicted 15 customers will buy and 2 incorrect predictions. Similarly, the model predicts 34 customers will not buy and 9 predictions incorrectly. What is the level of accuracy?? Let’s see below.
(dt_cm <- confusionMatrix(conf_matrix_dt))## Confusion Matrix and Statistics
##
##
## dt_pred No Yes
## No 34 9
## Yes 2 15
##
## Accuracy : 0.8167
## 95% CI : (0.6956, 0.9048)
## No Information Rate : 0.6
## P-Value [Acc > NIR] : 0.0002826
##
## Kappa : 0.5985
##
## Mcnemar's Test P-Value : 0.0704404
##
## Sensitivity : 0.9444
## Specificity : 0.6250
## Pos Pred Value : 0.7907
## Neg Pred Value : 0.8824
## Prevalence : 0.6000
## Detection Rate : 0.5667
## Detection Prevalence : 0.7167
## Balanced Accuracy : 0.7847
##
## 'Positive' Class : No
##
When using random forest - we are not required to split our dataset into train and test sets because random forest already has out-of-bag estimates (OOB) which act as a reliable estimate of the accuracy on unseen examples. Although, it is also possible to hold out a regular train-test cross-validation.
We will create a Random Forest model using a train dataset with 5-fold cross validation, then the process is repeated 3 times.
set.seed(417)
ctrl <- trainControl(method = "repeatedcv",
number = 5, # k-fold
repeats = 3) # repetition
(CB_forest <- train(Purchased ~ .,
data = CB_train,
method = "rf", # random forest
trControl = ctrl))## Random Forest
##
## 340 samples
## 3 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 272, 272, 272, 271, 273, 272, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8283868 0.6181232
## 3 0.8303476 0.6237286
## 5 0.8273918 0.6174792
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
saveRDS(CB_forest, "CB_forest.RDS")From the model summary, we know that the optimum number of variables
considered for splitting at each tree node is 3. We can also inspect the
importance of each variable that was used in our random forest using
varImp().
varImp(CB_forest)## rf variable importance
##
## Overall
## SalaryMedium 100.00
## Age> 50 82.73
## Age30-50 32.63
## SalaryLow 20.59
## GenderMale 0.00
The OOB we achieved (in the summary below) was generated from our CB_train dataset.
plot(CB_forest$finalModel)
legend("topright", colnames(CB_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)
And we can see the final model as follows.
CB_forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 19.41%
## Confusion matrix:
## No Yes class.error
## No 191 30 0.1357466
## Yes 36 83 0.3025210
The results of the using of mtry = 3, shows that the model correctly predicted 83 customers will buy and 36 incorrect predictions. Similarly, the model predicts 191 customers will not buy and 30 predictions incorrectly.
Let’s test our random forest model to our CB_test dataset.
forest_pred <- predict(CB_forest, CB_test_Val)
forest_pred## [1] No No No No No No No Yes No No No Yes No No No No No Yes No
## [20] No No No No Yes No No No No No No No No No No No No No Yes
## [39] Yes Yes Yes No Yes Yes Yes No No No Yes Yes Yes No Yes No Yes Yes No
## [58] No Yes No
## Levels: No Yes
(conf_matrix_forest <- table(forest_pred, CB_test$Purchased))##
## forest_pred No Yes
## No 33 9
## Yes 3 15
The results of the confusionmatrix shows that the Random Forest classification correctly predicted 15 customers will buy and 3 incorrect predictions. Similarly, the model predicts 33 customers will not buy and 9 predictions incorrectly. What is the level of accuracy?? Let’s see below.
(rf_cm <- confusionMatrix(conf_matrix_forest))## Confusion Matrix and Statistics
##
##
## forest_pred No Yes
## No 33 9
## Yes 3 15
##
## Accuracy : 0.8
## 95% CI : (0.6767, 0.8922)
## No Information Rate : 0.6
## P-Value [Acc > NIR] : 0.0008097
##
## Kappa : 0.5652
##
## Mcnemar's Test P-Value : 0.1489147
##
## Sensitivity : 0.9167
## Specificity : 0.6250
## Pos Pred Value : 0.7857
## Neg Pred Value : 0.8333
## Prevalence : 0.6000
## Detection Rate : 0.5500
## Detection Prevalence : 0.7000
## Balanced Accuracy : 0.7708
##
## 'Positive' Class : No
##
Based on business questions, the best metrics are
accuracy & sensitivity/recall. Because we
want to predict whether a customer will buy or not the product.
(eval_nb <- data_frame(Accuracy = nb_cm$overall[1],
Recall = nb_cm$byClass[1],
Specificity = nb_cm$byClass[2],
Precision = nb_cm$byClass[3]))(eval_dt <- data_frame(Accuracy = dt_cm$overall[1],
Recall = dt_cm$byClass[1],
Specificity = dt_cm$byClass[2],
Precision = dt_cm$byClass[3]))(eval_rf <- data_frame(Accuracy = rf_cm$overall[1],
Recall = rf_cm$byClass[1],
Specificity = rf_cm$byClass[2],
Precision = rf_cm$byClass[3]))Based on the evaluation above, it can be seen that the results of the Naive Bayes & Decission Tree have a similar and better accuracy and sensitivity/recal than Random Forest.
So it was decided that we can use Naive Bayes & Decission Tree model to answer future business questions.