1. DATA INTRODUCTION

As a marketing analyst, we want to increase sales by targeting customers with certain characteristics. From the historical data of 400 customers who have been prospected, we get information about gender, age, and salary category as well as whether he bought our product or not.

1.1. Data Preparation

Read the data.

CB <- read.csv("Customer_Behaviour.csv")
CB
glimpse(CB)
## Rows: 400
## Columns: 4
## $ Gender    <chr> "Male", "Male", "Female", "Female", "Male", "Male", "Female"…
## $ Age       <chr> "< 30", "30-50", "< 30", "< 30", "< 30", "< 30", "< 30", "30…
## $ Salary    <chr> "Low", "Low", "Medium", "Medium", "Medium", "Medium", "Mediu…
## $ Purchased <chr> "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No",…
  • Gender: Gender (Male, Female)
  • Age: Age range (< 30, 30-50, > 50)
  • Salary: Customer Salary Category (Low, Medium, High)
  • Purchased: Whether the client buys our product or not (Yes, No)

Check Missing Value.

anyNA(CB)
## [1] FALSE
colSums(is.na(CB))
##    Gender       Age    Salary Purchased 
##         0         0         0         0

1.2. Data Preprocessing

We need to change the data type of each variable to a data type that matches the data.

CB <- CB %>% 
  mutate(Gender = as.factor(Gender),
         Age = as.factor(Age),
         Salary = as.factor(Salary),
         Purchased = as.factor(Purchased))
CB
glimpse(CB)
## Rows: 400
## Columns: 4
## $ Gender    <fct> Male, Male, Female, Female, Male, Male, Female, Female, Male…
## $ Age       <fct> < 30, 30-50, < 30, < 30, < 30, < 30, < 30, 30-50, < 30, 30-5…
## $ Salary    <fct> Low, Low, Medium, Medium, Medium, Medium, Medium, High, Low,…
## $ Purchased <fct> No, No, No, No, No, No, No, Yes, No, No, No, No, No, No, No,…

2. DATA ANALYSIS

2.1. Exploratory

From the busines question, We are going to build a predictive model to classify “whether the client buys our product or not” (Purchased = Yes / No).

levels(CB$Purchased)
## [1] "No"  "Yes"

From the level above, it can be seen that the target variable consists of two categories, namely “No” and “Yes”.

Check the distribution proportion of target class

prop.table(table(CB$Purchased))
## 
##     No    Yes 
## 0.6425 0.3575
table(CB$Purchased)
## 
##  No Yes 
## 257 143

When viewed from the proportion of the two classes, it is quite balanced, so we don’t really need additional pre-processing to balance the proportion between the two target classes of variables.

2.2. Cross Validation

Splitting the data into data train(85%) and data test(15%).

RNGkind(sample.kind = "Rounding")
set.seed(417)

# index sampling
index <- sample(x = nrow(CB), size = nrow(CB)*0.85)

# splitting
CB_train <- CB[index , ]

CB_test <- CB[-index , ]

Check dimension.

dim(CB_train)
## [1] 340   4
dim(CB_test)
## [1] 60  4

Eliminating target variable from test dataset.

CB_test_Val <- CB_test %>% select(-Purchased)
dim(CB_test_Val)
## [1] 60  3

Check the distribution proportion of target class from data train.

prop.table(table(CB_train$Purchased))
## 
##   No  Yes 
## 0.65 0.35
table(CB_train$Purchase)
## 
##  No Yes 
## 221 119

The proportion is quite balanced.

3. NAIVE BAYES MODEL

3.1. Build the model

Skewness Due To Scarcity is one of the characteristics of Naive Bayes Model. To get over it, we will do a Laplace Smoothing when we build the model.

naive_model <- naiveBayes(Purchased  ~ .,
                          data = CB_train,
                          laplace = 1)
naive_model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   No  Yes 
## 0.65 0.35 
## 
## Conditional probabilities:
##      Gender
## Y        Female      Male
##   No  0.4887892 0.5112108
##   Yes 0.5454545 0.4545455
## 
##      Age
## Y           < 30       > 50      30-50
##   No  0.37946429 0.01785714 0.60267857
##   Yes 0.01639344 0.30327869 0.68032787
## 
##      Salary
## Y           High        Low     Medium
##   No  0.08482143 0.20982143 0.70535714
##   Yes 0.52459016 0.25409836 0.22131148

3.2. Predict the model with test data set

naive_pred <- predict(naive_model, newdata = CB_test_Val)
naive_pred
##  [1] No  No  No  No  No  No  No  Yes No  No  No  Yes No  No  No  No  No  Yes No 
## [20] No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes
## [39] Yes Yes Yes No  Yes Yes Yes No  No  No  Yes Yes Yes No  Yes No  Yes Yes No 
## [58] No  Yes No 
## Levels: No Yes

3.3. Confussion Matrix

(conf_mat_naive <- table(naive_pred, CB_test$Purchased))
##           
## naive_pred No Yes
##        No  34   9
##        Yes  2  15

The results of the confusionmatrix shows that the Naive Bayes classification correctly predicted 15 customers will buy and 2 incorrect predictions. Similarly, the model predicts 34 customers will not buy and 9 predictions incorrectly. What is the level of accuracy?? Let’s see below.

(nb_cm <- confusionMatrix(conf_mat_naive))
## Confusion Matrix and Statistics
## 
##           
## naive_pred No Yes
##        No  34   9
##        Yes  2  15
##                                           
##                Accuracy : 0.8167          
##                  95% CI : (0.6956, 0.9048)
##     No Information Rate : 0.6             
##     P-Value [Acc > NIR] : 0.0002826       
##                                           
##                   Kappa : 0.5985          
##                                           
##  Mcnemar's Test P-Value : 0.0704404       
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.6250          
##          Pos Pred Value : 0.7907          
##          Neg Pred Value : 0.8824          
##              Prevalence : 0.6000          
##          Detection Rate : 0.5667          
##    Detection Prevalence : 0.7167          
##       Balanced Accuracy : 0.7847          
##                                           
##        'Positive' Class : No              
## 

4. DECISSION TREE MODEL

4.1. Build the model.

dt_model <- ctree(Purchased ~ .,
                  CB_train,
                  control = ctree_control(mincriterion=0.95,
                                             minsplit=20,
                                             minbucket=7))
plot(dt_model, type="simple")

dt_model
## 
## Model formula:
## Purchased ~ Gender + Age + Salary
## 
## Fitted party:
## [1] root
## |   [2] Salary in High
## |   |   [3] Age < 30: No (n = 7, err = 14.3%)
## |   |   [4] Age > 50, 30-50: Yes (n = 74, err = 16.2%)
## |   [5] Salary in Low, Medium
## |   |   [6] Age < 30, 30-50
## |   |   |   [7] Age < 30: No (n = 78, err = 0.0%)
## |   |   |   [8] Age in 30-50
## |   |   |   |   [9] Salary in Low: No (n = 42, err = 50.0%)
## |   |   |   |   [10] Salary in Medium: No (n = 119, err = 13.4%)
## |   |   [11] Age > 50: Yes (n = 20, err = 5.0%)
## 
## Number of inner nodes:    5
## Number of terminal nodes: 6

the model above is built with default parameters. And to produce a better model we need to do post-prunning tree to get a simpler model. Becaude for a Decission Tree Model it has a better result if the model is simpler.

And to do that We will changing the following parameters:

  • mincriterion: increase the value
  • minsplit: increase the value
  • minbucket: increase the value
dt_model_prun <- ctree(Purchased ~ .,
                  CB_train,
                  control = ctree_control(mincriterion=0.97,
                                             minsplit=60,
                                             minbucket=21))
plot(dt_model_prun, type="simple")

dt_model_prun
## 
## Model formula:
## Purchased ~ Gender + Age + Salary
## 
## Fitted party:
## [1] root
## |   [2] Salary in High
## |   |   [3] Age < 30, > 50: Yes (n = 26, err = 30.8%)
## |   |   [4] Age in 30-50: Yes (n = 55, err = 18.2%)
## |   [5] Salary in Low, Medium
## |   |   [6] Age < 30: No (n = 78, err = 0.0%)
## |   |   [7] Age > 50, 30-50
## |   |   |   [8] Salary in Low: Yes (n = 52, err = 42.3%)
## |   |   |   [9] Salary in Medium: No (n = 129, err = 20.2%)
## 
## Number of inner nodes:    4
## Number of terminal nodes: 5

From the result above, the model has become simple enough.

4.2. Predict the model with test data set

After we train the data train then we can use it directly on the test data.

dt_pred <- predict(dt_model, CB_test)
dt_pred
##   4  12  19  23  25  30  32  43  47  48  61  65  73  90  94  98 102 104 105 118 
##  No  No  No  No  No  No  No Yes  No  No  No Yes  No  No  No  No  No Yes  No  No 
## 126 128 129 145 165 167 168 169 174 177 181 184 218 226 232 238 247 248 275 279 
##  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No  No Yes Yes Yes 
## 293 295 300 303 312 313 323 327 329 332 341 344 348 352 357 362 373 379 380 392 
## Yes  No Yes Yes Yes  No  No  No Yes Yes Yes  No Yes  No Yes Yes  No  No Yes  No 
## Levels: No Yes

4.3. Confussion Matrix

(conf_matrix_dt <- table(dt_pred, CB_test$Purchased))
##        
## dt_pred No Yes
##     No  34   9
##     Yes  2  15

The results of the confusionmatrix shows that the Decission Tree classification correctly predicted 15 customers will buy and 2 incorrect predictions. Similarly, the model predicts 34 customers will not buy and 9 predictions incorrectly. What is the level of accuracy?? Let’s see below.

(dt_cm <- confusionMatrix(conf_matrix_dt))
## Confusion Matrix and Statistics
## 
##        
## dt_pred No Yes
##     No  34   9
##     Yes  2  15
##                                           
##                Accuracy : 0.8167          
##                  95% CI : (0.6956, 0.9048)
##     No Information Rate : 0.6             
##     P-Value [Acc > NIR] : 0.0002826       
##                                           
##                   Kappa : 0.5985          
##                                           
##  Mcnemar's Test P-Value : 0.0704404       
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.6250          
##          Pos Pred Value : 0.7907          
##          Neg Pred Value : 0.8824          
##              Prevalence : 0.6000          
##          Detection Rate : 0.5667          
##    Detection Prevalence : 0.7167          
##       Balanced Accuracy : 0.7847          
##                                           
##        'Positive' Class : No              
## 

5. RANDOM FOREST MODEL

When using random forest - we are not required to split our dataset into train and test sets because random forest already has out-of-bag estimates (OOB) which act as a reliable estimate of the accuracy on unseen examples. Although, it is also possible to hold out a regular train-test cross-validation.

5.1. Build the model

We will create a Random Forest model using a train dataset with 5-fold cross validation, then the process is repeated 3 times.

set.seed(417)
 
ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # k-fold
                     repeats = 3) # repetition

(CB_forest <- train(Purchased ~ .,
                   data = CB_train,
                   method = "rf", # random forest
                    trControl = ctrl))
## Random Forest 
## 
## 340 samples
##   3 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 272, 272, 272, 271, 273, 272, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8283868  0.6181232
##   3     0.8303476  0.6237286
##   5     0.8273918  0.6174792
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
saveRDS(CB_forest, "CB_forest.RDS")

From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 3. We can also inspect the importance of each variable that was used in our random forest using varImp().

varImp(CB_forest)
## rf variable importance
## 
##              Overall
## SalaryMedium  100.00
## Age> 50        82.73
## Age30-50       32.63
## SalaryLow      20.59
## GenderMale      0.00

The OOB we achieved (in the summary below) was generated from our CB_train dataset.

plot(CB_forest$finalModel)
legend("topright", colnames(CB_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

And we can see the final model as follows.

CB_forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.41%
## Confusion matrix:
##      No Yes class.error
## No  191  30   0.1357466
## Yes  36  83   0.3025210

The results of the using of mtry = 3, shows that the model correctly predicted 83 customers will buy and 36 incorrect predictions. Similarly, the model predicts 191 customers will not buy and 30 predictions incorrectly.

5.2. Predict the model with test data set

Let’s test our random forest model to our CB_test dataset.

forest_pred <- predict(CB_forest, CB_test_Val)
forest_pred
##  [1] No  No  No  No  No  No  No  Yes No  No  No  Yes No  No  No  No  No  Yes No 
## [20] No  No  No  No  Yes No  No  No  No  No  No  No  No  No  No  No  No  No  Yes
## [39] Yes Yes Yes No  Yes Yes Yes No  No  No  Yes Yes Yes No  Yes No  Yes Yes No 
## [58] No  Yes No 
## Levels: No Yes

5.3. Confussion Matrix

(conf_matrix_forest <- table(forest_pred, CB_test$Purchased))
##            
## forest_pred No Yes
##         No  33   9
##         Yes  3  15

The results of the confusionmatrix shows that the Random Forest classification correctly predicted 15 customers will buy and 3 incorrect predictions. Similarly, the model predicts 33 customers will not buy and 9 predictions incorrectly. What is the level of accuracy?? Let’s see below.

(rf_cm <- confusionMatrix(conf_matrix_forest))
## Confusion Matrix and Statistics
## 
##            
## forest_pred No Yes
##         No  33   9
##         Yes  3  15
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6767, 0.8922)
##     No Information Rate : 0.6             
##     P-Value [Acc > NIR] : 0.0008097       
##                                           
##                   Kappa : 0.5652          
##                                           
##  Mcnemar's Test P-Value : 0.1489147       
##                                           
##             Sensitivity : 0.9167          
##             Specificity : 0.6250          
##          Pos Pred Value : 0.7857          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.6000          
##          Detection Rate : 0.5500          
##    Detection Prevalence : 0.7000          
##       Balanced Accuracy : 0.7708          
##                                           
##        'Positive' Class : No              
## 

6. CONCLUSION

Based on business questions, the best metrics are accuracy & sensitivity/recall. Because we want to predict whether a customer will buy or not the product.

(eval_nb <- data_frame(Accuracy = nb_cm$overall[1],
           Recall = nb_cm$byClass[1],
           Specificity = nb_cm$byClass[2],
           Precision = nb_cm$byClass[3]))
(eval_dt <- data_frame(Accuracy = dt_cm$overall[1],
           Recall = dt_cm$byClass[1],
           Specificity = dt_cm$byClass[2],
           Precision = dt_cm$byClass[3]))
(eval_rf <- data_frame(Accuracy = rf_cm$overall[1],
           Recall = rf_cm$byClass[1],
           Specificity = rf_cm$byClass[2],
           Precision = rf_cm$byClass[3]))

Based on the evaluation above, it can be seen that the results of the Naive Bayes & Decission Tree have a similar and better accuracy and sensitivity/recal than Random Forest.

So it was decided that we can use Naive Bayes & Decission Tree model to answer future business questions.