1 Objective

Over the years marketing researchers have pondered the reasoning behind consumer behavior, conducting study after study even in the present day as these behaviors continue to change. When research pointed toward solutions that didn’t maximize results, marketers were still wondering what the key determinant was for consumer behavior. Customer behavior refers to an individual’s buying habits, including social trends, frequency patterns, and background factors influencing their decision to buy something. This data is from a fictional company that sells a fictional product.

2 Data Preparation

The data represents details about 400 clients of a company including the unique ID, the gender, the age of the customer and the salary. Besides this, we have collected information regarding the buying decision - weather the customer decided to buy specific products or not.

#Load Library
library(dplyr)
library(e1071)
library(rsample)
library(caret)
library(ROCR)
library(partykit)
library(rsample)
library(randomForest)
#Import csv to data frame
cb <- read.csv("Customer_Behaviour.csv")
head(cb)
  • User.ID : Unique user id, available for each customer

  • Gender : Gender of the customer

  • Age : Age of the customer

  • EstimatedSalary : Estimated salary of customer

  • Purchased : 0 (Not Purchase), 1 (Purchase)

2.1 Data Wrangling

glimpse(cb)
#> Rows: 400
#> Columns: 5
#> $ User.ID         <int> 15624510, 15810944, 15668575, 15603246, 15804002, 1572~
#> $ Gender          <chr> "Male", "Male", "Female", "Female", "Male", "Male", "F~
#> $ Age             <int> 19, 35, 26, 27, 19, 27, 27, 32, 25, 35, 26, 26, 20, 32~
#> $ EstimatedSalary <int> 19000, 20000, 43000, 57000, 76000, 58000, 84000, 15000~
#> $ Purchased       <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ~
#Classify Age, <= 25 into young, <= 45 into adult, the rest to middle age
cb$Age[cb$Age <= 25] = "Young"
cb$Age[(cb$Age > 25 & cb$Age <=45)] = "Adult"
cb$Age[(cb$Age != "Young") & (cb$Age != "Adult")] = "Middle Age"
#Classify Age into <= 50000 into low, <= 90000 into moderate, the rest to high
cb$EstimatedSalary[cb$EstimatedSalary <= 50000] = "Low"
cb$EstimatedSalary[(cb$EstimatedSalary > 50000 & cb$EstimatedSalary <=90000)] = "Moderate"
cb$EstimatedSalary[(cb$EstimatedSalary != "Low") & (cb$EstimatedSalary != "Moderate")] = "High"
#Changing data type accordingly and removing unecessary data

cb_clean <- cb %>% 
  mutate(Gender = as.factor(Gender),
         Age = as.factor(Age),
         EstimatedSalary = as.factor(EstimatedSalary),
         Purchased = as.factor(Purchased)) %>% 
  select(-User.ID) 
#Checking missing value
colSums(is.na(cb_clean))
#>          Gender             Age EstimatedSalary       Purchased 
#>               0               0               0               0

2.2 Cross-Validation

#Splitting data train 80% and data test 20%
RNGkind(sample.kind = "Rounding")
set.seed(1234)

index <- sample(nrow(cb_clean), nrow(cb_clean)*0.8)

cb_train <- cb_clean[index,]
cb_test <- cb_clean[-index,]
#Checking target proportion
prop.table(table(cb_train$Purchased))
#> 
#>       0       1 
#> 0.65625 0.34375

Based on the target proportion, the data train is unbalanced with 65% to 35%, so decided to use DownSampling to make the proportion balanced.

set.seed(1234)

cb_train_down <- downSample(x = cb_train %>% 
                           select(-Purchased),
                          y = cb_train$Purchased,
                         yname = "Purchased")
#Checking target proportion after sampling
prop.table(table(cb_train_down$Purchased))
#> 
#>   0   1 
#> 0.5 0.5

3 Naive Bayes Model

After splitting our data into train and test set and downsample our train data, let us build our first model of Naive Bayes. There are several advantages in using this model, for example:

  • The model is relatively fast to train
  • It is estimating a probabilistic prediction
  • It can handle irrelevant features

3.1 Model Fitting

#Building naive bayes model
model_nb <- naiveBayes(formula = Purchased ~ ., data = cb_train_down)

3.2 Model Intrepretation

#Inspecting model
model_nb
#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#> 
#> A-priori probabilities:
#> Y
#>   0   1 
#> 0.5 0.5 
#> 
#> Conditional probabilities:
#>    Gender
#> Y      Female      Male
#>   0 0.5000000 0.5000000
#>   1 0.5545455 0.4454545
#> 
#>    Age
#> Y        Adult Middle Age      Young
#>   0 0.79090909 0.05454545 0.15454545
#>   1 0.40909091 0.59090909 0.00000000
#> 
#>    EstimatedSalary
#> Y         High        Low   Moderate
#>   0 0.07272727 0.31818182 0.60909091
#>   1 0.51818182 0.28181818 0.20000000
  • The probability of a Male Customer buying the product is 0.44, Female Customer is 0.55

  • The probability of an Adult Customer buying the product is 0.40, Middle Age customer is 0.59, and 0 from Young Customer

  • The probability of a Customer with High Salary buying the product is 0.51, Moderate Salary is 0.2, and Low Salary is 0.28

3.3 Model Evaluation

#Predicting the model to the data test
pred_naive <- predict(model_nb, newdata = cb_test, type = "class")
#Evaluating the model with confusion matrix
confusionMatrix(data = pred_naive, reference = cb_test$Purchased, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 42  2
#>          1  5 31
#>                                          
#>                Accuracy : 0.9125         
#>                  95% CI : (0.828, 0.9641)
#>     No Information Rate : 0.5875         
#>     P-Value [Acc > NIR] : 1.021e-10      
#>                                          
#>                   Kappa : 0.8219         
#>                                          
#>  Mcnemar's Test P-Value : 0.4497         
#>                                          
#>             Sensitivity : 0.9394         
#>             Specificity : 0.8936         
#>          Pos Pred Value : 0.8611         
#>          Neg Pred Value : 0.9545         
#>              Prevalence : 0.4125         
#>          Detection Rate : 0.3875         
#>    Detection Prevalence : 0.4500         
#>       Balanced Accuracy : 0.9165         
#>                                          
#>        'Positive' Class : 1              
#> 

Because both variable is important, we tend to look at the Accuracy metrics. From Naive Bayes model, the metrics is 0.9125

4 Decision Tree

Decision Tree is a fairly simple tree-based model with robust/powerful performance for prediction. Decision Tree produces a visualization in the form of a decision tree that can be interpreted easily.

Additional Decision Tree characters:

  • The predictor variables are assumed to be interdependent, so that they can overcome multicollinearity.
  • Can overcome numeric predictor values in the form of outliers.

4.1 Model Fitting

#Building decision tree model
model_dt <- ctree(formula = Purchased ~.,
                  data = cb_train_down)

4.2 Model Interpretation

#Plotting the decision tree model
plot(model_dt, type = "simple")

  • The highest entropy point from the data is divided by Age between Adult, Young and to Middle Age.

  • Customer with Middle Age is likely to Purchased the product, with only 8.5% error from 71 observations.

  • Customer with Adult, and Young is divided according to the Estimated Salary. A customer with High Salary is like to purchased the product, with 17.5% error from 40 observations. A customer with either Low or Moderate Salary is likely to not purchase the product, with 11% error from 109 observations.

4.3 Model Evaluation

#Predicting the model to the data test
pred_dt <- predict(model_dt, newdata = cb_test, type = "response")
#Evaluating the model with confusion matrix
confusionMatrix(data = pred_dt, reference = cb_test$Purchased, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 42  2
#>          1  5 31
#>                                          
#>                Accuracy : 0.9125         
#>                  95% CI : (0.828, 0.9641)
#>     No Information Rate : 0.5875         
#>     P-Value [Acc > NIR] : 1.021e-10      
#>                                          
#>                   Kappa : 0.8219         
#>                                          
#>  Mcnemar's Test P-Value : 0.4497         
#>                                          
#>             Sensitivity : 0.9394         
#>             Specificity : 0.8936         
#>          Pos Pred Value : 0.8611         
#>          Neg Pred Value : 0.9545         
#>              Prevalence : 0.4125         
#>          Detection Rate : 0.3875         
#>    Detection Prevalence : 0.4500         
#>       Balanced Accuracy : 0.9165         
#>                                          
#>        'Positive' Class : 1              
#> 

Because both variable is important, we tend to look at the Accuracy metrics. From Decision Tree model, the metrics is 0.9125

5 Random Forest

The last model to build is Random Forest. The following are among the advantages of the random forest model:

  • Reduce bias in a model as it aggregates multiple decision trees
  • Automatic feature selection
  • It generates an unbiased estimate of the out-of-box error

5.1 Model Fitting

#Creating RDS file
set.seed(1234)

ctrl <- trainControl(method = "repeatedcv",
                     number = 3, 
                     repeats = 3)
 
model_rf <- train(Purchased ~ .,
            data = cb_train,
            method = "rf", 
            trControl = ctrl)
 
saveRDS(model_rf, "model_rf.RDS")
#Reading rds file
model_rf <- readRDS("model_rf.RDS")
#Inspecting the model
model_rf
#> Random Forest 
#> 
#> 320 samples
#>   3 predictor
#>   2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (3 fold, repeated 3 times) 
#> Summary of sample sizes: 214, 213, 213, 214, 213, 213, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>   2     0.8990576  0.7775716
#>   3     0.8990576  0.7775716
#>   5     0.8990576  0.7775716
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

5.2 Model Prediction

#Predicting the model to the data test
pred_rf <- predict(model_rf, newdata = cb_test, type = "raw")

5.3 Model Evaluation

#Evaluating the model with confusion matrix
confusionMatrix(data = pred_rf, reference = cb_test$Purchased, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 42  2
#>          1  5 31
#>                                          
#>                Accuracy : 0.9125         
#>                  95% CI : (0.828, 0.9641)
#>     No Information Rate : 0.5875         
#>     P-Value [Acc > NIR] : 1.021e-10      
#>                                          
#>                   Kappa : 0.8219         
#>                                          
#>  Mcnemar's Test P-Value : 0.4497         
#>                                          
#>             Sensitivity : 0.9394         
#>             Specificity : 0.8936         
#>          Pos Pred Value : 0.8611         
#>          Neg Pred Value : 0.9545         
#>              Prevalence : 0.4125         
#>          Detection Rate : 0.3875         
#>    Detection Prevalence : 0.4500         
#>       Balanced Accuracy : 0.9165         
#>                                          
#>        'Positive' Class : 1              
#> 

Because both variable is important, we tend to look at the Accuracy metrics. From Random Forest model, the metrics is 0.9125

6 Conclusion

The business aim is predicting customer behaviour, with each given variable/factor whether the customer is going to buy the product or not, so both target is important. From the model evaluation, the metrics to be focus on is Accuracy. From all the models, all Accuracy is the same with 0.9125, or 91.25% of predicting customer behaviour. If the accuracy is lacking, we can do ROC/AUC evaluation, but the result is more than enough to assume that the models is good enough to predict customer behaviour.