Over the years marketing researchers have pondered the reasoning behind consumer behavior, conducting study after study even in the present day as these behaviors continue to change. When research pointed toward solutions that didn’t maximize results, marketers were still wondering what the key determinant was for consumer behavior. Customer behavior refers to an individual’s buying habits, including social trends, frequency patterns, and background factors influencing their decision to buy something. This data is from a fictional company that sells a fictional product.
The data represents details about 400 clients of a company including the unique ID, the gender, the age of the customer and the salary. Besides this, we have collected information regarding the buying decision - weather the customer decided to buy specific products or not.
#Load Library
library(dplyr)
library(e1071)
library(rsample)
library(caret)
library(ROCR)
library(partykit)
library(rsample)
library(randomForest)#Import csv to data frame
cb <- read.csv("Customer_Behaviour.csv")
head(cb)User.ID : Unique user id, available for each
customer
Gender : Gender of the customer
Age : Age of the customer
EstimatedSalary : Estimated salary of
customer
Purchased : 0 (Not Purchase), 1 (Purchase)
glimpse(cb)#> Rows: 400
#> Columns: 5
#> $ User.ID <int> 15624510, 15810944, 15668575, 15603246, 15804002, 1572~
#> $ Gender <chr> "Male", "Male", "Female", "Female", "Male", "Male", "F~
#> $ Age <int> 19, 35, 26, 27, 19, 27, 27, 32, 25, 35, 26, 26, 20, 32~
#> $ EstimatedSalary <int> 19000, 20000, 43000, 57000, 76000, 58000, 84000, 15000~
#> $ Purchased <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ~
#Classify Age, <= 25 into young, <= 45 into adult, the rest to middle age
cb$Age[cb$Age <= 25] = "Young"
cb$Age[(cb$Age > 25 & cb$Age <=45)] = "Adult"
cb$Age[(cb$Age != "Young") & (cb$Age != "Adult")] = "Middle Age"#Classify Age into <= 50000 into low, <= 90000 into moderate, the rest to high
cb$EstimatedSalary[cb$EstimatedSalary <= 50000] = "Low"
cb$EstimatedSalary[(cb$EstimatedSalary > 50000 & cb$EstimatedSalary <=90000)] = "Moderate"
cb$EstimatedSalary[(cb$EstimatedSalary != "Low") & (cb$EstimatedSalary != "Moderate")] = "High"#Changing data type accordingly and removing unecessary data
cb_clean <- cb %>%
mutate(Gender = as.factor(Gender),
Age = as.factor(Age),
EstimatedSalary = as.factor(EstimatedSalary),
Purchased = as.factor(Purchased)) %>%
select(-User.ID) #Checking missing value
colSums(is.na(cb_clean))#> Gender Age EstimatedSalary Purchased
#> 0 0 0 0
#Splitting data train 80% and data test 20%
RNGkind(sample.kind = "Rounding")
set.seed(1234)
index <- sample(nrow(cb_clean), nrow(cb_clean)*0.8)
cb_train <- cb_clean[index,]
cb_test <- cb_clean[-index,]#Checking target proportion
prop.table(table(cb_train$Purchased))#>
#> 0 1
#> 0.65625 0.34375
Based on the target proportion, the data train is unbalanced with 65% to 35%, so decided to use DownSampling to make the proportion balanced.
set.seed(1234)
cb_train_down <- downSample(x = cb_train %>%
select(-Purchased),
y = cb_train$Purchased,
yname = "Purchased")#Checking target proportion after sampling
prop.table(table(cb_train_down$Purchased))#>
#> 0 1
#> 0.5 0.5
After splitting our data into train and test set and downsample our train data, let us build our first model of Naive Bayes. There are several advantages in using this model, for example:
#Building naive bayes model
model_nb <- naiveBayes(formula = Purchased ~ ., data = cb_train_down)#Inspecting model
model_nb#>
#> Naive Bayes Classifier for Discrete Predictors
#>
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#>
#> A-priori probabilities:
#> Y
#> 0 1
#> 0.5 0.5
#>
#> Conditional probabilities:
#> Gender
#> Y Female Male
#> 0 0.5000000 0.5000000
#> 1 0.5545455 0.4454545
#>
#> Age
#> Y Adult Middle Age Young
#> 0 0.79090909 0.05454545 0.15454545
#> 1 0.40909091 0.59090909 0.00000000
#>
#> EstimatedSalary
#> Y High Low Moderate
#> 0 0.07272727 0.31818182 0.60909091
#> 1 0.51818182 0.28181818 0.20000000
The probability of a Male Customer buying the
product is 0.44, Female Customer is 0.55
The probability of an Adult Customer buying the
product is 0.40, Middle Age customer is 0.59, and 0 from
Young Customer
The probability of a Customer with High Salary
buying the product is 0.51, Moderate Salary is 0.2, and
Low Salary is 0.28
#Predicting the model to the data test
pred_naive <- predict(model_nb, newdata = cb_test, type = "class")#Evaluating the model with confusion matrix
confusionMatrix(data = pred_naive, reference = cb_test$Purchased, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 42 2
#> 1 5 31
#>
#> Accuracy : 0.9125
#> 95% CI : (0.828, 0.9641)
#> No Information Rate : 0.5875
#> P-Value [Acc > NIR] : 1.021e-10
#>
#> Kappa : 0.8219
#>
#> Mcnemar's Test P-Value : 0.4497
#>
#> Sensitivity : 0.9394
#> Specificity : 0.8936
#> Pos Pred Value : 0.8611
#> Neg Pred Value : 0.9545
#> Prevalence : 0.4125
#> Detection Rate : 0.3875
#> Detection Prevalence : 0.4500
#> Balanced Accuracy : 0.9165
#>
#> 'Positive' Class : 1
#>
Because both variable is important, we tend to look at the
Accuracy metrics. From Naive Bayes model, the metrics is
0.9125
Decision Tree is a fairly simple tree-based model with robust/powerful performance for prediction. Decision Tree produces a visualization in the form of a decision tree that can be interpreted easily.
Additional Decision Tree characters:
#Building decision tree model
model_dt <- ctree(formula = Purchased ~.,
data = cb_train_down)#Plotting the decision tree model
plot(model_dt, type = "simple")The highest entropy point from the data is divided by
Age between Adult, Young and to Middle Age.
Customer with
Middle Age is likely to Purchased the product, with only
8.5% error from 71 observations.
Customer with Adult, and Young is divided according to the
Estimated Salary. A customer with
High Salary is like to purchased the product, with 17.5%
error from 40 observations. A customer with either
Low or Moderate Salary is likely to not purchase the product,
with 11% error from 109 observations.
#Predicting the model to the data test
pred_dt <- predict(model_dt, newdata = cb_test, type = "response")#Evaluating the model with confusion matrix
confusionMatrix(data = pred_dt, reference = cb_test$Purchased, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 42 2
#> 1 5 31
#>
#> Accuracy : 0.9125
#> 95% CI : (0.828, 0.9641)
#> No Information Rate : 0.5875
#> P-Value [Acc > NIR] : 1.021e-10
#>
#> Kappa : 0.8219
#>
#> Mcnemar's Test P-Value : 0.4497
#>
#> Sensitivity : 0.9394
#> Specificity : 0.8936
#> Pos Pred Value : 0.8611
#> Neg Pred Value : 0.9545
#> Prevalence : 0.4125
#> Detection Rate : 0.3875
#> Detection Prevalence : 0.4500
#> Balanced Accuracy : 0.9165
#>
#> 'Positive' Class : 1
#>
Because both variable is important, we tend to look at the
Accuracy metrics. From Decision Tree model, the metrics is
0.9125
The last model to build is Random Forest. The following are among the advantages of the random forest model:
#Creating RDS file
set.seed(1234)
ctrl <- trainControl(method = "repeatedcv",
number = 3,
repeats = 3)
model_rf <- train(Purchased ~ .,
data = cb_train,
method = "rf",
trControl = ctrl)
saveRDS(model_rf, "model_rf.RDS")#Reading rds file
model_rf <- readRDS("model_rf.RDS")#Inspecting the model
model_rf#> Random Forest
#>
#> 320 samples
#> 3 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Cross-Validated (3 fold, repeated 3 times)
#> Summary of sample sizes: 214, 213, 213, 214, 213, 213, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.8990576 0.7775716
#> 3 0.8990576 0.7775716
#> 5 0.8990576 0.7775716
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
#Predicting the model to the data test
pred_rf <- predict(model_rf, newdata = cb_test, type = "raw")#Evaluating the model with confusion matrix
confusionMatrix(data = pred_rf, reference = cb_test$Purchased, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 42 2
#> 1 5 31
#>
#> Accuracy : 0.9125
#> 95% CI : (0.828, 0.9641)
#> No Information Rate : 0.5875
#> P-Value [Acc > NIR] : 1.021e-10
#>
#> Kappa : 0.8219
#>
#> Mcnemar's Test P-Value : 0.4497
#>
#> Sensitivity : 0.9394
#> Specificity : 0.8936
#> Pos Pred Value : 0.8611
#> Neg Pred Value : 0.9545
#> Prevalence : 0.4125
#> Detection Rate : 0.3875
#> Detection Prevalence : 0.4500
#> Balanced Accuracy : 0.9165
#>
#> 'Positive' Class : 1
#>
Because both variable is important, we tend to look at the
Accuracy metrics. From Random Forest model, the metrics is
0.9125
The business aim is predicting customer behaviour, with each given
variable/factor whether the customer is going to buy the product or not,
so both target is important. From the model evaluation, the metrics to
be focus on is Accuracy. From all the models, all Accuracy
is the same with 0.9125, or 91.25% of
predicting customer behaviour. If the accuracy is lacking, we can do
ROC/AUC evaluation, but the result is more than enough to assume that
the models is good enough to predict customer behaviour.