The Notebook is divided into tabbed sections as seen below.
our business problem is:
The business goal is determined together with other department. This part should be the continuation of the background problem.
On the Data Understanding phase, we will gather, describe and explore the data to make sure it fits the business goal.
The deliverable or result of this phase should include:
Answers to these research questions will appear in green checkboxes
The collected data consists of 40,000 distinct customers with 14 variables. The description of each column/variable can be seen below:
Variable | Description |
---|---|
flag | Whether the customer has bought the target product or not |
gender | Gender of the customer |
education | Education background of customer |
house_val | Value of the residence the customer lives in |
age | Age of the customer by group |
online | Whether the customer had online shopping experience or not |
customer_psy | Variable describing consumer psychology based on the area of residence |
marriage | Marriage status of the customer |
children | Whether the customer has children or not |
occupation | Career information of the customer |
mortgage | Housing Loan Information of customers |
house_own | Whether the customer owns a house or not |
region | Information on the area in which the customer are located |
fam_income | Family income Information of the customer(A means the lowest, and L means the highest) |
Load packages and import data.
# load required packages
library(pacman)
library(tidymodels)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(odbc)
library(DBI)
library(RSQLite)
::p_load(tidyverse,scales,ggthemes, ggplot2, gridExtra, DT, corrplot, broom)
pacman
# import the sales dataset
<- read.csv("sales.csv")
df_sales dim(df_sales)
#> [1] 40000 14
There are 40000 cases in the dataset and 14 covariates.
# look at the data
datatable(df_sales, options = list(pageLength = 5))
We can also see try to get the information about the structure of the data including:
glimpse(df_sales)
#> Rows: 40,000
#> Columns: 14
#> $ flag <chr> "Y", "N", "N", "Y", "Y", "Y", "Y", "N", "N", "Y", "N", "Y~
#> $ gender <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M", "M", "F~
#> $ education <chr> "4. Grad", "3. Bach", "2. Some College", "2. Some College~
#> $ house_val <int> 756460, 213171, 111147, 354151, 117087, 248694, 2000000, ~
#> $ age <chr> "1_Unk", "7_>65", "2_<=25", "2_<=25", "1_Unk", "6_<=65", ~
#> $ online <chr> "N", "N", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "N~
#> $ customer_psy <chr> "B", "E", "C", "B", "J", "B", "A", "C", "G", "C", "C", "F~
#> $ marriage <chr> "", "", "", "Single", "Married", "Married", "Married", "M~
#> $ child <chr> "U", "U", "Y", "U", "Y", "N", "U", "Y", "Y", "U", "Y", "U~
#> $ occupation <chr> "Professional", "Professional", "Professional", "Sales/Se~
#> $ mortgage <chr> "1Low", "1Low", "1Low", "1Low", "1Low", "2Med", "1Low", "~
#> $ house_owner <chr> "", "Owner", "Owner", "", "", "Owner", "", "Owner", "Rent~
#> $ region <chr> "Midwest", "Northeast", "Midwest", "West", "South", "West~
#> $ fam_income <chr> "L", "G", "J", "L", "H", "G", "C", "I", "D", "G", "E", "E~
Click on the tabs for different aspects of the Exploratory Data Analysis and cleaning.
We can check the full summary for customer_psy
and
fam_income
column since they contain many categories.
table("Customer Psychology" = df_sales$customer_psy)
#> Customer Psychology
#> A B C D E F G H I J U
#> 1427 8197 7830 2353 6650 4058 3951 958 2262 2187 127
table("Family Income" = df_sales$fam_income)
#> Family Income
#> A B C D E F G H I J K L U
#> 2274 2169 2687 4582 8432 6641 4224 2498 1622 1614 1487 1617 153
There are some interesting finding from the summary. For example, the
gender
column consists of 3 categories: F (Female), M
(Male), and U (Unknown). The child
column is similar, with
additional value of U (Unknown) and 0 (zero) even though the column
should only be Yes or No. The marriage
and
education
column contain empty values. This is not
surprising, since the sales team are not instructed to fulfill each
column with pre-determined values. However, this means that the incoming
data quality is not good and require future standardization in the
future. This also show us that we need to cleanse and prepare the data
before we do any analysis so that all relevant information can be
captured.
On this process, we handle the data based on the problem we find during the data understanding phase. Based on our finding, we will do the following process:
education
,
house_owner
and marriage
into explicit
Unknownage
category by removing the index (1_Unkn
into Unknown, 2_<=25 into <=25, etc.)mortgage
category by removing the
index<- df_sales |>
df_clean mutate(
flag = ifelse(flag == "Y", "Yes", "No"),
gender = case_when(gender == "F" ~ "Female",
== "M" ~ "Male",
gender TRUE ~ "Unknown"),
child = case_when(child == "Y" ~ "Yes",
== "N" ~ "No",
child TRUE ~ "Unknown"),
education = str_remove_all(education, "[0-9.]") |>
str_trim(),
education = ifelse(education == "", "Unknown", education),
age = substr(age, 3, nchar(age)),
age = ifelse(age == "Unk", "Unknown", age),
age = case_when(age == "<=35" ~ "26-35",
== "<=45" ~ "36-45",
age == "<=55" ~ "46-55",
age == "<=65" ~ "56-65",
age TRUE ~ age
),fam_income = ifelse(fam_income == "U", "Unknown", fam_income),
customer_psy = ifelse(customer_psy == "U", "Unknown", customer_psy),
online = ifelse(online == "Y", "Yes", "No"),
marriage = ifelse(marriage == "", "Unknown", marriage),
mortgage = substr(mortgage, 2, nchar(mortgage)),
house_owner = ifelse(house_owner == "", "Unknown", house_owner)
)
<-df_clean |>
df_cleanmutate_all(function(x) ifelse(x == "Unknown", NA, x)) |>
drop_na() |>
mutate_if(is.character, as.factor) |>
mutate(
flag = factor(flag, levels = c("Yes", "No"))
)
<-dbConnect(RSQLite::SQLite(), ":memory:")
sales_comcopy_to(sales_com,df_sales)
SELECT *, CASE WHEN flag='Y' THEN 'Yes'
ELSE 'No'
END AS flag_clean,
CASE WHEN gender = 'F' THEN 'Female'
WHEN gender = 'M' THEN 'Male'
ELSE 'Unknown'
END AS gender_clean,
CASE WHEN child = 'Y' THEN 'Yes'
WHEN child = 'N' THEN 'No'
ELSE 'Unknown'
END AS child_clean,
CASE WHEN LTRIM(RTRIM(SUBSTR(education,3,13)))='' THEN 'Unknown'
ELSE LTRIM(RTRIM(SUBSTR(education,3,13))) END AS education_clean,
CASE WHEN SUBSTR(age,3,6) = '<=35' THEN '26-35'
WHEN SUBSTR(age,3,6) = '<=45' THEN '36-45'
WHEN SUBSTR(age,3,6) = '<=55' THEN '46-55'
WHEN SUBSTR(age,3,6) = '<=65' THEN '56-65'
ELSE SUBSTR(age,3,6) END AS age_clean,
CASE WHEN fam_income = 'U' THEN 'Unknown'
ELSE fam_income
END AS fam_income_clean,
CASE WHEN customer_psy = 'U' THEN 'Unknown'
ELSE customer_psy
END AS customer_psy_clean,
CASE WHEN online = 'Y' THEN 'Yes'
ELSE 'No' END AS online_clean,
CASE WHEN marriage = '' THEN 'Unknown'
ELSE marriage END AS marriage_clean,
SUBSTR(mortgage, 2, 4) AS mortgage_clean,
CASE WHEN house_owner = '' THEN 'Unknown'
ELSE house_owner END AS house_owner_clean
FROM df_sales;
Do some final touches on data cleaned with sql
<-df_clean_sql |>
df_clean_sqlselect(-flag,-gender,-child,-education,-fam_income,-customer_psy,
-online,-marriage,-mortgage,-house_owner) |>
relocate(flag_clean,gender_clean,child_clean,
education_clean,fam_income_clean,
customer_psy_clean,online_clean,|>
marriage_clean,mortgage_clean,house_owner_clean)mutate_all(function(x) ifelse(x == "Unknown", NA, x)) |>
drop_na() |>
mutate_if(is.character, as.factor) |>
mutate(
flag = factor(flag_clean, levels = c("Yes", "No"))
)
<-dbConnect(RSQLite::SQLite(), ":memory:")
sales_comcopy_to(sales_com,df_clean_sql)
Click on the tabs for different aspects of the Exploratory Data Analysis and visualisation.
Here we will do visualization to see whether there are any difference between customer who buy our product and who don’t. To visualize a distribution, we can use histogram. The x-axis is the house valuation while the y-axis show the frequency or the number of customer with certain house valuation.
From the histogram, most of our customer has house valuation less than 2,500,000. Some customers are outlier and has house valuation greater than 2,500,000. Their frequency is low and they cannot be seen on the histogram. The distribution for people who buy and not buy are quite similar, therefore we cannot simply decide if a customer will buy our product based on their house valuation.
|>
df_clean_sql ggplot(aes(x = house_val, fill = flag_clean)) +
geom_histogram(color = "white", alpha = 0.75) +
theme(legend.position = "top") +
::scale_fill_stata()+
ggthemesscale_x_continuous(labels = number_format(big.mark = ",")) +
scale_y_continuous(labels = number_format(big.mark = ",")) +
labs(x = "House Valuation", y = "Frequency",
fill = "Flag",
title = "House Valuation Distribution")
We can cut and remove the outlier to see the distribution better.
|>
df_clean_sql ggplot(aes(x = house_val, fill = flag_clean)) +
geom_histogram(color = "white", alpha = 0.75) +
theme(legend.position = "top") +
::scale_fill_stata()+
ggthemesscale_x_continuous(labels = number_format(big.mark = ","), limits = c(0, 2*1e6)) +
scale_y_continuous(labels = number_format(big.mark = ",")) +
labs(x = "House Valuation", y = "Frequency",
fill = "Flag",
title = "House Valuation Distribution")
We will see if the education level can be a great indicator to decide if a customer has high probability to buy our product. The color of each block represent the frequency of people that fell in that category, with brighter color indicate higher frequency.
Based on the heatmap, people with higher education level (Bach and Grad) are more likely to buy our product. Therefore, education level may be a great indicator to check potential customer.
|>
df_clean_sql count(flag_clean, education_clean) |>
ggplot(aes(education_clean, flag_clean, fill = n)) +
geom_tile(color = "white") +
geom_text(aes(label = number(n, big.mark = ","))) +
scale_fill_binned(low = "firebrick", high = "lightyellow",
label = number_format(big.mark = ",")) +
theme(legend.position = "top",
legend.key.width = unit(15, "mm")) +
labs(x = "Education",
y = "Flag",
fill = "Frequency")
We will do the same thing here with the occupation/job. The one that stands out is the professional occupation that has a very high frequency of people who buy our product.
|>
df_clean_sql count(flag_clean, occupation) |>
ggplot(aes(occupation, flag_clean, fill = n)) +
geom_tile(color = "white") +
geom_text(aes(label = number(n, big.mark = ","))) +
scale_fill_binned(low = "purple", high = "lightyellow",
label = number_format(big.mark = ",")) +
theme(legend.position = "top",
legend.key.width = unit(15, "mm")) +
labs(x = "Occupation",
y = "Flag",
fill = "Frequency")
In this section we will use the tidymodels
to model our
data using random forest
, logistic regression
and decision tree
The cross-validation step is where we will split our data into 2 separate dataset: training dataset and testing dataset.
set.seed(123)
# Split the dataset
<-initial_split(df_clean,
td_split strata = flag,
prop=0.8)
<- training(td_split)
train <- testing(td_split) test
<- rpart(flag~., train)
model_tree
fancyRpartPlot(model_tree, sub = NULL, main = "Predictive Rule from Decision Tree\n")
The next model is Random Forest. In short, Random Forest is a collection of decision tree that together make a single decision.
set.seed(1234)
<- randomForest(flag ~ ., data = train,
model_rf ntree = 500,
nodesize = 1,
mtry = 2,
importance = T
)
$importance |>
model_rfas.data.frame() |>
rownames_to_column("var") |>
ggplot(aes(x = MeanDecreaseGini,
fill = MeanDecreaseGini,
y = var |> reorder(MeanDecreaseGini))) +
geom_col() +
scale_fill_binned(low = "firebrick", high = "lightyellow")+
labs(x = "Gini Decrease",
fill = "Gini Decrease",
y = "Predictors",
title = "Variable Importance")
In classification problem, we evaluate model by looking at how many of their predictions are correct. This can be plotted into something called Confusion Matrix.
The matrix is divided into four area:
For example, here is the confusion matrix from the decision tree after doing prediction to the testing dataset.
<- predict(model_tree,test, type = "class")
pred_test <- predict(model_tree,test, type = "prob")
pred_prob
<- as.data.frame(pred_prob) |>
df_prediction_tree mutate(prediction = pred_test,
truth =test$flag)
conf_mat(df_prediction_tree, truth = truth, estimate = prediction)
#> Truth
#> Prediction Yes No
#> Yes 1707 881
#> No 240 492
If we define positive as Yes:
Next, we can start doing evaluation using 3 different metrics: accuracy, recall, and precision. Those metrics are pretty general and complement each other. There are more evaluation metrics but we will not be discussed it here.
Accuracy simply tell us how many prediction is true compared to the total dataset.
\[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]
1728 + 508) / (1728 + 508+ 865 + 219)
(#> [1] 0.673494
From all data in testing dataset, only 67% of them are correctly predicted as buy/not buy.
\[ Accuracy = \frac{1728 + 508}{1728 + 508+ 865 + 219} = 0.673 = 67.3\% \]
Recall/sensitivity only concerns how many customers that actually buy can correctly be predicted. The metric don’t care about the customer that don’t actually buy our product.
\[ Recall = \frac{TP}{TP + FN} \]
1728 / (1728 + 219)
#> [1] 0.8875193
From all customer that actually buy our product, 89% of them are correctly predicted as buy and 16% as not buy.
\[ Recall = \frac{1728}{1728 + 219} = 0.8875 = 88.75\% \]
Precision only concern on how many positive prediction that are actually correct. The metric don’t care about customer that is predicted not buy.
\[ Precision = \frac{TP}{TP + FP} \]
1728 / (1728 + 865)
#> [1] 0.6664096
From all customer that is predicted to buy, only 67% of them that are actually buy our product.
\[ Precision = \frac{1728}{1728 + 865} = 0.6664 = 66.64\% \]
Here is the recap of the evaluation metrics for Random Forest. The model is slightly better than the Decision Tree.
<- predict(model_rf,test)
pred_test <- predict(model_rf,test, type = "prob")
pred_prob
<- as.data.frame(pred_prob) |>
df_prediction_rf mutate(prediction = pred_test,
truth =test$flag)
conf_mat(df_prediction_rf, truth = truth, estimate = prediction)
#> Truth
#> Prediction Yes No
#> Yes 1575 625
#> No 372 748
\[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]
1582 + 732) / (1582 + 732 + 641 + 365)
(#> [1] 0.696988
69% accuracy
<- recipes::recipe(flag ~ ., data=train) |>
train_rcp step_zv(all_predictors()) |>
step_dummy(all_nominal_predictors())
# Prep and bake the recipe so we can view this as a seperate data frame
<- train_rcp |>
training_df prep() |>
juice()
initialise the model
<- parsnip::logistic_reg() |>
lr_mod set_engine('glm')
add to workflow
<-
lr_wf workflow() |>
add_model(lr_mod) |>
add_recipe(train_rcp)
<-
lr_fit |>
lr_wf fit(data=train)
Extracting the fitted data
I want to pull the data fits into a tibble I can explore. This can be done below:
<- lr_fit |>
lr_fitted extract_fit_parsnip() |>
tidy()
I will visualise this via a bar chart to observe my significant features:
<- lr_fitted |>
lr_fitted_add mutate(Significance = ifelse(p.value < 0.05,
"Significant", "Insignificant")) |>
arrange(desc(p.value))
#Create a ggplot object to visualise significance
<- lr_fitted_add |>
plot ggplot(mapping = aes(x=fct_reorder(term,p.value), y=p.value, fill=Significance)) +
geom_col() +
scale_fill_calc()+
theme(axis.text.x = element_text(face="bold",
color="#0070BA",
size=8,
angle=90)) +
labs(y="P value",
x="Terms",
title="P value significance chart",
subtitle="A chart to represent the significant variables in the model",
caption="Produced by Bongani")
::ggplotly(plot) plotly
Use fitted logistic regression model to predict on test set
We are going to now append our predictions from our model we created
as a baseline to append to the predictions we already have in the
predictions
data frame:
<- predict(lr_fit, test, type='prob')
testing_lr_fit_probs <- predict(lr_fit, test)
testing_lr_fit_class
<- cbind(test, testing_lr_fit_probs, testing_lr_fit_class)
predictions
<- predictions |>
predictions ::mutate(log_reg_model_pred=.pred_class,
dplyrlog_reg_model_prob=.pred_Yes) |>
::select(everything(), -c(.pred_class, .pred_No)) |>
dplyr::mutate(log_reg_class_custom = ifelse(log_reg_model_prob >0.7,"Yes","No")) |>
dplyr::select(-.pred_Yes)
dplyr
# Get a head view of the finalised data
head(predictions)
#> flag gender education house_val age online customer_psy marriage child
#> 1 No Female Some College 136729 36-45 Yes C Married Yes
#> 2 Yes Male Some College 325081 46-55 Yes C Married Yes
#> 3 No Male Grad 0 46-55 Yes G Single Yes
#> 4 No Female Grad 248520 >65 Yes D Married No
#> 5 No Female <HS 0 36-45 No G Single Yes
#> 6 No Female Some College 654177 36-45 No E Single Yes
#> occupation mortgage house_owner region fam_income log_reg_model_pred
#> 1 Blue Collar Low Owner Midwest G No
#> 2 Sales/Service Low Owner Midwest J Yes
#> 3 Professional Low Owner Northeast D Yes
#> 4 Professional Low Owner Northeast E No
#> 5 Sales/Service Low Renter South A No
#> 6 Sales/Service Low Renter West D No
#> log_reg_model_prob log_reg_class_custom
#> 1 0.4826369 No
#> 2 0.7762933 Yes
#> 3 0.7164619 Yes
#> 4 0.4133618 No
#> 5 0.1384484 No
#> 6 0.3503392 No
The default caret
confusion_matrix function saves
everything as a string and doesn’t allow you to work with the values
from the output.
First, I will evaluate my baseline model using the package:
<- ConfusionTableR::binary_class_cm(
cm_lr #Here you will have to cast to factor type as the tool expects factors
train_labels = as.factor(predictions$log_reg_class_custom),
truth_labels = as.factor(predictions$flag),
positive='Yes', mode='everything'
)
# View the confusion matrix native
$confusion_matrix
cm_lr#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Yes No
#> Yes 1011 250
#> No 936 1123
#>
#> Accuracy : 0.6428
#> 95% CI : (0.6262, 0.6591)
#> No Information Rate : 0.5864
#> P-Value [Acc > NIR] : 1.755e-11
#>
#> Kappa : 0.314
#>
#> Mcnemar's Test P-Value : < 2.2e-16
#>
#> Sensitivity : 0.5193
#> Specificity : 0.8179
#> Pos Pred Value : 0.8017
#> Neg Pred Value : 0.5454
#> Precision : 0.8017
#> Recall : 0.5193
#> F1 : 0.6303
#> Prevalence : 0.5864
#> Detection Rate : 0.3045
#> Detection Prevalence : 0.3798
#> Balanced Accuracy : 0.6686
#>
#> 'Positive' Class : Yes
#>
The baseline model performs pretty well. You can see this is the result of fixing our imbalance. Let’s work with it in a row wise fashion, as we can extract some metrics we my be interested in:
# Get record level confusion matrix for logistic regression model
<- cm_lr$record_level_cm
cm_rl_log_reg <- tibble(
accuracy_frame Accuracy=cm_rl_log_reg$Accuracy,
Kappa=cm_rl_log_reg$Kappa,
Precision=cm_rl_log_reg$Precision,
Recall=cm_rl_log_reg$Recall
)
cm_rl_log_reg#> Pred_Yes_Ref_Yes Pred_No_Ref_Yes Pred_Yes_Ref_No Pred_No_Ref_No Accuracy
#> 1 1011 936 250 1123 0.6427711
#> Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
#> 1 0.3140479 0.6262006 0.6590897 0.5864458 1.755412e-11
#> McnemarPValue Sensitivity Specificity Pos.Pred.Value Neg.Pred.Value Precision
#> 1 4.906889e-88 0.5192604 0.817917 0.8017446 0.5454104 0.8017446
#> Recall F1 Prevalence Detection.Rate Detection.Prevalence
#> 1 0.5192604 0.6302993 0.5864458 0.3045181 0.3798193
#> Balanced.Accuracy cm_ts
#> 1 0.6685887 2023-12-20 20:13:52