Introduction
- Purpose
- The Automotive Sales dataset:
Set up Environment
Compare some data cleaning procedures in R and SQL
Explanatory Analysis
Modeling process

Introduction

The Notebook is divided into tabbed sections as seen below.

Purpose

our business problem is:

We have stagnant profit because the number of sales is constant
There are a lot of customer leads but we can’t reach all of them
We need to know which customer leads that should be prioritized
We need lead scoring so that we can be efficent on targeting potential buyer

The business goal is determined together with other department. This part should be the continuation of the background problem.

Gain insight on what drives people to buy our product
Increase profit by 10% year over year
Reduce annual marketing cost by 10%

On the Data Understanding phase, we will gather, describe and explore the data to make sure it fits the business goal.

The deliverable or result of this phase should include:

Data description
Early data exploration report
Data quality report

Answers to these research questions will appear in green checkboxes

The Automotive Sales dataset:

The collected data consists of 40,000 distinct customers with 14 variables. The description of each column/variable can be seen below:

Variable	Description
flag	Whether the customer has bought the target product or not
gender	Gender of the customer
education	Education background of customer
house_val	Value of the residence the customer lives in
age	Age of the customer by group
online	Whether the customer had online shopping experience or not
customer_psy	Variable describing consumer psychology based on the area of residence
marriage	Marriage status of the customer
children	Whether the customer has children or not
occupation	Career information of the customer
mortgage	Housing Loan Information of customers
house_own	Whether the customer owns a house or not
region	Information on the area in which the customer are located
fam_income	Family income Information of the customer(A means the lowest, and L means the highest)

Set up Environment

Load packages and import data.

# load required packages
library(pacman)
library(tidymodels)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(odbc)
library(DBI)
library(RSQLite)

pacman::p_load(tidyverse,scales,ggthemes, ggplot2, gridExtra, DT, corrplot, broom)

# import the sales dataset
df_sales <- read.csv("sales.csv")
dim(df_sales)
#> [1] 40000    14

There are 40000 cases in the dataset and 14 covariates.

# look at the data
datatable(df_sales, options = list(pageLength = 5))

We can also see try to get the information about the structure of the data including:

The number of rows (each row represent a single customer data)
The number of column
The name of each column
The data type of each column

glimpse(df_sales)
#> Rows: 40,000
#> Columns: 14
#> $ flag         <chr> "Y", "N", "N", "Y", "Y", "Y", "Y", "N", "N", "Y", "N", "Y~
#> $ gender       <chr> "M", "F", "M", "M", "F", "F", "M", "F", "F", "M", "M", "F~
#> $ education    <chr> "4. Grad", "3. Bach", "2. Some College", "2. Some College~
#> $ house_val    <int> 756460, 213171, 111147, 354151, 117087, 248694, 2000000, ~
#> $ age          <chr> "1_Unk", "7_>65", "2_<=25", "2_<=25", "1_Unk", "6_<=65", ~
#> $ online       <chr> "N", "N", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "N~
#> $ customer_psy <chr> "B", "E", "C", "B", "J", "B", "A", "C", "G", "C", "C", "F~
#> $ marriage     <chr> "", "", "", "Single", "Married", "Married", "Married", "M~
#> $ child        <chr> "U", "U", "Y", "U", "Y", "N", "U", "Y", "Y", "U", "Y", "U~
#> $ occupation   <chr> "Professional", "Professional", "Professional", "Sales/Se~
#> $ mortgage     <chr> "1Low", "1Low", "1Low", "1Low", "1Low", "2Med", "1Low", "~
#> $ house_owner  <chr> "", "Owner", "Owner", "", "", "Owner", "", "Owner", "Rent~
#> $ region       <chr> "Midwest", "Northeast", "Midwest", "West", "South", "West~
#> $ fam_income   <chr> "L", "G", "J", "L", "H", "G", "C", "I", "D", "G", "E", "E~

Compare some data cleaning procedures in R and SQL

Click on the tabs for different aspects of the Exploratory Data Analysis and cleaning.

initial exploration

We can check the full summary for customer_psy and fam_income column since they contain many categories.

table("Customer Psychology" = df_sales$customer_psy)
#> Customer Psychology
#>    A    B    C    D    E    F    G    H    I    J    U 
#> 1427 8197 7830 2353 6650 4058 3951  958 2262 2187  127

table("Family Income" = df_sales$fam_income)
#> Family Income
#>    A    B    C    D    E    F    G    H    I    J    K    L    U 
#> 2274 2169 2687 4582 8432 6641 4224 2498 1622 1614 1487 1617  153

There are some interesting finding from the summary. For example, the gender column consists of 3 categories: F (Female), M (Male), and U (Unknown). The child column is similar, with additional value of U (Unknown) and 0 (zero) even though the column should only be Yes or No. The marriage and education column contain empty values. This is not surprising, since the sales team are not instructed to fulfill each column with pre-determined values. However, this means that the incoming data quality is not good and require future standardization in the future. This also show us that we need to cleanse and prepare the data before we do any analysis so that all relevant information can be captured.

Data Cleansing in R

On this process, we handle the data based on the problem we find during the data understanding phase. Based on our finding, we will do the following process:

Change missing/empty value in education, house_owner and marriage into explicit Unknown
Make all U value in all categorical column into explicit Unknown
Cleanse the age category by removing the index (1_Unkn into Unknown, 2_<=25 into <=25, etc.)
Cleanse the mortgage category by removing the index

df_clean <- df_sales |> 
  mutate(
    flag = ifelse(flag == "Y", "Yes", "No"),
    gender = case_when(gender == "F" ~ "Female",
                       gender == "M" ~ "Male",
                       TRUE ~ "Unknown"),    
    child = case_when(child == "Y" ~ "Yes",
                      child == "N" ~ "No",
                      TRUE ~ "Unknown"),    
    education = str_remove_all(education, "[0-9.]") |> 
      str_trim(),
    education = ifelse(education == "", "Unknown", education),
    age = substr(age, 3, nchar(age)),
    age = ifelse(age == "Unk", "Unknown", age),
    age = case_when(age == "<=35" ~ "26-35",
                    age == "<=45" ~ "36-45",
                    age == "<=55" ~ "46-55",
                    age == "<=65" ~ "56-65",
                    TRUE ~ age
                    ),
    fam_income = ifelse(fam_income == "U", "Unknown", fam_income),
    customer_psy = ifelse(customer_psy == "U", "Unknown", customer_psy),
    online = ifelse(online == "Y", "Yes", "No"),
    marriage = ifelse(marriage == "", "Unknown", marriage),
    mortgage = substr(mortgage, 2, nchar(mortgage)),
    house_owner = ifelse(house_owner == "", "Unknown", house_owner)
  )

df_clean<-df_clean |>
  mutate_all(function(x) ifelse(x == "Unknown", NA, x)) |> 
  drop_na() |> 
  mutate_if(is.character, as.factor) |> 
  mutate(
    flag = factor(flag, levels = c("Yes", "No"))
         )

Data Cleansing in SQL

sales_com<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(sales_com,df_sales)

SELECT *, CASE WHEN flag='Y' THEN 'Yes'
               ELSE 'No'
               END AS flag_clean,
          CASE WHEN gender = 'F' THEN 'Female'
               WHEN gender = 'M' THEN 'Male'
               ELSE 'Unknown'
               END AS gender_clean,
          CASE WHEN child = 'Y' THEN 'Yes'
               WHEN child = 'N' THEN 'No'
               ELSE 'Unknown' 
               END AS child_clean,
               CASE WHEN LTRIM(RTRIM(SUBSTR(education,3,13)))='' THEN 'Unknown'
                    ELSE LTRIM(RTRIM(SUBSTR(education,3,13))) END AS education_clean,
               CASE WHEN SUBSTR(age,3,6) = '<=35' THEN '26-35'
                    WHEN SUBSTR(age,3,6) = '<=45' THEN '36-45'
                    WHEN SUBSTR(age,3,6) = '<=55' THEN '46-55'
                    WHEN SUBSTR(age,3,6) = '<=65' THEN '56-65'
                    ELSE SUBSTR(age,3,6) END AS age_clean,
               CASE WHEN fam_income = 'U' THEN 'Unknown'
                    ELSE fam_income
                    END AS fam_income_clean,
               CASE WHEN customer_psy = 'U' THEN 'Unknown'
                    ELSE customer_psy
                    END AS customer_psy_clean,
               CASE WHEN online = 'Y' THEN 'Yes'
                    ELSE 'No' END AS online_clean,
               CASE WHEN marriage = '' THEN 'Unknown'
                    ELSE marriage END AS marriage_clean,
               SUBSTR(mortgage, 2, 4) AS mortgage_clean,
               CASE WHEN house_owner = '' THEN 'Unknown'
                    ELSE house_owner END AS house_owner_clean
FROM df_sales;

Do some final touches on data cleaned with sql

df_clean_sql<-df_clean_sql |> 
  select(-flag,-gender,-child,-education,-fam_income,-customer_psy,
         -online,-marriage,-mortgage,-house_owner) |> 
  relocate(flag_clean,gender_clean,child_clean,
           education_clean,fam_income_clean,
           customer_psy_clean,online_clean,
           marriage_clean,mortgage_clean,house_owner_clean)|>
  mutate_all(function(x) ifelse(x == "Unknown", NA, x)) |> 
  drop_na() |> 
  mutate_if(is.character, as.factor) |> 
  mutate(
    flag = factor(flag_clean, levels = c("Yes", "No"))
         )

sales_com<-dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(sales_com,df_clean_sql)

Explanatory Analysis

Click on the tabs for different aspects of the Exploratory Data Analysis and visualisation.

House Valuation Distribution

Here we will do visualization to see whether there are any difference between customer who buy our product and who don’t. To visualize a distribution, we can use histogram. The x-axis is the house valuation while the y-axis show the frequency or the number of customer with certain house valuation.

Insights

From the histogram, most of our customer has house valuation less than 2,500,000. Some customers are outlier and has house valuation greater than 2,500,000. Their frequency is low and they cannot be seen on the histogram. The distribution for people who buy and not buy are quite similar, therefore we cannot simply decide if a customer will buy our product based on their house valuation.

df_clean_sql |> 
  ggplot(aes(x = house_val, fill = flag_clean)) +
  geom_histogram(color = "white", alpha = 0.75) +
  theme(legend.position = "top") +
  ggthemes::scale_fill_stata()+
  scale_x_continuous(labels = number_format(big.mark = ",")) +
  scale_y_continuous(labels = number_format(big.mark = ",")) +
  labs(x = "House Valuation", y = "Frequency",
       fill = "Flag",
       title = "House Valuation Distribution")

We can cut and remove the outlier to see the distribution better.

df_clean_sql |> 
  ggplot(aes(x = house_val, fill = flag_clean)) +
  geom_histogram(color = "white", alpha = 0.75) +
  theme(legend.position = "top") +
   ggthemes::scale_fill_stata()+
  scale_x_continuous(labels = number_format(big.mark = ","), limits = c(0, 2*1e6)) +
  scale_y_continuous(labels = number_format(big.mark = ",")) +
  labs(x = "House Valuation", y = "Frequency",
       fill = "Flag",
       title = "House Valuation Distribution")

Education Level

We will see if the education level can be a great indicator to decide if a customer has high probability to buy our product. The color of each block represent the frequency of people that fell in that category, with brighter color indicate higher frequency.

Based on the heatmap, people with higher education level (Bach and Grad) are more likely to buy our product. Therefore, education level may be a great indicator to check potential customer.

df_clean_sql |> 
  count(flag_clean, education_clean) |> 
  ggplot(aes(education_clean, flag_clean, fill = n)) +
  geom_tile(color = "white") +
  geom_text(aes(label = number(n, big.mark = ","))) +
  scale_fill_binned(low = "firebrick", high = "lightyellow",
                    label = number_format(big.mark = ",")) +
  theme(legend.position = "top",
        legend.key.width = unit(15, "mm")) +
  labs(x = "Education", 
       y = "Flag", 
       fill = "Frequency")

Occupation

We will do the same thing here with the occupation/job. The one that stands out is the professional occupation that has a very high frequency of people who buy our product.

df_clean_sql |> 
  count(flag_clean, occupation) |> 
  ggplot(aes(occupation, flag_clean, fill = n)) +
  geom_tile(color = "white") +
  geom_text(aes(label = number(n, big.mark = ","))) +
  scale_fill_binned(low = "purple", high = "lightyellow",
                    label = number_format(big.mark = ",")) +
  theme(legend.position = "top",
        legend.key.width = unit(15, "mm")) +
  labs(x = "Occupation", 
       y = "Flag", 
       fill = "Frequency")

Modeling process

In this section we will use the tidymodels to model our data using random forest, logistic regression and decision tree

model development

Cross-Validation

create train and test data

The cross-validation step is where we will split our data into 2 separate dataset: training dataset and testing dataset.

Training Dataset: Dataset that will be used to train the machine learning model
Testing Dataset: Dataset that will be used to evaluate the performance of the model.

set.seed(123)
# Split the dataset 
td_split <-initial_split(df_clean, 
                                   strata = flag, 
                                   prop=0.8)

train <- training(td_split)
test <- testing(td_split)

Model Fitting

model_tree <- rpart(flag~., train)

fancyRpartPlot(model_tree, sub = NULL, main = "Predictive Rule from Decision Tree\n")

Random Forest

The next model is Random Forest. In short, Random Forest is a collection of decision tree that together make a single decision.

set.seed(1234)
model_rf <- randomForest(flag ~ ., data = train,
                         ntree = 500,
                         nodesize = 1,
                         mtry = 2,
                         importance = T
                         )

model_rf$importance |> 
  as.data.frame() |> 
  rownames_to_column("var") |> 
  ggplot(aes(x = MeanDecreaseGini,
             fill = MeanDecreaseGini,
             y = var |> reorder(MeanDecreaseGini))) +
  geom_col() +
  scale_fill_binned(low = "firebrick", high = "lightyellow")+
  labs(x = "Gini Decrease",
       fill = "Gini Decrease",
       y = "Predictors",
       title = "Variable Importance")

Model Evaluation

In classification problem, we evaluate model by looking at how many of their predictions are correct. This can be plotted into something called Confusion Matrix.

The matrix is divided into four area:

True Positive (TP): The model predict customer will buy and the prediction is correct (customer buy)
False Positive (FP): The model predict customer will buy and the prediction is incorrect (customer not buy)
True Negative (TN): The model predict customer will not buy and the prediction is correct (customer not buy)
False Negative (FN): The model predict customer will not buy and the prediction is incorrect (customer buy)

For example, here is the confusion matrix from the decision tree after doing prediction to the testing dataset.

pred_test <- predict(model_tree,test, type = "class")
pred_prob <- predict(model_tree,test, type = "prob")

df_prediction_tree <- as.data.frame(pred_prob) |> 
  mutate(prediction = pred_test,
         truth =test$flag) 

conf_mat(df_prediction_tree, truth = truth, estimate = prediction)
#>           Truth
#> Prediction  Yes   No
#>        Yes 1707  881
#>        No   240  492

If we define positive as Yes:

True Positive (TP): 1728
False Positive (FP): 865
True Negative (TN): 508
False Negative (FN): 219

Next, we can start doing evaluation using 3 different metrics: accuracy, recall, and precision. Those metrics are pretty general and complement each other. There are more evaluation metrics but we will not be discussed it here.

Accuracy

Accuracy simply tell us how many prediction is true compared to the total dataset.

\[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]

(1728 + 508) / (1728 + 508+ 865 + 219)
#> [1] 0.673494

From all data in testing dataset, only 67% of them are correctly predicted as buy/not buy.

\[ Accuracy = \frac{1728 + 508}{1728 + 508+ 865 + 219} = 0.673 = 67.3\% \]

Sensitivity/Recall

Recall/sensitivity only concerns how many customers that actually buy can correctly be predicted. The metric don’t care about the customer that don’t actually buy our product.

\[ Recall = \frac{TP}{TP + FN} \]

1728 / (1728 + 219)
#> [1] 0.8875193

From all customer that actually buy our product, 89% of them are correctly predicted as buy and 16% as not buy.

\[ Recall = \frac{1728}{1728 + 219} = 0.8875 = 88.75\% \]

Precision

Precision only concern on how many positive prediction that are actually correct. The metric don’t care about customer that is predicted not buy.

\[ Precision = \frac{TP}{TP + FP} \]

1728 / (1728 + 865)
#> [1] 0.6664096

From all customer that is predicted to buy, only 67% of them that are actually buy our product.

\[ Precision = \frac{1728}{1728 + 865} = 0.6664 = 66.64\% \]

Random Forest

Here is the recap of the evaluation metrics for Random Forest. The model is slightly better than the Decision Tree.

pred_test <- predict(model_rf,test)
pred_prob <- predict(model_rf,test, type = "prob")

df_prediction_rf <- as.data.frame(pred_prob) |> 
  mutate(prediction = pred_test,
         truth =test$flag)

conf_mat(df_prediction_rf, truth = truth, estimate = prediction)
#>           Truth
#> Prediction  Yes   No
#>        Yes 1575  625
#>        No   372  748

\[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]

(1582 + 732) / (1582 + 732 + 641 + 365)
#> [1] 0.696988

69% accuracy

logistic regression

train_rcp <- recipes::recipe(flag ~ ., data=train) |> 
  step_zv(all_predictors()) |> 
  step_dummy(all_nominal_predictors())

# Prep and bake the recipe so we can view this as a seperate data frame
training_df <- train_rcp |> 
  prep() |> 
  juice()

initialise the model

lr_mod <- parsnip::logistic_reg() |> 
  set_engine('glm')

add to workflow

lr_wf <-
  workflow() |> 
  add_model(lr_mod) |> 
  add_recipe(train_rcp)

lr_fit <- 
  lr_wf |> 
  fit(data=train)

Extracting the fitted data

I want to pull the data fits into a tibble I can explore. This can be done below:

lr_fitted <- lr_fit |> 
  extract_fit_parsnip() |> 
  tidy()

I will visualise this via a bar chart to observe my significant features:


lr_fitted_add <- lr_fitted  |> 
  mutate(Significance = ifelse(p.value < 0.05, 
                               "Significant", "Insignificant")) |> 
  arrange(desc(p.value)) 
#Create a ggplot object to visualise significance
plot <- lr_fitted_add |> 
  ggplot(mapping = aes(x=fct_reorder(term,p.value), y=p.value, fill=Significance)) +
  geom_col() + 
  scale_fill_calc()+
  theme(axis.text.x = element_text(face="bold", 
                                   color="#0070BA", 
                                   size=8, 
                                   angle=90)) + 
  labs(y="P value", 
       x="Terms",
       title="P value significance chart",
       subtitle="A chart to represent the significant variables in the model",
       caption="Produced by Bongani")

plotly::ggplotly(plot)

Use fitted logistic regression model to predict on test set

We are going to now append our predictions from our model we created as a baseline to append to the predictions we already have in the predictions data frame:

testing_lr_fit_probs <- predict(lr_fit, test, type='prob')
testing_lr_fit_class <- predict(lr_fit, test)

predictions<- cbind(test, testing_lr_fit_probs, testing_lr_fit_class)

predictions <- predictions |> 
  dplyr::mutate(log_reg_model_pred=.pred_class,
                log_reg_model_prob=.pred_Yes) |> 
  dplyr::select(everything(), -c(.pred_class, .pred_No)) |> 
  dplyr::mutate(log_reg_class_custom = ifelse(log_reg_model_prob >0.7,"Yes","No")) |> 
  dplyr::select(-.pred_Yes)


# Get a head view of the finalised data
head(predictions)
#>   flag gender    education house_val   age online customer_psy marriage child
#> 1   No Female Some College    136729 36-45    Yes            C  Married   Yes
#> 2  Yes   Male Some College    325081 46-55    Yes            C  Married   Yes
#> 3   No   Male         Grad         0 46-55    Yes            G   Single   Yes
#> 4   No Female         Grad    248520   >65    Yes            D  Married    No
#> 5   No Female          <HS         0 36-45     No            G   Single   Yes
#> 6   No Female Some College    654177 36-45     No            E   Single   Yes
#>      occupation mortgage house_owner    region fam_income log_reg_model_pred
#> 1   Blue Collar      Low       Owner   Midwest          G                 No
#> 2 Sales/Service      Low       Owner   Midwest          J                Yes
#> 3  Professional      Low       Owner Northeast          D                Yes
#> 4  Professional      Low       Owner Northeast          E                 No
#> 5 Sales/Service      Low      Renter     South          A                 No
#> 6 Sales/Service      Low      Renter      West          D                 No
#>   log_reg_model_prob log_reg_class_custom
#> 1          0.4826369                   No
#> 2          0.7762933                  Yes
#> 3          0.7164619                  Yes
#> 4          0.4133618                   No
#> 5          0.1384484                   No
#> 6          0.3503392                   No

Evaluating the models with the ConfusionTableR package

The default caret confusion_matrix function saves everything as a string and doesn’t allow you to work with the values from the output.

Evaluate Logistic Regression baseline

First, I will evaluate my baseline model using the package:

cm_lr <- ConfusionTableR::binary_class_cm(
  #Here you will have to cast to factor type as the tool expects factors
  train_labels = as.factor(predictions$log_reg_class_custom),
  truth_labels = as.factor(predictions$flag),
  positive='Yes', mode='everything'
  )

# View the confusion matrix native
cm_lr$confusion_matrix
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  Yes   No
#>        Yes 1011  250
#>        No   936 1123
#>                                           
#>                Accuracy : 0.6428          
#>                  95% CI : (0.6262, 0.6591)
#>     No Information Rate : 0.5864          
#>     P-Value [Acc > NIR] : 1.755e-11       
#>                                           
#>                   Kappa : 0.314           
#>                                           
#>  Mcnemar's Test P-Value : < 2.2e-16       
#>                                           
#>             Sensitivity : 0.5193          
#>             Specificity : 0.8179          
#>          Pos Pred Value : 0.8017          
#>          Neg Pred Value : 0.5454          
#>               Precision : 0.8017          
#>                  Recall : 0.5193          
#>                      F1 : 0.6303          
#>              Prevalence : 0.5864          
#>          Detection Rate : 0.3045          
#>    Detection Prevalence : 0.3798          
#>       Balanced Accuracy : 0.6686          
#>                                           
#>        'Positive' Class : Yes             
#>

The baseline model performs pretty well. You can see this is the result of fixing our imbalance. Let’s work with it in a row wise fashion, as we can extract some metrics we my be interested in:

# Get record level confusion matrix for logistic regression model
cm_rl_log_reg <- cm_lr$record_level_cm
accuracy_frame <- tibble(
  Accuracy=cm_rl_log_reg$Accuracy,
  Kappa=cm_rl_log_reg$Kappa,
  Precision=cm_rl_log_reg$Precision,
  Recall=cm_rl_log_reg$Recall
)
cm_rl_log_reg
#>   Pred_Yes_Ref_Yes Pred_No_Ref_Yes Pred_Yes_Ref_No Pred_No_Ref_No  Accuracy
#> 1             1011             936             250           1123 0.6427711
#>       Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue
#> 1 0.3140479     0.6262006     0.6590897    0.5864458   1.755412e-11
#>   McnemarPValue Sensitivity Specificity Pos.Pred.Value Neg.Pred.Value Precision
#> 1  4.906889e-88   0.5192604    0.817917      0.8017446      0.5454104 0.8017446
#>      Recall        F1 Prevalence Detection.Rate Detection.Prevalence
#> 1 0.5192604 0.6302993  0.5864458      0.3045181            0.3798193
#>   Balanced.Accuracy               cm_ts
#> 1         0.6685887 2023-12-20 20:13:52

Data Science tools : case study

Bongani Ncube

2023-12-21

Introduction

Purpose

The Automotive Sales dataset:

Set up Environment

Compare some data cleaning procedures in R and SQL

initial exploration

Data Cleansing in R

Data Cleansing in SQL

Explanatory Analysis

House Valuation Distribution

Insights

Education Level

Occupation

Modeling process

model development

Cross-Validation

create train and test data

Model Fitting

Random Forest

Model Evaluation

Random Forest

logistic regression

Evaluating the models with the ConfusionTableR package

Evaluate Logistic Regression baseline