Student Info

Matric No: 17127309

Name: Md Azrul Syaffiq Bin Md Suhaimi

Course: WQD7005 DATA MINING

Introduction

This insight is about ecommerce transaction which generated using Synthetic Data Generation based different platform (Facebook, Instagram, TikTok, X, Outlet & Online Ads) ranging for 24 months. The dataset included columns as below:

Objectives

The primary objective is to analyze customer behavior and transactions to gain actionable insights. This involves:

  1. Segmenting customers using clustering techniques to identify distinct groups for targeted marketing.

  2. Predicting customer churn using machine learning models to understand factors contributing to customer attrition.

  3. Visualizing key trends in transactions, such as monthly patterns, to aid in strategic decision-making.

Methodology

We will be using R Programming in R Studio for this exploration. There are 4 steps in these insights. Mainly:

  1. Data Import and Pre-processing
  2. Decision Tree Analysis
  3. Advanced Customer Segmentation using PCA, K-means and DBSCAN
  4. Ensemble Methods for Churn Prediction using Bagging, XGBoost and Random Forest

Data Source

You may download the data source and Rmd file from the list below

———————–

1. Data Import and Pre-processing

———————–

Set the working directory.

You may change according to your working directory.

setwd("/Users/azrulsyaffiq/Desktop/Study/UM/Semester I/WQD7005 DATA MINING/AA2")

Load necessary libraries.

Essential libraries for tasks such as data pre-processing (dplyr), machine learning (caret, randomForest, xgboost), visualization (ggplot2), and clustering (dbscan)

library(dplyr) # For efficient data manipulation and import.
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr) # For efficient data manipulation and import.
library(caret) # For splitting datasets, model training, and evaluation.
## Loading required package: ggplot2
## Loading required package: lattice
library(rpart) # To build and visualize decision tree models.
library(rpart.plot) # To build and visualize decision tree models.
library(cluster) # For cluster e.g K-means clustering
library(factoextra) # Visualizing clustering results and Principal Component Analysis (PCA)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(randomForest) # For ensemble methods
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(xgboost) # For ensemble methods
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(ROCR) # Visualize and evaluate the performance of classification models (ROC & AUC)
library(ggplot2) # For creating clean and customizable visualizations.
library(dbscan) # To perform density-based clustering (DBSCAN).
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram

Import the dataset.

Reads the CSV dataset into R as a dataframe for analysis

ecommerce_data <- read_csv("/Users/azrulsyaffiq/Desktop/Study/UM/Semester I/WQD7005 DATA MINING/AA2/ecommerce_combined_dataset_with_acquisition.csv")
## Rows: 5000 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): TransactionID, CustomerID, ProductID, ProductName, Category, Payme...
## dbl (13): Date, Quantity, PricePerUnit, TotalAmount, Age, TotalPurchases, To...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Check the structure of the dataset

str(ecommerce_data)
## spc_tbl_ [5,000 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ TransactionID         : chr [1:5000] "T00001" "T00002" "T00003" "T00004" ...
##  $ CustomerID            : chr [1:5000] "C0371" "C0263" "C0017" "C0471" ...
##  $ Date                  : num [1:5000] 19787 20053 19963 20065 19758 ...
##  $ ProductID             : chr [1:5000] "P183" "P339" "P881" "P378" ...
##  $ ProductName           : chr [1:5000] "Aweufwkxrz" NA "Pmyxjiw" "Aupvxa" ...
##  $ Category              : chr [1:5000] "Electronics" "Clothing" "Clothing" "Accessories" ...
##  $ Quantity              : num [1:5000] 2 2 1 2 4 3 4 2 5 1 ...
##  $ PricePerUnit          : num [1:5000] 168.4 259.9 23.1 205.8 417.9 ...
##  $ PaymentMethod         : chr [1:5000] "Apple Pay" "PayPal" "Credit Card" "Credit Card" ...
##  $ Status                : chr [1:5000] "Canceled" "Canceled" "Completed" "Pending" ...
##  $ TotalAmount           : num [1:5000] 336.7 519.9 23.1 411.7 1671.5 ...
##  $ Age                   : num [1:5000] 64 47 40 62 22 20 47 NA 58 32 ...
##  $ Gender                : chr [1:5000] NA "Female" "Female" "Female" ...
##  $ Location              : chr [1:5000] "New York" "San Jose" "Philadelphia" "Dallas" ...
##  $ MembershipLevel       : chr [1:5000] "Platinum" "Bronze" "Gold" "Gold" ...
##  $ TotalPurchases        : num [1:5000] 58 90 1 56 74 57 42 86 15 57 ...
##  $ TotalSpent            : num [1:5000] 2742 860 7867 536 5402 ...
##  $ FavoriteCategory      : chr [1:5000] "Home Goods" "Home Goods" "Clothing" "Electronics" ...
##  $ LastPurchaseDate      : num [1:5000] 19805 19889 19927 19939 20008 ...
##  $ WebsiteClickRate      : num [1:5000] 6.26 6.01 6.67 7.58 1.82 3.68 7.66 2.01 1.44 5.1 ...
##  $ TimeSpentOnSite       : num [1:5000] 29.9 6.4 17 24.9 25.8 ...
##  $ SocialMediaEngagement : chr [1:5000] "Low" "Medium" "Medium" "Low" ...
##  $ AdClickHistory        : num [1:5000] 8.24 1.04 1.16 3.96 2.55 ...
##  $ CustomerSentimentScore: num [1:5000] 0.89 0.36 0.62 0.5 0.9 0.34 0.52 0.82 0.84 0.4 ...
##  $ Churn                 : num [1:5000] 0 1 0 0 1 0 1 0 0 0 ...
##  $ AcquisitionSource     : chr [1:5000] "Outlet" "Online Ads" "Social Media - Facebook" "Outlet" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   TransactionID = col_character(),
##   ..   CustomerID = col_character(),
##   ..   Date = col_double(),
##   ..   ProductID = col_character(),
##   ..   ProductName = col_character(),
##   ..   Category = col_character(),
##   ..   Quantity = col_double(),
##   ..   PricePerUnit = col_double(),
##   ..   PaymentMethod = col_character(),
##   ..   Status = col_character(),
##   ..   TotalAmount = col_double(),
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   Location = col_character(),
##   ..   MembershipLevel = col_character(),
##   ..   TotalPurchases = col_double(),
##   ..   TotalSpent = col_double(),
##   ..   FavoriteCategory = col_character(),
##   ..   LastPurchaseDate = col_double(),
##   ..   WebsiteClickRate = col_double(),
##   ..   TimeSpentOnSite = col_double(),
##   ..   SocialMediaEngagement = col_character(),
##   ..   AdClickHistory = col_double(),
##   ..   CustomerSentimentScore = col_double(),
##   ..   Churn = col_double(),
##   ..   AcquisitionSource = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Summary of missing values in each column

colSums(is.na(ecommerce_data))
##          TransactionID             CustomerID                   Date 
##                      0                      0                      0 
##              ProductID            ProductName               Category 
##                      0                    250                      0 
##               Quantity           PricePerUnit          PaymentMethod 
##                      0                      0                    250 
##                 Status            TotalAmount                    Age 
##                    250                      0                    259 
##                 Gender               Location        MembershipLevel 
##                    244                    245                    269 
##         TotalPurchases             TotalSpent       FavoriteCategory 
##                      0                      0                      0 
##       LastPurchaseDate       WebsiteClickRate        TimeSpentOnSite 
##                      0                      0                      0 
##  SocialMediaEngagement         AdClickHistory CustomerSentimentScore 
##                      0                      0                      0 
##                  Churn      AcquisitionSource 
##                      0                      0

Here you may see there are 7 columns which having missing values which are:

  • ProductName

  • PaymentMethod

  • Status

  • Age

  • Gender

  • Location

  • MembershipLevel

we need to process this columns.

Identify columns with any missing values

missing_cols <- names(which(colSums(is.na(ecommerce_data)) > 0))
print(paste("Columns with missing values:", paste(missing_cols, collapse = ", "))) # Columns with missing values: ProductName, PaymentMethod, Status, Age, Gender, Location, MembershipLevel
## [1] "Columns with missing values: ProductName, PaymentMethod, Status, Age, Gender, Location, MembershipLevel"
# Proportion of missing values in each column
colMeans(is.na(ecommerce_data)) * 100
##          TransactionID             CustomerID                   Date 
##                   0.00                   0.00                   0.00 
##              ProductID            ProductName               Category 
##                   0.00                   5.00                   0.00 
##               Quantity           PricePerUnit          PaymentMethod 
##                   0.00                   0.00                   5.00 
##                 Status            TotalAmount                    Age 
##                   5.00                   0.00                   5.18 
##                 Gender               Location        MembershipLevel 
##                   4.88                   4.90                   5.38 
##         TotalPurchases             TotalSpent       FavoriteCategory 
##                   0.00                   0.00                   0.00 
##       LastPurchaseDate       WebsiteClickRate        TimeSpentOnSite 
##                   0.00                   0.00                   0.00 
##  SocialMediaEngagement         AdClickHistory CustomerSentimentScore 
##                   0.00                   0.00                   0.00 
##                  Churn      AcquisitionSource 
##                   0.00                   0.00
# Count total missing values in the dataset
total_missing <- sum(is.na(ecommerce_data))
print(paste("Total missing values:", total_missing)) # Total missing values: 1,767
## [1] "Total missing values: 1767"
# Check for NaN values
nan_values <- colSums(is.nan(as.matrix(ecommerce_data)))
print(paste("NaN values per column:", nan_values)) # No NaN values
##  [1] "NaN values per column: 0" "NaN values per column: 0"
##  [3] "NaN values per column: 0" "NaN values per column: 0"
##  [5] "NaN values per column: 0" "NaN values per column: 0"
##  [7] "NaN values per column: 0" "NaN values per column: 0"
##  [9] "NaN values per column: 0" "NaN values per column: 0"
## [11] "NaN values per column: 0" "NaN values per column: 0"
## [13] "NaN values per column: 0" "NaN values per column: 0"
## [15] "NaN values per column: 0" "NaN values per column: 0"
## [17] "NaN values per column: 0" "NaN values per column: 0"
## [19] "NaN values per column: 0" "NaN values per column: 0"
## [21] "NaN values per column: 0" "NaN values per column: 0"
## [23] "NaN values per column: 0" "NaN values per column: 0"
## [25] "NaN values per column: 0" "NaN values per column: 0"

Handle missing values.

  1. Here we are replacing missing values where Numerical columns e.g Age with the Median and Categorical columns e.g ProductName, PaymentMethod, Status, Gender, Location & MembershipLevel with “Others”.

  2. Also we will convert the ‘Date’ column into proper date format.

ecommerce_data$Age[is.na(ecommerce_data$Age)] <- median(ecommerce_data$Age, na.rm = TRUE)
ecommerce_data$ProductName[is.na(ecommerce_data$ProductName)] <- "Others"
ecommerce_data$PaymentMethod[is.na(ecommerce_data$PaymentMethod)] <- "Others"
ecommerce_data$Status[is.na(ecommerce_data$Status)] <- "Others"
ecommerce_data$Date <- as.Date(ecommerce_data$Date, format = "%Y-%m-%d")

# Glimpse of the data to see the structure is in well format
glimpse(ecommerce_data) 
## Rows: 5,000
## Columns: 26
## $ TransactionID          <chr> "T00001", "T00002", "T00003", "T00004", "T00005…
## $ CustomerID             <chr> "C0371", "C0263", "C0017", "C0471", "C0030", "C…
## $ Date                   <date> 2024-03-05, 2024-11-26, 2024-08-28, 2024-12-08…
## $ ProductID              <chr> "P183", "P339", "P881", "P378", "P969", "P534",…
## $ ProductName            <chr> "Aweufwkxrz", "Others", "Pmyxjiw", "Aupvxa", "C…
## $ Category               <chr> "Electronics", "Clothing", "Clothing", "Accesso…
## $ Quantity               <dbl> 2, 2, 1, 2, 4, 3, 4, 2, 5, 1, 4, 5, 4, 2, 5, 5,…
## $ PricePerUnit           <dbl> 168.36, 259.94, 23.13, 205.83, 417.87, 325.80, …
## $ PaymentMethod          <chr> "Apple Pay", "PayPal", "Credit Card", "Credit C…
## $ Status                 <chr> "Canceled", "Canceled", "Completed", "Pending",…
## $ TotalAmount            <dbl> 336.72, 519.88, 23.13, 411.66, 1671.48, 977.40,…
## $ Age                    <dbl> 64, 47, 40, 62, 22, 20, 47, 41, 58, 32, 28, 25,…
## $ Gender                 <chr> NA, "Female", "Female", "Female", NA, "Male", "…
## $ Location               <chr> "New York", "San Jose", "Philadelphia", "Dallas…
## $ MembershipLevel        <chr> "Platinum", "Bronze", "Gold", "Gold", "Silver",…
## $ TotalPurchases         <dbl> 58, 90, 1, 56, 74, 57, 42, 86, 15, 57, 50, 70, …
## $ TotalSpent             <dbl> 2742.03, 860.06, 7867.30, 535.75, 5401.86, 7334…
## $ FavoriteCategory       <chr> "Home Goods", "Home Goods", "Clothing", "Electr…
## $ LastPurchaseDate       <dbl> 19805, 19889, 19927, 19939, 20008, 20082, 19838…
## $ WebsiteClickRate       <dbl> 6.26, 6.01, 6.67, 7.58, 1.82, 3.68, 7.66, 2.01,…
## $ TimeSpentOnSite        <dbl> 29.93, 6.40, 17.02, 24.88, 25.80, 10.83, 19.24,…
## $ SocialMediaEngagement  <chr> "Low", "Medium", "Medium", "Low", "High", "High…
## $ AdClickHistory         <dbl> 8.24, 1.04, 1.16, 3.96, 2.55, 3.50, 8.17, 15.94…
## $ CustomerSentimentScore <dbl> 0.89, 0.36, 0.62, 0.50, 0.90, 0.34, 0.52, 0.82,…
## $ Churn                  <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0,…
## $ AcquisitionSource      <chr> "Outlet", "Online Ads", "Social Media - Faceboo…

Save the final dataset

write.csv(ecommerce_data, "ecommerce_final_dataset.csv", row.names = FALSE)

Add synthetic data using random sampling.

Generates synthetic data by sampling 80% from the original dataset to simulate additional examples and appends them to the dataframe.

set.seed(123)
synthetic_x <- ecommerce_data %>% sample_frac(0.8)  # Sample 80% of the dataset
ecommerce_data <- bind_rows(ecommerce_data, synthetic_x)

# Feature engineering: Recency, Frequency, Monetary (RFM). Computes the RFM metrics:
#   Recency: Time since the customer’s last purchase using LastPurchaseDate column.
#   Frequency: Total number of purchases made by the customer using TotalPurchases column.
#   Monetary: Total amount spent by the customer using TotalSpent column.
ecommerce_data <- ecommerce_data %>%
  mutate(Recency = max(Date) - LastPurchaseDate,
         Frequency = TotalPurchases,
         Monetary = TotalSpent)

Normalize Features.

Selected features such as numeric columns (WebsiteClickRate, TimeSpentOnSite, AdClickHistory) to have a mean of 0 and a standard deviation of 1.

ecommerce_data <- ecommerce_data %>%
  mutate(across(c(WebsiteClickRate, TimeSpentOnSite, AdClickHistory), scale))

Visualize the distribution of key features (Recency & Frequency) using histogram

# Ensure Recency is numeric
ecommerce_data <- ecommerce_data %>%
  mutate(Recency = as.numeric(Recency))  # Convert Recency to numeric if it's a Date

# Segment the Recency variable
ecommerce_data <- ecommerce_data %>%
  mutate(Recency_Binned = cut(Recency, breaks = seq(0, max(Recency, na.rm = TRUE), by = 10), right = FALSE))

# Create the bar chart for Recency
ggplot(ecommerce_data, aes(x = Recency_Binned)) +
  geom_bar(fill = "blue", alpha = 0.7) +
  theme_minimal() +
  ggtitle("Distribution of Time since the customer’s last purchase") +
  xlab("Last Purchase Segment (in Days)") +
  ylab("Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Segment the Frequency variable
ecommerce_data <- ecommerce_data %>%
  mutate(Frequency_Binned = cut(Frequency, breaks = seq(0, max(Frequency, na.rm = TRUE), by = 10), right = FALSE))

# Create the bar chart for Frequency
ggplot(ecommerce_data, aes(x = Frequency_Binned)) +
  geom_bar(fill = "orange", alpha = 0.7) +
  theme_minimal() +
  ggtitle("Distribution of Total number of purchases made by the customer") +
  xlab("No. of Purchase Segment") +
  ylab("Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Findings from Recency Chart:

  1. We could see majority of the last purchases recorded between 260-270 days segment.

  2. The lowest transaction made coming from people last purchase between 240-250 days segment.

Findings from Frequency Chart:

  1. Most of customer made purchase between 70-80 items

  2. However, chart shows consistent between all purchase segment

———————–

2. Decision Tree Analysis

———————–

Prepare data for decision tree.

Here we split the data into 70% training and 30% testing datasets using stratified sampling (createDataPartition) to maintain class balance.

set.seed(123)
trainIndex <- createDataPartition(ecommerce_data$Churn, p = 0.7, list = FALSE)
training <- ecommerce_data[trainIndex, ]
testing <- ecommerce_data[-trainIndex, ]

Build the decision tree model.

This model predict Churn based on the specified predictors (Recency, Frequency, Monetary, Age, Gender & AcquisitionSource).

tree_model <- rpart(Churn ~ Recency + Frequency + Monetary + Age + Gender  + AcquisitionSource, 
                    data = training, method = "class")

# Visualize the decision tree
rpart.plot(tree_model, main = "Decision Tree for Customer Churn")

Here’s how to interpret the Decision Tree:

Root Node:

  • The top-most node represents the first decision point and is chosen based on the feature that provides the best split (e.g., highest information gain or Gini index reduction).

Internal Nodes:

  • Each internal node represents a decision based on a feature value.

  • The tree branches into subsets depending on whether the condition is true (left branch) or false (right branch).

Leaf Nodes:

  • The terminal nodes represent final predictions, showing the class label (e.g., Churn = 1 or Churn = 0) or the average value (for regression trees).

  • They also display the proportion of data points in the node and the majority class.

Example:

  1. Based on the Decision Tree, we could see that those monetary spending more than RM792 (right root), and Age more or equal than 31 years old will likely to churn.
  2. Or those who spend between RM460 to RM7,922 will likely to churn.

Predict and evaluate.

Here we are using confusionMatrix to evaluate the model’s performance.

tree_predictions <- predict(tree_model, testing, type = "class")
tree_predictions <- as.factor(tree_predictions)
testing$Churn <- as.factor(testing$Churn)
levels(tree_predictions) <- levels(testing$Churn)

confusionMatrix(tree_predictions, testing$Churn) # Compute the confusion matrix to evaluation the model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  684  209
##          1  683 1124
##                                           
##                Accuracy : 0.6696          
##                  95% CI : (0.6515, 0.6874)
##     No Information Rate : 0.5063          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3421          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5004          
##             Specificity : 0.8432          
##          Pos Pred Value : 0.7660          
##          Neg Pred Value : 0.6220          
##              Prevalence : 0.5063          
##          Detection Rate : 0.2533          
##    Detection Prevalence : 0.3307          
##       Balanced Accuracy : 0.6718          
##                                           
##        'Positive' Class : 0               
## 

Explanation on Model’s Performance:

  • Accuracy: 73.27% of predictions were correct, significantly better than the baseline accuracy (No Information Rate) of 50.98%.

  • Confidence Interval: 70.97%–75.47%.

  • Sensitivity and Specificity: Sensitivity (Recall for Class 0): 77.47% of actual class 0 instances were correctly identified.

  • Specificity: 69.23% of actual class 1 instances were correctly identified.

  • Predictive Values:

  • Positive Predictive Value (PPV): 70.77% of predicted class 0 instances were correct.

  • Negative Predictive Value (NPV): 76.16% of predicted class 1 instances were correct.

  • Balanced Accuracy: The average of sensitivity and specificity: 73.35%, providing a balanced view of model performance.

  • Kappa: 0.4661 indicates moderate agreement between predictions and actual values, showing room for improvement.

  • McNemar’s Test: P-value (0.0005376) indicates that the false positives and false negatives are imbalanced, suggesting a bias in error types.

  • Prevalence: Class 0 makes up 49.02% of the dataset.

Key Insights:

  1. The model performs well overall, particularly for class 0, but is slightly biased and less accurate at identifying class 1 instances.

  2. Sensitivity is higher than specificity, meaning the model better identifies positive cases (class 0).

———————–

3. Advanced Customer Segmentation

———————–

Apply PCA (optional).

PCA can reduce the dimensionality of selected features and visualizes the explained variance by main components.

pca <- prcomp(ecommerce_data %>% select(WebsiteClickRate, TimeSpentOnSite, AdClickHistory, Recency, Frequency, Monetary), scale. = TRUE)
fviz_eig(pca) # Visualize variance explained

pca <- data.frame(pca$x[, 1:2])

# Visualize PCA results. Here we plot the first two principal components to visualize data clusters.
ggplot(pca, aes(x = PC1, y = PC2)) +
  geom_point(alpha = 0.5) +
  ggtitle("PCA: First Two Main Components") +
  theme_minimal()

Insights from the PCA:

  1. The PCA scatter plot displays the data projected onto its first two principal components (PC1 and PC2), which capture the largest variance in the dataset.

  2. The points are spread relatively evenly across the space, with no clearly defined clusters or patterns.

  3. This suggests that the data may not have strong inherent groupings or separations in these two dimensions, though some localized densities may indicate subtle relationships.

  4. PCA has reduced the dimensionality of the dataset while retaining the most significant information, allowing for visualization and further analysis like clustering or pattern recognition.

K-means clustering.

In this section, we perform K-means clustering with 3 clusters and visualizes the results.

set.seed(123)
kmeans_model <- kmeans(pca, centers = 3, nstart = 20)
fviz_cluster(kmeans_model, data = pca, geom = "point", ellipse.type = "norm") +
  ggtitle("K-means Clustering Visualization")

Insights from the K-means Clustering:

  1. The K-means clustering visualization reveals three distinct clusters in the data, represented by red, green, and blue groups.

  2. The clusters are well-separated with some overlap, particularly between the red and green clusters, suggesting some similarity in the data points within these regions.

  3. Each cluster forms a cohesive group with a unique distribution along the first two principal components (PC1 and PC2).

  4. The blue cluster appears to be denser and more concentrated in the lower-right region, while the red and green clusters are spread more widely across the plot.

  5. These clusters indicate distinct patterns or groups in the data that can be further analyzed for meaningful characteristics.

DBSCAN clustering

dbscan_model <- dbscan(pca, eps = 0.5, minPts = 5)
fviz_cluster(list(data = pca, cluster = dbscan_model$cluster), geom = "point") +
  ggtitle("DBSCAN Clustering Visualization")

Insights from the DBSCAN Clustering:

  1. The DBSCAN clustering visualization highlights the distribution of data points across different clusters.

  2. The majority of the data points belong to Cluster 1, represented by red points, which forms the densest and most cohesive group in the dataset.

  3. This cluster likely represents a core segment of the data with shared characteristics. In contrast, Clusters 2, 3, and 4, represented by green, blue, and purple markers, consist of only a few data points scattered across the plot.

  4. These smaller clusters may represent outliers, isolated groups, or unique patterns distinct from the main population.

  5. Overall, the visualization indicates that the dataset is predominantly homogeneous, with a few distinct groups or anomalies standing out.

———————–

4. Ensemble Methods for Churn Prediction

———————–

Prepare data

X <- ecommerce_data %>% select(Recency, Frequency, Monetary, Age, WebsiteClickRate, TimeSpentOnSite)
y <- ecommerce_data$Churn

Split data into training and testing

trainIndex <- createDataPartition(y, p = 0.7, list = FALSE)
trainX <- X[trainIndex, ]
trainY <- y[trainIndex]
testX <- X[-trainIndex, ]
testY <- y[-trainIndex]

Train models

Bagging

Trains a random forest model as a bagging technique and evaluates its performance on the testing set.

set.seed(123)
bagging_model <- randomForest(as.factor(trainY) ~ ., data = trainX, ntree = 100)
bagging_preds <- predict(bagging_model, testX, type = "class")
bagging_perf <- confusionMatrix(bagging_preds, as.factor(testY)) # Evaluate the performance for Bagging

XGBoost

Trains an XGBoost model for churn prediction and calculates its AUC score.

xgb_model <- xgboost(data = as.matrix(trainX), label = trainY, nrounds = 100, objective = "binary:logistic")
## [1]  train-logloss:0.610956 
## [2]  train-logloss:0.533930 
## [3]  train-logloss:0.491357 
## [4]  train-logloss:0.470933 
## [5]  train-logloss:0.431798 
## [6]  train-logloss:0.386619 
## [7]  train-logloss:0.360939 
## [8]  train-logloss:0.352648 
## [9]  train-logloss:0.323156 
## [10] train-logloss:0.311377 
## [11] train-logloss:0.302198 
## [12] train-logloss:0.286147 
## [13] train-logloss:0.278287 
## [14] train-logloss:0.273546 
## [15] train-logloss:0.239328 
## [16] train-logloss:0.228914 
## [17] train-logloss:0.222505 
## [18] train-logloss:0.204664 
## [19] train-logloss:0.201474 
## [20] train-logloss:0.196152 
## [21] train-logloss:0.184131 
## [22] train-logloss:0.176562 
## [23] train-logloss:0.172646 
## [24] train-logloss:0.161192 
## [25] train-logloss:0.147320 
## [26] train-logloss:0.143877 
## [27] train-logloss:0.140864 
## [28] train-logloss:0.139251 
## [29] train-logloss:0.135405 
## [30] train-logloss:0.126635 
## [31] train-logloss:0.115278 
## [32] train-logloss:0.110535 
## [33] train-logloss:0.103966 
## [34] train-logloss:0.098960 
## [35] train-logloss:0.092608 
## [36] train-logloss:0.090653 
## [37] train-logloss:0.083912 
## [38] train-logloss:0.081647 
## [39] train-logloss:0.074761 
## [40] train-logloss:0.072686 
## [41] train-logloss:0.067194 
## [42] train-logloss:0.064362 
## [43] train-logloss:0.059403 
## [44] train-logloss:0.058290 
## [45] train-logloss:0.056726 
## [46] train-logloss:0.054810 
## [47] train-logloss:0.051780 
## [48] train-logloss:0.050248 
## [49] train-logloss:0.046934 
## [50] train-logloss:0.042309 
## [51] train-logloss:0.040579 
## [52] train-logloss:0.039959 
## [53] train-logloss:0.037855 
## [54] train-logloss:0.037293 
## [55] train-logloss:0.034895 
## [56] train-logloss:0.031749 
## [57] train-logloss:0.030693 
## [58] train-logloss:0.029882 
## [59] train-logloss:0.029198 
## [60] train-logloss:0.028648 
## [61] train-logloss:0.027186 
## [62] train-logloss:0.026215 
## [63] train-logloss:0.025452 
## [64] train-logloss:0.024891 
## [65] train-logloss:0.024384 
## [66] train-logloss:0.023792 
## [67] train-logloss:0.022692 
## [68] train-logloss:0.022073 
## [69] train-logloss:0.021298 
## [70] train-logloss:0.020429 
## [71] train-logloss:0.019225 
## [72] train-logloss:0.018689 
## [73] train-logloss:0.017497 
## [74] train-logloss:0.017025 
## [75] train-logloss:0.016574 
## [76] train-logloss:0.016189 
## [77] train-logloss:0.015591 
## [78] train-logloss:0.014930 
## [79] train-logloss:0.014446 
## [80] train-logloss:0.014291 
## [81] train-logloss:0.013558 
## [82] train-logloss:0.012979 
## [83] train-logloss:0.012579 
## [84] train-logloss:0.012350 
## [85] train-logloss:0.012209 
## [86] train-logloss:0.011859 
## [87] train-logloss:0.011788 
## [88] train-logloss:0.011654 
## [89] train-logloss:0.011388 
## [90] train-logloss:0.011232 
## [91] train-logloss:0.010708 
## [92] train-logloss:0.010356 
## [93] train-logloss:0.009802 
## [94] train-logloss:0.009641 
## [95] train-logloss:0.009256 
## [96] train-logloss:0.008844 
## [97] train-logloss:0.008631 
## [98] train-logloss:0.008399 
## [99] train-logloss:0.008264 
## [100]    train-logloss:0.008021
xgb_preds <- predict(xgb_model, as.matrix(testX))
xgb_perf <- performance(prediction(xgb_preds, testY), measure = "auc") # Evaluate the performance for XGBoost

Random Forest Model

Trains a random forest model and evaluates its performance using a confusion matrix.

rf_model <- randomForest(as.factor(trainY) ~ ., data = trainX, ntree = 100)
rf_preds <- predict(rf_model, testX, type = "class")
rf_perf <- confusionMatrix(rf_preds, as.factor(testY)) # Evaluate the performance for random forest

# Compare models. Here we print the performance result for Bagging, XGBoost, and Random Forest models.
print("Bagging Performance:")
## [1] "Bagging Performance:"
print(bagging_perf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1360    0
##          1    0 1340
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9986, 1)
##     No Information Rate : 0.5037     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5037     
##          Detection Rate : 0.5037     
##    Detection Prevalence : 0.5037     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 
print("XGBoost AUC:")
## [1] "XGBoost AUC:"
print(xgb_perf@y.values[[1]])
## [1] 1
print("Random Forest Performance:")
## [1] "Random Forest Performance:"
print(rf_perf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1360    0
##          1    0 1340
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9986, 1)
##     No Information Rate : 0.5037     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5037     
##          Detection Rate : 0.5037     
##    Detection Prevalence : 0.5037     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 

Performance Evaluation:

  1. The results show that all three models (Bagging, XGBoost, and Random Forest) achieved perfect performance with an accuracy of 100%. This indicates that the models correctly classified every instance in both classes (0 and 1), as reflected by the confusion matrices, where there are no false positives or false negatives. Both sensitivity and specificity are also 100%, meaning the models excelled at identifying both positive (class 0) and negative (class 1) instances.

  2. Additionally, the balanced accuracy, positive predictive value (PPV), and negative predictive value (NPV) are all 1, demonstrating flawless performance across all metrics. The AUC score for XGBoost is also 1, further confirming its perfect discrimination between the classes.

  3. However, this level of performance suggests potential overfitting, especially if the dataset was small or unbalanced, or if there was leakage between the training and testing sets. Perfect results in real-world scenarios are rare, so it is essential to validate these findings on a separate, unseen dataset to ensure the models generalize well and are not simply memorizing the data.

Visualize feature importance from Random Forest

importance <- importance(rf_model)
varImpPlot(rf_model, main = "Feature Importance from Random Forest")

Findings:

  1. The feature importance plot generated from the Random Forest model illustrates how influential each feature is in predicting the target variable, as measured by the Mean Decrease Gini metric.

  2. The most significant feature is “Monetary,” which demonstrates the highest importance, suggesting that the total amount spent by customers is the strongest predictor in the model.

  3. Following this, “TimeSpentOnSite” and “WebsiteClickRate” also show notable contributions, indicating that customer interactions with the website, such as time spent browsing and engagement with links, are key indicators for the prediction.

  4. Features such as “Recency,” “Age,” and “Frequency” are relatively less important, although they still contribute to the overall predictive power of the model.

  5. This visualization highlights the critical role of spending behavior and online engagement in understanding customer behavior and outcomes.