clustering and dimention reduction Consumer Behavior

Customer Personality Analysis is all about getting to know a company’s ideal customers on a deeper level. By understanding who they are, what they like, and how they behave, businesses can better tailor their products to fit the unique needs of different customer groups.

In this project, we dive into this analysis, helping businesses see beyond the numbers. With these insights, companies can adjust their products and marketing strategies to better connect with specific customer segments. For example, instead of trying to sell a new product to everyone, the company can focus on the group most likely to love it, making their efforts more targeted and effective.

loading necessary libraries

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## Loading required package: lattice
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

loading the dataset

Initial Data Exploration

since We need to perform clustering to summarize and understand customer segments, we need to clean the data to perform clustring. This unsupervised task will help us identify distinct groups within our customer base.

df <- read.csv("/Users/majid/Documents/usupervised learning/clustring project/marketing_campaign.csv", sep = "\t")

# Observations and Column Types
cat("Number of observations:", nrow(df), "\n")
## Number of observations: 2240
cat("Number of columns:", ncol(df), "\n")
## Number of columns: 29
cat("Missing values in 'Income':", sum(is.na(df$Income)), "\n")
## Missing values in 'Income': 24
cat("Unique values in 'Z_CostContact':", length(unique(df$Z_CostContact)), "\n")
## Unique values in 'Z_CostContact': 1
cat("Unique values in 'Z_Revenue':", length(unique(df$Z_Revenue)), "\n")
## Unique values in 'Z_Revenue': 1

Data Cleaning Steps

Initial data exploration reveals the following key points:

Observations: The dataset consists of 2,240 observations across 29 columns. Missing Values: 24 missing values are present in the ‘income’ column. Column Types: Most columns are numerical, while three are categorical: ‘marital_status’, ‘education’, and ‘Dt_customer’ (which should be converted to a date type). Outliers: Outliers are present in some numerical columns. Duplicates: No duplicate records were detected.

# Handling Missing Values
# Filling missing values in 'Income' with the median
df$Income[is.na(df$Income)] <- median(df$Income, na.rm = TRUE)

# Removing rows with any remaining missing values
df <- df %>% drop_na()
cat("Rows after removing missing values:", nrow(df), "\n")
## Rows after removing missing values: 2240
#Step 3: Removing Unnecessary Columns
df <- df %>% select(-ID, -Z_CostContact, -Z_Revenue)

# Step 4: Converting Date Column and Calculate Days as Client
df$Dt_Customer <- dmy(df$Dt_Customer)
latest_date <- max(df$Dt_Customer)
df$Days_is_client <- as.numeric(latest_date - df$Dt_Customer)

# Step 5: Standardizing Categorical Variables
# Standardize 'Marital_Status'
df$Marital_Status <- recode(df$Marital_Status,
                            "Married" = "Partner",
                            "Together" = "Partner",
                            "Divorced" = "Single",
                            "Widow" = "Single",
                            "Alone" = "Single",
                            "YOLO" = "Single",
                            "Absurd" = "Single")

# Standardize 'Education'
df$Education <- recode(df$Education,
                       "PhD" = "Postgraduate",
                       "Master" = "Postgraduate",
                       "2n Cycle" = "Graduate",
                       "Graduation" = "Graduate",
                       "Basic" = "Undergraduate")

# Step 6: Combining Columns for Feature Reduction
df$Kids <- df$Kidhome + df$Teenhome
df$Expenses <- df$MntWines + df$MntFruits + df$MntMeatProducts +
               df$MntFishProducts + df$MntSweetProducts + df$MntGoldProds
df$TotalAcceptedCmp <- df$AcceptedCmp1 + df$AcceptedCmp2 +
                       df$AcceptedCmp3 + df$AcceptedCmp4 + df$AcceptedCmp5
df$TotalNumPurchases <- df$NumWebPurchases + df$NumCatalogPurchases +
                        df$NumStorePurchases + df$NumDealsPurchases

# Step 7: Selecting Necessary Columns
df <- df %>% select(Education, Marital_Status, Income, Kids, Days_is_client, 
                    Recency, Expenses, TotalNumPurchases, TotalAcceptedCmp, 
                    Complain, Response)

# Step 8: Remove Duplicates
df <- df %>% distinct()
cat("Rows after removing duplicates:", nrow(df), "\n")
## Rows after removing duplicates: 2055
# Step 9: Detect and Remove Outliers
numerical_columns <- df %>% select(where(is.numeric)) %>% colnames()
z_scores <- scale(df[numerical_columns])
df <- df[!rowSums(abs(z_scores) > 3), ]
cat("Rows after removing outliers:", nrow(df), "\n")
## Rows after removing outliers: 1972
# Step 10: Encode Categorical Variables
# One-hot encode categorical variables
dummies <- dummyVars(~ Education + Marital_Status, data = df)
df_encoded <- predict(dummies, newdata = df) %>% as.data.frame()
cat("Rows after encoding categorical variables:", nrow(df_encoded), "\n")
## Rows after encoding categorical variables: 1972
# Step 11: Combine Numerical and Encoded Data
features_scaled <- scale(cbind(df %>% select(where(is.numeric)), df_encoded))
cat("Rows in features_scaled:", nrow(features_scaled), "\n")
## Rows in features_scaled: 1972
# Step 12: Check for NA, NaN, or Inf Values and Handle Them
# Convert features_scaled to a plain numeric matrix
features_scaled_matrix <- as.matrix(features_scaled)

# Check for NA, NaN, or Inf values and handle them globally
features_scaled_matrix <- apply(features_scaled_matrix, 2, function(x) {
  x[is.na(x) | is.nan(x)] <- mean(x[!is.na(x) & !is.nan(x)], na.rm = TRUE)  # Replace NA or NaN with column mean
  x[is.infinite(x)] <- max(x[is.finite(x)], na.rm = TRUE)  # Replace Inf with max finite value in the column
  return(x)
})

# # Diagnostic checks
# cat("Number of NA after handling:", sum(is.na(features_scaled_matrix)), "\n")
# cat("Number of NaN after handling:", sum(is.nan(features_scaled_matrix)), "\n")
# cat("Number of Inf after handling:", sum(is.infinite(features_scaled_matrix)), "\n")

# Convert back to a data frame for clustering
features_scaled <- as.data.frame(features_scaled_matrix)

# Final dataset shape
cat("Final shape of the dataset:", dim(features_scaled), "\n")
## Final shape of the dataset: 1972 14

During the data cleaning and feature engineering process, several key steps were undertaken to enhance the dataset quality, including handling missing values, removing unnecessary columns, standardizing categorical variables, and dealing with outliers. This prepares the dataset for further analysis and clustering.

Exploratory Data Analysis

# 1. Histograms for Numerical Columns
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Count Plots for Categorical Columns

# the color palette
custom_palette <- c("#327D7C", "#E2504A", "#F0C808")

# Plot countplots for each categorical column
categorical_columns <- df %>% select(where(is.character)) %>% colnames()
binary_columns <- df %>% select(where(~ n_distinct(.) == 2)) %>% colnames()

for (column in c(categorical_columns, binary_columns)) {
  p <- ggplot(df, aes_string(x = column, fill = column)) +
    geom_bar() +
    scale_fill_manual(values = custom_palette) +
    labs(title = paste("Countplot of", column), x = column, y = "Count") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5))
  
  print(p)
  ggsave(paste0("countplot_", column, ".png"), plot = p, width = 8, height = 5)
}

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Correlation Heatmap

#  the custom color palette
colors <- c("#E2504A", "#EDEBEC", "#327D7C")

# Example correlation matrix
corr_df <- cor(df %>% select(where(is.numeric)), use = "pairwise.complete.obs")
## Warning in cor(df %>% select(where(is.numeric)), use =
## "pairwise.complete.obs"): the standard deviation is zero
# Plotting the heatmap
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
corr_melted <- melt(corr_df)
p <- ggplot(corr_melted, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#E2504A", mid = "#EDEBEC", high = "#327D7C", midpoint = 0, limits = c(-1, 1)) +
  theme_minimal() +
  labs(title = "Correlation Heatmap", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(p)

ggsave("correlation_heatmap.png", plot = p, width = 16, height = 14)

Key Findings from the Visualizations

Income Distribution:

After removing outliers, income follows a normal distribution, suggesting that most customers earn around the average income, with fewer customers earning significantly more or less.

Days as Client & Recency:

Both features exhibit a fairly uniform distribution, indicating that customers have been with the company for varying lengths of time and have recently interacted with the company across a wide range of time periods. Expenses Distribution: Expenses show an exponential distribution, meaning a majority of customers have lower spending, with spending rapidly decreasing as the amount increases.

Total Number of Purchases:

This feature follows a binomial distribution, reflecting common purchasing behaviors among customers.

Countplot Insights:

The majority of customers are graduates with one child, have a partner, have not complained in the last two years, and have never accepted offers in campaigns, indicating a specific customer profile that the company caters to.

Correlated Features:

Income, expenses, and the total number of purchases are the most correlated features, suggesting that higher income is closely linked with higher spending and a greater number of purchases.

Data Preprocessing

1. Encoding Categorical Variables

## Categorical columns identified: Education, Marital_Status

2. Remove Zero Variance Variables

dummies <- dummyVars(~ ., data = df)
df_encoded <- predict(dummies, newdata = df) %>%
  as.data.frame()

# Remove zero variance variables
nzv <- nearZeroVar(df_encoded, saveMetrics = TRUE)
zero_variance_columns <- rownames(nzv[nzv$nzv == TRUE, ])

if (length(zero_variance_columns) > 0) {
  df_encoded <- df_encoded %>% select(-one_of(zero_variance_columns))
  cat("Removed zero variance columns:", paste(zero_variance_columns, collapse = ", "), "\n")
}
## Removed zero variance columns: EducationUndergraduate, Complain

3. Scaling

#  numerical columns for scaling
numerical_columns <- df_encoded %>% select(where(is.numeric)) %>% colnames()

# Instantiate the scaler
scaler <- preProcess(df_encoded[numerical_columns], method = c("center", "scale"))

# Fit and transform the data
features_scaled <- predict(scaler, df_encoded[numerical_columns])

# the shape of the scaled data
cat("Scaled data dimensions:", dim(features_scaled), "\n")
## Scaled data dimensions: 1972 12

show the data

##  EducationGraduate EducationPostgraduate Marital_StatusPartner
##  Min.   :-1.1987   Min.   :-0.7917       Min.   :-1.3329      
##  1st Qu.:-1.1987   1st Qu.:-0.7917       1st Qu.:-1.3329      
##  Median : 0.8338   Median :-0.7917       Median : 0.7499      
##  Mean   : 0.0000   Mean   : 0.0000       Mean   : 0.0000      
##  3rd Qu.: 0.8338   3rd Qu.: 1.2625       3rd Qu.: 0.7499      
##  Max.   : 0.8338   Max.   : 1.2625       Max.   : 0.7499      
##  Marital_StatusSingle     Income             Kids          Days_is_client     
##  Min.   :-0.7499      Min.   :-2.4468   Min.   :-1.31008   Min.   :-1.739540  
##  1st Qu.:-0.7499      1st Qu.:-0.7829   1st Qu.:-1.31008   1st Qu.:-0.855534  
##  Median :-0.7499      Median : 0.0117   Median : 0.03548   Median :-0.006098  
##  Mean   : 0.0000      Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 1.3329      3rd Qu.: 0.7902   3rd Qu.: 0.03548   3rd Qu.: 0.868031  
##  Max.   : 1.3329      Max.   : 3.1226   Max.   : 2.72660   Max.   : 1.712528  
##     Recency             Expenses       TotalNumPurchases  TotalAcceptedCmp 
##  Min.   :-1.687855   Min.   :-0.9933   Min.   :-1.94429   Min.   :-0.4549  
##  1st Qu.:-0.860825   1st Qu.:-0.8843   1st Qu.:-1.01766   1st Qu.:-0.4549  
##  Median : 0.000664   Median :-0.3641   Median : 0.04135   Median :-0.4549  
##  Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.862153   3rd Qu.: 0.7346   3rd Qu.: 0.83561   3rd Qu.:-0.4549  
##  Max.   : 1.723642   Max.   : 3.0692   Max.   : 2.68888   Max.   : 3.5144  
##     Response      
##  Min.   :-0.3948  
##  1st Qu.:-0.3948  
##  Median :-0.3948  
##  Mean   : 0.0000  
##  3rd Qu.:-0.3948  
##  Max.   : 2.5319

Summary of Preprocessing Steps

Encoding: Categorical variables were converted to one-hot encoded format using dummyVars.

Zero Variance Removal: Variables with zero variance were identified and removed.

Scaling: Numerical features were standardized to have a mean of 0 and a standard deviation of 1.

Clustering with KMeans

####Brief explanation of clustering with KMeans:

Initialization: The algorithm starts by randomly selecting a predefined number of cluster centers (centroids).

Assignment: Each data point is then assigned to the nearest centroid, forming initial clusters.

Update: The centroids are recalculated as the average position of all points within each cluster.

Iteration: The process of assignment and updating continues until the centroids stabilize and no longer change significantly.

Final Output: The result is a set of clusters where data points in the same cluster are more similar to each other than to those in other clusters.

K-Means Clustering Analysis

1. Determine Optimal Number of Clusters using the Elbow Method

Elbow Curve ######I will plot an elbow curve to determine the optimal number of clusters by examining the Sum of Squared Distances (SSD) for different cluster counts.

library(ggplot2)

# Calculate total within-cluster sum of squares for k = 1 to 10
wss <- numeric(10)
for (k in 1:10) {
  set.seed(123)
  kmeans_model <- kmeans(features_scaled, centers = k, nstart = 20)
  wss[k] <- kmeans_model$tot.withinss
}

# Plot the Elbow Method
elbow_plot <- ggplot(data.frame(k = 1:10, wss = wss), aes(x = k, y = wss)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Method for Optimal Clusters",
       x = "Number of Clusters",
       y = "Within-cluster Sum of Squares") +
  theme_minimal()

print(elbow_plot)

Silhouette Analysis

Another method to choose number of cluster is Silhouette Analysis. We will compute Silhouette scores for different cluster counts to evaluate the clustering quality.
# Load necessary libraries
library(factoextra)
library(ggplot2)

# Extract the two features (Income and Expenses)

# Extract the two features (Income and Expenses)
features_subset <- features_scaled[, c("Income", "Expenses")]

# Calculate average silhouette widths for k = 2 to 10
silhouette_scores <- numeric(10)  # To store average silhouette widths

for (k in 2:10) {
  set.seed(123)
  kmeans_model <- kmeans(features_subset, centers = k, nstart = 20)
  silhouette_k <- silhouette(kmeans_model$cluster, dist(features_subset))
  silhouette_scores[k] <- mean(silhouette_k[, 3])  # Average silhouette width
}


silhouette_plot <- ggplot(data.frame(k = 2:10, silhouette_scores = silhouette_scores[2:10]), 
                           aes(x = k, y = silhouette_scores)) +
  geom_line() +
  geom_point() +
  labs(title = "Silhouette Analysis for Optimal Number of Clusters",
       x = "Number of Clusters",
       y = "Average Silhouette Width") +
  theme_minimal()

print(silhouette_plot)

The Elbow Curve and Silhouette Analysis

both suggest that 2 clusters provide the best solution.

KMeans with chosen number of clusters

1. Initialize and Fit the KMeans Algorithm
library(factoextra)
library(ggplot2)

# Extract the two features (Income and Expenses)
features_subset <- features_scaled[, c("Income", "Expenses")]

# Perform KMeans clustering with 2 clusters
set.seed(123)  
km_result <- eclust(features_subset, FUNcluster = "kmeans", k = 2, nstart = 20)

df$Cluster <- as.factor(km_result$cluster)

# Calculate summary statistics for each cluster
cluster_summary_kmeans <- df %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), list(mean = mean, median = median)))


print(cluster_summary_kmeans)
## # A tibble: 2 × 19
##   Cluster Income_mean Income_median Kids_mean Kids_median Days_is_client_mean
##   <fct>         <dbl>         <dbl>     <dbl>       <dbl>               <dbl>
## 1 1            37600.        38012.     1.23            1                336.
## 2 2            70556.        70346.     0.595           1                376.
## # ℹ 13 more variables: Days_is_client_median <dbl>, Recency_mean <dbl>,
## #   Recency_median <dbl>, Expenses_mean <dbl>, Expenses_median <dbl>,
## #   TotalNumPurchases_mean <dbl>, TotalNumPurchases_median <dbl>,
## #   TotalAcceptedCmp_mean <dbl>, TotalAcceptedCmp_median <dbl>,
## #   Complain_mean <dbl>, Complain_median <dbl>, Response_mean <dbl>,
## #   Response_median <dbl>

Silhouette Plot for k-means

library(cluster)


# Calculate the silhouette scores
silhouette_scores <- silhouette(km_result$cluster, dist(features_scaled))


silhouette_plot <- fviz_silhouette(silhouette_scores, 
                                   palette = "jco", 
                                   ggtheme = theme_minimal()) +
  labs(title = "Silhouette Plot for KMeans Clustering",
       x = "Silhouette Width",
       y = "Cluster") +
  theme(plot.title = element_text(hjust = 0.5))
##   cluster size ave.sil.width
## 1       1 1174          0.22
## 2       2  798          0.13
print(silhouette_plot)

#####K-Means Clustering Plot (Income vs Expenses)

Plot Description:

This plot shows the results of K-Means clustering with 2 clusters based on two key features: Income (scaled) and Expenses (scaled). The clusters are visualized with ellipses, and each point represents a customer. Interpretation:

Cluster 1 (Red): Characteristics: Customers in this cluster have lower income and lower expenses. Insight: These are likely budget-conscious customers who spend less. They may be price-sensitive and less likely to engage in high-value purchases. Actionable Insight: Target this group with discounts, promotions, or budget-friendly products to increase their spending. Cluster 2 (Green): Characteristics: Customers in this cluster have higher income and higher expenses. Insight: These are high-value customers who spend more. They are likely to be loyal and responsive to premium offers.

#####Table Description:

The table provides summary statistics (mean and median) for each cluster across key features like Income, Kids, Days as Client, Recency, Expenses, and Total Purchases. Interpretation:

Cluster 1: Income: Lower income (mean = $37,000). Kids: More kids (mean = 1.23). Days as Client: Longer duration (mean = 396 days). Expenses: Lower spending. Total Purchases: Fewer purchases. Insight: These are long-term, low-income customers with larger families. They spend less but have been with the company for a longer time. Actionable Insight: Offer family-oriented deals or bulk discounts to encourage higher spending.

Cluster 2: Income: Higher income (mean = $70,550). Kids: Fewer kids (mean = 0.95). Days as Client: Shorter duration (mean = 376 days). Expenses: Higher spending. Total Purchases: More purchases. Insight: These are high-income, newer customers with fewer kids. They spend more and make frequent purchases.

#####3. Silhouette Plot for K-Means Clustering

Plot Description:

The silhouette plot shows the silhouette width for each customer in the two clusters. The average silhouette width is 0.18. Interpretation:

Cluster 1: Silhouette Width: 0.22 (moderate). Insight: Customers in this cluster are reasonably well-grouped but may have some overlap with Cluster 2. Cluster 2: Silhouette Width: 0.13 (lower). Insight: Customers in this cluster are less distinct and may overlap with Cluster 1.

Clara method for clustring

# Perform CLARA clustering with 2 clusters
set.seed(123)  
clara_result <- eclust(features_subset, FUNcluster = "clara", k = 2)

df$Cluster <- as.factor(clara_result$clustering)

# Calculate summary statistics for each cluster
cluster_summary_clara <- df %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), list(mean = mean, median = median)))


print(cluster_summary_clara)
## # A tibble: 2 × 19
##   Cluster Income_mean Income_median Kids_mean Kids_median Days_is_client_mean
##   <fct>         <dbl>         <dbl>     <dbl>       <int>               <dbl>
## 1 1            70208.         70038     0.608           1                376.
## 2 2            37190.         37509     1.23            1                335.
## # ℹ 13 more variables: Days_is_client_median <dbl>, Recency_mean <dbl>,
## #   Recency_median <int>, Expenses_mean <dbl>, Expenses_median <int>,
## #   TotalNumPurchases_mean <dbl>, TotalNumPurchases_median <int>,
## #   TotalAcceptedCmp_mean <dbl>, TotalAcceptedCmp_median <int>,
## #   Complain_mean <dbl>, Complain_median <int>, Response_mean <dbl>,
## #   Response_median <int>

silhoute plot for clara

library(cluster)

# Calculate the silhouette scores
silhouette_scores <- silhouette(clara_result$cluster, dist(features_scaled))

#silhouette scores
silhouette_plot <- fviz_silhouette(silhouette_scores, 
                                   palette = "jco", 
                                   ggtheme = theme_minimal()) +
  labs(title = "Silhouette Plot for Clara Clustering",
       x = "Silhouette Width",
       y = "Cluster") +
  theme(plot.title = element_text(hjust = 0.5))
##   cluster size ave.sil.width
## 1       1  821          0.13
## 2       2 1151          0.22
print(silhouette_plot)

1. CLARA Clustering Plot (Income vs Expenses)

Plot Description:

This plot shows the results of CLARA clustering with 2 clusters based on two key features: Income (scaled) and Expenses (scaled). The clusters are visualized with ellipses, and each point represents a customer. Interpretation:

Cluster 1 (Red): Characteristics: Customers in this cluster have higher income and higher expenses. Insight: These are high-value customers who spend more. They are likely to be responsive to premium offers and have a higher lifetime value. Actionable Insight: Focus on retention strategies for this group, such as loyalty programs, exclusive offers, or personalized recommendations for high-end products.

Cluster 2 (Green): Characteristics: Customers in this cluster have lower income and lower expenses. Insight: These are budget-conscious customers who spend less. They may be price-sensitive and less likely to engage in high-value purchases. Actionable Insight: Target this group with discounts, promotions, or budget-friendly products to increase their spending.

2. Cluster Summary Statistics (CLARA)

Table Description:

The table provides summary statistics (mean and median) for each cluster across key features like Income, Kids, Days as Client, Recency, Expenses, and Total Purchases. Interpretation:

Cluster 1: Income: Higher income (mean = $70,208). Kids: Fewer kids (mean = 0.608). Days as Client: Moderate duration (mean = 376 days). Expenses: Higher spending. Total Purchases: More purchases. Insight: These are high-income, high-spending customers with fewer kids. They are likely to be newer customers who are highly engaged. Actionable Insight: Focus on upselling and cross-selling premium products to this group. Offer exclusive early access to new products or services.

Cluster 2: Income: Lower income (mean = $37,190). Kids: More kids (mean = 1.23). Days as Client: Shorter duration (mean = 335 days). Expenses: Lower spending. Total Purchases: Fewer purchases. Insight: These are low-income, budget-conscious customers with larger families. They spend less and have been with the company for a shorter time. Actionable Insight: Offer family-oriented deals or bulk discounts to encourage higher spending. Use targeted marketing to increase engagement.

3. Silhouette Plot for CLARA Clustering

Plot Description:

The silhouette plot shows the silhouette width for each customer in the two clusters. The average silhouette width is 0.18. Interpretation:

Cluster 1: Silhouette Width: 0.22 (moderate). Insight: Customers in this cluster are reasonably well-grouped but may have some overlap with Cluster 2.

Cluster 2: Silhouette Width: 0.13 (lower). Insight: Customers in this cluster are less distinct and may overlap with Cluster 1. ### Hierarchical Dendrogram

# Set CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))


if (!require("ggdendro", quietly = TRUE)) {
  install.packages("ggdendro")
}


# Perform hierarchical clustering
# Using the scaled features (features_scaled) from your previous steps
#  distance matrix
dist_matrix <- dist(features_scaled, method = "euclidean")

# Perform hierarchical clustering using the complete linkage method
hc <- hclust(dist_matrix, method = "complete")


dendrogram_plot <- ggdendrogram(hc, rotate = TRUE, theme_dendro = FALSE) +
  labs(title = "Hierarchical Clustering Dendrogram",
       x = "Data Points",
       y = "Height (Distance)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.text.x = element_blank(),  # Hide x-axis labels for clarity
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12))


print(dendrogram_plot)

Hierarchical Dendrogram interpretation

Height (Y-axis): The height ranges from 0 (most similar) to a maximum value (most dissimilar). The height at which clusters merge indicates their similarity.

Clusters: The dendrogram shows multiple clusters forming at different heights. For example, if you cut the dendrogram at a height of 5, you might get 3 clusters.

Cluster Separation: The dendrogram shows some clear branches, indicating that the data points naturally form distinct clusters. However, there may also be some overlap or closely spaced branches, suggesting that some clusters are not well-separated.

hierachical dendogram for k = 4

# Cut the dendrogram to define 4 clusters
clusters <- cutree(hc, k = 4)


df$Cluster <- as.factor(clusters)


head(df)
##      Education Marital_Status Income Kids Days_is_client Recency Expenses
## 1     Graduate         Single  58138    0            663      58     1617
## 2     Graduate         Single  46344    2            113      38       27
## 3     Graduate        Partner  71613    0            312      26      776
## 4     Graduate        Partner  26646    1            139      26       53
## 5 Postgraduate        Partner  58293    1            161      94      422
## 6 Postgraduate        Partner  62513    1            293      16      716
##   TotalNumPurchases TotalAcceptedCmp Complain Response Cluster
## 1                25                0        0        1       1
## 2                 6                0        0        0       2
## 3                21                0        0        0       3
## 4                 8                0        0        0       2
## 5                19                0        0        0       3
## 6                22                0        0        0       3

Interpretation:

Cluster Assignments:

Cluster 1: Row 1 Cluster 2: Rows 2 and 4 Cluster 3: Rows 3, 5, and 6

This cluster represents high-value, frequent purchasers who make regular purchases but are less responsive to marketing campaigns. They may be loyal but not easily influenced by promotions.

Cluster Characteristics 1 High-value, loyal, engaged customers who respond to campaigns. 2 Low-engagement, newer customers who spend less and do not respond to campaigns. 3 High-value customers who make frequent purchases but are less responsive.

Actionable Insights for stakeholders:

Cluster 1 (High-Value, Loyal Customers):

Action: Focus on retaining these customers by offering exclusive deals or loyalty rewards. Goal: Increase their lifetime value and encourage repeat purchases.

Cluster 2 (Low-Engagement, Newer Customers):

Action: Develop targeted marketing campaigns to re-engage these customers (e.g., personalized offers or discounts). Goal: Increase their spending and encourage them to respond to campaigns.

Cluster 3 (High-Value, Frequent Purchasers):

Action: Investigate why these customers are not responding to campaigns and tailor marketing efforts to their preferences. Goal: Increase their responsiveness to campaigns while maintaining their loyalty.

Dimention redution

Reduce noise and irrelevant features. Improve clustering performance. Visualize the data in 2D or 3D space.

PCA on the Scaled Data

since I have already scaled the data (features_scaled), I directly apply PCA to it.

# Perform PCA on the scaled data
pca_result <- prcomp(features_scaled, center = TRUE, scale. = TRUE)

# Summary of PCA
summary(pca_result)
## Importance of components:
##                          PC1   PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.742 1.428 1.3983 1.1116 1.03244 0.95188 0.89184
## Proportion of Variance 0.253 0.170 0.1629 0.1030 0.08883 0.07551 0.06628
## Cumulative Proportion  0.253 0.423 0.5860 0.6889 0.77778 0.85328 0.91957
##                            PC8     PC9    PC10    PC11      PC12
## Standard deviation     0.72904 0.51418 0.35202 0.21309 2.499e-16
## Proportion of Variance 0.04429 0.02203 0.01033 0.00378 0.000e+00
## Cumulative Proportion  0.96386 0.98589 0.99622 1.00000 1.000e+00

Scree Plot

The scree plot helps us determine the number of principal components to retain by showing the proportion of variance explained by each component.

# Scree plot
fviz_eig(pca_result, barfill = "#327D7C", barcolor = "#327D7C", linecolor = "#E2504A") +
  labs(title = "Scree Plot", x = "Principal Components", y = "Percentage of Explained Variance") +
  theme_minimal()

Interpretation:

PC1 (First Principal Component):

Explains the largest proportion of variance in the dataset. PC1 likely captures the most significant patterns in customer behavior, such as income, expenses, and purchasing habits. The high variance explained by PC1 suggests that it is the most important component for distinguishing between customer segments.

PC2 (Second Principal Component):

Explains the second-largest proportion of variance. PC2 might capture secondary patterns, such as differences in family size (number of kids) or customer loyalty (days as a client).

Subsequent Components (PC3, PC4, etc.):

These components explain progressively smaller amounts of variance. They may capture more subtle or less significant patterns in the data.

Correlation Circle

The correlation circle shows the contribution of each variable to the principal components.

fviz_pca_var(pca_result, col.var = "contrib", gradient.cols = c("#E2504A", "#327D7C", "#F0C808")) +
  labs(title = "Correlation Circle", x = "PC1", y = "PC2") +
  theme_minimal()

#### Interpretation of Correlation circle :

PC1 (X-axis):

Variables like Income, Expenses, and TotalNumPurchases are likely to have strong contributions to PC1. These variables are probably highly correlated with each other, indicating that customers with higher incomes tend to spend more and make more purchases.

PC2 (Y-axis): Variables like Kids, Days_is_client, and Recency might contribute more to PC2. These variables could represent secondary patterns, such as family size or customer loyalty.

Correlated Variables:

Variables that are close to each other in the plot (e.g., Income and Expenses) are positively correlated. Variables that are far apart or in opposite directions (e.g., Income and Kids) might be negatively correlated or unrelated.

Less Influential Variables:

Variables with short arrows (e.g., Complain, Response) contribute less to the principal components and may be less important for clustering.

Individual Contributions

we can visualize the contributions of individual variables to the principal components.

# Contributions of variables to PC1
PC1 <- fviz_contrib(pca_result, choice = "var", axes = 1, fill = "#327D7C", color = "#327D7C") +
  labs(title = "Contributions to PC1") +
  theme_minimal()

# Contributions of variables to PC2
PC2 <- fviz_contrib(pca_result, choice = "var", axes = 2, fill = "#327D7C", color = "#327D7C") +
  labs(title = "Contributions to PC2") +
  theme_minimal()

# Contributions of variables to PC3
PC3 <- fviz_contrib(pca_result, choice = "var", axes = 3, fill = "#327D7C", color = "#327D7C") +
  labs(title = "Contributions to PC3") +
  theme_minimal()


grid.arrange(PC1, PC2, PC3, ncol = 3)

Interpretation of Contributions to PC1 :

Income:

Likely the most significant contributor to PC1. Indicates that income is a key driver of customer behavior and segmentation. Expenses: Also a major contributor to PC1. Suggests that spending habits are closely tied to income levels.

TotalNumPurchases:

Another important contributor. Reflects the frequency of purchases, which is likely correlated with income and expenses.

Other Variables: Variables like Kids, Days_is_client, and Recency have smaller contributions to PC1. These variables are less influential in defining PC1.

Actionable Insight for stakeholders:

#####PC1 primarily represents spending power or purchasing behavior. PC1 can be used identify high-income, high-spending customers who make frequent purchases.

Contributions to PC2:

Kids: Likely the most significant contributor to PC2. Indicates that family size (number of kids) is a key driver of customer behavior.

Days_is_client: Another major contributor to PC2. Reflects customer loyalty or the length of time a customer has been with the company . Recency: Also contributes to PC2. Indicates how recently a customer has made a purchase. Other Variables: Variables like Income and Expenses have smaller contributions to PC2. These variables are less influential in defining PC2.

Actionable Insight for stakeholders:

PC2 primarily represents family-oriented or loyalty-related customer behavior. PC2 can be used to identify family-oriented customers or long-term loyal customers.

Contributions to PC3

Recency: Likely the most significant contributor to PC3. Indicates that recent purchasing behavior is a key driver of customer behavior.

TotalAcceptedCmp: Another major contributor to PC3. Reflects the number of marketing campaigns a customer has accepted.

Complain: Also contributes to PC3. Indicates whether a customer has complained in the past.

Other Variables: Variables like Income and Expenses have smaller contributions to PC3. These variables are less influential in defining PC3.

Actionable Insight for stakeholders:

PC3 primarily represents customer engagement or responsiveness to marketing campaigns. PC3 can be used to identify customers who are more responsive to marketing efforts or who have recently engaged with the company.

Interpret the PCA Results

The summary of the PCA will give the proportion of variance explained by each principal component. Typically, it can retain the components that explain the most variance.

# Eigenvalues and variance explained
eig.val <- get_eigenvalue(pca_result)
print(eig.val)
##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.035945e+00     2.529954e+01                    25.29954
## Dim.2  2.040532e+00     1.700443e+01                    42.30397
## Dim.3  1.955230e+00     1.629358e+01                    58.59755
## Dim.4  1.235684e+00     1.029737e+01                    68.89492
## Dim.5  1.065940e+00     8.882831e+00                    77.77775
## Dim.6  9.060714e-01     7.550595e+00                    85.32834
## Dim.7  7.953849e-01     6.628207e+00                    91.95655
## Dim.8  5.315057e-01     4.429214e+00                    96.38577
## Dim.9  2.643816e-01     2.203180e+00                    98.58894
## Dim.10 1.239195e-01     1.032663e+00                    99.62161
## Dim.11 4.540709e-02     3.783924e-01                   100.00000
## Dim.12 6.243780e-32     5.203150e-31                   100.00000
Key Observations from the PCA results :

PC1 (Dim.1): Eigenvalue: 3.036 Variance Explained: 25.30% Cumulative Variance: 25.30% Interpretation: PC1 explains the largest proportion of variance in the dataset. It likely captures the most significant patterns in customer behavior, such as income, expenses, and purchasing habits.

PC2 (Dim.2): Eigenvalue: 2.041 Variance Explained: 17.00% Cumulative Variance: 42.30% Interpretation: PC2 explains the second-largest proportion of variance. It likely captures secondary patterns, such as family size (kids) and customer loyalty (days as a client).

PC3 (Dim.3): Eigenvalue: 1.955 Variance Explained: 16.29% Cumulative Variance: 58.60% Interpretation: PC3 explains additional variance and may capture patterns related to customer engagement or responsiveness to marketing campaigns.

Subsequent Components (PC4 to PC12): These components explain progressively smaller amounts of variance. For example, PC4 explains 10.30% of the variance, while PC12 explains almost no variance (close to 0%).

considering these results I decided to retain PC1 to PC4.

###Code for 3D Scatter Plot

library(plotly)

#the first three principal components (PC1, PC2, and PC3)
pca_data_3d <- as.data.frame(pca_result$x[, 1:3])


pca_data_3d$Cluster <- as.factor(km_result$cluster)

#  a 3D scatter plot for PC1, PC2, and PC3
pca_3d_plot <- plot_ly(pca_data_3d, x = ~PC1, y = ~PC2, z = ~PC3, color = ~Cluster, 
                       colors = "Set2", marker = list(size = 5)) %>%
  add_markers() %>%
  layout(scene = list(
    xaxis = list(title = "Income/Spending (PC1)"),
    yaxis = list(title = "Family Size/Loyalty (PC2)"),
    zaxis = list(title = "Customer Engagement (PC3)")
  ),
  title = "3D Scatter Plot of PCA (PC1, PC2, PC3)"
  )


pca_3d_plot
print(pca_3d_plot)

Key Insights from the Plot

Income/Spending (PC1):

Customers with higher PC1 values are high-income, high-spending customers. Customers with lower PC1 values are low-income, low-spending customers.

Family Size/Loyalty (PC2):

Customers with higher PC2 values are family-oriented or long-term loyal customers. Customers with lower PC2 values are single or newer customers.

Customer Engagement (PC3):

Customers with higher PC3 values are highly engaged or responsive to marketing campaigns. Customers with lower PC3 values are low-engagement customers.

Cluster Separation:

The clusters are well-separated in the 3D space, indicating that the clustering algorithm has effectively grouped customers based on their behavior. ### Using PCA for Clustering ##### Perform PCA and Retain PC1 to PC4

the principal components will used for clustering instead of the original features. This can help reduce noise and improve clustering performance.

# PCA on the scaled data
pca_result <- prcomp(features_scaled, center = TRUE, scale. = TRUE)

# the first four principal components (PC1 to PC4)
pca_data <- as.data.frame(pca_result$x[, 1:4])


head(pca_data)
##          PC1        PC2        PC3         PC4
## 1  2.7916659 -2.3933550 -0.7937757 -1.02702623
## 2 -1.8899101 -2.0794464 -0.2657500  0.37888782
## 3  1.0800238  0.8260433 -1.5778536  0.11975645
## 4 -2.0041943  0.5877798 -1.1174749 -0.54495631
## 5 -0.0105194  1.6396498  1.2958842  1.59216825
## 6  0.6926063  1.6493463  1.3224121  0.03686012

Perform CLARA Clustering on the PCA Data

# Convert columns to numeric if necessary
pca_data$PC1 <- as.numeric(pca_data$PC1)
pca_data$PC2 <- as.numeric(pca_data$PC2)
pca_data$PC3 <- as.numeric(pca_data$PC3)
pca_data$PC4 <- as.numeric(pca_data$PC4)

# Perform CLARA clustering on the PCA data (PC1 to PC4)
set.seed(123)  # For reproducibility
clara_result <- clara(pca_data[, 1:4], k = 4, samples = 50, sampsize = 1000)


pca_data$Cluster <- as.factor(clara_result$clustering)


head(pca_data)
##          PC1        PC2        PC3         PC4 Cluster
## 1  2.7916659 -2.3933550 -0.7937757 -1.02702623       1
## 2 -1.8899101 -2.0794464 -0.2657500  0.37888782       1
## 3  1.0800238  0.8260433 -1.5778536  0.11975645       2
## 4 -2.0041943  0.5877798 -1.1174749 -0.54495631       3
## 5 -0.0105194  1.6396498  1.2958842  1.59216825       4
## 6  0.6926063  1.6493463  1.3224121  0.03686012       4

Visualize the Clusters

#  KMeans clustering on the PCA data (PC1 and PC2)
set.seed(123)  # For reproducibility
km_result <- eclust(pca_data[, 1:2], FUNcluster = "kmeans", k = 4, nstart = 20)  # 4 clusters

#  KMeans clustering on the PCA data (PC3 and PC4)
set.seed(123)  # For reproducibility
km_result_pc3_pc4 <- eclust(pca_data[, 3:4], FUNcluster = "kmeans", k = 4, nstart = 20)  # 4 clusters

# Add cluster assignments to the PCA data
pca_data$Cluster_PC3_PC4 <- as.factor(km_result_pc3_pc4$cluster)

# View the clustered PCA data
head(pca_data)
##          PC1        PC2        PC3         PC4 Cluster Cluster_PC3_PC4
## 1  2.7916659 -2.3933550 -0.7937757 -1.02702623       1               3
## 2 -1.8899101 -2.0794464 -0.2657500  0.37888782       1               2
## 3  1.0800238  0.8260433 -1.5778536  0.11975645       2               2
## 4 -2.0041943  0.5877798 -1.1174749 -0.54495631       3               3
## 5 -0.0105194  1.6396498  1.2958842  1.59216825       4               4
## 6  0.6926063  1.6493463  1.3224121  0.03686012       4               4

####Key Insights from the Plot

Cluster Separation: The clusters are well-separated, indicating that the KMeans algorithm has effectively grouped customers based on their behavior. The confidence ellipses show that the clusters have distinct boundaries.

Income/Spending (PC1): Customers with higher PC1 values (right side of the plot) are likely high-income, high-spending customers. Customers with lower PC1 values (left side of the plot) are likely low-income, low-spending customers.

Family Size/Loyalty (PC2): Customers with higher PC2 values (top of the plot) are likely family-oriented or long-term loyal customers. Customers with lower PC2 values (bottom of the plot) are likely single or newer customers.

Customer Engagement (PC3): Customers with higher PC3 values are more responsive to marketing campaigns or highly engaged. Customers with lower PC3 values are less engaged and may require re-engagement strategies.

Purchase Frequency/Preferences (PC4): Customers with higher PC4 values are frequent purchasers or have specific product preferences. Customers with lower PC4 values are infrequent purchasers or have general preferences.

Interpret the Clusters

# Add cluster assignments to the original dataset
df$Cluster <- as.factor(km_result$cluster)  # Replace clara_result with km_result

# View the original data with cluster assignments
head(df)
##      Education Marital_Status Income Kids Days_is_client Recency Expenses
## 1     Graduate         Single  58138    0            663      58     1617
## 2     Graduate         Single  46344    2            113      38       27
## 3     Graduate        Partner  71613    0            312      26      776
## 4     Graduate        Partner  26646    1            139      26       53
## 5 Postgraduate        Partner  58293    1            161      94      422
## 6 Postgraduate        Partner  62513    1            293      16      716
##   TotalNumPurchases TotalAcceptedCmp Complain Response Cluster
## 1                25                0        0        1       1
## 2                 6                0        0        0       4
## 3                21                0        0        0       3
## 4                 8                0        0        0       2
## 5                19                0        0        0       2
## 6                22                0        0        0       3
# Add cluster assignments to the PCA data
pca_data$Cluster_PC3_PC4 <- as.factor(km_result_pc3_pc4$cluster)


head(pca_data)
##          PC1        PC2        PC3         PC4 Cluster Cluster_PC3_PC4
## 1  2.7916659 -2.3933550 -0.7937757 -1.02702623       1               3
## 2 -1.8899101 -2.0794464 -0.2657500  0.37888782       1               2
## 3  1.0800238  0.8260433 -1.5778536  0.11975645       2               2
## 4 -2.0041943  0.5877798 -1.1174749 -0.54495631       3               3
## 5 -0.0105194  1.6396498  1.2958842  1.59216825       4               4
## 6  0.6926063  1.6493463  1.3224121  0.03686012       4               4

Calculate Summary Statistics for Each Cluster

cluster_summary <- df %>%
  group_by(Cluster) %>%
  summarise(across(where(is.numeric), list(mean = mean, median = median)))


print(cluster_summary)
## # A tibble: 4 × 19
##   Cluster Income_mean Income_median Kids_mean Kids_median Days_is_client_mean
##   <fct>         <dbl>         <dbl>     <dbl>       <dbl>               <dbl>
## 1 1            68633.        69263      0.517           0                389.
## 2 2            37600.        37351      1.25            1                325.
## 3 3            68944.        69438.     0.654           1                384.
## 4 4            37218.        36997      1.25            1                331.
## # ℹ 13 more variables: Days_is_client_median <dbl>, Recency_mean <dbl>,
## #   Recency_median <dbl>, Expenses_mean <dbl>, Expenses_median <dbl>,
## #   TotalNumPurchases_mean <dbl>, TotalNumPurchases_median <dbl>,
## #   TotalAcceptedCmp_mean <dbl>, TotalAcceptedCmp_median <dbl>,
## #   Complain_mean <dbl>, Complain_median <dbl>, Response_mean <dbl>,
## #   Response_median <dbl>

insights from summary and table

1. Clustering on PC1 and PC2 (Income/Spending vs. Family Size/Loyalty):

Cluster 1: High-income, high-spending customers. Insight: Focus on retention strategies (e.g., loyalty programs, premium offers). Cluster 2: Family-oriented customers with moderate income. Insight: Offer family deals or bulk discounts to increase spending. Cluster 3: Low-income, budget-conscious customers. Insight: Target with discounts and budget-friendly products. Cluster 4: High-income, low-loyalty customers. Insight: Use re-engagement strategies (e.g., personalized offers) to boost loyalty.

2. Clustering on PC3 and PC4 (Engagement vs. Purchase Frequency):

Cluster 1: Highly engaged, frequent purchasers. Insight: Focus on upselling and cross-selling to maximize value. Cluster 2: Frequent purchasers with low campaign responsiveness. Insight: Tailor marketing efforts to their preferences. Cluster 3: Low-engagement, infrequent purchasers. Insight: Develop re-engagement campaigns (e.g., personalized offers). Cluster 4: Highly engaged but infrequent purchasers. Insight: Convert engagement into purchases with targeted promotions.

Silhouette Analysis for CLARA Clustering

silhouette_scores <- silhouette(km_result$cluster, dist(pca_data))  # Replace clara_result with km_result

# Plot the silhouette scores
silhouette_plot <- fviz_silhouette(silhouette_scores, 
                                   palette = "jco", 
                                   ggtheme = theme_minimal()) +
  labs(title = "Silhouette Plot for KMeans Clustering",  # Update title
       x = "Silhouette Width",
       y = "Cluster") +
  theme(plot.title = element_text(hjust = 0.5))
##   cluster size ave.sil.width
## 1       1  315          0.24
## 2       2  730          0.35
## 3       3  532          0.21
## 4       4  395          0.31
print(silhouette_plot)

# Calculate silhouette scores
silhouette_scores <- silhouette(km_result_pc3_pc4$cluster, dist(pca_data))  # 


silhouette_plot <- fviz_silhouette(silhouette_scores, 
                                   palette = "jco", 
                                   ggtheme = theme_minimal()) +
  labs(title = "Silhouette Plot for km_result_pc3_pc4",  # Update title
       x = "Silhouette Width",
       y = "Cluster") +
  theme(plot.title = element_text(hjust = 0.5))
##   cluster size ave.sil.width
## 1       1  244          0.24
## 2       2  829          0.23
## 3       3  362          0.14
## 4       4  537          0.31
print(silhouette_plot)

##### Key Insights from the Silhouette Plot

Overall Clustering Quality:

The average silhouette width (red dashed line) is relatively high, indicating that the overall clustering quality is good. However, the presence of lower silhouette scores in some clusters (e.g., Cluster 3 and Cluster 4) suggests that the clustering could be improved.

*Cluster Separation: Clusters with higher silhouette scores (e.g., Cluster 1 and Cluster 2) are well-separated and distinct. Clusters with lower silhouette scores (e.g., Cluster 3 and Cluster 4) may overlap or be less distinct.

Key Takeaways:

PCA improved clustering by focusing on the most significant patterns (income, spending, family size, loyalty, engagement, and purchase frequency). Four distinct customer segments were identified, each requiring tailored marketing strategies. Actionable Insights for stake holders: Retain high-value customers with premium offers. Encourage spending among budget-conscious customers with discounts. Re-engage low-engagement customers with personalized campaigns.