Customer segmentation and classification based on customer profile

Group Member

1. Benjamin Law Seng Yuan 23086626
2. Ang Eng Hooi 23086408
3. Clement Lim Kee Min 23121442
4. Meng Hui Dan 23104917
5. Shi Yan 24057850

1. Introduction

Customer Analysis involves a thorough examination of a company’s optimal customer profiles. This analysis facilitates a deeper understanding of customers, enabling businesses to tailor products to meet the distinct needs, behaviors, and concerns of various customer types.

By conducting a Customer Analysis, businesses can refine their products based on the preferences of specific customer segments. Rather than allocating resources to market a new product to the entire customer database, companies can identify the segments most likely to be interested in the product. Subsequently, targeted marketing efforts can be directed toward those particular segments, optimizing resource utilization and increasing the likelihood of successful product adoption.

1.1 Project Background

This analysis explores customer behavior patterns in a retail context, focusing on understanding customer segmentation and channel preferences. The dataset contains customer purchase behavior across different channels (Store, Web, Catalog) along with demographic information and product preferences. Understanding these patterns is crucial for developing targeted marketing strategies and optimizing channel operations.

1.2 Research Objectives

Primary Objectives:
- Identify distinct customer segments based on purchasing behavior and demographics
- Analyze channel preferences and predict customer channel choices
- Develop actionable insights for targeted marketing strategies
Specific Goals:
- Determine key factors influencing channel selection
- Create customer profiles based on purchase patterns
- Evaluate the effectiveness of different retail channels
- Provide recommendations for channel-specific marketing strategies

2.0 Data Preparation and Exploration

2.1 Load Required Libraries

# Install required packages if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")
if (!requireNamespace("corrplot", quietly = TRUE)) install.packages("corrplot")
if (!requireNamespace("lattice", quietly = TRUE)) install.packages("lattice")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("caret", quietly = TRUE)) install.packages("caret")
if (!requireNamespace("car", quietly = TRUE)) install.packages("car")
if (!requireNamespace("rpart", quietly = TRUE)) install.packages("rpart")
if (!requireNamespace("rpart.plot", quietly = TRUE)) install.packages("rpart.plot")
if (!requireNamespace("tikzDevice", quietly = TRUE)) install.packages("tikzDevice")
if (!requireNamespace("cluster", quietly = TRUE)) install.packages("cluster")
if (!requireNamespace("factoextra", quietly = TRUE)) install.packages("factoextra")
if (!requireNamespace("naniar", quietly = TRUE)) install.packages("naniar")
if (!requireNamespace("reshape2", quietly = TRUE))install.packages("reshape2")
if (!requireNamespace("data.table", quietly = TRUE))install.packages("data.table")

# Load libraries
library(tidyverse) #A collection of R packages for data manipulation, visualization, and analysis 
library(randomForest)
library(corrplot) #For visualizing correlation matrices
library(lattice)  #For creating high-level data visualizations
library(ggplot2) #Part of tidyverse, used for advanced data visualization with a layered grammar of graphics
library(caret) #For machine learning workflows, including data preprocessing, model training, and evaluation
library(car) #Tools for regression analysis, diagnostics, and multicollinearity checks
library(rpart) #For building classification and regression trees
library(rpart.plot) #For visualizing rpart decision trees
library(tikzDevice) #Exports R plots as TikZ/LaTeX figures for integration into LaTeX documents
library(cluster) #Tools for cluster analysis
library(factoextra) #Simplifies visualization and interpretation of clustering and PCA results
library(naniar) #For handling, visualizing, and analyzing missing data
library(reshape2) #For reshaping data between wide and long formats
library(data.table) #High-performance data manipulation and aggregation.

2.2 Load and Examine Data

# Read the dataset
customer_data <- read.csv("customer_segmentation.csv")

# Check initial structure and summary
cat("Dataset Dimensions: ", dim(customer_data), "\n")

## Dataset Dimensions:  2240 29

str(customer_data)

## 'data.frame':    2240 obs. of  29 variables:
##  $ ID                 : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
##  $ Year_Birth         : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Single" "Single" "Together" "Together" ...
##  $ Income             : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
##  $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
##  $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
##  $ Dt_Customer        : chr  "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
##  $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...
##  $ MntWines           : int  635 11 426 11 173 520 235 76 14 28 ...
##  $ MntFruits          : int  88 1 49 4 43 42 65 10 0 0 ...
##  $ MntMeatProducts    : int  546 6 127 20 118 98 164 56 24 6 ...
##  $ MntFishProducts    : int  172 2 111 10 46 0 50 3 3 1 ...
##  $ MntSweetProducts   : int  88 1 21 3 27 42 49 1 3 1 ...
##  $ MntGoldProds       : int  88 6 42 5 15 14 27 23 2 13 ...
##  $ NumDealsPurchases  : int  3 2 1 2 5 2 4 2 1 1 ...
##  $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
##  $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
##  $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
##  $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
##  $ AcceptedCmp3       : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ AcceptedCmp4       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Z_CostContact      : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Z_Revenue          : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ Response           : int  1 0 0 0 0 0 0 0 1 0 ...

summary(customer_data)

##        ID          Year_Birth    Education         Marital_Status    
##  Min.   :    0   Min.   :1893   Length:2240        Length:2240       
##  1st Qu.: 2828   1st Qu.:1959   Class :character   Class :character  
##  Median : 5458   Median :1970   Mode  :character   Mode  :character  
##  Mean   : 5592   Mean   :1969                                        
##  3rd Qu.: 8428   3rd Qu.:1977                                        
##  Max.   :11191   Max.   :1996                                        
##                                                                      
##      Income          Kidhome          Teenhome      Dt_Customer       
##  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Length:2240       
##  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median : 51382   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   : 52247   Mean   :0.4442   Mean   :0.5062                     
##  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :666666   Max.   :2.0000   Max.   :2.0000                     
##  NA's   :24                                                           
##     Recency         MntWines         MntFruits     MntMeatProducts 
##  Min.   : 0.00   Min.   :   0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.:24.00   1st Qu.:  23.75   1st Qu.:  1.0   1st Qu.:  16.0  
##  Median :49.00   Median : 173.50   Median :  8.0   Median :  67.0  
##  Mean   :49.11   Mean   : 303.94   Mean   : 26.3   Mean   : 166.9  
##  3rd Qu.:74.00   3rd Qu.: 504.25   3rd Qu.: 33.0   3rd Qu.: 232.0  
##  Max.   :99.00   Max.   :1493.00   Max.   :199.0   Max.   :1725.0  
##                                                                    
##  MntFishProducts  MntSweetProducts  MntGoldProds    NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
##  Median : 12.00   Median :  8.00   Median : 24.00   Median : 2.000   
##  Mean   : 37.53   Mean   : 27.06   Mean   : 44.02   Mean   : 2.325   
##  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   
##  Max.   :259.00   Max.   :263.00   Max.   :362.00   Max.   :15.000   
##                                                                      
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.00     Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.00     1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.00     Median : 6.000   
##  Mean   : 4.085   Mean   : 2.662      Mean   : 5.79     Mean   : 5.317   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.00     3rd Qu.: 7.000   
##  Max.   :27.000   Max.   :28.000      Max.   :13.00     Max.   :20.000   
##                                                                          
##   AcceptedCmp3      AcceptedCmp4      AcceptedCmp5      AcceptedCmp1    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.07277   Mean   :0.07455   Mean   :0.07277   Mean   :0.06429  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                         
##   AcceptedCmp2        Complain        Z_CostContact   Z_Revenue 
##  Min.   :0.00000   Min.   :0.000000   Min.   :3     Min.   :11  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:3     1st Qu.:11  
##  Median :0.00000   Median :0.000000   Median :3     Median :11  
##  Mean   :0.01339   Mean   :0.009375   Mean   :3     Mean   :11  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:3     3rd Qu.:11  
##  Max.   :1.00000   Max.   :1.000000   Max.   :3     Max.   :11  
##                                                                 
##     Response     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1491  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
##

2.3 Data Cleaning

2.3.1 Handling Missing Values

# Count missing values in each column
missing_summary <- colSums(is.na(customer_data))
cat("Missing Values Summary:\n")

## Missing Values Summary:

print(missing_summary)

##                  ID          Year_Birth           Education      Marital_Status 
##                   0                   0                   0                   0 
##              Income             Kidhome            Teenhome         Dt_Customer 
##                  24                   0                   0                   0 
##             Recency            MntWines           MntFruits     MntMeatProducts 
##                   0                   0                   0                   0 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##                   0                   0                   0                   0 
##     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
##                   0                   0                   0                   0 
##        AcceptedCmp3        AcceptedCmp4        AcceptedCmp5        AcceptedCmp1 
##                   0                   0                   0                   0 
##        AcceptedCmp2            Complain       Z_CostContact           Z_Revenue 
##                   0                   0                   0                   0 
##            Response 
##                   0

# Remove rows with missing values
df <- na.omit(customer_data)

# Verify no missing values remain
cat("Missing Values After Cleaning:\n")

## Missing Values After Cleaning:

colSums(is.na(df))

##                  ID          Year_Birth           Education      Marital_Status 
##                   0                   0                   0                   0 
##              Income             Kidhome            Teenhome         Dt_Customer 
##                   0                   0                   0                   0 
##             Recency            MntWines           MntFruits     MntMeatProducts 
##                   0                   0                   0                   0 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##                   0                   0                   0                   0 
##     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
##                   0                   0                   0                   0 
##        AcceptedCmp3        AcceptedCmp4        AcceptedCmp5        AcceptedCmp1 
##                   0                   0                   0                   0 
##        AcceptedCmp2            Complain       Z_CostContact           Z_Revenue 
##                   0                   0                   0                   0 
##            Response 
##                   0

# Visualize missing values
gg_miss_var(customer_data) +
  theme_minimal() +
  labs(title = "Missing Values by Variable",
       x = "Variables",
       y = "Number of Missing Values")

2.3.2 Handling Duplicates

# Check for duplicate rows
duplicates <- df[duplicated(df), ]
cat("Number of Duplicate Rows: ", nrow(duplicates), "\n")

## Number of Duplicate Rows:  0

# Remove duplicates
df <- df[!duplicated(df), ]
cat("Dataset Dimensions After Removing Duplicates: ", dim(df), "\n")

## Dataset Dimensions After Removing Duplicates:  2216 29

2.3.3 Handling Outliers

# Identify outlier 
numerical_cols <- names(df)[sapply(df, is.numeric)]
categorical_cols <- names(df)[sapply(df, is.character)]


outlier_summary <- data.frame(Variable = character(), Lower_Bound = numeric(), Upper_Bound = numeric(), 
                              Outliers_Count = integer(), stringsAsFactors = FALSE)

for (col in numerical_cols) {
  if (sum(!is.na(df[[col]])) > 0) {
    Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
    Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
    IQR <- Q3 - Q1
    lower_bound <- Q1 - 1.5 * IQR
    upper_bound <- Q3 + 1.5 * IQR
    outlier_count <- sum(df[[col]] < lower_bound | df[[col]] > upper_bound, na.rm = TRUE)
    outlier_summary <- rbind(outlier_summary, data.frame(
      Variable = col,
      Lower_Bound = lower_bound,
      Upper_Bound = upper_bound,
      Outliers_Count = outlier_count
    ))
  }
}

cat("Outlier Summary:\n")

## Outlier Summary:

print(outlier_summary)

##                  Variable Lower_Bound Upper_Bound Outliers_Count
## 25%                    ID   -5595.750   16832.250              0
## 25%1           Year_Birth    1932.000    2004.000              3
## 25%2               Income  -14525.500  118350.500              8
## 25%3              Kidhome      -1.500       2.500              0
## 25%4             Teenhome      -1.500       2.500              0
## 25%5              Recency     -51.000     149.000              0
## 25%6             MntWines    -697.500    1226.500             35
## 25%7            MntFruits     -44.500      79.500            246
## 25%8      MntMeatProducts    -308.375     556.625            174
## 25%9      MntFishProducts     -67.500     120.500            222
## 25%10    MntSweetProducts     -47.000      81.000            246
## 25%11        MntGoldProds     -61.500     126.500            205
## 25%12   NumDealsPurchases      -2.000       6.000             84
## 25%13     NumWebPurchases      -4.000      12.000              3
## 25%14 NumCatalogPurchases      -6.000      10.000             23
## 25%15   NumStorePurchases      -4.500      15.500              0
## 25%16   NumWebVisitsMonth      -3.000      13.000              8
## 25%17        AcceptedCmp3       0.000       0.000            163
## 25%18        AcceptedCmp4       0.000       0.000            164
## 25%19        AcceptedCmp5       0.000       0.000            162
## 25%20        AcceptedCmp1       0.000       0.000            142
## 25%21        AcceptedCmp2       0.000       0.000             30
## 25%22            Complain       0.000       0.000             21
## 25%23       Z_CostContact       3.000       3.000              0
## 25%24           Z_Revenue      11.000      11.000              0
## 25%25            Response       0.000       0.000            333

# Cap and floor outliers
cap_floor_outliers <- function(df, cols) {
  for (col in cols) {
    Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
    Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
    IQR <- Q3 - Q1
    lower_bound <- Q1 - 1.5 * IQR
    upper_bound <- Q3 + 1.5 * IQR
    df[[col]] <- ifelse(df[[col]] < lower_bound, lower_bound, df[[col]])
    df[[col]] <- ifelse(df[[col]] > upper_bound, upper_bound, df[[col]])
  }
  return(df)
}

# Apply the function to the dataset
df <- cap_floor_outliers(df, numerical_cols)

Check if any outlier remains

# Outlier Summary after cap floor method 
outlier_summary <- data.frame(Variable = character(), Lower_Bound = numeric(), Upper_Bound = numeric(), 
                              Outliers_Count = integer(), stringsAsFactors = FALSE)

for (col in numerical_cols) {
  if (sum(!is.na(df[[col]])) > 0) {
    Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
    Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
    IQR <- Q3 - Q1
    lower_bound <- Q1 - 1.5 * IQR
    upper_bound <- Q3 + 1.5 * IQR
    outlier_count <- sum(df[[col]] < lower_bound | df[[col]] > upper_bound, na.rm = TRUE)
    outlier_summary <- rbind(outlier_summary, data.frame(
      Variable = col,
      Lower_Bound = lower_bound,
      Upper_Bound = upper_bound,
      Outliers_Count = outlier_count
    ))
  }
}

cat("Outlier Summary:\n")

## Outlier Summary:

print(outlier_summary)

##                  Variable Lower_Bound Upper_Bound Outliers_Count
## 25%                    ID   -5595.750   16832.250              0
## 25%1           Year_Birth    1932.000    2004.000              0
## 25%2               Income  -14525.500  118350.500              0
## 25%3              Kidhome      -1.500       2.500              0
## 25%4             Teenhome      -1.500       2.500              0
## 25%5              Recency     -51.000     149.000              0
## 25%6             MntWines    -697.500    1226.500              0
## 25%7            MntFruits     -44.500      79.500              0
## 25%8      MntMeatProducts    -308.375     556.625              0
## 25%9      MntFishProducts     -67.500     120.500              0
## 25%10    MntSweetProducts     -47.000      81.000              0
## 25%11        MntGoldProds     -61.500     126.500              0
## 25%12   NumDealsPurchases      -2.000       6.000              0
## 25%13     NumWebPurchases      -4.000      12.000              0
## 25%14 NumCatalogPurchases      -6.000      10.000              0
## 25%15   NumStorePurchases      -4.500      15.500              0
## 25%16   NumWebVisitsMonth      -3.000      13.000              0
## 25%17        AcceptedCmp3       0.000       0.000              0
## 25%18        AcceptedCmp4       0.000       0.000              0
## 25%19        AcceptedCmp5       0.000       0.000              0
## 25%20        AcceptedCmp1       0.000       0.000              0
## 25%21        AcceptedCmp2       0.000       0.000              0
## 25%22            Complain       0.000       0.000              0
## 25%23       Z_CostContact       3.000       3.000              0
## 25%24           Z_Revenue      11.000      11.000              0
## 25%25            Response       0.000       0.000              0

2.4 Final Dataset Summary

# Summary of the cleaned dataset
summary(df)

##        ID          Year_Birth    Education         Marital_Status    
##  Min.   :    0   Min.   :1932   Length:2216        Length:2216       
##  1st Qu.: 2815   1st Qu.:1959   Class :character   Class :character  
##  Median : 5458   Median :1970   Mode  :character   Mode  :character  
##  Mean   : 5588   Mean   :1969                                        
##  3rd Qu.: 8422   3rd Qu.:1977                                        
##  Max.   :11191   Max.   :1996                                        
##      Income          Kidhome          Teenhome      Dt_Customer       
##  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Length:2216       
##  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median : 51382   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   : 51874   Mean   :0.4418   Mean   :0.5054                     
##  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :118351   Max.   :2.0000   Max.   :2.0000                     
##     Recency         MntWines        MntFruits     MntMeatProducts
##  Min.   : 0.00   Min.   :   0.0   Min.   : 0.00   Min.   :  0.0  
##  1st Qu.:24.00   1st Qu.:  24.0   1st Qu.: 2.00   1st Qu.: 16.0  
##  Median :49.00   Median : 174.5   Median : 8.00   Median : 68.0  
##  Mean   :49.01   Mean   : 303.3   Mean   :21.57   Mean   :151.3  
##  3rd Qu.:74.00   3rd Qu.: 505.0   3rd Qu.:33.00   3rd Qu.:232.2  
##  Max.   :99.00   Max.   :1226.5   Max.   :79.50   Max.   :556.6  
##  MntFishProducts  MntSweetProducts  MntGoldProds   NumDealsPurchases
##  Min.   :  0.00   Min.   : 0.00    Min.   :  0.0   Min.   :0.000    
##  1st Qu.:  3.00   1st Qu.: 1.00    1st Qu.:  9.0   1st Qu.:1.000    
##  Median : 12.00   Median : 8.00    Median : 24.5   Median :2.000    
##  Mean   : 32.19   Mean   :21.95    Mean   : 39.5   Mean   :2.216    
##  3rd Qu.: 50.00   3rd Qu.:33.00    3rd Qu.: 56.0   3rd Qu.:3.000    
##  Max.   :120.50   Max.   :81.00    Max.   :126.5   Max.   :6.000    
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.000    Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.000    1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.000    Median : 6.000   
##  Mean   : 4.068   Mean   : 2.633      Mean   : 5.801    Mean   : 5.301   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.000    3rd Qu.: 7.000   
##  Max.   :12.000   Max.   :10.000      Max.   :13.000    Max.   :13.000   
##   AcceptedCmp3  AcceptedCmp4  AcceptedCmp5  AcceptedCmp1  AcceptedCmp2
##  Min.   :0     Min.   :0     Min.   :0     Min.   :0     Min.   :0    
##  1st Qu.:0     1st Qu.:0     1st Qu.:0     1st Qu.:0     1st Qu.:0    
##  Median :0     Median :0     Median :0     Median :0     Median :0    
##  Mean   :0     Mean   :0     Mean   :0     Mean   :0     Mean   :0    
##  3rd Qu.:0     3rd Qu.:0     3rd Qu.:0     3rd Qu.:0     3rd Qu.:0    
##  Max.   :0     Max.   :0     Max.   :0     Max.   :0     Max.   :0    
##     Complain Z_CostContact   Z_Revenue     Response
##  Min.   :0   Min.   :3     Min.   :11   Min.   :0  
##  1st Qu.:0   1st Qu.:3     1st Qu.:11   1st Qu.:0  
##  Median :0   Median :3     Median :11   Median :0  
##  Mean   :0   Mean   :3     Mean   :11   Mean   :0  
##  3rd Qu.:0   3rd Qu.:3     3rd Qu.:11   3rd Qu.:0  
##  Max.   :0   Max.   :3     Max.   :11   Max.   :0

cat("Final Dataset Dimensions: ", dim(df), "\n")

## Final Dataset Dimensions:  2216 29

# Compare dataset dimensions before and after cleaning
before_cleaning <- nrow(customer_data)
after_cleaning <- nrow(df)

numeric_data <- customer_data %>% 
  select(where(is.numeric)) %>% 
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

cleaning_summary <- data.frame(
  Step = c("Before Cleaning", "After Cleaning"),
  Rows = c(before_cleaning, after_cleaning)
)

ggplot(cleaning_summary, aes(x = Step, y = Rows, fill = Step)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Dataset Size Before and After Cleaning",
       x = "Step",
       y = "Number of Rows")

# Visualize numeric variable distributions
ggplot(numeric_data, aes(x = Value)) +
  geom_histogram(fill = "steelblue", bins = 30) +
  facet_wrap(~Variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numeric Variables",
       x = "Value",
       y = "Frequency")

2.5 Customer Purchase Patterns

# Product category analysis
purchase_data <- df %>%
  select(MntWines, MntFruits, MntMeatProducts, 
         MntFishProducts, MntSweetProducts, MntGoldProds) %>%
  gather(key = "Category", value = "Amount")

# Spending by category
p1 <- ggplot(purchase_data, aes(x = reorder(Category, -Amount), y = Amount)) +
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Purchase Amounts by Category")

# Channel usage patterns
channel_summary <- df %>%
  summarise(
    Store = sum(NumStorePurchases),
    Web = sum(NumWebPurchases),
    Catalog = sum(NumCatalogPurchases)
  ) %>%
  gather(key = "Channel", value = "Total_Purchases")

p2 <- ggplot(channel_summary, aes(x = reorder(Channel, -Total_Purchases), 
                                 y = Total_Purchases)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Total Purchases by Channel")

gridExtra::grid.arrange(p1, p2, ncol = 2)

3.0 Feature Engineering

3.1 Encoding Categorical Values

# Convert categorical variables to factors
df$Education <- as.factor(df$Education)
df$Marital_Status <- as.factor(df$Marital_Status)

# Frequency of categories
education_freq <- table(df$Education)
maristatus_freq <- table(df$Marital_Status)

# Calculate an appropriate y-axis limit 
y_max <- max(education_freq) + 100 
y1_max <- max(maristatus_freq) + 200

# Visualize using bar plot with adjusted y-axis
barplot(education_freq, main = "Education Distribution", col = "lightblue", ylim = c(0, y_max))

barplot(table(df$Marital_Status),main= "Marital Status",col= "lightblue", ylim= c(0, y1_max))

# Check the structure of the modified dataset
str(df)

## 'data.frame':    2216 obs. of  29 variables:
##  $ ID                 : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
##  $ Year_Birth         : num  1957 1954 1965 1984 1981 ...
##  $ Education          : Factor w/ 5 levels "2n Cycle","Basic",..: 3 3 3 3 5 4 3 5 5 5 ...
##  $ Marital_Status     : Factor w/ 8 levels "Absurd","Alone",..: 5 5 6 6 4 6 3 4 6 6 ...
##  $ Income             : num  58138 46344 71613 26646 58293 ...
##  $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
##  $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
##  $ Dt_Customer        : chr  "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
##  $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...
##  $ MntWines           : num  635 11 426 11 173 520 235 76 14 28 ...
##  $ MntFruits          : num  79.5 1 49 4 43 42 65 10 0 0 ...
##  $ MntMeatProducts    : num  546 6 127 20 118 98 164 56 24 6 ...
##  $ MntFishProducts    : num  120 2 111 10 46 ...
##  $ MntSweetProducts   : num  81 1 21 3 27 42 49 1 3 1 ...
##  $ MntGoldProds       : num  88 6 42 5 15 14 27 23 2 13 ...
##  $ NumDealsPurchases  : num  3 2 1 2 5 2 4 2 1 1 ...
##  $ NumWebPurchases    : num  8 1 8 2 5 6 7 4 3 1 ...
##  $ NumCatalogPurchases: num  10 1 2 0 3 4 3 0 0 0 ...
##  $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
##  $ NumWebVisitsMonth  : num  7 5 4 6 5 6 6 8 9 13 ...
##  $ AcceptedCmp3       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp4       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Complain           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Z_CostContact      : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Z_Revenue          : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ Response           : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int [1:24] 11 28 44 49 59 72 91 92 93 129 ...
##   ..- attr(*, "names")= chr [1:24] "11" "28" "44" "49" ...

3.2 Adding Feature and Finding Correlation on target variable

# Create new features for target variable of Decision Tree
df$Preferred_Channel <- apply(df[, c("NumStorePurchases", "NumWebPurchases", "NumCatalogPurchases")], 1, 
                               function(x) c("Store", "Web", "Catalog")[which.max(x)])
df$Preferred_Channel <- as.factor(df$Preferred_Channel)
df$Preferred_Channel <- as.numeric(as.factor(df$Preferred_Channel))


#Finding correlation using Pearson Coefficient 
# Select numerical columns
numerical_columns <- df[, sapply(df, is.numeric)]

# Calculate Pearson correlation for each numerical variable with 'Preferred Channel'
correlations <- sapply(numerical_columns, function(x) cor(x, df$Preferred_Channel, use = "complete.obs", method = "pearson"))

# Calculate the correlation matrix
correlation_matrix <- cor(numerical_columns, use = "complete.obs", method = "pearson")

# Melt the correlation matrix into a long format for ggplot
correlation_melted <- melt(correlation_matrix)

# Create the heatmap using ggplot2
heatmap <- ggplot(data = correlation_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
                       limit = c(-1, 1), space = "Lab",
                       name = "Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Heatmap", x = "Variables", y = "Variables")

# Print the heatmap
print(heatmap)

Spending Pattern

Variables like MntMeatProducts, MntWines, and MntSweetProducts show strong correlations with each other (red blocks), indicating that customers who spend more on one product category often spend more on others. This could suggest overlapping purchasing behaviors.

Purchasing Channels

Features like NumWebPurchases, NumStoresPurchases, and NumCatalogPurchases are positively correlated with each other, meaning customers who are active in one channel tend to be active in others as well.
Preferred_Channel does not have strong direct correlations with other variables, its prediction might depend on non-linear relationships or interactions between features, which can be captured by models like Random Forest

Other Demographics

Income shows positive correlations with spending-related features (MntWines, MntMeatProducts, etc.), suggesting higher-income customers tend to spend more.

Notes: The grey box are binary feature that do not show correlation.

3.3 Data Partition

#Data Partition
#Split the dataset into 70% training and 30 % testing sets
set.seed(123)
sample <- sample(1:nrow(df), 0.7 * nrow(df))
train_data <- df[sample, ]
test_data <- df[-sample, ]

3.4 Checking the Feature Importance

# Consumer behavior target variable
df$Preferred_Channel <- as.factor(df$Preferred_Channel)
index_channel <- createDataPartition(df$Preferred_Channel, p = 0.7, list = FALSE)
train_channel <- df[index_channel, ]
test_channel <- df[-index_channel, ]

#Random Forest Model - Consumer Behavior Analysis
rf_channel <- randomForest(Preferred_Channel ~ Income + Kidhome + Teenhome +
                             MntFruits + MntMeatProducts + MntFishProducts +
                             MntSweetProducts + MntGoldProds +
                             NumDealsPurchases + NumWebPurchases +
                             NumCatalogPurchases + NumStorePurchases,
                           data = train_channel, importance = TRUE, ntree = 100)

# Feature Importance
importance_channel <- importance(rf_channel)
varImpPlot(rf_channel)

From the feature importance, NumStorePurchases, NumWebPurchases, NumCatalogPurchases are highly important as the model’s overall accuracy will decrease if these features are removed during training. While Mean decrease gini shows that these three features is important in contributing to the homogeneity of decision trees in the Random Forest.

4.0 Decision Tree Analysis

4.1 Decision Tree Visualization

#Decision Tree
tree_channel <- rpart(Preferred_Channel ~  NumStorePurchases + NumWebPurchases + NumCatalogPurchases , 
                      data = train_data, method = "class")


# Decision Tree Visualization
rpart.plot(tree_channel, 
           main = "Channel Preference Analysis Decision Tree", 
           type = 2, 
           extra = 102,
           tweak = 2
)

4.2 Decision Tree Analysis

Main decision factors and their impact:
- NumWebPurchases:
  - is the root node of the model, indicating that it has the strongest ability to distinguish channel preferences.
  - If a customer purchases less online (< 7), he or she will prefer offline purchasing channels (Store).
- NumCatalogPurchases:
  - It is an important indicator of preference for the Catalog channel. High Catalog purchase quantity (>= 8) directly points to the Catalog channel.
- NumStorePurchases:
  - It is an important indicator of preference for the Store channel. A high Store purchase quantity (>= 7) directly points to the Store channel.
channel preference model:
- Store:
  - Customers overall prefer the Store channel (Store occupies most of the leaf nodes).
  - High Store purchasing behavior is the defining characteristic that differentiates Store channel preference.
- Catalog:
  - Obvious Catalog channel preference only appears when high Catalog purchasing behavior occurs (e.g. NumCatalogPurchases >= 8).
- Web:
  - Characteristics of customers who prefer web channels usually include:
    - High web purchase quantity (e.g. NumWebPurchases >= 8).
    - Relatively low Store or Catalog purchasing behavior.
Summary of customer flow:
- The decision tree clearly divides the preference patterns of store, catalog and online channels from the perspective of customer behavior.
- The store channel accounts for the preference of most customers, especially those with low online purchasing behavior.
- The customer base of catalog and online channels is smaller, but their characteristics are very clear, which are closely related to high catalog or online purchasing behavior respectively.

4.3 Prediction and evaluation

pred_channel <- predict(tree_channel, test_data, type = "class")
conf_matrix_channel <- table(Predicted = pred_channel, Actual = test_data$Preferred_Channel)
print(conf_matrix_channel)

##          Actual
## Predicted   1   2   3
##         1  31   0   0
##         2   5 515   3
##         3   1   3 107

4.4 Evaluation results analysis

True Positives (TP):

Catalog: 31
Store: 515
Web: 107
The model is relatively accurate in predicting the main categories of all channels, and most samples are correctly classified.

False Positives (FP):

Catalog: 1 + 0 = 1(Other channels are incorrectly predicted as Catalog)
Store: 5 + 3 = 8(Other channels are incorrectly predicted as Store)
Web: 1 + 3 = 4(Other channels are incorrectly predicted as Web)
The number of mispredicted samples is generally small, and the confusion of the model is relatively limited.

False Negatives (FN):

Catalog: 5(Catalog samples misclassified as other channels: 5 +
Store: 3(Store samples misclassified as other channels: 0 + 3)
Web: 1(Web samples misclassified as other channels: 1 + 0)

4.5 Overview of Evaluation

Store channel forecasting performs best:
- The prediction precision (98.48%), recall (99.42%), and F1 score (98.95%) for the Store channel are all close to perfect, indicating that the model performs best in this channel classification.
Catalog channel forecast performance is slightly worse:
- The recall rate (86.11%) and F1 score (91.35%) of the Catalog channel are slightly lower, indicating that some Catalog samples are misclassified as other channels (5 samples are misclassified). This may be related to the smaller size of Catalog samples.
Web channel forecasts perform well:
- The recall rate (99.07%) and F1 score (97.76%) of the Web channel performed well, with only a small number of samples being misclassified.
Excellent overall performance:
- The overall accuracy of the model is as high as 97.04%, and it performs particularly well in Store and Web channel classification. Catalog prediction can be further optimized.

5.0 Clustering Analysis

5.1 Prepare Features for Clustering

# Select and scale features
features <- df %>%
  select(Income, Recency, MntWines, MntFruits, MntMeatProducts, 
         MntFishProducts, MntSweetProducts, MntGoldProds,
         NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, 
         NumStorePurchases, NumWebVisitsMonth)

scaled_features <- scale(features)

# Correlation analysis
corrplot(cor(features), 
         method = "color",
         type = "upper",
         order = "hclust",
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         diag = FALSE)

Feature Scaling before K-means Clustering

Before fitting the model, there is a further step to do. k-means is sensitive to variables that have incomparable units, leading to misleading results. To ensure all variables to be treated equally, the variables need to be transformed to be on a similar scale. In this way, they can be compared correctly using the distance metric.

There are different methods to tackle this issue. The most known and used is standardization, which consists in subtracting the average value from the feature value and, then, dividing it by its standard deviation. This technique will allow obtaining features with a mean of 0 and a deviation of 1. The variables can be scaled using the scale() function. Since this returns a matrix, the code is cleaner using a base-R style rather than a tidyverse style. The scale() function standardizes the values in a vector by subtracting the mean and dividing by the standard deviation.

5.2 Determine Optimal Number of Clusters via Elbow Method

# Calculate metrics for different k values
set.seed(333)

# Calculate WSS for k=1 to k=10
set.seed(333)
k_values <- 1:10
wss_values <- sapply(k_values, function(k) {
  kmeans(scaled_features, centers = k, nstart = 25)$tot.withinss
})


# Function to calculate elbow point using maximum curvature
find_elbow_point <- function(x, y) {
  # Calculate distances from point to line
  nPoints <- length(x)
  allCoords <- cbind(x, y)
  firstPoint <- allCoords[1,]
  lineVec <- allCoords[nPoints,] - firstPoint
  lineVecN <- lineVec / sqrt(sum(lineVec * lineVec))
  
  # Vector from point to first point
  vecFromFirst <- t(apply(allCoords, 1, function(p) p - firstPoint))
  
  # Distance from point to line
  scalarProduct <- vecFromFirst %*% lineVecN
  vecFromFirstParallel <- outer(as.vector(scalarProduct), lineVecN)
  vecToLine <- vecFromFirst - vecFromFirstParallel
  distToLine <- sqrt(rowSums(vecToLine * vecToLine))
  
  # Return the elbow point
  return(which.max(distToLine))
}

# Find elbow point
elbow_k <- find_elbow_point(k_values, wss_values)

# Create data frame for plotting
plot_data <- data.frame(
  k = k_values,
  wss = wss_values
)

# Calculate coordinates for annotation
x_pos <- elbow_k + 0.2
y_pos <- wss_values[elbow_k]

# Create elbow plot with automatic indicator
ggplot(plot_data, aes(x = k, y = wss)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  geom_vline(xintercept = elbow_k, linetype = "dashed", color = "red", size = 1) +
  geom_point(aes(x = elbow_k, y = wss_values[elbow_k]), 
             color = "red", size = 4) +
  annotate("text", x = x_pos, y = y_pos, label = paste("Elbow Point k =", elbow_k), hjust = -0.3, color = "red") +
  labs(title = "Elbow Method for Optimal Number of Clusters (k)",
       x = "Number of Clusters (k)",
       y = "Total Within-Cluster Sum of Squares (WSS)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

# Print elbow point details
print(paste("Optimal Number of Clusters (k) =", elbow_k))

## [1] "Optimal Number of Clusters (k) = 3"

print(paste("Within-Cluster Sum of Squares (WSS) at Elbow Point =", round(wss_values[elbow_k], 2)))

## [1] "Within-Cluster Sum of Squares (WSS) at Elbow Point = 14524.29"

Elbow Method is used to determine the optimal number of clusters for a k-means clustering algorithm

Within-Cluster Sum of Squares (WSS) Calculation Process
The code begins by setting a random seed (123) for reproducibility. It then calculates the Within-Cluster Sum of Squares (WSS) for k values ranging from 1 to 10. The “sapply” function iterates through each k value, running k-means clustering with 25 different random starts (nstart=25) on the scaled feature set. For each k, it stores the total within-cluster sum of squares (tot.withinss), which measures how compact the clusters are. Lower WSS values indicate more compact clusters.

Elbow Point Calculation Function
The “find_elbow_point” function implements a geometric approach to find the elbow. First, it calculates the perpendicular distance from each point to a line drawn from the first to last point. The point with the maximum distance from this line is identified as the elbow point. This mathematical approach effectively finds the point of maximum curvature in the WSS curve, which represents the optimal k value.

Elbow Plot Creation
The elbow plot is created using ggplot2 with several key elements: - A line plot showing the WSS values for each k - Points marking each k value - A vertical red dashed line at the calculated elbow point - A red point highlighting the exact elbow location - An annotation indicating the optimal k value

Output Details
It concludes by printing two key pieces of information: 1. The optimal number of clusters (k) identified by the algorithm 2. The corresponding Within-Cluster Sum of Squares (WSS) value at this elbow point, rounded to two decimal places These values provide quantitative support for the visual representation and can be used in subsequent analysis steps.

5.3 Perform K-means Clustering

# Perform clustering
set.seed(333)
km_result <- kmeans(scaled_features, centers = 3, nstart = 25)


# Visualize clusters using PCA
fviz_cluster(km_result, data = scaled_features,
             main = "Customer Segments",
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())

K-Means clustering groups the data on similar groups.

The algorithm is as follows:
1. Choose the number K clusters.
2. Select at random K points, the centroids.
3. Assign each data point to closest centroid that forms K clusters.
4. Compute and place the new centroid of each centroid.
5. Reassign each data point to new cluster.

K-Means function to perform the clustering

The dataset containing the scaled features is passed as the first argument, and the “centers” parameter is set to the optimal number of clusters, 3, which was provided by the elbow plot, which means that the algorithm will try to find 3 clusters. The “nstart” parameter is set to 25, which means that the algorithm will run 25 times with different initial starting points to try to find the best clustering solution.

5.4 Analyze Cluster Characteristics

# Add clusters to original data
df$Cluster <- as.factor(km_result$cluster)
# Calculate cluster profiles
cluster_profile <- df %>%
  group_by(Cluster) %>%
  summarise(
    Size = n(),
    Avg_Income = mean(Income),
    Avg_Total_Spent = mean(MntWines + MntFruits + MntMeatProducts + 
                          MntFishProducts + MntSweetProducts + MntGoldProds),
    Avg_Web_Visits = mean(NumWebVisitsMonth),
    Avg_Purchases = mean(NumWebPurchases + NumCatalogPurchases + NumStorePurchases),
    Avg_Recency = mean(Recency)
  )

# Visualize cluster sizes
ggplot(cluster_profile, aes(x = Cluster, y = Size)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = Size), vjust = -0.5, size = 4) +
  theme_minimal() +
  labs(title = "Cluster Sizes", x = "Cluster", y = "Number of Customers")

The 3 clusters are made and their sizes are shown respectively in the bar plot above.

# Visualize key metrics by cluster
cluster_metrics <- cluster_profile %>%
  select(-Size) %>%
  gather(key = "Metric", value = "Value", -Cluster)

ggplot(cluster_metrics, aes(x = Cluster, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = sprintf("%.1f", Value)), position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 2.5) + 
  facet_wrap(~Metric, scales = "free_y") +
  theme_minimal() +
  labs(title = "Cluster Characteristics",
       x = "Cluster",
       y = "Value")

Cluster 1: High-Value customers
Cluster 2: Regular Shoppers
Cluster 3: Budget-Conscious Customers

The K-means clustering analysis revealed three distinct customer segments:
1. High-Value Customers (Cluster 1): This group demonstrates the highest spending across all product categories, including wines, meat products, and gold products. They are likely to use multiple channels for purchases, including online and in-store options. These customers visit frequently, showing high engagement with the brand, and make frequent purchases with a high average transaction value.

2. Regular Shoppers (Cluster 2): These customers exhibit moderate spending across most product categories, with a balanced distribution. They prefer a mix of online and catalog purchases. Their visit frequency is regular but not as high as Cluster 1. Their purchasing behavior is consistent, with a moderate number of transactions.

3. Budget-Conscious Customers (Cluster 3): This segment shows the lowest spending across all product categories, focusing primarily on essential items. They prefer in-store purchases and tend to avoid premium channels. Their visit frequency is lower compared to the other clusters, and they are characterized as occasional buyers with a low average transaction value.

5.5 Cluster Quality Assessment

# Calculate silhouette score
sil <- silhouette(km_result$cluster, dist(scaled_features))
avg_sil <- mean(sil[,3])

print("Average Silhouette Score:")

## [1] "Average Silhouette Score:"

print(avg_sil)

## [1] 0.2917457

# Plot silhouette
fviz_silhouette(sil) + 
  theme_minimal() +
  ggtitle("Silhouette Plot")

##   cluster size ave.sil.width
## 1       1  558          0.10
## 2       2  673          0.20
## 3       3  985          0.47

Observation

Cluster 3 has the highest average silhouette width (0.47), suggesting it has the best clustering quality among the three clusters.
Cluster 1 has the lowest average silhouette width (0.10), implying that it may have weaker clustering quality compared to others.
Cluster 2 has a moderate silhouette width (0.20), which indicates some boundary overlap or less distinct clustering.

# Total Sum of Squares
total_ss <- sum(scale(scaled_features)^2)

# Between Sum of Squares (BSS)
bss <- km_result$betweenss

# Within Sum of Squares (WSS)
wss <- km_result$tot.withinss

# Calculate Explained variance ratio
explained_variance_ratio <- bss / (bss + wss)

# Print metrics
print("K-means Clustering Metrics:")

## [1] "K-means Clustering Metrics:"

print(paste("Total Sum of Squares:", round(total_ss, 2)))

## [1] "Total Sum of Squares: 28795"

print(paste("Between Sum of Squares:", round(bss, 2)))

## [1] "Between Sum of Squares: 14270.71"

print(paste("Within Sum of Squares:", round(wss, 2)))

## [1] "Within Sum of Squares: 14524.29"

print(paste("Explained Variance Ratio:", round(explained_variance_ratio, 4)))

## [1] "Explained Variance Ratio: 0.4956"

Interpretation of result
Explained Variance Ratio
An value of 0.4956 means that the clusters explain 49.56% of the total variance in the dataset. This is a moderate result, as higher explained variance ratios (closer to 1) indicate better clustering.
High Within-Cluster Variance (WSS):
The WSS (14524.29) is slightly high, suggesting that data points within clusters are not tightly grouped.
Between-Cluster Variance (BSS):
The BSS (14270.71) is comparable to WSS, indicating that clusters are moderately well-separated. Ideally, BSS should dominate (e.g., > 70% of TSS).

# Calculate average distance to centroid for each cluster
distances_to_centroid <- numeric(length(km_result$cluster))
for(i in 1:nrow(scaled_features)) {
  cluster_id <- km_result$cluster[i]
  centroid <- km_result$centers[cluster_id,]
  distances_to_centroid[i] <- sqrt(sum((scaled_features[i,] - centroid)^2))
}
avg_dist_by_cluster <- tapply(distances_to_centroid, km_result$cluster, mean)

print("Average Distance to Centroid by Cluster:")

## [1] "Average Distance to Centroid by Cluster:"

print(round(avg_dist_by_cluster, 4))

##      1      2      3 
## 2.7564 3.1401 1.6826

# Calculate and display quality metrics
quality_metrics <- data.frame(
  Metric = c("Average Silhouette Score", "Between SS / Total SS"),
  Value = c(avg_sil, km_result$betweenss / km_result$totss)
)

print(quality_metrics)

##                     Metric     Value
## 1 Average Silhouette Score 0.2917457
## 2    Between SS / Total SS 0.4955969

Silhouette Score Interpretation

Scale -1.0 to -0.5: Poor clustering, possible misassignment
-0.5 to 0.0: Overlapping clusters
0.0 to 0.25: Weak structure
0.25 to 0.50: Fair structure
0.50 to 0.75: Good structure
0.75 to 1.00: Strong structure

An Average Silhouette Score of 0.2917 indicates: Fair clustering quality: The clusters have some meaningful structure Moderate separation: Objects are somewhat well-matched to their clusters Some overlap: There might be some overlap between clusters Acceptable for real data: Many real-world datasets achieve scores in this range due to natural noise and complexity

Within the cluster, the sum of squares is 49.56% (0.4956).

6.0 Conclusion and Recommendations

6.1 Channel Preference Insights

Channel Distribution:
- Store channel is the most preferred
- Web and Catalog show specific usage patterns
- Decision tree achieved high accuracy in prediction
Key Drivers:
- Purchase frequency in each channel
- Income levels
- Product category preferences

6.2 Customer Segment Characteristics

Segment Profiles:
- High-value customers (Cluster 1)
- Regular shoppers (Cluster 2)
- Budget-conscious customers (Cluster 3)
Key Differentiators:
- Spending patterns
- Channel preferences
- Visit frequency
- Purchase behavior

6.3 Business Recommendations

Channel Strategy:
- Optimize store experience as primary channel
- Develop targeted online campaigns
- Improve catalog effectiveness
Customer Engagement:
- Personalized marketing by segment
- Channel-specific promotions
- Product recommendations based on segment preferences
Product Strategy:
- Align product mix with segment preferences
- Develop segment-specific pricing strategies
- Create targeted promotional campaigns

Customer Analysis Report

Group 8

2025-01-12