Customer Segmentation Cluster Analysis

Load the Dataset

# Make sure the file name matches EXACTLY what you uploaded
mydata <- read.csv("customer_segmentation (1).csv")

head(mydata)

##   ID CS_helpful Recommend Come_again All_Products Profesionalism Limitation
## 1  1          2         2          2            2              2          2
## 2  2          1         2          1            1              1          1
## 3  3          2         1          1            1              1          2
## 4  4          3         3          2            4              1          2
## 5  5          2         1          3            5              2          1
## 6  6          1         1          3            2              1          1
##   Online_grocery delivery Pick_up Find_items other_shops Gender Age Education
## 1              2        3       4          1           2      1   2         2
## 2              2        3       3          1           2      1   2         2
## 3              3        3       2          1           3      1   2         2
## 4              3        3       2          2           2      1   3         5
## 5              2        3       1          2           3      2   4         2
## 6              1        2       1          1           4      1   2         5

str(mydata)

## 'data.frame':    22 obs. of  15 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CS_helpful    : int  2 1 2 3 2 1 2 1 1 1 ...
##  $ Recommend     : int  2 2 1 3 1 1 1 1 1 1 ...
##  $ Come_again    : int  2 1 1 2 3 3 1 1 1 1 ...
##  $ All_Products  : int  2 1 1 4 5 2 2 2 2 1 ...
##  $ Profesionalism: int  2 1 1 1 2 1 2 1 2 1 ...
##  $ Limitation    : int  2 1 2 2 1 1 1 2 1 1 ...
##  $ Online_grocery: int  2 2 3 3 2 1 2 1 2 3 ...
##  $ delivery      : int  3 3 3 3 3 2 2 1 1 2 ...
##  $ Pick_up       : int  4 3 2 2 1 1 2 2 3 2 ...
##  $ Find_items    : int  1 1 1 2 2 1 1 2 1 1 ...
##  $ other_shops   : int  2 2 3 2 3 4 1 4 1 1 ...
##  $ Gender        : int  1 1 1 1 2 1 1 1 2 2 ...
##  $ Age           : int  2 2 2 3 4 2 2 2 2 2 ...
##  $ Education     : int  2 2 2 5 2 5 3 2 1 2 ...

summary(mydata)

##        ID          CS_helpful      Recommend       Come_again   
##  Min.   : 1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 6.25   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :11.50   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :11.50   Mean   :1.591   Mean   :1.318   Mean   :1.455  
##  3rd Qu.:16.75   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :22.00   Max.   :3.000   Max.   :3.000   Max.   :3.000  
##   All_Products   Profesionalism    Limitation  Online_grocery     delivery    
##  Min.   :1.000   Min.   :1.000   Min.   :1.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.250   1st Qu.:1.000   1st Qu.:1.0   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :1.000   Median :1.0   Median :2.000   Median :3.000  
##  Mean   :2.091   Mean   :1.409   Mean   :1.5   Mean   :2.273   Mean   :2.409  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.0   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :3.000   Max.   :4.0   Max.   :3.000   Max.   :3.000  
##     Pick_up        Find_items     other_shops        Gender     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.250   1st Qu.:1.000  
##  Median :2.000   Median :1.000   Median :2.000   Median :1.000  
##  Mean   :2.455   Mean   :1.455   Mean   :2.591   Mean   :1.273  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:3.750   3rd Qu.:1.750  
##  Max.   :5.000   Max.   :3.000   Max.   :5.000   Max.   :2.000  
##       Age          Education    
##  Min.   :2.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :2.500  
##  Mean   :2.455   Mean   :3.182  
##  3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :4.000   Max.   :5.000

Prepare the Data

# Remove ID column (first column)
cluster_data <- mydata[,-1]

# Keep ONLY numeric columns (THIS FIXES YOUR ERROR)
numeric_data <- cluster_data[, sapply(cluster_data, is.numeric)]

# Scale the data
cluster_scaled <- scale(numeric_data)

K-Means Clustering

set.seed(123)

kmeans_result <- kmeans(cluster_scaled, centers = 3, nstart = 25)

# Add cluster labels to original dataset
mydata$Cluster <- as.factor(kmeans_result$cluster)

head(mydata)

##   ID CS_helpful Recommend Come_again All_Products Profesionalism Limitation
## 1  1          2         2          2            2              2          2
## 2  2          1         2          1            1              1          1
## 3  3          2         1          1            1              1          2
## 4  4          3         3          2            4              1          2
## 5  5          2         1          3            5              2          1
## 6  6          1         1          3            2              1          1
##   Online_grocery delivery Pick_up Find_items other_shops Gender Age Education
## 1              2        3       4          1           2      1   2         2
## 2              2        3       3          1           2      1   2         2
## 3              3        3       2          1           3      1   2         2
## 4              3        3       2          2           2      1   3         5
## 5              2        3       1          2           3      2   4         2
## 6              1        2       1          1           4      1   2         5
##   Cluster
## 1       3
## 2       3
## 3       3
## 4       2
## 5       2
## 6       1

Number of Observations in Each Cluster

table(mydata$Cluster)

## 
##  1  2  3 
##  6  4 12

Answer:
The table above shows the number of observations in each cluster. Each value represents how many customers belong to each group.

Cluster Means

aggregate(numeric_data, by = list(Cluster = mydata$Cluster), mean)

##   Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1       1   1.000000      1.00   1.500000     2.166667       1.000000
## 2       2   2.500000      2.00   2.500000     3.250000       2.000000
## 3       3   1.583333      1.25   1.083333     1.666667       1.416667
##   Limitation Online_grocery delivery  Pick_up Find_items other_shops   Gender
## 1   1.166667       1.666667 1.666667 2.000000   1.333333    3.666667 1.166667
## 2   2.000000       2.250000 3.000000 1.250000   2.000000    2.750000 1.750000
## 3   1.500000       2.583333 2.583333 3.083333   1.333333    2.000000 1.166667
##        Age Education
## 1 2.833333  4.500000
## 2 2.750000  3.000000
## 3 2.166667  2.583333

Cluster Medians

aggregate(numeric_data, by = list(Cluster = mydata$Cluster), median)

##   Cluster CS_helpful Recommend Come_again All_Products Profesionalism
## 1       1        1.0         1        1.0          2.0              1
## 2       2        2.5         2        2.5          3.5              2
## 3       3        1.5         1        1.0          2.0              1
##   Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age
## 1        1.0            1.5        2       2          1         4.0      1 2.5
## 2        1.5            2.5        3       1          2         2.5      2 2.5
## 3        1.0            3.0        3       3          1         2.0      1 2.0
##   Education
## 1       5.0
## 2       2.5
## 3       2.0

Why are means or medians important?

Answer:
Means and medians help describe each cluster by showing the typical values for each variable. This allows us to understand the differences between customer groups and identify patterns in behavior. These insights help businesses create better marketing strategies.

Should mean or median be used?

Answer:
The median is better when the data contains outliers because it is not affected by extreme values. The mean is useful when the data is evenly distributed. In this case, the median is more reliable because it better represents the typical customer.

Summary Measures for Each Cluster

Answer:
Important summary measures include mean, median, minimum, maximum, and standard deviation. These help describe each cluster’s characteristics and allow for better comparison between groups.

Hierarchical Clustering

distance_matrix <- dist(cluster_scaled)

hc_result <- hclust(distance_matrix, method = "ward.D2")

plot(hc_result, main = "Hierarchical Clustering Dendrogram",
     xlab = "", sub = "")

K-Means vs Hierarchical Clustering

Answer:
K-means clustering requires selecting the number of clusters beforehand and works efficiently with large datasets. Hierarchical clustering builds a dendrogram and does not require a predefined number of clusters. I prefer K-means because it is faster and easier to interpret.

Advanced Question

Answer:
We should use numeric_data (instead of mydata or mydata[,-1]) because it removes non-numeric columns. This ensures accurate calculations since clustering summaries require numeric variables only.