Introduction

This report demonstrates a cluster analysis using customer segmentation survey data. The dataset includes variables such as Age, Education, CS_helpful, and Recommend. We will standardize the data, determine the optimal number of clusters using the elbow method, perform k-means clustering, and then interpret the cluster characteristics.

Data Loading and Preprocessing

# Load the dataset from the attached CSV file
survey_data <- read.csv("customer_segmentation.csv", stringsAsFactors = FALSE)

# Display a summary of the data to ensure it loaded correctly
summary(survey_data)
##        ID          CS_helpful      Recommend       Come_again   
##  Min.   : 1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 6.25   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :11.50   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :11.50   Mean   :1.591   Mean   :1.318   Mean   :1.455  
##  3rd Qu.:16.75   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :22.00   Max.   :3.000   Max.   :3.000   Max.   :3.000  
##   All_Products   Profesionalism    Limitation  Online_grocery     delivery    
##  Min.   :1.000   Min.   :1.000   Min.   :1.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.250   1st Qu.:1.000   1st Qu.:1.0   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :1.000   Median :1.0   Median :2.000   Median :3.000  
##  Mean   :2.091   Mean   :1.409   Mean   :1.5   Mean   :2.273   Mean   :2.409  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.0   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :3.000   Max.   :4.0   Max.   :3.000   Max.   :3.000  
##     Pick_up        Find_items     other_shops        Gender     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.250   1st Qu.:1.000  
##  Median :2.000   Median :1.000   Median :2.000   Median :1.000  
##  Mean   :2.455   Mean   :1.455   Mean   :2.591   Mean   :1.273  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:3.750   3rd Qu.:1.750  
##  Max.   :5.000   Max.   :3.000   Max.   :5.000   Max.   :2.000  
##       Age          Education    
##  Min.   :2.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :2.500  
##  Mean   :2.455   Mean   :3.182  
##  3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :4.000   Max.   :5.000
# Standardize the data (center and scale each variable)
survey_data_scaled <- scale(survey_data)
# Calculate within-group sum of squares (WSS) for 1 to 10 clusters
wss <- numeric(10)
for (i in 1:10) {
  wss[i] <- sum(kmeans(survey_data_scaled, centers = i, nstart = 25)$withinss)
}

# Plot the WSS values to visualize the "elbow"
plot(1:10, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within-group Sum of Squares", 
     main = "Elbow Method for Determining Optimal Clusters")

# Perform k-means clustering with 3 clusters
set.seed(123)
clusters <- kmeans(survey_data_scaled, centers = 3, nstart = 25)

# Print the clustering result
print(clusters)
## K-means clustering with 3 clusters of sizes 12, 6, 4
## 
## Cluster means:
##            ID  CS_helpful  Recommend  Come_again All_Products Profesionalism
## 1 -0.02566635 -0.01031923 -0.1054899 -0.50262359  -0.39835424     0.01283318
## 2  0.00000000 -0.80490011 -0.4922862  0.06154575   0.07113469    -0.69299145
## 3  0.07699905  1.23830786  1.0548991  1.41555215   1.08836068     1.00098765
##      Limitation Online_grocery   delivery    Pick_up Find_items other_shops
## 1  1.850372e-17     0.40480555  0.2373423  0.5949772 -0.1806489  -0.4212692
## 2 -4.157397e-01    -0.78986449 -1.0112848 -0.4301040 -0.1806489   0.7669260
## 3  6.236096e-01    -0.02961992  0.8049001 -1.1397755  0.8129201   0.1134186
##       Gender        Age  Education
## 1 -0.2326695 -0.3897897 -0.3688989
## 2 -0.2326695  0.5128812  0.8125115
## 3  1.0470128  0.4000473 -0.1120706
## 
## Clustering vector:
##  [1] 1 1 1 3 3 2 1 2 1 1 2 1 2 1 2 2 1 1 3 3 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 110.31871  48.18146  63.52384
##  (between_SS / total_SS =  29.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Add the cluster results to the original data
survey_data$Cluster <- as.factor(clusters$cluster)

# Scatter plot: Using Age vs. Education as an example
plot(survey_data$Age, survey_data$Education, 
     col = survey_data$Cluster, pch = 19,
     xlab = "Age", ylab = "Education", 
     main = "Cluster Visualization: Age vs Education")

# Add a legend to identify clusters
legend("topright", legend = levels(survey_data$Cluster), 
       col = 1:length(levels(survey_data$Cluster)), pch = 19)

# Aggregate mean values for Age, Education, CS_helpful, and Recommend by cluster
cluster_summary <- aggregate(
  survey_data[, c("Age", "Education", "CS_helpful", "Recommend")],
  by = list(Cluster = survey_data$Cluster),
  FUN = mean
)
print(cluster_summary)
##   Cluster      Age Education CS_helpful Recommend
## 1       1 2.166667  2.583333   1.583333      1.25
## 2       2 2.833333  4.500000   1.000000      1.00
## 3       3 2.750000  3.000000   2.500000      2.00

Interpretation of the Results

After performing the cluster analysis on the customer segmentation data, three distinct clusters were identified:

These insights can guide targeted marketing and service strategies, allowing for communications and initiatives tailored to the differing needs and characteristics of each cluster.