This report demonstrates a cluster analysis using customer segmentation survey data. The dataset includes variables such as Age, Education, CS_helpful, and Recommend. We will standardize the data, determine the optimal number of clusters using the elbow method, perform k-means clustering, and then interpret the cluster characteristics.
# Load the dataset from the attached CSV file
survey_data <- read.csv("customer_segmentation.csv", stringsAsFactors = FALSE)
# Display a summary of the data to ensure it loaded correctly
summary(survey_data)
## ID CS_helpful Recommend Come_again
## Min. : 1.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.: 6.25 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :11.50 Median :1.000 Median :1.000 Median :1.000
## Mean :11.50 Mean :1.591 Mean :1.318 Mean :1.455
## 3rd Qu.:16.75 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :22.00 Max. :3.000 Max. :3.000 Max. :3.000
## All_Products Profesionalism Limitation Online_grocery delivery
## Min. :1.000 Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000
## 1st Qu.:1.250 1st Qu.:1.000 1st Qu.:1.0 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :1.000 Median :1.0 Median :2.000 Median :3.000
## Mean :2.091 Mean :1.409 Mean :1.5 Mean :2.273 Mean :2.409
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.0 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :5.000 Max. :3.000 Max. :4.0 Max. :3.000 Max. :3.000
## Pick_up Find_items other_shops Gender
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.250 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :2.000 Median :1.000
## Mean :2.455 Mean :1.455 Mean :2.591 Mean :1.273
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.750 3rd Qu.:1.750
## Max. :5.000 Max. :3.000 Max. :5.000 Max. :2.000
## Age Education
## Min. :2.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :2.500
## Mean :2.455 Mean :3.182
## 3rd Qu.:3.000 3rd Qu.:5.000
## Max. :4.000 Max. :5.000
# Standardize the data (center and scale each variable)
survey_data_scaled <- scale(survey_data)
# Calculate within-group sum of squares (WSS) for 1 to 10 clusters
wss <- numeric(10)
for (i in 1:10) {
wss[i] <- sum(kmeans(survey_data_scaled, centers = i, nstart = 25)$withinss)
}
# Plot the WSS values to visualize the "elbow"
plot(1:10, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within-group Sum of Squares",
main = "Elbow Method for Determining Optimal Clusters")
# Perform k-means clustering with 3 clusters
set.seed(123)
clusters <- kmeans(survey_data_scaled, centers = 3, nstart = 25)
# Print the clustering result
print(clusters)
## K-means clustering with 3 clusters of sizes 12, 6, 4
##
## Cluster means:
## ID CS_helpful Recommend Come_again All_Products Profesionalism
## 1 -0.02566635 -0.01031923 -0.1054899 -0.50262359 -0.39835424 0.01283318
## 2 0.00000000 -0.80490011 -0.4922862 0.06154575 0.07113469 -0.69299145
## 3 0.07699905 1.23830786 1.0548991 1.41555215 1.08836068 1.00098765
## Limitation Online_grocery delivery Pick_up Find_items other_shops
## 1 1.850372e-17 0.40480555 0.2373423 0.5949772 -0.1806489 -0.4212692
## 2 -4.157397e-01 -0.78986449 -1.0112848 -0.4301040 -0.1806489 0.7669260
## 3 6.236096e-01 -0.02961992 0.8049001 -1.1397755 0.8129201 0.1134186
## Gender Age Education
## 1 -0.2326695 -0.3897897 -0.3688989
## 2 -0.2326695 0.5128812 0.8125115
## 3 1.0470128 0.4000473 -0.1120706
##
## Clustering vector:
## [1] 1 1 1 3 3 2 1 2 1 1 2 1 2 1 2 2 1 1 3 3 1 1
##
## Within cluster sum of squares by cluster:
## [1] 110.31871 48.18146 63.52384
## (between_SS / total_SS = 29.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Add the cluster results to the original data
survey_data$Cluster <- as.factor(clusters$cluster)
# Scatter plot: Using Age vs. Education as an example
plot(survey_data$Age, survey_data$Education,
col = survey_data$Cluster, pch = 19,
xlab = "Age", ylab = "Education",
main = "Cluster Visualization: Age vs Education")
# Add a legend to identify clusters
legend("topright", legend = levels(survey_data$Cluster),
col = 1:length(levels(survey_data$Cluster)), pch = 19)
# Aggregate mean values for Age, Education, CS_helpful, and Recommend by cluster
cluster_summary <- aggregate(
survey_data[, c("Age", "Education", "CS_helpful", "Recommend")],
by = list(Cluster = survey_data$Cluster),
FUN = mean
)
print(cluster_summary)
## Cluster Age Education CS_helpful Recommend
## 1 1 2.166667 2.583333 1.583333 1.25
## 2 2 2.833333 4.500000 1.000000 1.00
## 3 3 2.750000 3.000000 2.500000 2.00
After performing the cluster analysis on the customer segmentation data, three distinct clusters were identified:
Cluster 1:
This cluster exhibits lower average values for Age and Education, while
the CS_helpful and Recommend scores are moderate. This may suggest a
group of younger customers with lower formal education levels who report
average perceptions regarding customer support and likelihood to
recommend.
Cluster 2:
Respondents in this cluster show mid-range values across all variables.
Their Age, Education, CS_helpful, and Recommend scores are balanced,
indicating an intermediate segment of customers with neither outstanding
nor poor ratings.
Cluster 3:
This group is characterized by higher averages in Age and Education
along with elevated CS_helpful and Recommend scores. These customers
could represent a more mature and possibly more discerning group,
reflecting higher satisfaction and greater loyalty.
These insights can guide targeted marketing and service strategies, allowing for communications and initiatives tailored to the differing needs and characteristics of each cluster.