Cluster Analysis on Yogurt Survey Data

Data Loading and Preprocessing

# Load the dataset from the attached CSV file
survey_data <- read.csv("customer_segmentation.csv", stringsAsFactors = FALSE)

# Display a summary of the data to ensure it loaded correctly
summary(survey_data)

##        ID          CS_helpful      Recommend       Come_again   
##  Min.   : 1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 6.25   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :11.50   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :11.50   Mean   :1.591   Mean   :1.318   Mean   :1.455  
##  3rd Qu.:16.75   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :22.00   Max.   :3.000   Max.   :3.000   Max.   :3.000  
##   All_Products   Profesionalism    Limitation  Online_grocery     delivery    
##  Min.   :1.000   Min.   :1.000   Min.   :1.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.250   1st Qu.:1.000   1st Qu.:1.0   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :1.000   Median :1.0   Median :2.000   Median :3.000  
##  Mean   :2.091   Mean   :1.409   Mean   :1.5   Mean   :2.273   Mean   :2.409  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.0   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :3.000   Max.   :4.0   Max.   :3.000   Max.   :3.000  
##     Pick_up        Find_items     other_shops        Gender     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.250   1st Qu.:1.000  
##  Median :2.000   Median :1.000   Median :2.000   Median :1.000  
##  Mean   :2.455   Mean   :1.455   Mean   :2.591   Mean   :1.273  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:3.750   3rd Qu.:1.750  
##  Max.   :5.000   Max.   :3.000   Max.   :5.000   Max.   :2.000  
##       Age          Education    
##  Min.   :2.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :2.500  
##  Mean   :2.455   Mean   :3.182  
##  3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :4.000   Max.   :5.000

# Standardize the data (center and scale each variable)
survey_data_scaled <- scale(survey_data)

# Calculate within-group sum of squares (WSS) for 1 to 10 clusters
wss <- numeric(10)
for (i in 1:10) {
  wss[i] <- sum(kmeans(survey_data_scaled, centers = i, nstart = 25)$withinss)
}

# Plot the WSS values to visualize the "elbow"
plot(1:10, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within-group Sum of Squares", 
     main = "Elbow Method for Determining Optimal Clusters")

# Perform k-means clustering with 3 clusters
set.seed(123)
clusters <- kmeans(survey_data_scaled, centers = 3, nstart = 25)

# Print the clustering result
print(clusters)

## K-means clustering with 3 clusters of sizes 12, 6, 4
## 
## Cluster means:
##            ID  CS_helpful  Recommend  Come_again All_Products Profesionalism
## 1 -0.02566635 -0.01031923 -0.1054899 -0.50262359  -0.39835424     0.01283318
## 2  0.00000000 -0.80490011 -0.4922862  0.06154575   0.07113469    -0.69299145
## 3  0.07699905  1.23830786  1.0548991  1.41555215   1.08836068     1.00098765
##      Limitation Online_grocery   delivery    Pick_up Find_items other_shops
## 1  1.850372e-17     0.40480555  0.2373423  0.5949772 -0.1806489  -0.4212692
## 2 -4.157397e-01    -0.78986449 -1.0112848 -0.4301040 -0.1806489   0.7669260
## 3  6.236096e-01    -0.02961992  0.8049001 -1.1397755  0.8129201   0.1134186
##       Gender        Age  Education
## 1 -0.2326695 -0.3897897 -0.3688989
## 2 -0.2326695  0.5128812  0.8125115
## 3  1.0470128  0.4000473 -0.1120706
## 
## Clustering vector:
##  [1] 1 1 1 3 3 2 1 2 1 1 2 1 2 1 2 2 1 1 3 3 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 110.31871  48.18146  63.52384
##  (between_SS / total_SS =  29.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Add the cluster results to the original data
survey_data$Cluster <- as.factor(clusters$cluster)

# Scatter plot: Using Age vs. Education as an example
plot(survey_data$Age, survey_data$Education, 
     col = survey_data$Cluster, pch = 19,
     xlab = "Age", ylab = "Education", 
     main = "Cluster Visualization: Age vs Education")

# Add a legend to identify clusters
legend("topright", legend = levels(survey_data$Cluster), 
       col = 1:length(levels(survey_data$Cluster)), pch = 19)

# Aggregate mean values for Age, Education, CS_helpful, and Recommend by cluster
cluster_summary <- aggregate(
  survey_data[, c("Age", "Education", "CS_helpful", "Recommend")],
  by = list(Cluster = survey_data$Cluster),
  FUN = mean
)
print(cluster_summary)

##   Cluster      Age Education CS_helpful Recommend
## 1       1 2.166667  2.583333   1.583333      1.25
## 2       2 2.833333  4.500000   1.000000      1.00
## 3       3 2.750000  3.000000   2.500000      2.00

Interpretation of the Results

After performing the cluster analysis on the customer segmentation data, three distinct clusters were identified:

Cluster 1:
This cluster exhibits lower average values for Age and Education, while the CS_helpful and Recommend scores are moderate. This may suggest a group of younger customers with lower formal education levels who report average perceptions regarding customer support and likelihood to recommend.
Cluster 2:
Respondents in this cluster show mid-range values across all variables. Their Age, Education, CS_helpful, and Recommend scores are balanced, indicating an intermediate segment of customers with neither outstanding nor poor ratings.
Cluster 3:
This group is characterized by higher averages in Age and Education along with elevated CS_helpful and Recommend scores. These customers could represent a more mature and possibly more discerning group, reflecting higher satisfaction and greater loyalty.

These insights can guide targeted marketing and service strategies, allowing for communications and initiatives tailored to the differing needs and characteristics of each cluster.

Cluster Analysis on Yogurt Survey Data

Logan Salcido

04/09/2025

Introduction

Data Loading and Preprocessing

Interpretation of the Results