Grocery Customer Segmentation Analysis

Reflection This analysis applies k-means clustering to a grocery store customer survey dataset in order to identify distinct customer segments based on service satisfaction, product range perceptions, and digital channel usage. Using R, the workflow encompasses data cleaning, descriptive statistics, demographic profiling, correlation analysis, and unsupervised machine learning. Eleven Likert-scale survey items were scaled and submitted to a k-means algorithm, with the optimal number of clusters determined through an elbow plot. The resulting three segments were then visualized via PCA projection and profiled across both behavioral and demographic dimensions, culminating in a summary table that quantifies each segment’s service, channel, and product scores.

Completing this analysis deepened my understanding of the full customer segmentation pipeline, from raw survey data to actionable consumer profiles. Working through the k-means methodology reinforced the importance of feature scaling prior to clustering, as unscaled variables would disproportionately influence distance calculations and distort segment boundaries. Debugging the summarise() error in the summary table further illustrated a critical distinction in how dplyr handles grouped versus ungrouped data frames, specifically that row-level operations such as rowMeans() must be computed before group_by() is called rather than inside summarise(). Taken together, this lab strengthened both my technical proficiency in R and my conceptual understanding of how quantitative segmentation methods can inform retail marketing strategy.

1. Packages

required <- c("tidyverse", "knitr", "kableExtra", "corrplot",
              "cluster", "factoextra", "scales")

new_pkgs <- required[!required %in% rownames(installed.packages())]
if (length(new_pkgs)) install.packages(new_pkgs, repos = "https://cran.r-project.org")

library(tidyverse)
library(knitr)
library(kableExtra)
library(corrplot)
library(cluster)
library(factoextra)
library(scales)

2. Load & Clean Data

df_raw <- read_csv("customer_segmentation.csv", show_col_types = FALSE)

df <- df_raw %>%
  rename(Professionalism = Profesionalism) %>%
  mutate(
    Gender_lbl    = factor(Gender,    levels = 1:2,         labels = c("Male", "Female")),
    Age_lbl       = factor(Age,       levels = 2:4,         labels = c("18-30", "31-45", "46+")),
    Education_lbl = factor(Education, levels = c(1, 2, 3, 5),
                           labels = c("High School", "Some College", "Bachelor's", "Graduate"))
  )

cat("Dataset:", nrow(df), "respondents x", ncol(df_raw), "variables\n")

## Dataset: 22 respondents x 15 variables

glimpse(df)

## Rows: 22
## Columns: 18
## $ ID              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ CS_helpful      <dbl> 2, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, …
## $ Recommend       <dbl> 2, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Come_again      <dbl> 2, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, …
## $ All_Products    <dbl> 2, 1, 1, 4, 5, 2, 2, 2, 2, 1, 2, 2, 1, 2, 4, 2, 2, 1, …
## $ Professionalism <dbl> 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Limitation      <dbl> 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, …
## $ Online_grocery  <dbl> 2, 2, 3, 3, 2, 1, 2, 1, 2, 3, 2, 3, 1, 3, 2, 3, 2, 3, …
## $ delivery        <dbl> 3, 3, 3, 3, 3, 2, 2, 1, 1, 2, 2, 2, 2, 3, 2, 1, 3, 3, …
## $ Pick_up         <dbl> 4, 3, 2, 2, 1, 1, 2, 2, 3, 2, 2, 3, 2, 3, 2, 3, 5, 3, …
## $ Find_items      <dbl> 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, …
## $ other_shops     <dbl> 2, 2, 3, 2, 3, 4, 1, 4, 1, 1, 3, 3, 1, 1, 5, 5, 5, 2, …
## $ Gender          <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, …
## $ Age             <dbl> 2, 2, 2, 3, 4, 2, 2, 2, 2, 2, 4, 3, 4, 3, 2, 3, 2, 2, …
## $ Education       <dbl> 2, 2, 2, 5, 2, 5, 3, 2, 1, 2, 5, 1, 5, 5, 5, 5, 1, 5, …
## $ Gender_lbl      <fct> Male, Male, Male, Male, Female, Male, Male, Male, Fema…
## $ Age_lbl         <fct> 18-30, 18-30, 18-30, 31-45, 46+, 18-30, 18-30, 18-30, …
## $ Education_lbl   <fct> Some College, Some College, Some College, Graduate, So…

3. Data Quality

Missing Values

miss <- df %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  filter(Missing > 0)

if (nrow(miss) == 0) {
  cat("No missing values found.\n")
} else {
  miss %>%
    kable(caption = "Variables with Missing Values") %>%
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
}

## No missing values found.

Descriptive Statistics

df %>%
  select(where(is.numeric), -ID) %>%
  summarise(across(everything(), list(
    Min  = min,
    Max  = max,
    Mean = ~round(mean(.), 2),
    SD   = ~round(sd(.), 2)
  ))) %>%
  pivot_longer(everything(), names_to = "key", values_to = "val") %>%
  separate(key, into = c("Variable", "Stat"), sep = "_(?=[^_]+$)") %>%
  pivot_wider(names_from = Stat, values_from = val) %>%
  kable(caption = "Descriptive Statistics (numeric variables)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

Descriptive Statistics (numeric variables)
Variable	Min	Max	Mean	SD
CS_helpful	1	3	1.59	0.73
Recommend	1	3	1.32	0.65
Come_again	1	3	1.45	0.74
All_Products	1	5	2.09	1.06
Professionalism	1	3	1.41	0.59
Limitation	1	4	1.50	0.80
Online_grocery	1	3	2.27	0.77
delivery	1	3	2.41	0.73
Pick_up	1	5	2.45	1.06
Find_items	1	3	1.45	0.67
other_shops	1	5	2.59	1.40
Gender	1	2	1.27	0.46
Age	2	4	2.45	0.74
Education	1	5	3.18	1.62

4. Demographic Profile

Gender Distribution

demo_bar(df, Gender_lbl, "Gender Distribution", c("#4E79A7", "#F28E2B"))

Age Group Distribution

demo_bar(df, Age_lbl, "Age Group Distribution", c("#59A14F", "#EDC948", "#E15759"))

Education Level Distribution

demo_bar(df, Education_lbl, "Education Level Distribution",
         c("#76B7B2", "#FF9DA7", "#9C755F", "#BAB0AC"))

5. Survey Response Distributions

Response Distributions by Item

df %>%
  select(all_of(likert_vars)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Score") %>%
  mutate(Variable = recode(Variable, !!!var_labels)) %>%
  ggplot(aes(x = factor(Score), fill = factor(Score))) +
  geom_bar(show.legend = FALSE) +
  facet_wrap(~Variable, ncol = 3) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title    = "Response Distribution by Survey Item",
    subtitle = "1 = Most favorable / most frequent",
    x = "Score", y = "Count"
  ) +
  theme_minimal(base_size = 11) +
  theme(strip.text = element_text(face = "bold"))

Mean Scores (Ranked)

df %>%
  select(all_of(likert_vars)) %>%
  summarise(across(everything(), mean)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Mean") %>%
  mutate(
    Label = recode(Variable, !!!var_labels),
    Category = case_when(
      Variable %in% c("CS_helpful", "Recommend", "Come_again",
                      "Professionalism", "Find_items") ~ "Service / Experience",
      Variable %in% c("All_Products", "Limitation")    ~ "Product Range",
      TRUE                                              ~ "Shopping Channel"
    )
  ) %>%
  ggplot(aes(x = reorder(Label, Mean), y = Mean, fill = Category)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = round(Mean, 2)), hjust = -0.15, size = 3.8) +
  coord_flip() +
  scale_y_continuous(limits = c(0, 4.2)) +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title    = "Mean Score per Survey Item",
    subtitle = "Lower = more favorable",
    x = NULL, y = "Mean Score", fill = "Category"
  ) +
  theme_minimal(base_size = 12)

6. Correlation Matrix

cor_mat <- df %>%
  select(all_of(likert_vars)) %>%
  cor(use = "pairwise.complete.obs")

colnames(cor_mat) <- var_labels[colnames(cor_mat)]
rownames(cor_mat) <- var_labels[rownames(cor_mat)]

corrplot(
  cor_mat,
  method      = "color",
  type        = "upper",
  order       = "hclust",
  addCoef.col = "black",
  number.cex  = 0.68,
  tl.cex      = 0.75,
  tl.col      = "black",
  col         = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
  title       = "Correlation Matrix (hierarchical order)",
  mar         = c(0, 0, 2, 0)
)

7. Customer Segmentation (K-Means)

Scale Features

df_scaled <- df %>%
  select(all_of(likert_vars)) %>%
  scale()

Elbow Plot

set.seed(42)
wss <- sapply(1:8, function(k) {
  kmeans(df_scaled, centers = k, nstart = 25)$tot.withinss
})

tibble(k = 1:8, WSS = wss) %>%
  ggplot(aes(x = k, y = WSS)) +
  geom_line(linewidth = 1, color = "#4E79A7") +
  geom_point(size = 3.5, color = "#E15759") +
  scale_x_continuous(breaks = 1:8) +
  labs(
    title    = "Elbow Plot — Choosing the Number of Clusters",
    subtitle = "Select k at the 'elbow' where WSS stops dropping sharply",
    x = "k (clusters)", y = "Total Within-Cluster SS"
  ) +
  theme_minimal(base_size = 13)

Fit k = 3

set.seed(42)
km <- kmeans(df_scaled, centers = 3, nstart = 50)

df <- df %>%
  mutate(Cluster = factor(km$cluster, labels = c("Segment A", "Segment B", "Segment C")))

cat("Cluster sizes:\n")

## Cluster sizes:

print(table(df$Cluster))

## 
## Segment A Segment B Segment C 
##        10         4         8

PCA Cluster Plot

fviz_cluster(
  km,
  data         = df_scaled,
  geom         = "point",
  ellipse.type = "convex",
  palette      = c("#E15759", "#4E79A7", "#59A14F"),
  ggtheme      = theme_minimal(base_size = 12),
  main         = "Customer Segments — PCA Projection"
)

Cluster Profiles

df %>%
  group_by(Cluster) %>%
  summarise(across(all_of(likert_vars), mean)) %>%
  pivot_longer(-Cluster, names_to = "Variable", values_to = "Mean") %>%
  mutate(Label = recode(Variable, !!!var_labels)) %>%
  ggplot(aes(x = Label, y = Mean, fill = Cluster)) +
  geom_col(position = "dodge", width = 0.7) +
  coord_flip() +
  scale_fill_manual(values = c(
    "Segment A" = "#E15759",
    "Segment B" = "#4E79A7",
    "Segment C" = "#59A14F"
  )) +
  labs(
    title    = "Mean Score per Item by Customer Segment",
    subtitle = "Lower = more favorable",
    x = NULL, y = "Mean Score", fill = "Segment"
  ) +
  theme_minimal(base_size = 11)

Demographic Breakdown by Cluster

stacked_demo(df, Gender_lbl,    "Gender Mix by Segment")

stacked_demo(df, Age_lbl,       "Age Mix by Segment")

stacked_demo(df, Education_lbl, "Education Mix by Segment")

8. Service vs. Channel Preference

df %>%
  mutate(
    Service_Score = rowMeans(select(., CS_helpful, Recommend, Come_again,
                                    Professionalism, Find_items)),
    Channel_Score = rowMeans(select(., Online_grocery, delivery, Pick_up))
  ) %>%
  ggplot(aes(x = Service_Score, y = Channel_Score, color = Cluster)) +
  geom_point(size = 4, alpha = 0.85) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 0.8,
              aes(group = 1), color = "grey40", alpha = 0.15) +
  scale_color_manual(values = c(
    "Segment A" = "#E15759",
    "Segment B" = "#4E79A7",
    "Segment C" = "#59A14F"
  )) +
  labs(
    title    = "Service Satisfaction vs. Digital Channel Usage",
    subtitle = "Lower = more satisfied / heavier channel user (scale: 1–5)",
    x = "Service Score (avg)", y = "Channel Score (avg)", color = "Segment"
  ) +
  theme_minimal(base_size = 13)

9. Summary Table

df %>%
  mutate(
    Service_Score = rowMeans(cbind(CS_helpful, Recommend, Come_again,
                                   Professionalism, Find_items)),
    Channel_Score = rowMeans(cbind(Online_grocery, delivery, Pick_up)),
    Product_Score = rowMeans(cbind(All_Products, Limitation))
  ) %>%
  group_by(Cluster) %>%
  summarise(
    N           = n(),
    Pct_Female  = percent(mean(Gender == 2), 1),
    Avg_Service = round(mean(Service_Score), 2),
    Avg_Channel = round(mean(Channel_Score), 2),
    Avg_Product = round(mean(Product_Score), 2)
  ) %>%
  kable(caption = "Segment Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Segment Summary
Cluster	N	Pct_Female	Avg_Service	Avg_Channel	Avg_Product
Segment A	10	30%	1.16	1.97	1.55
Segment B	4	75%	2.20	2.17	2.62
Segment C	8	0%	1.43	3.00	1.69

10. Session Info

sessionInfo()

## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.4.0     factoextra_2.0.0 cluster_2.1.8.2  corrplot_0.95   
##  [5] kableExtra_1.4.0 knitr_1.51       lubridate_1.9.5  forcats_1.0.1   
##  [9] stringr_1.6.0    dplyr_1.2.0      purrr_1.2.1      readr_2.2.0     
## [13] tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.56          bslib_0.10.0       ggrepel_0.9.8     
##  [5] rstatix_0.7.3      lattice_0.22-7     tzdb_0.5.0         vctrs_0.7.1       
##  [9] tools_4.5.3        generics_0.1.4     parallel_4.5.3     pkgconfig_2.0.3   
## [13] Matrix_1.7-4       RColorBrewer_1.1-3 S7_0.2.1           lifecycle_1.0.5   
## [17] compiler_4.5.3     farver_2.1.2       textshaping_1.0.4  carData_3.0-6     
## [21] htmltools_0.5.9    sass_0.4.10        yaml_2.3.12        Formula_1.2-5     
## [25] pillar_1.11.1      car_3.1-5          ggpubr_0.6.3       crayon_1.5.3      
## [29] jquerylib_0.1.4    cachem_1.1.0       abind_1.4-8        nlme_3.1-168      
## [33] tidyselect_1.2.1   digest_0.6.39      stringi_1.8.7      splines_4.5.3     
## [37] labeling_0.4.3     fastmap_1.2.0      grid_4.5.3         cli_3.6.5         
## [41] magrittr_2.0.4     broom_1.0.12       withr_3.0.2        backports_1.5.0   
## [45] bit64_4.6.0-1      timechange_0.4.0   rmarkdown_2.30     bit_4.6.0         
## [49] otel_0.2.0         ggsignif_0.6.4     hms_1.1.4          evaluate_1.0.5    
## [53] viridisLite_0.4.3  mgcv_1.9-3         rlang_1.1.7        Rcpp_1.1.1        
## [57] glue_1.8.0         xml2_1.5.2         svglite_2.2.2      rstudioapi_0.18.0 
## [61] vroom_1.7.0        jsonlite_2.0.0     R6_2.6.1           systemfonts_1.3.1