Grocery Store Customer Segmentation Analysis

1. Introduction

This report analyzes survey data collected from 22 grocery store customers. Respondents were asked to rate various aspects of their shopping experience on Likert-type scales, and demographic information (gender, age group, and education level) was also recorded.This experience was very easy and straightforward. This grocery store customer segmentation dataset, though modest at just 22 respondents, offers a useful snapshot of shopper attitudes across several dimensions of the retail experience. The survey captures both service quality perceptions — staff helpfulness, professionalism, and ease of finding items — and preferences around modern shopping channels like online ordering, delivery, and pick-up. Core service measures trend positive overall, while pick-up responses are notably more mixed, hinting at an operational gap worth addressing. Demographically, the sample skews younger and female, which limits generalizability, but the data still reveals a telling pattern: most customers cluster in a “moderate satisfaction” band, representing a prime opportunity to convert lukewarm shoppers into loyal advocates. Future iterations would benefit from a larger, more diverse sample and open-ended questions to better understand the specific friction points driving that middle-ground sentiment.

Research questions explored:

How satisfied are customers with core service dimensions?
Do shopping behavior preferences (online, delivery, pick-up) vary by demographics?
Are there identifiable customer segments based on satisfaction ratings?

2. Setup & Data Loading

# Install any missing packages before loading
required_packages <- c("tidyverse", "knitr", "kableExtra", "ggcorrplot",
                       "scales", "patchwork", "psych")

installed <- rownames(installed.packages())
to_install <- required_packages[!(required_packages %in% installed)]
if (length(to_install) > 0) install.packages(to_install, repos = "https://cloud.r-project.org")

library(tidyverse)
library(knitr)
library(kableExtra)
library(ggcorrplot)
library(scales)
library(patchwork)
library(psych)

df_raw <- read.csv("customer_segmentation.csv", stringsAsFactors = FALSE)

# Trim whitespace from column names (one column has a trailing space)
names(df_raw) <- trimws(names(df_raw))

glimpse(df_raw)

## Rows: 22
## Columns: 15
## $ ID             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ CS_helpful     <int> 2, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3…
## $ Recommend      <int> 2, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Come_again     <int> 2, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2…
## $ All_Products   <int> 2, 1, 1, 4, 5, 2, 2, 2, 2, 1, 2, 2, 1, 2, 4, 2, 2, 1, 3…
## $ Profesionalism <int> 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ Limitation     <int> 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4…
## $ Online_grocery <int> 2, 2, 3, 3, 2, 1, 2, 1, 2, 3, 2, 3, 1, 3, 2, 3, 2, 3, 1…
## $ delivery       <int> 3, 3, 3, 3, 3, 2, 2, 1, 1, 2, 2, 2, 2, 3, 2, 1, 3, 3, 3…
## $ Pick_up        <int> 4, 3, 2, 2, 1, 1, 2, 2, 3, 2, 2, 3, 2, 3, 2, 3, 5, 3, 1…
## $ Find_items     <int> 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 3…
## $ other_shops    <int> 2, 2, 3, 2, 3, 4, 1, 4, 1, 1, 3, 3, 1, 1, 5, 5, 5, 2, 2…
## $ Gender         <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2…
## $ Age            <int> 2, 2, 2, 3, 4, 2, 2, 2, 2, 2, 4, 3, 4, 3, 2, 3, 2, 2, 2…
## $ Education      <int> 2, 2, 2, 5, 2, 5, 3, 2, 1, 2, 5, 1, 5, 5, 5, 5, 1, 5, 2…

3. Data Cleaning & Recoding

All Likert variables use numeric codes. We decode them into human-readable labels and convert demographic variables to ordered factors.

# ── Likert scale labels ──────────────────────────────────────────────────
likert3 <- c("1" = "Agree",    "2" = "Neutral", "3" = "Disagree")
likert5 <- c("1" = "Strongly Agree", "2" = "Agree", "3" = "Neutral",
             "4" = "Disagree",       "5" = "Strongly Disagree")

# ── Demographic labels ───────────────────────────────────────────────────
gender_labels    <- c("1" = "Male", "2" = "Female")
age_labels       <- c("2" = "18–24", "3" = "25–34", "4" = "35–44")
education_labels <- c("1" = "Some High School", "2" = "High School Diploma",
                      "3" = "Some College",     "5" = "Bachelor's or Higher")

df <- df_raw %>%
  mutate(
    # Service satisfaction items (3-point)
    CS_helpful_f     = factor(CS_helpful,    levels = 1:3, labels = likert3),
    Recommend_f      = factor(Recommend,     levels = 1:3, labels = likert3),
    Come_again_f     = factor(Come_again,    levels = 1:3, labels = likert3),
    Profesionalism_f = factor(Profesionalism, levels = 1:3, labels = likert3),
    Online_grocery_f = factor(Online_grocery, levels = 1:3, labels = likert3),
    delivery_f       = factor(delivery,       levels = 1:3, labels = likert3),
    Find_items_f     = factor(Find_items,     levels = 1:3, labels = likert3),

    # 5-point items
    All_Products_f   = factor(All_Products, levels = 1:5, labels = likert5),
    Limitation_f     = factor(Limitation,   levels = 1:4,
                              labels = c("Strongly Agree","Agree","Neutral","Disagree")),
    Pick_up_f        = factor(Pick_up,      levels = 1:5, labels = likert5),
    other_shops_f    = factor(other_shops,  levels = 1:5, labels = likert5),

    # Demographics
    Gender_f    = factor(Gender,    levels = 1:2,     labels = gender_labels),
    Age_f       = factor(Age,       levels = c(2,3,4), labels = age_labels),
    Education_f = factor(Education, levels = c(1,2,3,5), labels = education_labels)
  )

cat("Rows:", nrow(df), " | Columns:", ncol(df), "\n")

## Rows: 22  | Columns: 29

4. Sample Description

4.1 Demographic Profile

demo_summary <- df %>%
  select(Gender_f, Age_f, Education_f) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Category") %>%
  mutate(Variable = recode(Variable,
    Gender_f    = "Gender",
    Age_f       = "Age Group",
    Education_f = "Education")) %>%
  count(Variable, Category) %>%
  mutate(Percent = percent(n / nrow(df), accuracy = 1))

demo_summary %>%
  kable(caption = "Demographic Breakdown (N = 22)",
        col.names = c("Variable", "Category", "n", "%")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) %>%
  collapse_rows(columns = 1, valign = "top")

Demographic Breakdown (N = 22)
Variable	Category	n	%
Age Group	18–24	15	68%
	25–34	4	18%
	35–44	3	14%
Education	Some High School	3	14%
	High School Diploma	8	36%
	Some College	2	9%
	Bachelor’s or Higher	9	41%
Gender	Male	16	73%
Gender	Female	6	27%

p_gender <- df %>%
  count(Gender_f) %>%
  ggplot(aes(x = Gender_f, y = n, fill = Gender_f)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = n), vjust = -0.4, fontface = "bold") +
  scale_fill_manual(values = c("#4E79A7","#F28E2B")) +
  labs(title = "Gender", x = NULL, y = "Count") +
  theme_minimal(base_size = 12)

p_age <- df %>%
  count(Age_f) %>%
  ggplot(aes(x = Age_f, y = n, fill = Age_f)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = n), vjust = -0.4, fontface = "bold") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Age Group", x = NULL, y = "Count") +
  theme_minimal(base_size = 12)

p_edu <- df %>%
  count(Education_f) %>%
  ggplot(aes(x = n, y = reorder(Education_f, n), fill = Education_f)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = n), hjust = -0.3, fontface = "bold") +
  scale_fill_brewer(palette = "Pastel1") +
  xlim(0, max(table(df$Education_f)) + 2) +
  labs(title = "Education Level", x = "Count", y = NULL) +
  theme_minimal(base_size = 12)

(p_gender | p_age) / p_edu +
  plot_annotation(title = "Sample Demographics",
                  theme = theme(plot.title = element_text(size = 14, face = "bold")))

5. Satisfaction & Service Ratings

5.1 Summary Statistics (Numeric)

service_vars <- c("CS_helpful","Recommend","Come_again","All_Products",
                  "Profesionalism","Limitation","Online_grocery",
                  "delivery","Pick_up","Find_items","other_shops")

df %>%
  select(all_of(service_vars)) %>%
  describe() %>%
  select(n, mean, sd, median, min, max, skew, kurtosis) %>%
  round(2) %>%
  kable(caption = "Descriptive Statistics — Service & Shopping Variables") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"))

Descriptive Statistics — Service & Shopping Variables
	n	mean	sd	median	min	max	skew	kurtosis
CS_helpful	22	1.59	0.73	1	1	3	0.73	-0.89
Recommend	22	1.32	0.65	1	1	3	1.67	1.38
Come_again	22	1.45	0.74	1	1	3	1.16	-0.23
All_Products	22	2.09	1.06	2	1	5	1.18	0.79
Profesionalism	22	1.41	0.59	1	1	3	1.00	-0.14
Limitation	22	1.50	0.80	1	1	4	1.59	1.99
Online_grocery	22	2.27	0.77	2	1	3	-0.46	-1.25
delivery	22	2.41	0.73	3	1	3	-0.73	-0.89
Pick_up	22	2.45	1.06	2	1	5	0.46	-0.37
Find_items	22	1.45	0.67	1	1	3	1.06	-0.19
other_shops	22	2.59	1.40	2	1	5	0.42	-1.21

Note on scale direction: Lower scores indicate more positive responses (1 = Agree / Strongly Agree).

5.2 Frequency Distribution of Key Satisfaction Items

satisfaction_items <- c("CS_helpful_f","Recommend_f","Come_again_f",
                        "Profesionalism_f","Find_items_f")
item_labels <- c("CS Helpfulness","Recommend Store",
                 "Come Again","Professionalism","Find Items Easily")

freq_df <- map2_dfr(satisfaction_items, item_labels, function(var, lbl) {
  df %>%
    count(.data[[var]]) %>%
    rename(Response = 1) %>%
    mutate(Item = lbl, Pct = n / sum(n))
})

ggplot(freq_df, aes(x = Response, y = Pct, fill = Response)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  geom_text(aes(label = percent(Pct, accuracy = 1)),
            vjust = -0.4, size = 3.5, fontface = "bold") +
  facet_wrap(~Item, nrow = 2) +
  scale_y_continuous(labels = percent_format(), limits = c(0, 0.9)) +
  scale_fill_manual(values = c("#59A14F","#EDC948","#E15759")) +
  labs(title = "Customer Satisfaction: Key Service Items",
       subtitle = "Response distribution across 22 respondents",
       x = NULL, y = "Proportion") +
  theme_minimal(base_size = 12) +
  theme(strip.text = element_text(face = "bold"))

5.3 Shopping Channel Preferences

channel_items  <- c("Online_grocery_f","delivery_f","Pick_up_f")
channel_labels <- c("Online Grocery","Delivery","Pick-Up")

channel_df <- map2_dfr(channel_items, channel_labels, function(var, lbl) {
  df %>%
    count(.data[[var]]) %>%
    rename(Response = 1) %>%
    mutate(Item = lbl, Pct = n / sum(n))
})

ggplot(channel_df, aes(x = Response, y = Pct, fill = Item)) +
  geom_col(position = "dodge", width = 0.7) +
  geom_text(aes(label = percent(Pct, accuracy = 1)),
            position = position_dodge(width = 0.7),
            vjust = -0.4, size = 3.2, fontface = "bold") +
  scale_y_continuous(labels = percent_format(), limits = c(0, 0.85)) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Shopping Channel Preferences",
       subtitle = "Agreement rates for online, delivery, and pick-up options",
       x = "Response", y = "Proportion", fill = "Channel") +
  theme_minimal(base_size = 12)

6. Correlation Analysis

cor_matrix <- df %>%
  select(all_of(service_vars)) %>%
  cor(use = "complete.obs")

ggcorrplot(cor_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3,
           colors   = c("#2166AC","white","#D6604D"),
           title    = "Correlation Matrix — Service & Shopping Variables",
           ggtheme  = theme_minimal(base_size = 11)) +
  theme(plot.title = element_text(face = "bold", size = 13))

Interpretation: Positive correlations (red) suggest items that tend to move together. Because the scale is reverse-coded (1 = positive), a positive correlation means respondents who agreed on one item tended to agree on the other.

7. Demographic Subgroup Comparisons

7.1 Satisfaction by Gender

df %>%
  select(Gender_f, CS_helpful, Recommend, Come_again, Profesionalism, Find_items) %>%
  pivot_longer(-Gender_f, names_to = "Item", values_to = "Score") %>%
  group_by(Gender_f, Item) %>%
  summarise(Mean = mean(Score), SE = sd(Score)/sqrt(n()), .groups = "drop") %>%
  ggplot(aes(x = Item, y = Mean, fill = Gender_f)) +
  geom_col(position = position_dodge(0.75), width = 0.65) +
  geom_errorbar(aes(ymin = Mean - SE, ymax = Mean + SE),
                position = position_dodge(0.75), width = 0.25) +
  scale_fill_manual(values = c("#4E79A7","#F28E2B")) +
  scale_y_continuous(limits = c(0, 3.2),
                     breaks = 1:3,
                     labels = c("1 = Agree","2 = Neutral","3 = Disagree")) +
  labs(title = "Mean Satisfaction Scores by Gender",
       subtitle = "Lower = more positive; error bars show ±1 SE",
       x = NULL, y = "Mean Score", fill = "Gender") +
  coord_flip() +
  theme_minimal(base_size = 12)

7.2 Satisfaction by Age Group

df %>%
  select(Age_f, CS_helpful, Recommend, Come_again, Profesionalism, Find_items) %>%
  pivot_longer(-Age_f, names_to = "Item", values_to = "Score") %>%
  group_by(Age_f, Item) %>%
  summarise(Mean = mean(Score), SE = sd(Score)/sqrt(n()), .groups = "drop") %>%
  ggplot(aes(x = Item, y = Mean, color = Age_f, group = Age_f)) +
  geom_point(size = 3, position = position_dodge(0.4)) +
  geom_errorbar(aes(ymin = Mean - SE, ymax = Mean + SE),
                position = position_dodge(0.4), width = 0.25) +
  geom_line(position = position_dodge(0.4), linetype = "dashed", linewidth = 0.7) +
  scale_color_brewer(palette = "Dark2") +
  scale_y_continuous(limits = c(0.5, 3),
                     breaks = 1:3,
                     labels = c("1 = Agree","2 = Neutral","3 = Disagree")) +
  labs(title = "Mean Satisfaction Scores by Age Group",
       subtitle = "Lower = more positive; error bars show ±1 SE",
       x = NULL, y = "Mean Score", color = "Age Group") +
  coord_flip() +
  theme_minimal(base_size = 12)

8. Simple Customer Segmentation

We create a composite Overall Satisfaction Score by averaging the five core service items (CS_helpful, Recommend, Come_again, Professionalism, Find_items), then classify respondents as High, Moderate, or Low satisfaction.

df <- df %>%
  mutate(
    Satisfaction_Score = rowMeans(select(., CS_helpful, Recommend,
                                         Come_again, Profesionalism, Find_items)),
    Segment = case_when(
      Satisfaction_Score <= 1.4 ~ "High Satisfaction",
      Satisfaction_Score <= 2.2 ~ "Moderate Satisfaction",
      TRUE                      ~ "Low Satisfaction"
    ),
    Segment = factor(Segment,
                     levels = c("High Satisfaction","Moderate Satisfaction","Low Satisfaction"))
  )

df %>%
  count(Segment) %>%
  mutate(Percent = percent(n / sum(n), accuracy = 1)) %>%
  kable(caption = "Customer Segments Based on Composite Satisfaction Score",
        col.names = c("Segment","n","%")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Customer Segments Based on Composite Satisfaction Score
Segment	n	%
High Satisfaction	16	73%
Moderate Satisfaction	5	23%
Low Satisfaction	1	5%

p_seg_demo <- df %>%
  count(Segment, Gender_f) %>%
  ggplot(aes(x = Segment, y = n, fill = Gender_f)) +
  geom_col(position = "fill", width = 0.65) +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = c("#4E79A7","#F28E2B")) +
  labs(title = "Segments by Gender", x = NULL, y = "Proportion", fill = "Gender") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

p_seg_score <- df %>%
  ggplot(aes(x = Segment, y = Satisfaction_Score, fill = Segment)) +
  geom_boxplot(width = 0.5, show.legend = FALSE, alpha = 0.8) +
  geom_jitter(width = 0.1, size = 2.5, alpha = 0.6, show.legend = FALSE) +
  scale_fill_manual(values = c("#59A14F","#EDC948","#E15759")) +
  labs(title = "Score Distribution by Segment",
       x = NULL, y = "Avg. Satisfaction Score") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

p_seg_demo | p_seg_score

9. Key Findings & Recommendations

findings <- tibble(
  `#` = 1:5,
  Finding = c(
    "Majority of customers agree that staff are helpful and that they would recommend the store.",
    "Pick-up service receives the most mixed ratings, suggesting room for improvement.",
    "Online grocery and delivery channels show moderate-to-high agreement, indicating demand.",
    "Younger customers (18–24) tend to show slightly higher overall satisfaction.",
    "Most respondents fall in the 'Moderate Satisfaction' segment, pointing to an opportunity to convert them to highly satisfied customers."
  ),
  Recommendation = c(
    "Reinforce staff training programs and recognize high-performing employees.",
    "Audit the pick-up process for bottlenecks; consider a dedicated pick-up lane.",
    "Invest in digital infrastructure; promote online ordering and same-day delivery.",
    "Tailor marketing and loyalty programs to retain younger demographics.",
    "Identify the specific friction points for 'Moderate' customers and address them proactively."
  )
)

findings %>%
  kable(caption = "Summary of Findings and Recommendations") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed")) %>%
  column_spec(2, width = "45%") %>%
  column_spec(3, width = "45%")

Summary of Findings and Recommendations
#	Finding	Recommendation
1	Majority of customers agree that staff are helpful and that they would recommend the store.	Reinforce staff training programs and recognize high-performing employees.
2	Pick-up service receives the most mixed ratings, suggesting room for improvement.	Audit the pick-up process for bottlenecks; consider a dedicated pick-up lane.
3	Online grocery and delivery channels show moderate-to-high agreement, indicating demand.	Invest in digital infrastructure; promote online ordering and same-day delivery.
4	Younger customers (18–24) tend to show slightly higher overall satisfaction.	Tailor marketing and loyalty programs to retain younger demographics.
5	Most respondents fall in the ‘Moderate Satisfaction’ segment, pointing to an opportunity to convert them to highly satisfied customers.	Identify the specific friction points for ‘Moderate’ customers and address them proactively.

10. Limitations

Small sample (N = 22): Statistical conclusions should be interpreted with caution. Patterns observed may not generalize beyond this convenience sample.
Likert scale coding: The direction of scales (lower = better) should be double-checked against the original survey instrument.
No open-ended data: Qualitative insights from comments, if collected, could enrich these findings.
Cross-sectional design: This is a snapshot in time; longitudinal data would be needed to assess trends.

Session Info

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] psych_2.6.3        patchwork_1.3.2    scales_1.4.0       ggcorrplot_0.1.4.1
##  [5] kableExtra_1.4.0   knitr_1.51         lubridate_1.9.5    forcats_1.0.1     
##  [9] stringr_1.6.0      dplyr_1.2.0        purrr_1.2.1        readr_2.2.0       
## [13] tidyr_1.3.2        tibble_3.3.1       ggplot2_4.0.2      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     xml2_1.5.2         lattice_0.22-7    
##  [5] stringi_1.8.7      hms_1.1.4          digest_0.6.39      magrittr_2.0.4    
##  [9] evaluate_1.0.5     grid_4.5.2         timechange_0.4.0   RColorBrewer_1.1-3
## [13] fastmap_1.2.0      plyr_1.8.9         jsonlite_2.0.0     viridisLite_0.4.3 
## [17] textshaping_1.0.5  jquerylib_0.1.4    mnormt_2.1.2       cli_3.6.5         
## [21] rlang_1.1.7        withr_3.0.2        cachem_1.1.0       yaml_2.3.12       
## [25] parallel_4.5.2     tools_4.5.2        reshape2_1.4.5     tzdb_0.5.0        
## [29] vctrs_0.7.2        R6_2.6.1           lifecycle_1.0.5    pkgconfig_2.0.3   
## [33] pillar_1.11.1      bslib_0.10.0       gtable_0.3.6       Rcpp_1.1.1        
## [37] glue_1.8.0         systemfonts_1.3.2  xfun_0.57          tidyselect_1.2.1  
## [41] rstudioapi_0.18.0  farver_2.1.2       nlme_3.1-168       htmltools_0.5.9   
## [45] labeling_0.4.3     rmarkdown_2.30     svglite_2.2.2      compiler_4.5.2    
## [49] S7_0.2.1