1 Introduction

This report analyzes survey data collected from 22 grocery store customers. The dataset captures satisfaction ratings, shopping preferences, and demographic information. Our goal is to uncover patterns in customer behavior, identify distinct customer segments, and provide actionable recommendations.

1.1 Research Questions

What are the overall satisfaction levels among customers?
Do satisfaction levels differ by demographic group (gender, age, education)?
Can we identify distinct customer segments based on shopping preferences?
What factors most strongly predict customer loyalty (likelihood to return/recommend)?

2 Setup & Data Loading

# Load required packages
library(tidyverse)   # Data wrangling and ggplot2 visualization
library(corrplot)    # Correlation matrix visualization
library(factoextra)  # K-Means clustering visualization
library(knitr)       # Nice tables
library(kableExtra)  # Enhanced tables
library(scales)      # Axis formatting

# Load the dataset
df <- read.csv("customer_segmentation.csv")

# Recode factor variables with meaningful labels
df <- df %>%
  mutate(
    Gender_label    = factor(Gender,    levels = c(1, 2),
                             labels = c("Male", "Female")),
    Age_label       = factor(Age,       levels = c(2, 3, 4),
                             labels = c("Young Adult (18-30)", "Middle-Aged (31-50)", "Older Adult (51+)")),
    Education_label = factor(Education, levels = c(1, 2, 3, 5),
                             labels = c("Some High School", "HS Diploma", "Some College", "Graduate Degree"))
  )

# Preview the data
kable(head(df, 6), caption = "First 6 Rows of the Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

First 6 Rows of the Dataset
ID	CS_helpful	Recommend	Come_again	All_Products	Profesionalism	Limitation	Online_grocery	delivery	Pick_up	Find_items	other_shops	Gender	Age	Education	Gender_label	Age_label	Education_label
1	2	2	2	2	2	2	2	3	4	1	2	1	2	2	Male	Young Adult (18-30)	HS Diploma
2	1	2	1	1	1	1	2	3	3	1	2	1	2	2	Male	Young Adult (18-30)	HS Diploma
3	2	1	1	1	1	2	3	3	2	1	3	1	2	2	Male	Young Adult (18-30)	HS Diploma
4	3	3	2	4	1	2	3	3	2	2	2	1	3	5	Male	Middle-Aged (31-50)	Graduate Degree
5	2	1	3	5	2	1	2	3	1	2	3	2	4	2	Female	Older Adult (51+)	HS Diploma
6	1	1	3	2	1	1	1	2	1	1	4	1	2	5	Male	Young Adult (18-30)	Graduate Degree

3 Exploratory Data Analysis

3.1 Summary Statistics

# Select only the survey/satisfaction columns
survey_cols <- c("CS_helpful", "Recommend", "Come_again", "All_Products",
                 "Profesionalism", "Limitation", "Online_grocery",
                 "delivery", "Pick_up", "Find_items", "other_shops")

summary_table <- df %>%
  select(all_of(survey_cols)) %>%
  summarise(across(everything(), list(
    Mean   = ~round(mean(.), 2),
    Median = ~median(.),
    SD     = ~round(sd(.), 2),
    Min    = ~min(.),
    Max    = ~max(.)
  ))) %>%
  pivot_longer(everything(), names_to = c("Variable", "Stat"), names_sep = "_(?=[^_]+$)") %>%
  pivot_wider(names_from = Stat, values_from = value)

kable(summary_table, caption = "Descriptive Statistics for Survey Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics for Survey Variables
Variable	Mean	Median	SD	Min	Max
CS_helpful	1.59	1	0.73	1	3
Recommend	1.32	1	0.65	1	3
Come_again	1.45	1	0.74	1	3
All_Products	2.09	2	1.06	1	5
Profesionalism	1.41	1	0.59	1	3
Limitation	1.50	1	0.80	1	4
Online_grocery	2.27	2	0.77	1	3
delivery	2.41	3	0.73	1	3
Pick_up	2.45	2	1.06	1	5
Find_items	1.45	1	0.67	1	3
other_shops	2.59	2	1.40	1	5

Interpretation: The mean scores for CS_helpful, Recommend, and Come_again are close to 1–2, suggesting most customers rate the store positively on these dimensions (lower scores = more favorable on the 1–3 scale used here). other_shops has higher variance, indicating customers differ widely in how often they shop elsewhere.

3.2 Frequency Tables for Demographic Variables

# Gender distribution
gender_freq <- df %>%
  count(Gender_label) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

kable(gender_freq, col.names = c("Gender", "Count", "Percent (%)"),
      caption = "Gender Distribution") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Gender Distribution
Gender	Count	Percent (%)
Male	16	72.7
Female	6	27.3

# Age distribution
age_freq <- df %>%
  count(Age_label) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

kable(age_freq, col.names = c("Age Group", "Count", "Percent (%)"),
      caption = "Age Group Distribution") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Age Group Distribution
Age Group	Count	Percent (%)
Young Adult (18-30)	15	68.2
Middle-Aged (31-50)	4	18.2
Older Adult (51+)	3	13.6

# Education distribution
edu_freq <- df %>%
  count(Education_label) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

kable(edu_freq, col.names = c("Education Level", "Count", "Percent (%)"),
      caption = "Education Level Distribution") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Education Level Distribution
Education Level	Count	Percent (%)
Some High School	3	13.6
HS Diploma	8	36.4
Some College	2	9.1
Graduate Degree	9	40.9

3.3 Distribution of Key Satisfaction Items

# Reshape satisfaction columns to long format for faceted bar chart
df_long <- df %>%
  select(all_of(survey_cols)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Rating")

ggplot(df_long, aes(x = factor(Rating), fill = Variable)) +
  geom_bar(show.legend = FALSE) +
  facet_wrap(~Variable, scales = "free_x") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Distribution of Survey Responses by Item",
    x = "Rating", y = "Count"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title = element_text(face = "bold", size = 13),
    strip.text = element_text(face = "bold")
  )

Interpretation: Most survey items show a left-skewed distribution (ratings clustered at 1–2), reflecting generally positive sentiment. other_shops and Pick_up show more spread, suggesting greater customer disagreement on these dimensions.

4 Demographic Breakdowns

4.1 Satisfaction by Gender

gender_summary <- df %>%
  group_by(Gender_label) %>%
  summarise(
    CS_helpful    = mean(CS_helpful),
    Recommend     = mean(Recommend),
    Come_again    = mean(Come_again),
    All_Products  = mean(All_Products),
    Find_items    = mean(Find_items)
  ) %>%
  pivot_longer(-Gender_label, names_to = "Item", values_to = "Mean_Score")

ggplot(gender_summary, aes(x = Item, y = Mean_Score, fill = Gender_label)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Male" = "#2c7bb6", "Female" = "#d7191c")) +
  labs(
    title = "Mean Satisfaction Scores by Gender",
    x = "Survey Item", y = "Mean Score (lower = more positive)",
    fill = "Gender"
  ) +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1),
        plot.title = element_text(face = "bold"))

4.2 Satisfaction by Age Group

age_summary <- df %>%
  group_by(Age_label) %>%
  summarise(across(all_of(c("CS_helpful","Recommend","Come_again","delivery","Pick_up")), mean)) %>%
  pivot_longer(-Age_label, names_to = "Item", values_to = "Mean_Score")

ggplot(age_summary, aes(x = Age_label, y = Mean_Score, fill = Item)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "Mean Satisfaction Scores by Age Group",
    x = "Age Group", y = "Mean Score", fill = "Item"
  ) +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1),
        plot.title = element_text(face = "bold"))

Interpretation: Younger adults tend to rate delivery and online grocery options more favorably, which may reflect greater comfort with technology-enabled shopping. Middle-aged customers show slightly higher loyalty scores (Come_again, Recommend).

5 Correlation Analysis

# Compute correlation matrix on survey items
cor_matrix <- df %>%
  select(all_of(survey_cols)) %>%
  cor(use = "complete.obs")

corrplot(cor_matrix,
         method    = "color",
         type      = "upper",
         tl.col    = "black",
         tl.srt    = 45,
         addCoef.col = "black",
         number.cex  = 0.7,
         col       = colorRampPalette(c("#d73027", "white", "#4575b4"))(200),
         title     = "Correlation Matrix of Survey Items",
         mar       = c(0, 0, 2, 0))

Interpretation: Strong positive correlations (dark blue) appear between CS_helpful, Recommend, and Come_again, suggesting that customers who find staff helpful are more likely to return and recommend the store. Limitation and other_shops correlate positively — customers who perceive more limitations shop at competitors more often.

6 Customer Segmentation (K-Means Clustering)

6.1 Determine Optimal Number of Clusters (Elbow Method)

# Use only survey/preference columns for clustering; scale them
cluster_data <- df %>%
  select(all_of(survey_cols)) %>%
  scale()

set.seed(123)
fviz_nbclust(cluster_data, kmeans, method = "wss") +
  labs(title = "Elbow Method — Optimal Number of Clusters") +
  theme_minimal()

6.2 Fit K-Means Model

set.seed(123)
k <- 3  # Adjust based on elbow plot

km_result <- kmeans(cluster_data, centers = k, nstart = 25)

# Append cluster labels to original data
df$Cluster <- factor(km_result$cluster, labels = paste("Segment", 1:k))

cat("Cluster sizes:\n")

## Cluster sizes:

print(table(df$Cluster))

## 
## Segment 1 Segment 2 Segment 3 
##         4        10         8

6.3 Visualize Clusters

fviz_cluster(km_result, data = cluster_data,
             palette = c("#2c7bb6", "#d7191c", "#1a9641"),
             ellipse.type = "convex",
             ggtheme = theme_minimal(),
             main = "Customer Segments (K-Means, k=3)")

6.4 Cluster Profiles

cluster_profiles <- df %>%
  group_by(Cluster) %>%
  summarise(
    n             = n(),
    CS_helpful    = round(mean(CS_helpful), 2),
    Recommend     = round(mean(Recommend), 2),
    Come_again    = round(mean(Come_again), 2),
    All_Products  = round(mean(All_Products), 2),
    delivery      = round(mean(delivery), 2),
    other_shops   = round(mean(other_shops), 2),
    Pct_Female    = round(mean(Gender == 2) * 100, 1)
  )

kable(cluster_profiles, caption = "Mean Scores by Customer Segment") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Mean Scores by Customer Segment
Cluster	n	CS_helpful	Recommend	Come_again	All_Products	delivery	other_shops	Pct_Female
Segment 1	4	2.50	2.00	2.50	3.25	3.0	2.75	75
Segment 2	10	1.10	1.00	1.30	2.00	1.7	2.80	30
Segment 3	8	1.75	1.38	1.12	1.62	3.0	2.25	0

Interpretation: - Segment 1 (Loyal Advocates): High satisfaction, strong likelihood to recommend and return, low competitor shopping. - Segment 2 (Neutral Shoppers): Moderate ratings across the board; delivery preferences vary. - Segment 3 (At-Risk Customers): Lower satisfaction scores and higher other_shops scores, indicating they frequently visit competitors.

7 Visualizations

7.1 Composite Satisfaction Score

# Create a composite satisfaction score (lower = more satisfied, so we reverse)
df <- df %>%
  mutate(
    Composite_Score = rowMeans(select(., CS_helpful, Recommend, Come_again,
                                      All_Products, Profesionalism, Find_items))
  )

ggplot(df, aes(x = Cluster, y = Composite_Score, fill = Cluster)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21) +
  scale_fill_manual(values = c("#2c7bb6", "#d7191c", "#1a9641")) +
  labs(
    title    = "Composite Satisfaction Score by Customer Segment",
    subtitle = "Lower score = higher satisfaction",
    x = "Customer Segment", y = "Composite Score"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

7.2 Delivery vs. Pick-Up Preference by Segment

df %>%
  select(Cluster, delivery, Pick_up) %>%
  pivot_longer(-Cluster, names_to = "Mode", values_to = "Score") %>%
  ggplot(aes(x = Cluster, y = Score, fill = Mode)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("delivery" = "#fdae61", "Pick_up" = "#74add1"),
                    labels = c("Delivery", "Pick-Up")) +
  labs(
    title = "Delivery vs. Pick-Up Preference by Customer Segment",
    x = "Segment", y = "Mean Score", fill = "Mode"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

8 Conclusions & Recommendations

8.1 Summary of Findings

Overall satisfaction is high — most customers rate the store positively on helpfulness, professionalism, and likelihood to return.
Three distinct customer segments were identified:
- Loyal Advocates value in-store experience and staff.
- Neutral Shoppers are satisfied but not strongly engaged.
- At-Risk Customers perceive product limitations and shop at competitors frequently.
Product variety (All_Products) and ease of finding items (Find_items) show the most room for improvement.
Younger adults prefer delivery and online options more than older groups.

8.2 Recommendations

Priority	Recommendation	Target Segment
High	Expand product variety to reduce perceived limitations	At-Risk Customers
High	Improve store layout and item signage	All Segments
Medium	Invest in delivery infrastructure & app UX	Young Adults
Medium	Loyalty program to convert Neutral Shoppers	Segment 2
Low	Staff professionalism training (already strong)	All Segments

8.3 Limitations

Sample size is small (n = 22); findings may not generalize.
Survey scales vary across items; normalization was applied for clustering.
Cross-sectional data — cannot infer causation.

Report generated with R Markdown. Data: customer_segmentation.csv.

Grocery Store Customer Segmentation Analysis

Student Research Group

2026-03-25