1 Introduction

This report analyzes survey data collected from 22 grocery store customers. The dataset captures satisfaction ratings, shopping preferences, and demographic information. Our goal is to uncover patterns in customer behavior, identify distinct customer segments, and provide actionable recommendations.

1.1 Research Questions

  1. What are the overall satisfaction levels among customers?
  2. Do satisfaction levels differ by demographic group (gender, age, education)?
  3. Can we identify distinct customer segments based on shopping preferences?
  4. What factors most strongly predict customer loyalty (likelihood to return/recommend)?

2 Setup & Data Loading

# Load required packages
library(tidyverse)   # Data wrangling and ggplot2 visualization
library(corrplot)    # Correlation matrix visualization
library(factoextra)  # K-Means clustering visualization
library(knitr)       # Nice tables
library(kableExtra)  # Enhanced tables
library(scales)      # Axis formatting
# Load the dataset
df <- read.csv("customer_segmentation.csv")

# Recode factor variables with meaningful labels
df <- df %>%
  mutate(
    Gender_label    = factor(Gender,    levels = c(1, 2),
                             labels = c("Male", "Female")),
    Age_label       = factor(Age,       levels = c(2, 3, 4),
                             labels = c("Young Adult (18-30)", "Middle-Aged (31-50)", "Older Adult (51+)")),
    Education_label = factor(Education, levels = c(1, 2, 3, 5),
                             labels = c("Some High School", "HS Diploma", "Some College", "Graduate Degree"))
  )

# Preview the data
kable(head(df, 6), caption = "First 6 Rows of the Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
First 6 Rows of the Dataset
ID CS_helpful Recommend Come_again All_Products Profesionalism Limitation Online_grocery delivery Pick_up Find_items other_shops Gender Age Education Gender_label Age_label Education_label
1 2 2 2 2 2 2 2 3 4 1 2 1 2 2 Male Young Adult (18-30) HS Diploma
2 1 2 1 1 1 1 2 3 3 1 2 1 2 2 Male Young Adult (18-30) HS Diploma
3 2 1 1 1 1 2 3 3 2 1 3 1 2 2 Male Young Adult (18-30) HS Diploma
4 3 3 2 4 1 2 3 3 2 2 2 1 3 5 Male Middle-Aged (31-50) Graduate Degree
5 2 1 3 5 2 1 2 3 1 2 3 2 4 2 Female Older Adult (51+) HS Diploma
6 1 1 3 2 1 1 1 2 1 1 4 1 2 5 Male Young Adult (18-30) Graduate Degree

3 Exploratory Data Analysis

3.1 Summary Statistics

# Select only the survey/satisfaction columns
survey_cols <- c("CS_helpful", "Recommend", "Come_again", "All_Products",
                 "Profesionalism", "Limitation", "Online_grocery",
                 "delivery", "Pick_up", "Find_items", "other_shops")

summary_table <- df %>%
  select(all_of(survey_cols)) %>%
  summarise(across(everything(), list(
    Mean   = ~round(mean(.), 2),
    Median = ~median(.),
    SD     = ~round(sd(.), 2),
    Min    = ~min(.),
    Max    = ~max(.)
  ))) %>%
  pivot_longer(everything(), names_to = c("Variable", "Stat"), names_sep = "_(?=[^_]+$)") %>%
  pivot_wider(names_from = Stat, values_from = value)

kable(summary_table, caption = "Descriptive Statistics for Survey Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Descriptive Statistics for Survey Variables
Variable Mean Median SD Min Max
CS_helpful 1.59 1 0.73 1 3
Recommend 1.32 1 0.65 1 3
Come_again 1.45 1 0.74 1 3
All_Products 2.09 2 1.06 1 5
Profesionalism 1.41 1 0.59 1 3
Limitation 1.50 1 0.80 1 4
Online_grocery 2.27 2 0.77 1 3
delivery 2.41 3 0.73 1 3
Pick_up 2.45 2 1.06 1 5
Find_items 1.45 1 0.67 1 3
other_shops 2.59 2 1.40 1 5

Interpretation: The mean scores for CS_helpful, Recommend, and Come_again are close to 1–2, suggesting most customers rate the store positively on these dimensions (lower scores = more favorable on the 1–3 scale used here). other_shops has higher variance, indicating customers differ widely in how often they shop elsewhere.

3.2 Frequency Tables for Demographic Variables

# Gender distribution
gender_freq <- df %>%
  count(Gender_label) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

kable(gender_freq, col.names = c("Gender", "Count", "Percent (%)"),
      caption = "Gender Distribution") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Gender Distribution
Gender Count Percent (%)
Male 16 72.7
Female 6 27.3
# Age distribution
age_freq <- df %>%
  count(Age_label) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

kable(age_freq, col.names = c("Age Group", "Count", "Percent (%)"),
      caption = "Age Group Distribution") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Age Group Distribution
Age Group Count Percent (%)
Young Adult (18-30) 15 68.2
Middle-Aged (31-50) 4 18.2
Older Adult (51+) 3 13.6
# Education distribution
edu_freq <- df %>%
  count(Education_label) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

kable(edu_freq, col.names = c("Education Level", "Count", "Percent (%)"),
      caption = "Education Level Distribution") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Education Level Distribution
Education Level Count Percent (%)
Some High School 3 13.6
HS Diploma 8 36.4
Some College 2 9.1
Graduate Degree 9 40.9

3.3 Distribution of Key Satisfaction Items

# Reshape satisfaction columns to long format for faceted bar chart
df_long <- df %>%
  select(all_of(survey_cols)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Rating")

ggplot(df_long, aes(x = factor(Rating), fill = Variable)) +
  geom_bar(show.legend = FALSE) +
  facet_wrap(~Variable, scales = "free_x") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title = "Distribution of Survey Responses by Item",
    x = "Rating", y = "Count"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title = element_text(face = "bold", size = 13),
    strip.text = element_text(face = "bold")
  )

Interpretation: Most survey items show a left-skewed distribution (ratings clustered at 1–2), reflecting generally positive sentiment. other_shops and Pick_up show more spread, suggesting greater customer disagreement on these dimensions.


4 Demographic Breakdowns

4.1 Satisfaction by Gender

gender_summary <- df %>%
  group_by(Gender_label) %>%
  summarise(
    CS_helpful    = mean(CS_helpful),
    Recommend     = mean(Recommend),
    Come_again    = mean(Come_again),
    All_Products  = mean(All_Products),
    Find_items    = mean(Find_items)
  ) %>%
  pivot_longer(-Gender_label, names_to = "Item", values_to = "Mean_Score")

ggplot(gender_summary, aes(x = Item, y = Mean_Score, fill = Gender_label)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Male" = "#2c7bb6", "Female" = "#d7191c")) +
  labs(
    title = "Mean Satisfaction Scores by Gender",
    x = "Survey Item", y = "Mean Score (lower = more positive)",
    fill = "Gender"
  ) +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1),
        plot.title = element_text(face = "bold"))

4.2 Satisfaction by Age Group

age_summary <- df %>%
  group_by(Age_label) %>%
  summarise(across(all_of(c("CS_helpful","Recommend","Come_again","delivery","Pick_up")), mean)) %>%
  pivot_longer(-Age_label, names_to = "Item", values_to = "Mean_Score")

ggplot(age_summary, aes(x = Age_label, y = Mean_Score, fill = Item)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(
    title = "Mean Satisfaction Scores by Age Group",
    x = "Age Group", y = "Mean Score", fill = "Item"
  ) +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1),
        plot.title = element_text(face = "bold"))

Interpretation: Younger adults tend to rate delivery and online grocery options more favorably, which may reflect greater comfort with technology-enabled shopping. Middle-aged customers show slightly higher loyalty scores (Come_again, Recommend).


5 Correlation Analysis

# Compute correlation matrix on survey items
cor_matrix <- df %>%
  select(all_of(survey_cols)) %>%
  cor(use = "complete.obs")

corrplot(cor_matrix,
         method    = "color",
         type      = "upper",
         tl.col    = "black",
         tl.srt    = 45,
         addCoef.col = "black",
         number.cex  = 0.7,
         col       = colorRampPalette(c("#d73027", "white", "#4575b4"))(200),
         title     = "Correlation Matrix of Survey Items",
         mar       = c(0, 0, 2, 0))

Interpretation: Strong positive correlations (dark blue) appear between CS_helpful, Recommend, and Come_again, suggesting that customers who find staff helpful are more likely to return and recommend the store. Limitation and other_shops correlate positively — customers who perceive more limitations shop at competitors more often.


6 Customer Segmentation (K-Means Clustering)

6.1 Determine Optimal Number of Clusters (Elbow Method)

# Use only survey/preference columns for clustering; scale them
cluster_data <- df %>%
  select(all_of(survey_cols)) %>%
  scale()

set.seed(123)
fviz_nbclust(cluster_data, kmeans, method = "wss") +
  labs(title = "Elbow Method — Optimal Number of Clusters") +
  theme_minimal()

6.2 Fit K-Means Model

set.seed(123)
k <- 3  # Adjust based on elbow plot

km_result <- kmeans(cluster_data, centers = k, nstart = 25)

# Append cluster labels to original data
df$Cluster <- factor(km_result$cluster, labels = paste("Segment", 1:k))

cat("Cluster sizes:\n")
## Cluster sizes:
print(table(df$Cluster))
## 
## Segment 1 Segment 2 Segment 3 
##         4        10         8

6.3 Visualize Clusters

fviz_cluster(km_result, data = cluster_data,
             palette = c("#2c7bb6", "#d7191c", "#1a9641"),
             ellipse.type = "convex",
             ggtheme = theme_minimal(),
             main = "Customer Segments (K-Means, k=3)")

6.4 Cluster Profiles

cluster_profiles <- df %>%
  group_by(Cluster) %>%
  summarise(
    n             = n(),
    CS_helpful    = round(mean(CS_helpful), 2),
    Recommend     = round(mean(Recommend), 2),
    Come_again    = round(mean(Come_again), 2),
    All_Products  = round(mean(All_Products), 2),
    delivery      = round(mean(delivery), 2),
    other_shops   = round(mean(other_shops), 2),
    Pct_Female    = round(mean(Gender == 2) * 100, 1)
  )

kable(cluster_profiles, caption = "Mean Scores by Customer Segment") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Mean Scores by Customer Segment
Cluster n CS_helpful Recommend Come_again All_Products delivery other_shops Pct_Female
Segment 1 4 2.50 2.00 2.50 3.25 3.0 2.75 75
Segment 2 10 1.10 1.00 1.30 2.00 1.7 2.80 30
Segment 3 8 1.75 1.38 1.12 1.62 3.0 2.25 0

Interpretation: - Segment 1 (Loyal Advocates): High satisfaction, strong likelihood to recommend and return, low competitor shopping. - Segment 2 (Neutral Shoppers): Moderate ratings across the board; delivery preferences vary. - Segment 3 (At-Risk Customers): Lower satisfaction scores and higher other_shops scores, indicating they frequently visit competitors.


7 Visualizations

7.1 Composite Satisfaction Score

# Create a composite satisfaction score (lower = more satisfied, so we reverse)
df <- df %>%
  mutate(
    Composite_Score = rowMeans(select(., CS_helpful, Recommend, Come_again,
                                      All_Products, Profesionalism, Find_items))
  )

ggplot(df, aes(x = Cluster, y = Composite_Score, fill = Cluster)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21) +
  scale_fill_manual(values = c("#2c7bb6", "#d7191c", "#1a9641")) +
  labs(
    title    = "Composite Satisfaction Score by Customer Segment",
    subtitle = "Lower score = higher satisfaction",
    x = "Customer Segment", y = "Composite Score"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

7.2 Delivery vs. Pick-Up Preference by Segment

df %>%
  select(Cluster, delivery, Pick_up) %>%
  pivot_longer(-Cluster, names_to = "Mode", values_to = "Score") %>%
  ggplot(aes(x = Cluster, y = Score, fill = Mode)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("delivery" = "#fdae61", "Pick_up" = "#74add1"),
                    labels = c("Delivery", "Pick-Up")) +
  labs(
    title = "Delivery vs. Pick-Up Preference by Customer Segment",
    x = "Segment", y = "Mean Score", fill = "Mode"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))


8 Conclusions & Recommendations

8.1 Summary of Findings

  1. Overall satisfaction is high — most customers rate the store positively on helpfulness, professionalism, and likelihood to return.
  2. Three distinct customer segments were identified:
    • Loyal Advocates value in-store experience and staff.
    • Neutral Shoppers are satisfied but not strongly engaged.
    • At-Risk Customers perceive product limitations and shop at competitors frequently.
  3. Product variety (All_Products) and ease of finding items (Find_items) show the most room for improvement.
  4. Younger adults prefer delivery and online options more than older groups.

8.2 Recommendations

Priority Recommendation Target Segment
High Expand product variety to reduce perceived limitations At-Risk Customers
High Improve store layout and item signage All Segments
Medium Invest in delivery infrastructure & app UX Young Adults
Medium Loyalty program to convert Neutral Shoppers Segment 2
Low Staff professionalism training (already strong) All Segments

8.3 Limitations

  • Sample size is small (n = 22); findings may not generalize.
  • Survey scales vary across items; normalization was applied for clustering.
  • Cross-sectional data — cannot infer causation.

Report generated with R Markdown. Data: customer_segmentation.csv.