1. Introduction

This dataset provides insight into customer satisfaction and shopping behavior, highlighting how individuals interact with grocery services and their likelihood to return or recommend the business. By examining variables such as customer service helpfulness, recommendation ratings, and shopping preferences like delivery or pick-up, clear patterns begin to emerge. For example, higher satisfaction levels tend to align with a greater willingness to recommend the service and return in the future, suggesting that customer experience plays a major role in retention. The dataset also reveals differences in how customers prefer to shop, which can help businesses tailor their services to meet demand more effectively.

In addition, the demographic variables included in the dataset, such as gender, age, and education, offer a deeper understanding of the customer base. These factors can influence shopping habits and satisfaction levels, allowing for more targeted marketing and service improvements. Overall, the dataset shows that both customer experience and convenience are key drivers of behavior. Businesses can use this information to improve customer satisfaction, increase loyalty, and better align their services with the needs and preferences of their customers.

Survey variables include:

Variable Description
CS_helpful Customer service helpfulness rating
Recommend Likelihood to recommend the store
Come_again Likelihood to return
All_Products Satisfaction with product availability
Profesionalism Staff professionalism rating
Limitation Perceived store limitations
Online_grocery Online grocery experience rating
delivery Delivery service satisfaction
Pick_up Pick-up service satisfaction
Find_items Ease of finding items
other_shops Frequency of shopping at other stores
Gender Gender (1 = Male, 2 = Female)
Age Age group (1 = Under 18, 2 = 18–34, 3 = 35–54, 4 = 55+)
Education Education level (1 = No formal, 2 = High school, 3 = Some college, 4 = Bachelor’s, 5 = Graduate)

All satisfaction/rating variables use a 1–5 Likert scale (1 = Strongly Agree / Very Satisfied, 5 = Strongly Disagree / Very Unsatisfied), unless otherwise noted.


2. Setup & Data Loading

# Install any missing packages automatically
required_packages <- c("tidyverse", "psych", "corrplot", "ggplot2",
"scales", "knitr", "reshape2", "cluster",
"factoextra", "RColorBrewer")

for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg, repos = "https://cloud.r-project.org")
}
}

library(tidyverse)
library(psych)
library(corrplot)
library(ggplot2)
library(scales)
library(knitr)
library(reshape2)
library(cluster)
library(factoextra)
library(RColorBrewer)
# ── Load dataset ──────────────────────────────────────────────────────────────
# Update this path to wherever you saved the CSV
df_raw <- read.csv("customer_segmentation.csv", stringsAsFactors = FALSE)

# Preview
glimpse(df_raw)
## Rows: 22
## Columns: 15
## $ ID             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ CS_helpful     <int> 2, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3…
## $ Recommend      <int> 2, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Come_again     <int> 2, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2…
## $ All_Products   <int> 2, 1, 1, 4, 5, 2, 2, 2, 2, 1, 2, 2, 1, 2, 4, 2, 2, 1, 3…
## $ Profesionalism <int> 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ Limitation     <int> 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4…
## $ Online_grocery <int> 2, 2, 3, 3, 2, 1, 2, 1, 2, 3, 2, 3, 1, 3, 2, 3, 2, 3, 1…
## $ delivery       <int> 3, 3, 3, 3, 3, 2, 2, 1, 1, 2, 2, 2, 2, 3, 2, 1, 3, 3, 3…
## $ Pick_up        <int> 4, 3, 2, 2, 1, 1, 2, 2, 3, 2, 2, 3, 2, 3, 2, 3, 5, 3, 1…
## $ Find_items     <int> 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 3…
## $ other_shops    <int> 2, 2, 3, 2, 3, 4, 1, 4, 1, 1, 3, 3, 1, 1, 5, 5, 5, 2, 2…
## $ Gender         <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2…
## $ Age            <int> 2, 2, 2, 3, 4, 2, 2, 2, 2, 2, 4, 3, 4, 3, 2, 3, 2, 2, 2…
## $ Education      <int> 2, 2, 2, 5, 2, 5, 3, 2, 1, 2, 5, 1, 5, 5, 5, 5, 1, 5, 2…

3. Data Cleaning & Preparation

df <- df_raw %>%
# Drop the ID column — not analytically useful
select(-ID) %>%
# Recode numeric codes into labeled factors for demographics
mutate(
Gender_f = factor(Gender,
levels = 1:2,
labels = c("Male", "Female")),
Age_f = factor(Age,
levels = 1:4,
labels = c("Under 18", "18–34", "35–54", "55+")),
Education_f = factor(Education,
levels = 1:5,
labels = c("No formal education",
"High school",
"Some college",
"Bachelor's degree",
"Graduate degree"))
)

# Satisfaction / rating columns (numeric, used for analysis)
rating_vars <- c("CS_helpful", "Recommend", "Come_again",
"All_Products", "Profesionalism", "Limitation",
"Online_grocery", "delivery", "Pick_up",
"Find_items", "other_shops")

df_ratings <- df %>% select(all_of(rating_vars))

cat("Dataset dimensions:", nrow(df), "rows ×", ncol(df), "columns\n")
## Dataset dimensions: 22 rows × 17 columns
cat("Missing values per column:\n")
## Missing values per column:
print(colSums(is.na(df)))
##     CS_helpful      Recommend     Come_again   All_Products Profesionalism 
##              0              0              0              0              0 
##     Limitation Online_grocery       delivery        Pick_up     Find_items 
##              0              0              0              0              0 
##    other_shops         Gender            Age      Education       Gender_f 
##              0              0              0              0              0 
##          Age_f    Education_f 
##              0              0

4. Descriptive Statistics

4.1 Summary Statistics

desc_stats <- describe(df_ratings) %>%
as.data.frame() %>%
select(n, mean, sd, median, min, max, skew, kurtosis) %>%
mutate(across(where(is.numeric), ~ round(.x, 2)))

kable(desc_stats,
caption = "Descriptive Statistics – Rating Variables",
col.names = c("N", "Mean", "SD", "Median", "Min", "Max", "Skewness", "Kurtosis"))
Descriptive Statistics – Rating Variables
N Mean SD Median Min Max Skewness Kurtosis
CS_helpful 22 1.59 0.73 1 1 3 0.73 -0.89
Recommend 22 1.32 0.65 1 1 3 1.67 1.38
Come_again 22 1.45 0.74 1 1 3 1.16 -0.23
All_Products 22 2.09 1.06 2 1 5 1.18 0.79
Profesionalism 22 1.41 0.59 1 1 3 1.00 -0.14
Limitation 22 1.50 0.80 1 1 4 1.59 1.99
Online_grocery 22 2.27 0.77 2 1 3 -0.46 -1.25
delivery 22 2.41 0.73 3 1 3 -0.73 -0.89
Pick_up 22 2.45 1.06 2 1 5 0.46 -0.37
Find_items 22 1.45 0.67 1 1 3 1.06 -0.19
other_shops 22 2.59 1.40 2 1 5 0.42 -1.21

4.2 Demographic Profile

# Gender distribution
p1 <- df %>%
count(Gender_f) %>%
mutate(pct = n / sum(n)) %>%
ggplot(aes(x = Gender_f, y = pct, fill = Gender_f)) +
geom_col(width = 0.5, show.legend = FALSE) +
geom_text(aes(label = percent(pct, 1)), vjust = -0.4, size = 4.5) +
scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Gender Distribution", x = NULL, y = "Proportion") +
theme_minimal(base_size = 13)

# Age distribution
p2 <- df %>%
count(Age_f) %>%
mutate(pct = n / sum(n)) %>%
ggplot(aes(x = Age_f, y = pct, fill = Age_f)) +
geom_col(width = 0.6, show.legend = FALSE) +
geom_text(aes(label = percent(pct, 1)), vjust = -0.4, size = 4) +
scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Age Group Distribution", x = NULL, y = "Proportion") +
theme_minimal(base_size = 13)

# Education distribution
p3 <- df %>%
count(Education_f) %>%
mutate(pct = n / sum(n)) %>%
ggplot(aes(x = reorder(Education_f, pct), y = pct, fill = Education_f)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = percent(pct, 1)), hjust = -0.15, size = 3.8) +
scale_y_continuous(labels = percent_format(), limits = c(0, 0.8)) +
scale_fill_brewer(palette = "Pastel1") +
coord_flip() +
labs(title = "Education Level Distribution", x = NULL, y = "Proportion") +
theme_minimal(base_size = 12)

print(p1)

print(p2)

print(p3)


5. Satisfaction Ratings Analysis

5.1 Average Ratings per Category

avg_ratings <- df_ratings %>%
summarise(across(everything(), mean)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Mean") %>%
arrange(Mean)

ggplot(avg_ratings, aes(x = reorder(Variable, Mean), y = Mean, fill = Mean)) +
geom_col() +
geom_text(aes(label = round(Mean, 2)), hjust = -0.15, size = 4) +
scale_fill_gradient(low = "#2ecc71", high = "#e74c3c",
limits = c(1, 5), name = "Score") +
scale_y_continuous(limits = c(0, 5.5)) +
coord_flip() +
labs(title = "Average Satisfaction Score by Category",
subtitle = "Scale: 1 (Strongly Agree / Very Satisfied) → 5 (Strongly Disagree)",
x = NULL, y = "Mean Score") +
theme_minimal(base_size = 13)

5.2 Rating Distributions (Boxplots)

df_ratings_long <- df_ratings %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Score")

ggplot(df_ratings_long, aes(x = reorder(Variable, Score, FUN = median),
y = Score, fill = Variable)) +
geom_boxplot(show.legend = FALSE, alpha = 0.8, outlier.shape = 21) +
stat_summary(fun = mean, geom = "point", shape = 23,
size = 3, fill = "white", color = "black") +
scale_fill_manual(values = colorRampPalette(brewer.pal(9, "Set1"))(11)) +
scale_y_continuous(breaks = 1:5) +
coord_flip() +
labs(title = "Distribution of Satisfaction Scores",
subtitle = "Diamond = mean; box = IQR; whiskers = full range",
x = NULL, y = "Score (1–5)") +
theme_minimal(base_size = 13)

5.3 Frequency Heatmap

freq_table <- df_ratings_long %>%
count(Variable, Score) %>%
group_by(Variable) %>%
mutate(Proportion = n / sum(n)) %>%
ungroup()

ggplot(freq_table, aes(x = factor(Score), y = Variable, fill = Proportion)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = percent(Proportion, 1)), size = 3.5) +
scale_fill_gradient(low = "#ffffff", high = "#2c7bb6",
labels = percent_format(), name = "Proportion") +
labs(title = "Response Frequency Heatmap",
subtitle = "Proportion of respondents choosing each score level",
x = "Score", y = NULL) +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(size = 11))


6. Correlation Analysis

6.1 Correlation Matrix

cor_matrix <- cor(df_ratings, use = "complete.obs")

corrplot(cor_matrix,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
number.cex = 0.75,
tl.cex = 0.85,
tl.col = "black",
col = colorRampPalette(c("#d73027","#f7f7f7","#1a9850"))(200),
title = "Pearson Correlation Matrix – Rating Variables",
mar = c(0, 0, 2, 0))

6.2 Top Correlating Pairs

cor_df <- as.data.frame(as.table(cor_matrix)) %>%
rename(Var1 = Var1, Var2 = Var2, Correlation = Freq) %>%
filter(Var1 != Var2) %>%
mutate(pair = pmap_chr(list(Var1, Var2), ~ paste(sort(c(..1, ..2)), collapse = " – "))) %>%
distinct(pair, .keep_all = TRUE) %>%
arrange(desc(abs(Correlation))) %>%
slice_head(n = 10) %>%
select(pair, Correlation) %>%
mutate(Correlation = round(Correlation, 3))

kable(cor_df, caption = "Top 10 Strongest Correlations (by |r|)",
col.names = c("Variable Pair", "r"))
Top 10 Strongest Correlations (by |r|)
Variable Pair r
CS_helpful – Limitation 0.607
CS_helpful – delivery 0.590
All_Products – Find_items 0.539
Come_again – Pick_up -0.521
CS_helpful – Profesionalism 0.514
CS_helpful – Recommend 0.488
Limitation – Find_items 0.443
Come_again – Profesionalism 0.427
Recommend – delivery 0.415
Recommend – Profesionalism 0.391

7. Demographic Subgroup Comparisons

7.1 Mean Scores by Gender

df_with_demos <- bind_cols(df_ratings, df %>% select(Gender_f, Age_f, Education_f))

gender_means <- df_with_demos %>%
group_by(Gender_f) %>%
summarise(across(all_of(rating_vars), mean, .names = "{.col}")) %>%
pivot_longer(-Gender_f, names_to = "Variable", values_to = "Mean")

ggplot(gender_means, aes(x = Variable, y = Mean, fill = Gender_f)) +
geom_col(position = "dodge", alpha = 0.85) +
scale_fill_brewer(palette = "Set2", name = "Gender") +
scale_y_continuous(limits = c(0, 5)) +
coord_flip() +
labs(title = "Average Satisfaction Scores by Gender",
x = NULL, y = "Mean Score (1–5)") +
theme_minimal(base_size = 12)

7.2 Mean Scores by Age Group

age_means <- df_with_demos %>%
group_by(Age_f) %>%
summarise(across(all_of(rating_vars), mean, .names = "{.col}")) %>%
pivot_longer(-Age_f, names_to = "Variable", values_to = "Mean")

ggplot(age_means, aes(x = Variable, y = Mean, color = Age_f, group = Age_f)) +
geom_line(linewidth = 1) +
geom_point(size = 2.5) +
scale_color_brewer(palette = "Dark2", name = "Age Group") +
scale_y_continuous(limits = c(0, 5)) +
coord_flip() +
labs(title = "Average Satisfaction Scores by Age Group",
x = NULL, y = "Mean Score (1–5)") +
theme_minimal(base_size = 12)


8. Customer Segmentation (K-Means Clustering)

set.seed(42)
df_scaled <- scale(df_ratings)

# Elbow method to choose optimal k
fviz_nbclust(df_scaled, kmeans, method = "wss",
k.max = 8,
linecolor = "#e74c3c") +
labs(title = "Elbow Method – Optimal Number of Clusters",
subtitle = "Look for the 'elbow' where WSS improvement flattens",
x = "Number of Clusters (k)", y = "Total Within-Cluster SS") +
theme_minimal(base_size = 13)

# Fit k-means with k = 3 (adjust based on elbow plot)
k <- 3
km <- kmeans(df_scaled, centers = k, nstart = 25, iter.max = 100)

df$Cluster <- factor(km$cluster, labels = paste0("Segment ", 1:k))

cat("Cluster sizes:\n")
## Cluster sizes:
print(table(df$Cluster))
## 
## Segment 1 Segment 2 Segment 3 
##         4        10         8
fviz_cluster(km, data = df_scaled,
geom = "point",
ellipse = TRUE,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(base_size = 13)) +
labs(title = "Customer Segments – PCA Projection",
subtitle = "Each point = one customer; ellipses define segment boundaries")

8.1 Segment Profiles

segment_profiles <- df %>%
group_by(Cluster) %>%
summarise(
n = n(),
CS_helpful = round(mean(CS_helpful), 2),
Recommend = round(mean(Recommend), 2),
Come_again = round(mean(Come_again), 2),
All_Products = round(mean(All_Products), 2),
Profesionalism = round(mean(Profesionalism), 2),
Online_grocery = round(mean(Online_grocery), 2),
delivery = round(mean(delivery), 2),
Pick_up = round(mean(Pick_up), 2),
Find_items = round(mean(Find_items), 2),
.groups = "drop"
)

kable(segment_profiles, caption = "Mean Scores per Customer Segment")
Mean Scores per Customer Segment
Cluster n CS_helpful Recommend Come_again All_Products Profesionalism Online_grocery delivery Pick_up Find_items
Segment 1 4 2.50 2.00 2.50 3.25 2.00 2.25 3.0 1.25 2.0
Segment 2 10 1.10 1.00 1.30 2.00 1.20 2.00 1.7 2.20 1.2
Segment 3 8 1.75 1.38 1.12 1.62 1.38 2.62 3.0 3.38 1.5
# Radar-style comparison using a grouped bar chart
segment_long <- segment_profiles %>%
select(-n) %>%
pivot_longer(-Cluster, names_to = "Variable", values_to = "Mean")

ggplot(segment_long, aes(x = Variable, y = Mean, fill = Cluster)) +
geom_col(position = "dodge", alpha = 0.85) +
scale_fill_brewer(palette = "Set1", name = "Segment") +
scale_y_continuous(limits = c(0, 5)) +
coord_flip() +
labs(title = "Customer Segment Profiles",
subtitle = "Mean satisfaction scores per variable, grouped by segment",
x = NULL, y = "Mean Score (1–5)") +
theme_minimal(base_size = 12)


9. Key Findings & Recommendations

1. Overall Satisfaction

Most customers gave ratings between 1 and 2, suggesting generally high satisfaction (on a scale where 1 = best). The mean scores across all rating dimensions were low, indicating the store performs well overall.

2. Highest Satisfaction Area

The item with the lowest (best) mean score is Recommend (M = 1.32), suggesting this is the area customers are most pleased with.

3. Area for Improvement

The item with the highest (worst) mean score is other_shops (M = 2.59). This is the primary opportunity for targeted improvement.

4. Segmentation

K-means clustering identified 3 distinct customer segments. Each segment shows a distinct satisfaction profile. Stores can tailor communication, loyalty programs, and service improvements based on these profiles.

5. Demographic Insights

The majority of respondents fall in the 18–34 age group, with high-school and graduate-level education dominating. Gender representation is uneven; findings should be interpreted with this in mind given the small sample size (n = 22).


10. Limitations