Reflection This analysis applies k-means clustering to a grocery store customer survey dataset in order to identify distinct customer segments based on service satisfaction, product range perceptions, and digital channel usage. Using R, the workflow encompasses data cleaning, descriptive statistics, demographic profiling, correlation analysis, and unsupervised machine learning. Eleven Likert-scale survey items were scaled and submitted to a k-means algorithm, with the optimal number of clusters determined through an elbow plot. The resulting three segments were then visualized via PCA projection and profiled across both behavioral and demographic dimensions, culminating in a summary table that quantifies each segment’s service, channel, and product scores.
Completing this analysis deepened my understanding of the full customer segmentation pipeline, from raw survey data to actionable consumer profiles. Working through the k-means methodology reinforced the importance of feature scaling prior to clustering, as unscaled variables would disproportionately influence distance calculations and distort segment boundaries. Debugging the summarise() error in the summary table further illustrated a critical distinction in how dplyr handles grouped versus ungrouped data frames, specifically that row-level operations such as rowMeans() must be computed before group_by() is called rather than inside summarise(). Taken together, this lab strengthened both my technical proficiency in R and my conceptual understanding of how quantitative segmentation methods can inform retail marketing strategy.
required <- c("tidyverse", "knitr", "kableExtra", "corrplot",
"cluster", "factoextra", "scales")
new_pkgs <- required[!required %in% rownames(installed.packages())]
if (length(new_pkgs)) install.packages(new_pkgs, repos = "https://cran.r-project.org")
library(tidyverse)
library(knitr)
library(kableExtra)
library(corrplot)
library(cluster)
library(factoextra)
library(scales)df_raw <- read_csv("customer_segmentation.csv", show_col_types = FALSE)
df <- df_raw %>%
rename(Professionalism = Profesionalism) %>%
mutate(
Gender_lbl = factor(Gender, levels = 1:2, labels = c("Male", "Female")),
Age_lbl = factor(Age, levels = 2:4, labels = c("18-30", "31-45", "46+")),
Education_lbl = factor(Education, levels = c(1, 2, 3, 5),
labels = c("High School", "Some College", "Bachelor's", "Graduate"))
)
cat("Dataset:", nrow(df), "respondents x", ncol(df_raw), "variables\n")## Dataset: 22 respondents x 15 variables
## Rows: 22
## Columns: 18
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ CS_helpful <dbl> 2, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, …
## $ Recommend <dbl> 2, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Come_again <dbl> 2, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, …
## $ All_Products <dbl> 2, 1, 1, 4, 5, 2, 2, 2, 2, 1, 2, 2, 1, 2, 4, 2, 2, 1, …
## $ Professionalism <dbl> 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Limitation <dbl> 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, …
## $ Online_grocery <dbl> 2, 2, 3, 3, 2, 1, 2, 1, 2, 3, 2, 3, 1, 3, 2, 3, 2, 3, …
## $ delivery <dbl> 3, 3, 3, 3, 3, 2, 2, 1, 1, 2, 2, 2, 2, 3, 2, 1, 3, 3, …
## $ Pick_up <dbl> 4, 3, 2, 2, 1, 1, 2, 2, 3, 2, 2, 3, 2, 3, 2, 3, 5, 3, …
## $ Find_items <dbl> 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, …
## $ other_shops <dbl> 2, 2, 3, 2, 3, 4, 1, 4, 1, 1, 3, 3, 1, 1, 5, 5, 5, 2, …
## $ Gender <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, …
## $ Age <dbl> 2, 2, 2, 3, 4, 2, 2, 2, 2, 2, 4, 3, 4, 3, 2, 3, 2, 2, …
## $ Education <dbl> 2, 2, 2, 5, 2, 5, 3, 2, 1, 2, 5, 1, 5, 5, 5, 5, 1, 5, …
## $ Gender_lbl <fct> Male, Male, Male, Male, Female, Male, Male, Male, Fema…
## $ Age_lbl <fct> 18-30, 18-30, 18-30, 31-45, 46+, 18-30, 18-30, 18-30, …
## $ Education_lbl <fct> Some College, Some College, Some College, Graduate, So…
miss <- df %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
filter(Missing > 0)
if (nrow(miss) == 0) {
cat("No missing values found.\n")
} else {
miss %>%
kable(caption = "Variables with Missing Values") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
}## No missing values found.
df %>%
select(where(is.numeric), -ID) %>%
summarise(across(everything(), list(
Min = min,
Max = max,
Mean = ~round(mean(.), 2),
SD = ~round(sd(.), 2)
))) %>%
pivot_longer(everything(), names_to = "key", values_to = "val") %>%
separate(key, into = c("Variable", "Stat"), sep = "_(?=[^_]+$)") %>%
pivot_wider(names_from = Stat, values_from = val) %>%
kable(caption = "Descriptive Statistics (numeric variables)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Variable | Min | Max | Mean | SD |
|---|---|---|---|---|
| CS_helpful | 1 | 3 | 1.59 | 0.73 |
| Recommend | 1 | 3 | 1.32 | 0.65 |
| Come_again | 1 | 3 | 1.45 | 0.74 |
| All_Products | 1 | 5 | 2.09 | 1.06 |
| Professionalism | 1 | 3 | 1.41 | 0.59 |
| Limitation | 1 | 4 | 1.50 | 0.80 |
| Online_grocery | 1 | 3 | 2.27 | 0.77 |
| delivery | 1 | 3 | 2.41 | 0.73 |
| Pick_up | 1 | 5 | 2.45 | 1.06 |
| Find_items | 1 | 3 | 1.45 | 0.67 |
| other_shops | 1 | 5 | 2.59 | 1.40 |
| Gender | 1 | 2 | 1.27 | 0.46 |
| Age | 2 | 4 | 2.45 | 0.74 |
| Education | 1 | 5 | 3.18 | 1.62 |
df %>%
select(all_of(likert_vars)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Score") %>%
mutate(Variable = recode(Variable, !!!var_labels)) %>%
ggplot(aes(x = factor(Score), fill = factor(Score))) +
geom_bar(show.legend = FALSE) +
facet_wrap(~Variable, ncol = 3) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Response Distribution by Survey Item",
subtitle = "1 = Most favorable / most frequent",
x = "Score", y = "Count"
) +
theme_minimal(base_size = 11) +
theme(strip.text = element_text(face = "bold"))df %>%
select(all_of(likert_vars)) %>%
summarise(across(everything(), mean)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Mean") %>%
mutate(
Label = recode(Variable, !!!var_labels),
Category = case_when(
Variable %in% c("CS_helpful", "Recommend", "Come_again",
"Professionalism", "Find_items") ~ "Service / Experience",
Variable %in% c("All_Products", "Limitation") ~ "Product Range",
TRUE ~ "Shopping Channel"
)
) %>%
ggplot(aes(x = reorder(Label, Mean), y = Mean, fill = Category)) +
geom_col(width = 0.7) +
geom_text(aes(label = round(Mean, 2)), hjust = -0.15, size = 3.8) +
coord_flip() +
scale_y_continuous(limits = c(0, 4.2)) +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Mean Score per Survey Item",
subtitle = "Lower = more favorable",
x = NULL, y = "Mean Score", fill = "Category"
) +
theme_minimal(base_size = 12)cor_mat <- df %>%
select(all_of(likert_vars)) %>%
cor(use = "pairwise.complete.obs")
colnames(cor_mat) <- var_labels[colnames(cor_mat)]
rownames(cor_mat) <- var_labels[rownames(cor_mat)]
corrplot(
cor_mat,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
number.cex = 0.68,
tl.cex = 0.75,
tl.col = "black",
col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
title = "Correlation Matrix (hierarchical order)",
mar = c(0, 0, 2, 0)
)set.seed(42)
wss <- sapply(1:8, function(k) {
kmeans(df_scaled, centers = k, nstart = 25)$tot.withinss
})
tibble(k = 1:8, WSS = wss) %>%
ggplot(aes(x = k, y = WSS)) +
geom_line(linewidth = 1, color = "#4E79A7") +
geom_point(size = 3.5, color = "#E15759") +
scale_x_continuous(breaks = 1:8) +
labs(
title = "Elbow Plot — Choosing the Number of Clusters",
subtitle = "Select k at the 'elbow' where WSS stops dropping sharply",
x = "k (clusters)", y = "Total Within-Cluster SS"
) +
theme_minimal(base_size = 13)set.seed(42)
km <- kmeans(df_scaled, centers = 3, nstart = 50)
df <- df %>%
mutate(Cluster = factor(km$cluster, labels = c("Segment A", "Segment B", "Segment C")))
cat("Cluster sizes:\n")## Cluster sizes:
##
## Segment A Segment B Segment C
## 10 4 8
fviz_cluster(
km,
data = df_scaled,
geom = "point",
ellipse.type = "convex",
palette = c("#E15759", "#4E79A7", "#59A14F"),
ggtheme = theme_minimal(base_size = 12),
main = "Customer Segments — PCA Projection"
)df %>%
group_by(Cluster) %>%
summarise(across(all_of(likert_vars), mean)) %>%
pivot_longer(-Cluster, names_to = "Variable", values_to = "Mean") %>%
mutate(Label = recode(Variable, !!!var_labels)) %>%
ggplot(aes(x = Label, y = Mean, fill = Cluster)) +
geom_col(position = "dodge", width = 0.7) +
coord_flip() +
scale_fill_manual(values = c(
"Segment A" = "#E15759",
"Segment B" = "#4E79A7",
"Segment C" = "#59A14F"
)) +
labs(
title = "Mean Score per Item by Customer Segment",
subtitle = "Lower = more favorable",
x = NULL, y = "Mean Score", fill = "Segment"
) +
theme_minimal(base_size = 11)df %>%
mutate(
Service_Score = rowMeans(select(., CS_helpful, Recommend, Come_again,
Professionalism, Find_items)),
Channel_Score = rowMeans(select(., Online_grocery, delivery, Pick_up))
) %>%
ggplot(aes(x = Service_Score, y = Channel_Score, color = Cluster)) +
geom_point(size = 4, alpha = 0.85) +
geom_smooth(method = "lm", se = TRUE, linewidth = 0.8,
aes(group = 1), color = "grey40", alpha = 0.15) +
scale_color_manual(values = c(
"Segment A" = "#E15759",
"Segment B" = "#4E79A7",
"Segment C" = "#59A14F"
)) +
labs(
title = "Service Satisfaction vs. Digital Channel Usage",
subtitle = "Lower = more satisfied / heavier channel user (scale: 1–5)",
x = "Service Score (avg)", y = "Channel Score (avg)", color = "Segment"
) +
theme_minimal(base_size = 13)df %>%
mutate(
Service_Score = rowMeans(cbind(CS_helpful, Recommend, Come_again,
Professionalism, Find_items)),
Channel_Score = rowMeans(cbind(Online_grocery, delivery, Pick_up)),
Product_Score = rowMeans(cbind(All_Products, Limitation))
) %>%
group_by(Cluster) %>%
summarise(
N = n(),
Pct_Female = percent(mean(Gender == 2), 1),
Avg_Service = round(mean(Service_Score), 2),
Avg_Channel = round(mean(Channel_Score), 2),
Avg_Product = round(mean(Product_Score), 2)
) %>%
kable(caption = "Segment Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Cluster | N | Pct_Female | Avg_Service | Avg_Channel | Avg_Product |
|---|---|---|---|---|---|
| Segment A | 10 | 30% | 1.16 | 1.97 | 1.55 |
| Segment B | 4 | 75% | 2.20 | 2.17 | 2.62 |
| Segment C | 8 | 0% | 1.43 | 3.00 | 1.69 |
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_1.4.0 factoextra_2.0.0 cluster_2.1.8.2 corrplot_0.95
## [5] kableExtra_1.4.0 knitr_1.51 lubridate_1.9.5 forcats_1.0.1
## [9] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1 readr_2.2.0
## [13] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.56 bslib_0.10.0 ggrepel_0.9.8
## [5] rstatix_0.7.3 lattice_0.22-7 tzdb_0.5.0 vctrs_0.7.1
## [9] tools_4.5.3 generics_0.1.4 parallel_4.5.3 pkgconfig_2.0.3
## [13] Matrix_1.7-4 RColorBrewer_1.1-3 S7_0.2.1 lifecycle_1.0.5
## [17] compiler_4.5.3 farver_2.1.2 textshaping_1.0.4 carData_3.0-6
## [21] htmltools_0.5.9 sass_0.4.10 yaml_2.3.12 Formula_1.2-5
## [25] pillar_1.11.1 car_3.1-5 ggpubr_0.6.3 crayon_1.5.3
## [29] jquerylib_0.1.4 cachem_1.1.0 abind_1.4-8 nlme_3.1-168
## [33] tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7 splines_4.5.3
## [37] labeling_0.4.3 fastmap_1.2.0 grid_4.5.3 cli_3.6.5
## [41] magrittr_2.0.4 broom_1.0.12 withr_3.0.2 backports_1.5.0
## [45] bit64_4.6.0-1 timechange_0.4.0 rmarkdown_2.30 bit_4.6.0
## [49] otel_0.2.0 ggsignif_0.6.4 hms_1.1.4 evaluate_1.0.5
## [53] viridisLite_0.4.3 mgcv_1.9-3 rlang_1.1.7 Rcpp_1.1.1
## [57] glue_1.8.0 xml2_1.5.2 svglite_2.2.2 rstudioapi_0.18.0
## [61] vroom_1.7.0 jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.1