Customer Segmentation Analysis – Grocery Store Survey

📌 Prompts Used to Generate This Analysis

Prompt 1 (Original):

“I have a CSV dataset about grocery store customers. Can you write me an R Markdown file to analyze it?”

Prompt 2 (Improved):

“I have a CSV dataset from a grocery store customer survey. The dataset has 22 rows and 15 columns. The columns include Likert-scale ratings (1–5) on topics like customer service helpfulness (CS_helpful), willingness to recommend (Recommend), likelihood to return (Come_again), product variety (All_Products), staff professionalism (Profesionalism), store limitations (Limitation), online grocery preference (Online_grocery), delivery preference (delivery), pick-up preference (Pick_up), ease of finding items (Find_items), shopping at other stores (other_shops), and demographic variables (Gender, Age, Education coded as numbers). Please write a complete R Markdown file that: loads and cleans the data, generates summary statistics, creates bar charts for demographic variables (Gender, Age, Education), visualizes the distribution of key service ratings, computes a correlation matrix for the Likert-scale items, and interprets the findings in plain language. Use ggplot2 for all visualizations.”

1. Load Libraries and Data

library(tidyverse)
library(ggplot2)
library(corrplot)
library(knitr)

# Load the dataset
df <- read.csv("customer_segmentation.csv")

# Preview the data
kable(head(df), caption = "First 6 rows of the customer survey dataset")

First 6 rows of the customer survey dataset
ID	CS_helpful	Recommend	Come_again	All_Products	Profesionalism	Limitation	Online_grocery	delivery	Pick_up	Find_items	other_shops	Gender	Age	Education
1	2	2	2	2	2	2	2	3	4	1	2	1	2	2
2	1	2	1	1	1	1	2	3	3	1	2	1	2	2
3	2	1	1	1	1	2	3	3	2	1	3	1	2	2
4	3	3	2	4	1	2	3	3	2	2	2	1	3	5
5	2	1	3	5	2	1	2	3	1	2	3	2	4	2
6	1	1	3	2	1	1	1	2	1	1	4	1	2	5

2. Data Overview

# Dimensions
cat("Number of respondents:", nrow(df), "\n")

## Number of respondents: 22

cat("Number of variables:", ncol(df), "\n\n")

## Number of variables: 15

# Summary statistics
summary(df[, 2:12])  # Likert-scale columns only

##    CS_helpful      Recommend       Come_again     All_Products  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.250  
##  Median :1.000   Median :1.000   Median :1.000   Median :2.000  
##  Mean   :1.591   Mean   :1.318   Mean   :1.455   Mean   :2.091  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :5.000  
##  Profesionalism    Limitation  Online_grocery     delivery        Pick_up     
##  Min.   :1.000   Min.   :1.0   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.0   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :1.000   Median :1.0   Median :2.000   Median :3.000   Median :2.000  
##  Mean   :1.409   Mean   :1.5   Mean   :2.273   Mean   :2.409   Mean   :2.455  
##  3rd Qu.:2.000   3rd Qu.:2.0   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :4.0   Max.   :3.000   Max.   :3.000   Max.   :5.000  
##    Find_items     other_shops   
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.250  
##  Median :1.000   Median :2.000  
##  Mean   :1.455   Mean   :2.591  
##  3rd Qu.:2.000   3rd Qu.:3.750  
##  Max.   :3.000   Max.   :5.000

Interpretation: The dataset contains responses from 22 grocery store customers. Most Likert ratings hover near 1–2, suggesting respondents had mixed-to-low satisfaction scores on several dimensions. A few variables like Pick_up and delivery show more spread.

3. Demographic Breakdown

3a. Gender

# Recode gender: 1 = Male, 2 = Female
df$Gender_label <- ifelse(df$Gender == 1, "Male", "Female")

ggplot(df, aes(x = Gender_label, fill = Gender_label)) +
  geom_bar(width = 0.5, color = "white") +
  scale_fill_manual(values = c("Male" = "#4A90D9", "Female" = "#E87D7D")) +
  labs(title = "Survey Respondents by Gender",
       x = "Gender", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation: The sample skews male — 16 male respondents vs. 6 female. This is worth noting as a potential limitation in generalizability.

3b. Age Group

# Recode age: 1=Under 18, 2=18-34, 3=35-54, 4=55+
df$Age_label <- recode(df$Age,
  `1` = "Under 18",
  `2` = "18–34",
  `3` = "35–54",
  `4` = "55+"
)

ggplot(df, aes(x = Age_label, fill = Age_label)) +
  geom_bar(width = 0.5, color = "white") +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Survey Respondents by Age Group",
       x = "Age Group", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation: The majority of respondents (15 out of 22) fall in the 18–34 age bracket, making this a relatively young sample. No respondents under 18 were recorded.

3c. Education Level

# Recode education: 1=HS or less, 2=Some college, 3=Bachelor's, 4=Master's, 5=Doctorate/Professional
df$Edu_label <- recode(df$Education,
  `1` = "HS or Less",
  `2` = "Some College",
  `3` = "Bachelor's",
  `4` = "Master's",
  `5` = "Doctorate/Prof."
)

ggplot(df, aes(x = Edu_label, fill = Edu_label)) +
  geom_bar(width = 0.5, color = "white") +
  scale_fill_brewer(palette = "Pastel1") +
  labs(title = "Survey Respondents by Education Level",
       x = "Education Level", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 20, hjust = 1),
        legend.position = "none")

Interpretation: Education is bimodal — respondents cluster at either Some College or Doctorate/Professional levels. This could reflect two distinct customer segments shopping at this store.

4. Service Rating Distributions

# Select key service variables
service_vars <- df %>%
  select(CS_helpful, Recommend, Come_again, All_Products, Profesionalism, Find_items)

# Pivot to long format for faceted plot
service_long <- service_vars %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Rating")

ggplot(service_long, aes(x = factor(Rating), fill = Variable)) +
  geom_bar(show.legend = FALSE, color = "white") +
  facet_wrap(~Variable, ncol = 3) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Distribution of Service Ratings (1 = Low, 5 = High)",
       x = "Rating", y = "Count") +
  theme_minimal()

Interpretation: Most service variables are rated 1–2, indicating that respondents generally rate the store below average on helpfulness, recommendation likelihood, and intent to return. Find_items shows the most variance, with some respondents rating it up to 5.

5. Shopping Preference Ratings

pref_vars <- df %>%
  select(Online_grocery, delivery, Pick_up, other_shops)

pref_long <- pref_vars %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Rating")

ggplot(pref_long, aes(x = factor(Rating), fill = Variable)) +
  geom_bar(show.legend = FALSE, color = "white") +
  facet_wrap(~Variable, ncol = 2) +
  scale_fill_brewer(palette = "Pastel2") +
  labs(title = "Distribution of Shopping Preference Ratings",
       x = "Rating", y = "Count") +
  theme_minimal()

Interpretation: Customers show moderate interest in online grocery and delivery, while other_shops ratings suggest some customers do shop at competitors. Pick-up shows a wide spread, meaning preferences vary considerably.

6. Correlation Matrix

# Select only Likert-scale variables
likert_cols <- df %>%
  select(CS_helpful, Recommend, Come_again, All_Products,
         Profesionalism, Limitation, Online_grocery,
         delivery, Pick_up, Find_items, other_shops)

# Rename Profesionalism (fix typo for display)
names(likert_cols)[5] <- "Professionalism"

# Compute correlation
cor_matrix <- cor(likert_cols, use = "complete.obs")

# Plot
corrplot(cor_matrix,
         method = "color",
         type = "upper",
         tl.cex = 0.8,
         addCoef.col = "black",
         number.cex = 0.7,
         col = colorRampPalette(c("#E87D7D", "white", "#4A90D9"))(200),
         title = "Correlation Matrix of Survey Items",
         mar = c(0, 0, 2, 0))

Interpretation:

CS_helpful, Recommend, and Come_again are positively correlated — customers who find staff helpful are more likely to recommend and return. This is expected and suggests customer service is a key driver of loyalty.
Online_grocery and delivery are moderately correlated, suggesting the same customers who prefer online shopping also prefer delivery.
other_shops has weak or near-zero correlations with most service variables, suggesting shopping at competitors is not strongly tied to satisfaction here.

7. Summary & Insights

Finding	Implication
Most ratings cluster at 1–2	Overall satisfaction is low; areas for improvement exist across the board
CS_helpful → Recommend → Come_again	Customer service training could improve retention and word-of-mouth
Online_grocery ↔︎ delivery correlation	There’s a customer segment preferring digital convenience
Bimodal education distribution	The store may be serving two distinct segments with different needs
Male-skewed sample	Future surveys should target a more balanced gender sample

8. Conclusion

This analysis revealed that while the grocery store’s sample is small (n=22), there are clear patterns: satisfaction is generally low, customer service is the strongest predictor of loyalty, and a segment of customers prefers digital shopping options. Future research could expand the sample size and incorporate open-ended questions to capture the “why” behind these ratings.

Analysis generated using R with tidyverse, ggplot2, and corrplot packages.