Prompt 1 (Original):
“I have a CSV dataset about grocery store customers. Can you write me an R Markdown file to analyze it?”
Prompt 2 (Improved):
“I have a CSV dataset from a grocery store customer survey. The dataset has 22 rows and 15 columns. The columns include Likert-scale ratings (1–5) on topics like customer service helpfulness (CS_helpful), willingness to recommend (Recommend), likelihood to return (Come_again), product variety (All_Products), staff professionalism (Profesionalism), store limitations (Limitation), online grocery preference (Online_grocery), delivery preference (delivery), pick-up preference (Pick_up), ease of finding items (Find_items), shopping at other stores (other_shops), and demographic variables (Gender, Age, Education coded as numbers). Please write a complete R Markdown file that: loads and cleans the data, generates summary statistics, creates bar charts for demographic variables (Gender, Age, Education), visualizes the distribution of key service ratings, computes a correlation matrix for the Likert-scale items, and interprets the findings in plain language. Use ggplot2 for all visualizations.”
library(tidyverse)
library(ggplot2)
library(corrplot)
library(knitr)
# Load the dataset
df <- read.csv("customer_segmentation.csv")
# Preview the data
kable(head(df), caption = "First 6 rows of the customer survey dataset")| ID | CS_helpful | Recommend | Come_again | All_Products | Profesionalism | Limitation | Online_grocery | delivery | Pick_up | Find_items | other_shops | Gender | Age | Education |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 4 | 1 | 2 | 1 | 2 | 2 |
| 2 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 1 | 2 | 1 | 2 | 2 |
| 3 | 2 | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 2 | 1 | 3 | 1 | 2 | 2 |
| 4 | 3 | 3 | 2 | 4 | 1 | 2 | 3 | 3 | 2 | 2 | 2 | 1 | 3 | 5 |
| 5 | 2 | 1 | 3 | 5 | 2 | 1 | 2 | 3 | 1 | 2 | 3 | 2 | 4 | 2 |
| 6 | 1 | 1 | 3 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 4 | 1 | 2 | 5 |
## Number of respondents: 22
## Number of variables: 15
## CS_helpful Recommend Come_again All_Products
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.250
## Median :1.000 Median :1.000 Median :1.000 Median :2.000
## Mean :1.591 Mean :1.318 Mean :1.455 Mean :2.091
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :3.000 Max. :3.000 Max. :3.000 Max. :5.000
## Profesionalism Limitation Online_grocery delivery Pick_up
## Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.0 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :1.000 Median :1.0 Median :2.000 Median :3.000 Median :2.000
## Mean :1.409 Mean :1.5 Mean :2.273 Mean :2.409 Mean :2.455
## 3rd Qu.:2.000 3rd Qu.:2.0 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.000 Max. :4.0 Max. :3.000 Max. :3.000 Max. :5.000
## Find_items other_shops
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.250
## Median :1.000 Median :2.000
## Mean :1.455 Mean :2.591
## 3rd Qu.:2.000 3rd Qu.:3.750
## Max. :3.000 Max. :5.000
Interpretation: The dataset contains responses from
22 grocery store customers. Most Likert ratings hover
near 1–2, suggesting respondents had mixed-to-low satisfaction scores on
several dimensions. A few variables like Pick_up and
delivery show more spread.
# Recode gender: 1 = Male, 2 = Female
df$Gender_label <- ifelse(df$Gender == 1, "Male", "Female")
ggplot(df, aes(x = Gender_label, fill = Gender_label)) +
geom_bar(width = 0.5, color = "white") +
scale_fill_manual(values = c("Male" = "#4A90D9", "Female" = "#E87D7D")) +
labs(title = "Survey Respondents by Gender",
x = "Gender", y = "Count") +
theme_minimal() +
theme(legend.position = "none")Interpretation: The sample skews male — 16 male respondents vs. 6 female. This is worth noting as a potential limitation in generalizability.
# Recode age: 1=Under 18, 2=18-34, 3=35-54, 4=55+
df$Age_label <- recode(df$Age,
`1` = "Under 18",
`2` = "18–34",
`3` = "35–54",
`4` = "55+"
)
ggplot(df, aes(x = Age_label, fill = Age_label)) +
geom_bar(width = 0.5, color = "white") +
scale_fill_brewer(palette = "Set2") +
labs(title = "Survey Respondents by Age Group",
x = "Age Group", y = "Count") +
theme_minimal() +
theme(legend.position = "none")Interpretation: The majority of respondents (15 out of 22) fall in the 18–34 age bracket, making this a relatively young sample. No respondents under 18 were recorded.
# Recode education: 1=HS or less, 2=Some college, 3=Bachelor's, 4=Master's, 5=Doctorate/Professional
df$Edu_label <- recode(df$Education,
`1` = "HS or Less",
`2` = "Some College",
`3` = "Bachelor's",
`4` = "Master's",
`5` = "Doctorate/Prof."
)
ggplot(df, aes(x = Edu_label, fill = Edu_label)) +
geom_bar(width = 0.5, color = "white") +
scale_fill_brewer(palette = "Pastel1") +
labs(title = "Survey Respondents by Education Level",
x = "Education Level", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 20, hjust = 1),
legend.position = "none")Interpretation: Education is bimodal — respondents cluster at either Some College or Doctorate/Professional levels. This could reflect two distinct customer segments shopping at this store.
# Select key service variables
service_vars <- df %>%
select(CS_helpful, Recommend, Come_again, All_Products, Profesionalism, Find_items)
# Pivot to long format for faceted plot
service_long <- service_vars %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Rating")
ggplot(service_long, aes(x = factor(Rating), fill = Variable)) +
geom_bar(show.legend = FALSE, color = "white") +
facet_wrap(~Variable, ncol = 3) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Distribution of Service Ratings (1 = Low, 5 = High)",
x = "Rating", y = "Count") +
theme_minimal()Interpretation: Most service variables are rated
1–2, indicating that respondents generally rate the
store below average on helpfulness, recommendation
likelihood, and intent to return. Find_items shows the most
variance, with some respondents rating it up to 5.
pref_vars <- df %>%
select(Online_grocery, delivery, Pick_up, other_shops)
pref_long <- pref_vars %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Rating")
ggplot(pref_long, aes(x = factor(Rating), fill = Variable)) +
geom_bar(show.legend = FALSE, color = "white") +
facet_wrap(~Variable, ncol = 2) +
scale_fill_brewer(palette = "Pastel2") +
labs(title = "Distribution of Shopping Preference Ratings",
x = "Rating", y = "Count") +
theme_minimal()Interpretation: Customers show moderate interest in
online grocery and delivery, while
other_shops ratings suggest some customers do shop at
competitors. Pick-up shows a wide spread, meaning preferences vary
considerably.
# Select only Likert-scale variables
likert_cols <- df %>%
select(CS_helpful, Recommend, Come_again, All_Products,
Profesionalism, Limitation, Online_grocery,
delivery, Pick_up, Find_items, other_shops)
# Rename Profesionalism (fix typo for display)
names(likert_cols)[5] <- "Professionalism"
# Compute correlation
cor_matrix <- cor(likert_cols, use = "complete.obs")
# Plot
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.8,
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c("#E87D7D", "white", "#4A90D9"))(200),
title = "Correlation Matrix of Survey Items",
mar = c(0, 0, 2, 0))Interpretation:
| Finding | Implication |
|---|---|
| Most ratings cluster at 1–2 | Overall satisfaction is low; areas for improvement exist across the board |
| CS_helpful → Recommend → Come_again | Customer service training could improve retention and word-of-mouth |
| Online_grocery ↔︎ delivery correlation | There’s a customer segment preferring digital convenience |
| Bimodal education distribution | The store may be serving two distinct segments with different needs |
| Male-skewed sample | Future surveys should target a more balanced gender sample |
This analysis revealed that while the grocery store’s sample is small (n=22), there are clear patterns: satisfaction is generally low, customer service is the strongest predictor of loyalty, and a segment of customers prefers digital shopping options. Future research could expand the sample size and incorporate open-ended questions to capture the “why” behind these ratings.
Analysis generated using R with tidyverse, ggplot2, and corrplot packages.