Reflection

At the start of this analysis I didn’t know what to expect from Claude AI in regards to the diversity of the data reports and how it would choose to analyze the data. I was genuinely impressed with its ability to generate the R Markdown file and process all the data differently each time it was asked.

This was an effective practice of using Claude to analyze data. I didn’t realize it was so good at creating the script, but I will be using this moving forward.

1. Introduction

This report analyzes a grocery shopping survey dataset collected from classmates. The dataset contains 22 responses and 15 variables covering customer satisfaction, shopping behaviors, demographics, and preferences related to a grocery store experience.

Survey Scale Reference (unless otherwise noted):

Score	Meaning
1	Strongly Agree / Very Satisfied
2	Agree / Satisfied
3	Neutral
4	Disagree / Dissatisfied
5	Strongly Disagree / Very Dissatisfied

2. Setup & Data Loading

# Load required libraries
library(tidyverse)
library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
library(scales)
library(corrplot)
library(reshape2)

# Load the dataset
df <- read.csv("customer_segmentation.csv", stringsAsFactors = FALSE)

# Trim whitespace from column names
colnames(df) <- trimws(colnames(df))

# Preview the data
head(df)

3. Data Overview

3.1 Dataset Dimensions

cat("Number of rows (respondents):", nrow(df), "\n")

## Number of rows (respondents): 22

cat("Number of columns (variables):", ncol(df), "\n")

## Number of columns (variables): 15

3.2 Variable Types

str(df)

## 'data.frame':    22 obs. of  15 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CS_helpful    : int  2 1 2 3 2 1 2 1 1 1 ...
##  $ Recommend     : int  2 2 1 3 1 1 1 1 1 1 ...
##  $ Come_again    : int  2 1 1 2 3 3 1 1 1 1 ...
##  $ All_Products  : int  2 1 1 4 5 2 2 2 2 1 ...
##  $ Profesionalism: int  2 1 1 1 2 1 2 1 2 1 ...
##  $ Limitation    : int  2 1 2 2 1 1 1 2 1 1 ...
##  $ Online_grocery: int  2 2 3 3 2 1 2 1 2 3 ...
##  $ delivery      : int  3 3 3 3 3 2 2 1 1 2 ...
##  $ Pick_up       : int  4 3 2 2 1 1 2 2 3 2 ...
##  $ Find_items    : int  1 1 1 2 2 1 1 2 1 1 ...
##  $ other_shops   : int  2 2 3 2 3 4 1 4 1 1 ...
##  $ Gender        : int  1 1 1 1 2 1 1 1 2 2 ...
##  $ Age           : int  2 2 2 3 4 2 2 2 2 2 ...
##  $ Education     : int  2 2 2 5 2 5 3 2 1 2 ...

3.3 Summary Statistics

summary(df[, -1]) # Exclude ID column

##    CS_helpful      Recommend       Come_again     All_Products  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.250  
##  Median :1.000   Median :1.000   Median :1.000   Median :2.000  
##  Mean   :1.591   Mean   :1.318   Mean   :1.455   Mean   :2.091  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :5.000  
##  Profesionalism    Limitation  Online_grocery     delivery        Pick_up     
##  Min.   :1.000   Min.   :1.0   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.0   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :1.000   Median :1.0   Median :2.000   Median :3.000   Median :2.000  
##  Mean   :1.409   Mean   :1.5   Mean   :2.273   Mean   :2.409   Mean   :2.455  
##  3rd Qu.:2.000   3rd Qu.:2.0   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :4.0   Max.   :3.000   Max.   :3.000   Max.   :5.000  
##    Find_items     other_shops        Gender           Age       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :2.000  
##  1st Qu.:1.000   1st Qu.:1.250   1st Qu.:1.000   1st Qu.:2.000  
##  Median :1.000   Median :2.000   Median :1.000   Median :2.000  
##  Mean   :1.455   Mean   :2.591   Mean   :1.273   Mean   :2.455  
##  3rd Qu.:2.000   3rd Qu.:3.750   3rd Qu.:1.750   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :5.000   Max.   :2.000   Max.   :4.000  
##    Education    
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.500  
##  Mean   :3.182  
##  3rd Qu.:5.000  
##  Max.   :5.000

3.4 Missing Values Check

missing_counts <- colSums(is.na(df))
missing_df <- data.frame(
  Variable = names(missing_counts),
  Missing = missing_counts
)

kable(missing_df, row.names = FALSE, caption = "Missing Values per Variable") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

Missing Values per Variable
Variable	Missing
ID	0
CS_helpful	0
Recommend	0
Come_again	0
All_Products	0
Profesionalism	0
Limitation	0
Online_grocery	0
delivery	0
Pick_up	0
Find_items	0
other_shops	0
Gender	0
Age	0
Education	0

4. Demographics

4.1 Gender Distribution

# Recode Gender: 1 = Male, 2 = Female
df$Gender_Label <- ifelse(df$Gender == 1, "Male", "Female")

gender_counts <- df %>%
  count(Gender_Label) %>%
  mutate(Percentage = round(n / sum(n) * 100, 1))

kable(gender_counts, col.names = c("Gender", "Count", "Percentage (%)"),
      caption = "Gender Distribution") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Gender Distribution
Gender	Count	Percentage (%)
Female	6	27.3
Male	16	72.7

ggplot(gender_counts, aes(x = Gender_Label, y = n, fill = Gender_Label)) +
  geom_bar(stat = "identity", width = 0.5, color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("Male" = "#2E86AB", "Female" = "#E84855")) +
  labs(title = "Gender Distribution of Survey Respondents",
       x = "Gender", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

4.2 Age Distribution

# Recode Age: 1 = Under 18, 2 = 18-25, 3 = 26-35, 4 = 36-50, 5 = 51+
age_labels <- c("1" = "Under 18", "2" = "18–25", "3" = "26–35", "4" = "36–50", "5" = "51+")
df$Age_Label <- recode(as.character(df$Age), !!!age_labels)

age_counts <- df %>%
  count(Age_Label) %>%
  arrange(match(Age_Label, age_labels))

ggplot(age_counts, aes(x = Age_Label, y = n, fill = Age_Label)) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = n), vjust = -0.5, size = 4) +
  scale_fill_brewer(palette = "Blues") +
  labs(title = "Age Distribution of Survey Respondents",
       x = "Age Group", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

4.3 Education Level

# Recode Education: 1=High School, 2=Some College, 3=Associate's, 4=Bachelor's, 5=Graduate+
edu_labels <- c("1" = "High School", "2" = "Some College",
                "3" = "Associate's", "4" = "Bachelor's", "5" = "Graduate+")
df$Education_Label <- recode(as.character(df$Education), !!!edu_labels)

edu_counts <- df %>%
  count(Education_Label) %>%
  mutate(Percentage = round(n / sum(n) * 100, 1))

ggplot(edu_counts, aes(x = reorder(Education_Label, -n), y = n, fill = Education_Label)) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, "\n(", Percentage, "%)")), vjust = -0.3, size = 3.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Education Level of Survey Respondents",
       x = "Education Level", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))

5. Customer Satisfaction & Attitudes

5.1 Customer Service Helpfulness

cs_counts <- df %>%
  count(CS_helpful) %>%
  mutate(
    Label = recode(as.character(CS_helpful),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(cs_counts, aes(x = Label, y = n, fill = factor(CS_helpful))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(title = "Customer Service is Helpful",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

5.2 Likelihood to Recommend

rec_counts <- df %>%
  count(Recommend) %>%
  mutate(
    Label = recode(as.character(Recommend),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(rec_counts, aes(x = Label, y = n, fill = factor(Recommend))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(title = "Likelihood to Recommend the Store",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

5.3 Likelihood to Return (Come Again)

ca_counts <- df %>%
  count(Come_again) %>%
  mutate(
    Label = recode(as.character(Come_again),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(ca_counts, aes(x = Label, y = n, fill = factor(Come_again))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(title = "Likelihood to Come Again",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

5.4 Satisfaction Scores — Side-by-Side Comparison

# Compute mean scores for key satisfaction variables
sat_vars <- c("CS_helpful", "Recommend", "Come_again", "Profesionalism", "Find_items")
sat_labels <- c("Customer Service", "Recommend", "Come Again", "Professionalism", "Find Items")

sat_means <- colMeans(df[, sat_vars], na.rm = TRUE)
sat_df <- data.frame(
  Variable = sat_labels,
  Mean_Score = round(sat_means, 2)
)

ggplot(sat_df, aes(x = reorder(Variable, Mean_Score), y = Mean_Score, fill = Mean_Score)) +
  geom_bar(stat = "identity", color = "white", width = 0.6) +
  geom_text(aes(label = Mean_Score), hjust = -0.2, size = 4) +
  scale_fill_gradient(low = "#2ECC71", high = "#E74C3C") +
  coord_flip() +
  labs(title = "Average Satisfaction Scores by Category",
       subtitle = "Scale: 1 = Strongly Agree / Very Satisfied → 5 = Strongly Disagree / Very Dissatisfied",
       x = "", y = "Mean Score") +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 3.5)

6. Shopping Preferences & Behavior

6.1 Online Grocery Shopping Interest

og_counts <- df %>%
  count(Online_grocery) %>%
  mutate(
    Label = recode(as.character(Online_grocery),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(og_counts, aes(x = Label, y = n, fill = factor(Online_grocery))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "PuBuGn") +
  labs(title = "Interest in Online Grocery Shopping",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

6.2 Delivery vs. Pick-Up Preference

pref_df <- data.frame(
  Category = c(rep("Delivery", nrow(df)), rep("Pick-Up", nrow(df))),
  Score = c(df$delivery, df$Pick_up)
)

ggplot(pref_df, aes(x = factor(Score), fill = Category)) +
  geom_bar(position = "dodge", color = "white") +
  scale_fill_manual(values = c("Delivery" = "#3498DB", "Pick-Up" = "#E67E22")) +
  scale_x_discrete(labels = c("1" = "Strongly\nAgree", "2" = "Agree",
                               "3" = "Neutral", "4" = "Disagree", "5" = "Strongly\nDisagree")) +
  labs(title = "Delivery vs. Pick-Up Preference",
       x = "Response", y = "Count", fill = "Shopping Method") +
  theme_minimal()

6.3 Shopping at Other Stores

os_counts <- df %>%
  count(other_shops) %>%
  mutate(
    Label = recode(as.character(other_shops),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(os_counts, aes(x = Label, y = n, fill = factor(other_shops))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "Oranges") +
  labs(title = "Shops at Other Grocery Stores as Well",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

7. Correlation Analysis

7.1 Correlation Matrix

# Select numeric survey variables (exclude ID, Gender, Age, Education)
survey_vars <- df[, c("CS_helpful", "Recommend", "Come_again", "All_Products",
                      "Profesionalism", "Limitation", "Online_grocery",
                      "delivery", "Pick_up", "Find_items", "other_shops")]

cor_matrix <- cor(survey_vars, use = "complete.obs")

corrplot(cor_matrix,
         method = "color",
         type = "upper",
         tl.col = "black",
         tl.srt = 45,
         tl.cex = 0.8,
         addCoef.col = "black",
         number.cex = 0.65,
         col = colorRampPalette(c("#E74C3C", "white", "#2980B9"))(200),
         title = "Correlation Matrix — Survey Variables",
         mar = c(0, 0, 2, 0))

7.2 Key Correlations: Recommend vs. Other Variables

cor_with_recommend <- cor(survey_vars, use = "complete.obs")[, "Recommend"]
cor_df <- data.frame(
  Variable = names(cor_with_recommend),
  Correlation = round(cor_with_recommend, 3)
) %>%
  filter(Variable != "Recommend") %>%
  arrange(desc(abs(Correlation)))

kable(cor_df, row.names = FALSE,
      caption = "Correlation of Variables with 'Likelihood to Recommend'") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Correlation of Variables with ‘Likelihood to Recommend’
Variable	Correlation
CS_helpful	0.488
delivery	0.415
Profesionalism	0.391
Come_again	0.381
Online_grocery	0.297
Pick_up	-0.082
other_shops	-0.060
Limitation	0.046
All_Products	0.025
Find_items	-0.020

8. Cross-Tabulations

8.1 Recommend by Gender

crosstab_gender <- df %>%
  count(Gender_Label, Recommend) %>%
  mutate(
    Recommend_Label = recode(as.character(Recommend),
                             "1" = "Strongly Agree", "2" = "Agree",
                             "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree")
  )

ggplot(crosstab_gender, aes(x = Recommend_Label, y = n, fill = Gender_Label)) +
  geom_bar(stat = "identity", position = "dodge", color = "white") +
  scale_fill_manual(values = c("Male" = "#2E86AB", "Female" = "#E84855")) +
  labs(title = "Likelihood to Recommend by Gender",
       x = "Response", y = "Count", fill = "Gender") +
  theme_minimal()

8.2 Come Again by Age Group

crosstab_age <- df %>%
  count(Age_Label, Come_again) %>%
  mutate(
    Come_again_Label = recode(as.character(Come_again),
                              "1" = "Strongly Agree", "2" = "Agree",
                              "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree")
  )

ggplot(crosstab_age, aes(x = Age_Label, y = n, fill = Come_again_Label)) +
  geom_bar(stat = "identity", position = "fill", color = "white") +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Likelihood to Come Again by Age Group (Proportional)",
       x = "Age Group", y = "Proportion", fill = "Response") +
  theme_minimal()

9. Key Findings & Summary

findings <- data.frame(
  Finding = c(
    "Dominant age group",
    "Most common gender",
    "Most common education level",
    "Customer service helpfulness",
    "Likelihood to recommend",
    "Likelihood to return",
    "Online grocery preference",
    "Preferred fulfillment method"
  ),
  Result = c(
    paste0("18–25 (", round(mean(df$Age == 2) * 100, 1), "% of respondents)"),
    paste0("Male (", round(mean(df$Gender == 1) * 100, 1), "%)"),
    "Graduate+ and Some College (tied most common)",
    paste0("Mean score: ", round(mean(df$CS_helpful), 2), " — largely positive"),
    paste0("Mean score: ", round(mean(df$Recommend), 2), " — largely positive"),
    paste0("Mean score: ", round(mean(df$Come_again), 2), " — largely positive"),
    paste0("Mean score: ", round(mean(df$Online_grocery), 2), " — mixed/neutral interest"),
    paste0("Delivery mean: ", round(mean(df$delivery), 2),
           " | Pick-up mean: ", round(mean(df$Pick_up), 2))
  )
)

kable(findings, col.names = c("Finding", "Result"),
      caption = "Summary of Key Findings") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)

Summary of Key Findings
Finding	Result
Dominant age group	18–25 (68.2% of respondents)
Most common gender	Male (72.7%)
Most common education level	Graduate+ and Some College (tied most common)
Customer service helpfulness	Mean score: 1.59 — largely positive
Likelihood to recommend	Mean score: 1.32 — largely positive
Likelihood to return	Mean score: 1.45 — largely positive
Online grocery preference	Mean score: 2.27 — mixed/neutral interest
Preferred fulfillment method	Delivery mean: 2.41 \| Pick-up mean: 2.45

10. Conclusion

Based on analysis of this 22-respondent classmate grocery survey dataset:

Customer satisfaction is generally positive: the majority of respondents agreed that customer service is helpful, would recommend the store, and would return.
Young adults (18–25) make up the largest portion of respondents, and they tend to express positive sentiments about the store.
Online grocery and delivery preferences are mixed, with some respondents interested and others neutral or disagreeing — suggesting an opportunity for the store to improve its digital offerings.
Correlation analysis shows that Come Again and Recommend are positively correlated, meaning customers who intend to return also tend to recommend the store to others.
The small sample size (n = 22) limits generalizability, but the data provides a useful snapshot of this particular group’s grocery shopping attitudes.

Report generated using R Markdown. Dataset: customer_segmentation.csv.

Customer Segmentation Analysis

Grocery Shopping Survey Dataset

Classmate Survey Data

2026-03-24