Reflection

At the start of this analysis I didn’t know what to expect from Claude AI in regards to the diversity of the data reports and how it would choose to analyze the data. I was genuinely impressed with its ability to generate the R Markdown file and process all the data differently each time it was asked.

This was an effective practice of using Claude to analyze data. I didn’t realize it was so good at creating the script, but I will be using this moving forward.


1. Introduction

This report analyzes a grocery shopping survey dataset collected from classmates. The dataset contains 22 responses and 15 variables covering customer satisfaction, shopping behaviors, demographics, and preferences related to a grocery store experience.

Survey Scale Reference (unless otherwise noted):

Score Meaning
1 Strongly Agree / Very Satisfied
2 Agree / Satisfied
3 Neutral
4 Disagree / Dissatisfied
5 Strongly Disagree / Very Dissatisfied

2. Setup & Data Loading

# Load required libraries
library(tidyverse)
library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
library(scales)
library(corrplot)
library(reshape2)
# Load the dataset
df <- read.csv("customer_segmentation.csv", stringsAsFactors = FALSE)

# Trim whitespace from column names
colnames(df) <- trimws(colnames(df))

# Preview the data
head(df)

3. Data Overview

3.1 Dataset Dimensions

cat("Number of rows (respondents):", nrow(df), "\n")
## Number of rows (respondents): 22
cat("Number of columns (variables):", ncol(df), "\n")
## Number of columns (variables): 15

3.2 Variable Types

str(df)
## 'data.frame':    22 obs. of  15 variables:
##  $ ID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CS_helpful    : int  2 1 2 3 2 1 2 1 1 1 ...
##  $ Recommend     : int  2 2 1 3 1 1 1 1 1 1 ...
##  $ Come_again    : int  2 1 1 2 3 3 1 1 1 1 ...
##  $ All_Products  : int  2 1 1 4 5 2 2 2 2 1 ...
##  $ Profesionalism: int  2 1 1 1 2 1 2 1 2 1 ...
##  $ Limitation    : int  2 1 2 2 1 1 1 2 1 1 ...
##  $ Online_grocery: int  2 2 3 3 2 1 2 1 2 3 ...
##  $ delivery      : int  3 3 3 3 3 2 2 1 1 2 ...
##  $ Pick_up       : int  4 3 2 2 1 1 2 2 3 2 ...
##  $ Find_items    : int  1 1 1 2 2 1 1 2 1 1 ...
##  $ other_shops   : int  2 2 3 2 3 4 1 4 1 1 ...
##  $ Gender        : int  1 1 1 1 2 1 1 1 2 2 ...
##  $ Age           : int  2 2 2 3 4 2 2 2 2 2 ...
##  $ Education     : int  2 2 2 5 2 5 3 2 1 2 ...

3.3 Summary Statistics

summary(df[, -1]) # Exclude ID column
##    CS_helpful      Recommend       Come_again     All_Products  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.250  
##  Median :1.000   Median :1.000   Median :1.000   Median :2.000  
##  Mean   :1.591   Mean   :1.318   Mean   :1.455   Mean   :2.091  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :5.000  
##  Profesionalism    Limitation  Online_grocery     delivery        Pick_up     
##  Min.   :1.000   Min.   :1.0   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.0   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :1.000   Median :1.0   Median :2.000   Median :3.000   Median :2.000  
##  Mean   :1.409   Mean   :1.5   Mean   :2.273   Mean   :2.409   Mean   :2.455  
##  3rd Qu.:2.000   3rd Qu.:2.0   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :4.0   Max.   :3.000   Max.   :3.000   Max.   :5.000  
##    Find_items     other_shops        Gender           Age       
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :2.000  
##  1st Qu.:1.000   1st Qu.:1.250   1st Qu.:1.000   1st Qu.:2.000  
##  Median :1.000   Median :2.000   Median :1.000   Median :2.000  
##  Mean   :1.455   Mean   :2.591   Mean   :1.273   Mean   :2.455  
##  3rd Qu.:2.000   3rd Qu.:3.750   3rd Qu.:1.750   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :5.000   Max.   :2.000   Max.   :4.000  
##    Education    
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.500  
##  Mean   :3.182  
##  3rd Qu.:5.000  
##  Max.   :5.000

3.4 Missing Values Check

missing_counts <- colSums(is.na(df))
missing_df <- data.frame(
  Variable = names(missing_counts),
  Missing = missing_counts
)

kable(missing_df, row.names = FALSE, caption = "Missing Values per Variable") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Missing Values per Variable
Variable Missing
ID 0
CS_helpful 0
Recommend 0
Come_again 0
All_Products 0
Profesionalism 0
Limitation 0
Online_grocery 0
delivery 0
Pick_up 0
Find_items 0
other_shops 0
Gender 0
Age 0
Education 0

4. Demographics

4.1 Gender Distribution

# Recode Gender: 1 = Male, 2 = Female
df$Gender_Label <- ifelse(df$Gender == 1, "Male", "Female")

gender_counts <- df %>%
  count(Gender_Label) %>%
  mutate(Percentage = round(n / sum(n) * 100, 1))

kable(gender_counts, col.names = c("Gender", "Count", "Percentage (%)"),
      caption = "Gender Distribution") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Gender Distribution
Gender Count Percentage (%)
Female 6 27.3
Male 16 72.7
ggplot(gender_counts, aes(x = Gender_Label, y = n, fill = Gender_Label)) +
  geom_bar(stat = "identity", width = 0.5, color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("Male" = "#2E86AB", "Female" = "#E84855")) +
  labs(title = "Gender Distribution of Survey Respondents",
       x = "Gender", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

4.2 Age Distribution

# Recode Age: 1 = Under 18, 2 = 18-25, 3 = 26-35, 4 = 36-50, 5 = 51+
age_labels <- c("1" = "Under 18", "2" = "18–25", "3" = "26–35", "4" = "36–50", "5" = "51+")
df$Age_Label <- recode(as.character(df$Age), !!!age_labels)

age_counts <- df %>%
  count(Age_Label) %>%
  arrange(match(Age_Label, age_labels))

ggplot(age_counts, aes(x = Age_Label, y = n, fill = Age_Label)) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = n), vjust = -0.5, size = 4) +
  scale_fill_brewer(palette = "Blues") +
  labs(title = "Age Distribution of Survey Respondents",
       x = "Age Group", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

4.3 Education Level

# Recode Education: 1=High School, 2=Some College, 3=Associate's, 4=Bachelor's, 5=Graduate+
edu_labels <- c("1" = "High School", "2" = "Some College",
                "3" = "Associate's", "4" = "Bachelor's", "5" = "Graduate+")
df$Education_Label <- recode(as.character(df$Education), !!!edu_labels)

edu_counts <- df %>%
  count(Education_Label) %>%
  mutate(Percentage = round(n / sum(n) * 100, 1))

ggplot(edu_counts, aes(x = reorder(Education_Label, -n), y = n, fill = Education_Label)) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, "\n(", Percentage, "%)")), vjust = -0.3, size = 3.5) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Education Level of Survey Respondents",
       x = "Education Level", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))


5. Customer Satisfaction & Attitudes

5.1 Customer Service Helpfulness

cs_counts <- df %>%
  count(CS_helpful) %>%
  mutate(
    Label = recode(as.character(CS_helpful),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(cs_counts, aes(x = Label, y = n, fill = factor(CS_helpful))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(title = "Customer Service is Helpful",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

5.2 Likelihood to Recommend

rec_counts <- df %>%
  count(Recommend) %>%
  mutate(
    Label = recode(as.character(Recommend),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(rec_counts, aes(x = Label, y = n, fill = factor(Recommend))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(title = "Likelihood to Recommend the Store",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

5.3 Likelihood to Return (Come Again)

ca_counts <- df %>%
  count(Come_again) %>%
  mutate(
    Label = recode(as.character(Come_again),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(ca_counts, aes(x = Label, y = n, fill = factor(Come_again))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  labs(title = "Likelihood to Come Again",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

5.4 Satisfaction Scores — Side-by-Side Comparison

# Compute mean scores for key satisfaction variables
sat_vars <- c("CS_helpful", "Recommend", "Come_again", "Profesionalism", "Find_items")
sat_labels <- c("Customer Service", "Recommend", "Come Again", "Professionalism", "Find Items")

sat_means <- colMeans(df[, sat_vars], na.rm = TRUE)
sat_df <- data.frame(
  Variable = sat_labels,
  Mean_Score = round(sat_means, 2)
)

ggplot(sat_df, aes(x = reorder(Variable, Mean_Score), y = Mean_Score, fill = Mean_Score)) +
  geom_bar(stat = "identity", color = "white", width = 0.6) +
  geom_text(aes(label = Mean_Score), hjust = -0.2, size = 4) +
  scale_fill_gradient(low = "#2ECC71", high = "#E74C3C") +
  coord_flip() +
  labs(title = "Average Satisfaction Scores by Category",
       subtitle = "Scale: 1 = Strongly Agree / Very Satisfied → 5 = Strongly Disagree / Very Dissatisfied",
       x = "", y = "Mean Score") +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 3.5)


6. Shopping Preferences & Behavior

6.1 Online Grocery Shopping Interest

og_counts <- df %>%
  count(Online_grocery) %>%
  mutate(
    Label = recode(as.character(Online_grocery),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(og_counts, aes(x = Label, y = n, fill = factor(Online_grocery))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "PuBuGn") +
  labs(title = "Interest in Online Grocery Shopping",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

6.2 Delivery vs. Pick-Up Preference

pref_df <- data.frame(
  Category = c(rep("Delivery", nrow(df)), rep("Pick-Up", nrow(df))),
  Score = c(df$delivery, df$Pick_up)
)

ggplot(pref_df, aes(x = factor(Score), fill = Category)) +
  geom_bar(position = "dodge", color = "white") +
  scale_fill_manual(values = c("Delivery" = "#3498DB", "Pick-Up" = "#E67E22")) +
  scale_x_discrete(labels = c("1" = "Strongly\nAgree", "2" = "Agree",
                               "3" = "Neutral", "4" = "Disagree", "5" = "Strongly\nDisagree")) +
  labs(title = "Delivery vs. Pick-Up Preference",
       x = "Response", y = "Count", fill = "Shopping Method") +
  theme_minimal()

6.3 Shopping at Other Stores

os_counts <- df %>%
  count(other_shops) %>%
  mutate(
    Label = recode(as.character(other_shops),
                   "1" = "Strongly Agree", "2" = "Agree",
                   "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree"),
    Percentage = round(n / sum(n) * 100, 1)
  )

ggplot(os_counts, aes(x = Label, y = n, fill = factor(other_shops))) +
  geom_bar(stat = "identity", color = "white") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), vjust = -0.4, size = 3.5) +
  scale_fill_brewer(palette = "Oranges") +
  labs(title = "Shops at Other Grocery Stores as Well",
       x = "Response", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")


7. Correlation Analysis

7.1 Correlation Matrix

# Select numeric survey variables (exclude ID, Gender, Age, Education)
survey_vars <- df[, c("CS_helpful", "Recommend", "Come_again", "All_Products",
                      "Profesionalism", "Limitation", "Online_grocery",
                      "delivery", "Pick_up", "Find_items", "other_shops")]

cor_matrix <- cor(survey_vars, use = "complete.obs")

corrplot(cor_matrix,
         method = "color",
         type = "upper",
         tl.col = "black",
         tl.srt = 45,
         tl.cex = 0.8,
         addCoef.col = "black",
         number.cex = 0.65,
         col = colorRampPalette(c("#E74C3C", "white", "#2980B9"))(200),
         title = "Correlation Matrix — Survey Variables",
         mar = c(0, 0, 2, 0))

7.2 Key Correlations: Recommend vs. Other Variables

cor_with_recommend <- cor(survey_vars, use = "complete.obs")[, "Recommend"]
cor_df <- data.frame(
  Variable = names(cor_with_recommend),
  Correlation = round(cor_with_recommend, 3)
) %>%
  filter(Variable != "Recommend") %>%
  arrange(desc(abs(Correlation)))

kable(cor_df, row.names = FALSE,
      caption = "Correlation of Variables with 'Likelihood to Recommend'") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Correlation of Variables with ‘Likelihood to Recommend’
Variable Correlation
CS_helpful 0.488
delivery 0.415
Profesionalism 0.391
Come_again 0.381
Online_grocery 0.297
Pick_up -0.082
other_shops -0.060
Limitation 0.046
All_Products 0.025
Find_items -0.020

8. Cross-Tabulations

8.1 Recommend by Gender

crosstab_gender <- df %>%
  count(Gender_Label, Recommend) %>%
  mutate(
    Recommend_Label = recode(as.character(Recommend),
                             "1" = "Strongly Agree", "2" = "Agree",
                             "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree")
  )

ggplot(crosstab_gender, aes(x = Recommend_Label, y = n, fill = Gender_Label)) +
  geom_bar(stat = "identity", position = "dodge", color = "white") +
  scale_fill_manual(values = c("Male" = "#2E86AB", "Female" = "#E84855")) +
  labs(title = "Likelihood to Recommend by Gender",
       x = "Response", y = "Count", fill = "Gender") +
  theme_minimal()

8.2 Come Again by Age Group

crosstab_age <- df %>%
  count(Age_Label, Come_again) %>%
  mutate(
    Come_again_Label = recode(as.character(Come_again),
                              "1" = "Strongly Agree", "2" = "Agree",
                              "3" = "Neutral", "4" = "Disagree", "5" = "Strongly Disagree")
  )

ggplot(crosstab_age, aes(x = Age_Label, y = n, fill = Come_again_Label)) +
  geom_bar(stat = "identity", position = "fill", color = "white") +
  scale_fill_brewer(palette = "RdYlGn", direction = -1) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Likelihood to Come Again by Age Group (Proportional)",
       x = "Age Group", y = "Proportion", fill = "Response") +
  theme_minimal()


9. Key Findings & Summary

findings <- data.frame(
  Finding = c(
    "Dominant age group",
    "Most common gender",
    "Most common education level",
    "Customer service helpfulness",
    "Likelihood to recommend",
    "Likelihood to return",
    "Online grocery preference",
    "Preferred fulfillment method"
  ),
  Result = c(
    paste0("18–25 (", round(mean(df$Age == 2) * 100, 1), "% of respondents)"),
    paste0("Male (", round(mean(df$Gender == 1) * 100, 1), "%)"),
    "Graduate+ and Some College (tied most common)",
    paste0("Mean score: ", round(mean(df$CS_helpful), 2), " — largely positive"),
    paste0("Mean score: ", round(mean(df$Recommend), 2), " — largely positive"),
    paste0("Mean score: ", round(mean(df$Come_again), 2), " — largely positive"),
    paste0("Mean score: ", round(mean(df$Online_grocery), 2), " — mixed/neutral interest"),
    paste0("Delivery mean: ", round(mean(df$delivery), 2),
           " | Pick-up mean: ", round(mean(df$Pick_up), 2))
  )
)

kable(findings, col.names = c("Finding", "Result"),
      caption = "Summary of Key Findings") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)
Summary of Key Findings
Finding Result
Dominant age group 18–25 (68.2% of respondents)
Most common gender Male (72.7%)
Most common education level Graduate+ and Some College (tied most common)
Customer service helpfulness Mean score: 1.59 — largely positive
Likelihood to recommend Mean score: 1.32 — largely positive
Likelihood to return Mean score: 1.45 — largely positive
Online grocery preference Mean score: 2.27 — mixed/neutral interest
Preferred fulfillment method Delivery mean: 2.41 &#124; Pick-up mean: 2.45

10. Conclusion

Based on analysis of this 22-respondent classmate grocery survey dataset:

  • Customer satisfaction is generally positive: the majority of respondents agreed that customer service is helpful, would recommend the store, and would return.
  • Young adults (18–25) make up the largest portion of respondents, and they tend to express positive sentiments about the store.
  • Online grocery and delivery preferences are mixed, with some respondents interested and others neutral or disagreeing — suggesting an opportunity for the store to improve its digital offerings.
  • Correlation analysis shows that Come Again and Recommend are positively correlated, meaning customers who intend to return also tend to recommend the store to others.
  • The small sample size (n = 22) limits generalizability, but the data provides a useful snapshot of this particular group’s grocery shopping attitudes.

Report generated using R Markdown. Dataset: customer_segmentation.csv.