Customer Shopping Behavior Analysis

Understanding Purchasing Patterns in Retail

Introduction

This project is a comprehensive data analysis initiative that involves data wrangling, cleaning, manipulation, preprocessing, exploratory data analysis (EDA), and visualization. The dataset, obtained from Kaggle, contains information on customer shopping behavior across various locations, including products purchased, spending amounts, age groups, and discount usage. This project provides an overview of purchasing patterns, uncovering trends, customer preferences, and insights that can inform business strategies and decision-making.

Aim

To understand what drives customer spending and identify patterns that can help optimize sales strategies.

Business Questions?

  1. Which product categories are purchased the most?

  2. How does purchase amount vary by age group?

  3. Do discounts lead to higher purchase amounts?

  4. What payment methods are most commonly used?

  5. Does purchase frequency relate to customer satisfaction?

  6. Which product categories rely most heavily on discounts to drive purchases?

  7. How does average customer spending vary across different locations?

  8. Which product categories achieve the highest customer satisfaction ratings?

Data Cleaning

Loaded Necessary Libraries

# Load the tidyverse package - tools for cleaning and visualizing
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load skimr - for nice summaries
library(skimr)
# Load ggplot2 - our main graphing tool
library(ggplot2)
# Load plotly - makes graphs interactive
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
# Load knitr - for nice tables
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.3
library(readr)
library(dplyr)
library(rlang)
## Warning: package 'rlang' was built under R version 4.3.3
## 
## Attaching package: 'rlang'
## 
## The following objects are masked from 'package:purrr':
## 
##     %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice

These packages provide the necessary tools for data cleaning, exploration, visualization, and reporting throughout the analysis.

Importing Dataset

shopping_data <-read_csv("C:/Users/user/Downloads/shopping_trends_updated.csv")
## Rows: 3900 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): Gender, Item Purchased, Category, Location, Size, Color, Season, S...
## dbl  (5): Customer ID, Age, Purchase Amount (USD), Review Rating, Previous P...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#view(shopping_data)
# Quick look
glimpse(shopping_data)
## Rows: 3,900
## Columns: 18
## $ `Customer ID`            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ Age                      <dbl> 55, 19, 50, 21, 45, 46, 63, 27, 26, 57, 53, 3…
## $ Gender                   <chr> "Male", "Male", "Male", "Male", "Male", "Male…
## $ `Item Purchased`         <chr> "Blouse", "Sweater", "Jeans", "Sandals", "Blo…
## $ Category                 <chr> "Clothing", "Clothing", "Clothing", "Footwear…
## $ `Purchase Amount (USD)`  <dbl> 53, 64, 73, 90, 49, 20, 85, 34, 97, 31, 34, 6…
## $ Location                 <chr> "Kentucky", "Maine", "Massachusetts", "Rhode …
## $ Size                     <chr> "L", "L", "S", "M", "M", "M", "M", "L", "L", …
## $ Color                    <chr> "Gray", "Maroon", "Maroon", "Maroon", "Turquo…
## $ Season                   <chr> "Winter", "Winter", "Spring", "Spring", "Spri…
## $ `Review Rating`          <dbl> 3.1, 3.1, 3.1, 3.5, 2.7, 2.9, 3.2, 3.2, 2.6, …
## $ `Subscription Status`    <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ `Shipping Type`          <chr> "Express", "Express", "Free Shipping", "Next …
## $ `Discount Applied`       <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ `Promo Code Used`        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ `Previous Purchases`     <dbl> 14, 2, 23, 49, 31, 14, 49, 19, 8, 4, 26, 10, …
## $ `Payment Method`         <chr> "Venmo", "Cash", "Credit Card", "PayPal", "Pa…
## $ `Frequency of Purchases` <chr> "Fortnightly", "Fortnightly", "Weekly", "Week…

The dataset was loaded and briefly explored to understand its structure and variables prior to data cleaning.

Renaming Column names

# Create a list of new, cleaner column names
clean_names <- c("Customer_ID", "Age", "Gender", "Item", "Category", 
                 "Spend", "Location", "Size", "Color", "Season", 
                 "Rating", "Subscription", "Shipping", "Discount", 
                 "Promo_Code", "Prev_Purchases", "Payment", "Frequency")

# Apply these new names
colnames(shopping_data) <- clean_names

# Verify
head(shopping_data)
## # A tibble: 6 × 18
##   Customer_ID   Age Gender Item     Category Spend Location   Size  Color Season
##         <dbl> <dbl> <chr>  <chr>    <chr>    <dbl> <chr>      <chr> <chr> <chr> 
## 1           1    55 Male   Blouse   Clothing    53 Kentucky   L     Gray  Winter
## 2           2    19 Male   Sweater  Clothing    64 Maine      L     Maro… Winter
## 3           3    50 Male   Jeans    Clothing    73 Massachus… S     Maro… Spring
## 4           4    21 Male   Sandals  Footwear    90 Rhode Isl… M     Maro… Spring
## 5           5    45 Male   Blouse   Clothing    49 Oregon     M     Turq… Spring
## 6           6    46 Male   Sneakers Footwear    20 Wyoming    M     White Summer
## # ℹ 8 more variables: Rating <dbl>, Subscription <chr>, Shipping <chr>,
## #   Discount <chr>, Promo_Code <chr>, Prev_Purchases <dbl>, Payment <chr>,
## #   Frequency <chr>

Column names were simplified and made descriptive to improve readability and consistency.

Convert Text To Categories

# Convert character columns to factors for easier analysis
shopping_data[sapply(shopping_data, is.character)] <- 
  lapply(shopping_data[sapply(shopping_data, is.character)], as.factor)

# Check structure
str(shopping_data)
## spc_tbl_ [3,900 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Customer_ID   : num [1:3900] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Age           : num [1:3900] 55 19 50 21 45 46 63 27 26 57 ...
##  $ Gender        : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Item          : Factor w/ 25 levels "Backpack","Belt",..: 3 24 12 15 3 21 17 19 5 8 ...
##  $ Category      : Factor w/ 4 levels "Accessories",..: 2 2 2 3 2 3 2 2 4 1 ...
##  $ Spend         : num [1:3900] 53 64 73 90 49 20 85 34 97 31 ...
##  $ Location      : Factor w/ 50 levels "Alabama","Alaska",..: 17 19 21 39 37 50 26 18 48 25 ...
##  $ Size          : Factor w/ 4 levels "L","M","S","XL": 1 1 3 2 2 2 2 1 1 2 ...
##  $ Color         : Factor w/ 25 levels "Beige","Black",..: 8 13 13 13 22 24 8 5 20 17 ...
##  $ Season        : Factor w/ 4 levels "Fall","Spring",..: 4 4 2 2 2 3 1 4 3 2 ...
##  $ Rating        : num [1:3900] 3.1 3.1 3.1 3.5 2.7 2.9 3.2 3.2 2.6 4.8 ...
##  $ Subscription  : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Shipping      : Factor w/ 6 levels "2-Day Shipping",..: 2 2 3 4 3 5 3 3 2 1 ...
##  $ Discount      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Promo_Code    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Prev_Purchases: num [1:3900] 14 2 23 49 31 14 49 19 8 4 ...
##  $ Payment       : Factor w/ 6 levels "Bank Transfer",..: 6 2 3 5 5 6 2 3 6 2 ...
##  $ Frequency     : Factor w/ 7 levels "Annually","Bi-Weekly",..: 4 4 7 7 1 7 6 7 1 6 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Customer ID` = col_double(),
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   `Item Purchased` = col_character(),
##   ..   Category = col_character(),
##   ..   `Purchase Amount (USD)` = col_double(),
##   ..   Location = col_character(),
##   ..   Size = col_character(),
##   ..   Color = col_character(),
##   ..   Season = col_character(),
##   ..   `Review Rating` = col_double(),
##   ..   `Subscription Status` = col_character(),
##   ..   `Shipping Type` = col_character(),
##   ..   `Discount Applied` = col_character(),
##   ..   `Promo Code Used` = col_character(),
##   ..   `Previous Purchases` = col_double(),
##   ..   `Payment Method` = col_character(),
##   ..   `Frequency of Purchases` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
# DATA SUMMARY

# skim() gives a comprehensive summary
skim(shopping_data)
Data summary
Name shopping_data
Number of rows 3900
Number of columns 18
_______________________
Column type frequency:
factor 13
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 0 1 FALSE 2 Mal: 2652, Fem: 1248
Item 0 1 FALSE 25 Blo: 171, Jew: 171, Pan: 171, Shi: 169
Category 0 1 FALSE 4 Clo: 1737, Acc: 1240, Foo: 599, Out: 324
Location 0 1 FALSE 50 Mon: 96, Cal: 95, Ida: 93, Ill: 92
Size 0 1 FALSE 4 M: 1755, L: 1053, S: 663, XL: 429
Color 0 1 FALSE 25 Oli: 177, Yel: 174, Sil: 173, Tea: 172
Season 0 1 FALSE 4 Spr: 999, Fal: 975, Win: 971, Sum: 955
Subscription 0 1 FALSE 2 No: 2847, Yes: 1053
Shipping 0 1 FALSE 6 Fre: 675, Sta: 654, Sto: 650, Nex: 648
Discount 0 1 FALSE 2 No: 2223, Yes: 1677
Promo_Code 0 1 FALSE 2 No: 2223, Yes: 1677
Payment 0 1 FALSE 6 Pay: 677, Cre: 671, Cas: 670, Deb: 636
Frequency 0 1 FALSE 7 Eve: 584, Ann: 572, Qua: 563, Mon: 553

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Customer_ID 0 1 1950.50 1125.98 1.0 975.75 1950.5 2925.25 3900 ▇▇▇▇▇
Age 0 1 44.07 15.21 18.0 31.00 44.0 57.00 70 ▇▇▇▇▇
Spend 0 1 59.76 23.69 20.0 39.00 60.0 81.00 100 ▇▇▇▇▇
Rating 0 1 3.75 0.72 2.5 3.10 3.7 4.40 5 ▇▇▇▇▆
Prev_Purchases 0 1 25.35 14.45 1.0 13.00 25.0 38.00 50 ▇▇▇▇▇

Character columns were converted to categorical factors; the dataset structure and summary were examined.

Grouping of Age

shopping_data <- shopping_data %>%
  mutate(
    Age_Group = cut(Age, 
                    breaks = c(18, 30, 40, 50, 60, 70, 100),
                    labels = c("18-30", "30-40", "40-50", "50-60", "60-70", "70+"),
                    right = FALSE)
  )

# Check the new column
shopping_data %>% count(Age_Group)
## # A tibble: 6 × 2
##   Age_Group     n
##   <fct>     <int>
## 1 18-30       878
## 2 30-40       724
## 3 40-50       739
## 4 50-60       771
## 5 60-70       721
## 6 70+          67

A new variable that groups ages into ranges because analyzing individual ages is too detailed. This helps us see clearer trends in the data.”

VISUALIZATION

ANSWER TO QUESTIONS

2.Spending by age group

head(shopping_data)
## # A tibble: 6 × 19
##   Customer_ID   Age Gender Item     Category Spend Location   Size  Color Season
##         <dbl> <dbl> <fct>  <fct>    <fct>    <dbl> <fct>      <fct> <fct> <fct> 
## 1           1    55 Male   Blouse   Clothing    53 Kentucky   L     Gray  Winter
## 2           2    19 Male   Sweater  Clothing    64 Maine      L     Maro… Winter
## 3           3    50 Male   Jeans    Clothing    73 Massachus… S     Maro… Spring
## 4           4    21 Male   Sandals  Footwear    90 Rhode Isl… M     Maro… Spring
## 5           5    45 Male   Blouse   Clothing    49 Oregon     M     Turq… Spring
## 6           6    46 Male   Sneakers Footwear    20 Wyoming    M     White Summer
## # ℹ 9 more variables: Rating <dbl>, Subscription <fct>, Shipping <fct>,
## #   Discount <fct>, Promo_Code <fct>, Prev_Purchases <dbl>, Payment <fct>,
## #   Frequency <fct>, Age_Group <fct>
# Prepare data: Average spend per age group
age_spend <- shopping_data %>%
  group_by(Age_Group) %>%
  summarise(Avg_Spend = mean(Spend))

# Create Plot
age_plot <- ggplot(age_spend, aes(x = Age_Group, y = Avg_Spend, group = 1)) +
  geom_line(color = "#3498db", size = 1.5) +
  geom_point(color = "#2c3e50", size = 4) +
  labs(title = "Average Spending by Age Group",
       subtitle = "Is there a 'richer' generation?",
       x = "Age Group",
       y = "Average Purchase Amount ($)") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 100)) # Set range to see real scale
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(age_plot)

Surprisingly, spending is almost perfectly flat across all ages. Whether the customer is 25 or 65, they spend roughly $60 per transaction.

Takeaway: We do not have a specific “high-value” age demographic. Our product appeals equally to all generations.

3. Impact of Discounts

# Calculate average spend for Discount vs No Discount
discount_impact <- shopping_data %>%
  group_by(Discount) %>%
  summarise(
    Avg_Spend = mean(Spend),
    Count = n()
  )
# Create Plot
disc_plot <- ggplot(discount_impact, aes(x = Discount, y = Avg_Spend, fill = Discount)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = round(Avg_Spend, 2)), vjust = -0.5, fontface = "bold") +
  labs(title = "Do Discounts Increase Spending?",
       x = "Discount Applied?",
       y = "Average Spend ($)") +
  scale_fill_manual(values = c("No" = "#e74c3c", "Yes" = "#2ecc71")) +
  coord_cartesian(ylim = c(50, 70)) + # Zoom in to see the difference
  theme_minimal()

ggplotly(disc_plot)

Customers with a discount spent an average of $59.28. Customers without a discount spent $60.13. The discount did not lead to higher cart values; in fact, it was slightly lower.

Takeaway: “Discounts do not increase basket size, potentially reducing margin without boosting volume.”

4.Payment Methods List

# Create Plot
pay_plot <- ggplot(shopping_data, aes(x = Payment)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Preferred Payment Methods",
       x = "Method",
       y = "Count") +
  theme_minimal() +
  coord_flip()

ggplotly(pay_plot)

Payment preference is incredibly evenly split. PayPal, Credit Cards, Cash, and Venmo all have very similar usage rates (around 16-17% each).
Takeaway: Customers demand flexibility. Removing any one of these options (like Cash or Venmo) could alienate ~17% of our user base.

5. Frequency vs. Satisfaction

# Order the frequency levels logically
shopping_data$Frequency <- factor(shopping_data$Frequency, 
                                  levels = c("Weekly", "Bi-Weekly", "Fortnightly", 
                                             "Monthly", "Quarterly", "Every 3 Months", "Annually"))

# Calculate average rating
freq_rating <- shopping_data %>%
  group_by(Frequency) %>%
  summarise(Avg_Rating = mean(Rating))

# Create Plot
freq_plot <- ggplot(freq_rating, aes(x = Frequency, y = Avg_Rating)) +
  geom_point(size = 5, color = "purple") +
  geom_segment(aes(x=Frequency, xend=Frequency, y=0, yend=Avg_Rating), color="grey") +
  labs(title = "Customer Satisfaction by Shopping Frequency",
       y = "Average Rating (1-5)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(freq_plot)

Average ratings hover consistently around 3.7 to 3.8 regardless of how often they shop. Even “Annual” shoppers are just as happy as “Weekly” shoppers. Takeaway: Loyalty (frequency) does not equal higher satisfaction. Our frequent shoppers aren’t necessarily “happier,” they just buy more often.

6. Which Categories Rely Most on Discounts?

# Prepare data
disc_cat <- shopping_data %>%
  group_by(Category, Discount) %>%
  summarise(Count = n(), .groups = "drop")

# Plot
disc_cat_plot <- ggplot(disc_cat, 
                        aes(x = reorder(Category, -Count), 
                            y = Count, 
                            fill = Discount)) +
  geom_col(position = "dodge") +
  coord_flip() +
  geom_text(aes(label = Count),
            position = position_dodge(width = 0.9),
            hjust = -0.2,
            size = 4) +
  labs(title = "Which Categories Depend Most on Discounts?",
       subtitle = "Identifying price-sensitive product groups",
       x = "Category",
       y = "Number of Purchases") +
  theme_minimal()

ggplotly(disc_cat_plot)

Some categories show a much higher number of discounted purchases than full-price purchases. These are likely price-sensitive fashion items where customers wait for deals. Categories with fewer discounts indicate strong brand value or necessity purchases.

7. Which Locations Spend the Most? (Geographic Fashion Demand)

loc_spend <- shopping_data %>%
  group_by(Location) %>%
  summarise(Avg_Spend = mean(Spend, na.rm = TRUE)) %>%
  arrange(desc(Avg_Spend)) %>%
  slice_head(n = 10)

loc_plot <- ggplot(loc_spend, 
                   aes(x = reorder(Location, Avg_Spend), 
                       y = Avg_Spend)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  geom_text(aes(label = round(Avg_Spend, 2)),
            hjust = -0.2,
            size = 4,
            fontface = "bold") +
  labs(title = "Top 10 Locations by Average Spending",
       subtitle = "High-value fashion markets",
       x = "Location",
       y = "Average Spend ($)") +
  theme_minimal()

ggplotly(loc_plot)

Some locations show higher average transaction values. These areas may represent urban or fashion-forward markets where consumers spend more on style and trends.

8. Which Categories Have the Highest Customer Ratings?

cat_rating <- shopping_data %>%
  group_by(Category) %>%
  summarise(Avg_Rating = mean(Rating), .groups = "drop")

rating_plot <- ggplot(cat_rating, 
                      aes(x = reorder(Category, Avg_Rating), 
                          y = Avg_Rating)) +
  geom_col(fill = "blue") +
  coord_flip() +
  geom_text(aes(label = round(Avg_Rating, 2)),
            hjust = -0.2,
            size = 4,
            fontface = "bold") +
  labs(title = "Customer Satisfaction by Category",
       subtitle = "Which products delight customers most?",
       x = "Category",
       y = "Average Rating (1–5)") +
  theme_minimal() +
  coord_cartesian(ylim = c(0, 5))
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
ggplotly(rating_plot)

Some categories consistently receive higher ratings, meaning better fit, quality, or style satisfaction. Lower-rated categories may face issues like sizing, durability, or expectations mismatch.

Key Findings

1. Which product categories are purchased the most?

Answer: Clothing (45%) and Accessories (32%). These two categories make up over three-quarters of all sales.

2. How does purchase amount vary by age group?

Answer: It doesn’t. Spending is remarkably consistent (~$60) across all age groups from 18 to 70+.

3. Do discounts lead to higher purchase amounts?

Answer: No. Discounted purchases actually had a slightly lower average value ($59 vs $60).

4. What payment methods are most commonly used?

Answer: There is no dominant method. Usage is split evenly between PayPal, Credit Card, Cash, and Venmo.

5. Does purchase frequency relate to customer satisfaction?

Answer: No. Satisfaction ratings are stable (3.75/5.0) regardless of how often a customer shops.

6. Which categories rely most on discounts?

Some fashion categories are more price-sensitive, indicating where discount strategies influence demand.

7. Which locations spend the most?

Answer: Average spending varies by location. The top 10 locations show noticeably higher transaction values, indicating stronger purchasing power and higher demand for fashion products in these areas.

8. Which categories have the highest customer ratings?

Customer satisfaction varies by category, highlighting quality perception differences.

Recommendations

1. Rethink the Discount Strategy

Since discounts aren’t increasing the average purchase value, stop offering generic discounts. Instead, use “Spend $100, Get $10 Off” promos to force the basket size up.

2. Bundle Accessories with Clothing

Since 77% of sales are Clothing or Accessories, create bundles (e.g., “Shirt + Belt” combos). This leverages the popularity of both categories.

3. Maintain Payment Flexibility

Do not streamline payment options. The even split shows that our customers rely on having their specific preferred method available.

Conclusion

This study reveals that our customer base is highly consistent. We don’t have a specific “niche” age group, and our shoppers behave similarly regardless of discounts or frequency. Key Takeaway: We have a stable, predictable business model. Growth will likely come from increasing the average transaction value (bundling) rather than finding a new demographic.