3. Maintain Payment Flexibility
Do not streamline payment options. The even split shows that our customers rely on having their specific preferred method available.
Understanding Purchasing Patterns in Retail
This project is a comprehensive data analysis initiative that involves data wrangling, cleaning, manipulation, preprocessing, exploratory data analysis (EDA), and visualization. The dataset, obtained from Kaggle, contains information on customer shopping behavior across various locations, including products purchased, spending amounts, age groups, and discount usage. This project provides an overview of purchasing patterns, uncovering trends, customer preferences, and insights that can inform business strategies and decision-making.
To understand what drives customer spending and identify patterns that can help optimize sales strategies.
Which product categories are purchased the most?
How does purchase amount vary by age group?
Do discounts lead to higher purchase amounts?
What payment methods are most commonly used?
Does purchase frequency relate to customer satisfaction?
Which product categories rely most heavily on discounts to drive purchases?
How does average customer spending vary across different locations?
Which product categories achieve the highest customer satisfaction ratings?
# Load the tidyverse package - tools for cleaning and visualizing
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load skimr - for nice summaries
library(skimr)
# Load ggplot2 - our main graphing tool
library(ggplot2)
# Load plotly - makes graphs interactive
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
# Load knitr - for nice tables
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.3
library(readr)
library(dplyr)
library(rlang)
## Warning: package 'rlang' was built under R version 4.3.3
##
## Attaching package: 'rlang'
##
## The following objects are masked from 'package:purrr':
##
## %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
These packages provide the necessary tools for data cleaning, exploration, visualization, and reporting throughout the analysis.
shopping_data <-read_csv("C:/Users/user/Downloads/shopping_trends_updated.csv")
## Rows: 3900 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): Gender, Item Purchased, Category, Location, Size, Color, Season, S...
## dbl (5): Customer ID, Age, Purchase Amount (USD), Review Rating, Previous P...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#view(shopping_data)
# Quick look
glimpse(shopping_data)
## Rows: 3,900
## Columns: 18
## $ `Customer ID` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ Age <dbl> 55, 19, 50, 21, 45, 46, 63, 27, 26, 57, 53, 3…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male…
## $ `Item Purchased` <chr> "Blouse", "Sweater", "Jeans", "Sandals", "Blo…
## $ Category <chr> "Clothing", "Clothing", "Clothing", "Footwear…
## $ `Purchase Amount (USD)` <dbl> 53, 64, 73, 90, 49, 20, 85, 34, 97, 31, 34, 6…
## $ Location <chr> "Kentucky", "Maine", "Massachusetts", "Rhode …
## $ Size <chr> "L", "L", "S", "M", "M", "M", "M", "L", "L", …
## $ Color <chr> "Gray", "Maroon", "Maroon", "Maroon", "Turquo…
## $ Season <chr> "Winter", "Winter", "Spring", "Spring", "Spri…
## $ `Review Rating` <dbl> 3.1, 3.1, 3.1, 3.5, 2.7, 2.9, 3.2, 3.2, 2.6, …
## $ `Subscription Status` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ `Shipping Type` <chr> "Express", "Express", "Free Shipping", "Next …
## $ `Discount Applied` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ `Promo Code Used` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Ye…
## $ `Previous Purchases` <dbl> 14, 2, 23, 49, 31, 14, 49, 19, 8, 4, 26, 10, …
## $ `Payment Method` <chr> "Venmo", "Cash", "Credit Card", "PayPal", "Pa…
## $ `Frequency of Purchases` <chr> "Fortnightly", "Fortnightly", "Weekly", "Week…
The dataset was loaded and briefly explored to understand its structure and variables prior to data cleaning.
# Create a list of new, cleaner column names
clean_names <- c("Customer_ID", "Age", "Gender", "Item", "Category",
"Spend", "Location", "Size", "Color", "Season",
"Rating", "Subscription", "Shipping", "Discount",
"Promo_Code", "Prev_Purchases", "Payment", "Frequency")
# Apply these new names
colnames(shopping_data) <- clean_names
# Verify
head(shopping_data)
## # A tibble: 6 × 18
## Customer_ID Age Gender Item Category Spend Location Size Color Season
## <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 1 55 Male Blouse Clothing 53 Kentucky L Gray Winter
## 2 2 19 Male Sweater Clothing 64 Maine L Maro… Winter
## 3 3 50 Male Jeans Clothing 73 Massachus… S Maro… Spring
## 4 4 21 Male Sandals Footwear 90 Rhode Isl… M Maro… Spring
## 5 5 45 Male Blouse Clothing 49 Oregon M Turq… Spring
## 6 6 46 Male Sneakers Footwear 20 Wyoming M White Summer
## # ℹ 8 more variables: Rating <dbl>, Subscription <chr>, Shipping <chr>,
## # Discount <chr>, Promo_Code <chr>, Prev_Purchases <dbl>, Payment <chr>,
## # Frequency <chr>
Column names were simplified and made descriptive to improve readability and consistency.
# Convert character columns to factors for easier analysis
shopping_data[sapply(shopping_data, is.character)] <-
lapply(shopping_data[sapply(shopping_data, is.character)], as.factor)
# Check structure
str(shopping_data)
## spc_tbl_ [3,900 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Customer_ID : num [1:3900] 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num [1:3900] 55 19 50 21 45 46 63 27 26 57 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Item : Factor w/ 25 levels "Backpack","Belt",..: 3 24 12 15 3 21 17 19 5 8 ...
## $ Category : Factor w/ 4 levels "Accessories",..: 2 2 2 3 2 3 2 2 4 1 ...
## $ Spend : num [1:3900] 53 64 73 90 49 20 85 34 97 31 ...
## $ Location : Factor w/ 50 levels "Alabama","Alaska",..: 17 19 21 39 37 50 26 18 48 25 ...
## $ Size : Factor w/ 4 levels "L","M","S","XL": 1 1 3 2 2 2 2 1 1 2 ...
## $ Color : Factor w/ 25 levels "Beige","Black",..: 8 13 13 13 22 24 8 5 20 17 ...
## $ Season : Factor w/ 4 levels "Fall","Spring",..: 4 4 2 2 2 3 1 4 3 2 ...
## $ Rating : num [1:3900] 3.1 3.1 3.1 3.5 2.7 2.9 3.2 3.2 2.6 4.8 ...
## $ Subscription : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Shipping : Factor w/ 6 levels "2-Day Shipping",..: 2 2 3 4 3 5 3 3 2 1 ...
## $ Discount : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Promo_Code : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Prev_Purchases: num [1:3900] 14 2 23 49 31 14 49 19 8 4 ...
## $ Payment : Factor w/ 6 levels "Bank Transfer",..: 6 2 3 5 5 6 2 3 6 2 ...
## $ Frequency : Factor w/ 7 levels "Annually","Bi-Weekly",..: 4 4 7 7 1 7 6 7 1 6 ...
## - attr(*, "spec")=
## .. cols(
## .. `Customer ID` = col_double(),
## .. Age = col_double(),
## .. Gender = col_character(),
## .. `Item Purchased` = col_character(),
## .. Category = col_character(),
## .. `Purchase Amount (USD)` = col_double(),
## .. Location = col_character(),
## .. Size = col_character(),
## .. Color = col_character(),
## .. Season = col_character(),
## .. `Review Rating` = col_double(),
## .. `Subscription Status` = col_character(),
## .. `Shipping Type` = col_character(),
## .. `Discount Applied` = col_character(),
## .. `Promo Code Used` = col_character(),
## .. `Previous Purchases` = col_double(),
## .. `Payment Method` = col_character(),
## .. `Frequency of Purchases` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# DATA SUMMARY
# skim() gives a comprehensive summary
skim(shopping_data)
| Name | shopping_data |
| Number of rows | 3900 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| factor | 13 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Gender | 0 | 1 | FALSE | 2 | Mal: 2652, Fem: 1248 |
| Item | 0 | 1 | FALSE | 25 | Blo: 171, Jew: 171, Pan: 171, Shi: 169 |
| Category | 0 | 1 | FALSE | 4 | Clo: 1737, Acc: 1240, Foo: 599, Out: 324 |
| Location | 0 | 1 | FALSE | 50 | Mon: 96, Cal: 95, Ida: 93, Ill: 92 |
| Size | 0 | 1 | FALSE | 4 | M: 1755, L: 1053, S: 663, XL: 429 |
| Color | 0 | 1 | FALSE | 25 | Oli: 177, Yel: 174, Sil: 173, Tea: 172 |
| Season | 0 | 1 | FALSE | 4 | Spr: 999, Fal: 975, Win: 971, Sum: 955 |
| Subscription | 0 | 1 | FALSE | 2 | No: 2847, Yes: 1053 |
| Shipping | 0 | 1 | FALSE | 6 | Fre: 675, Sta: 654, Sto: 650, Nex: 648 |
| Discount | 0 | 1 | FALSE | 2 | No: 2223, Yes: 1677 |
| Promo_Code | 0 | 1 | FALSE | 2 | No: 2223, Yes: 1677 |
| Payment | 0 | 1 | FALSE | 6 | Pay: 677, Cre: 671, Cas: 670, Deb: 636 |
| Frequency | 0 | 1 | FALSE | 7 | Eve: 584, Ann: 572, Qua: 563, Mon: 553 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Customer_ID | 0 | 1 | 1950.50 | 1125.98 | 1.0 | 975.75 | 1950.5 | 2925.25 | 3900 | ▇▇▇▇▇ |
| Age | 0 | 1 | 44.07 | 15.21 | 18.0 | 31.00 | 44.0 | 57.00 | 70 | ▇▇▇▇▇ |
| Spend | 0 | 1 | 59.76 | 23.69 | 20.0 | 39.00 | 60.0 | 81.00 | 100 | ▇▇▇▇▇ |
| Rating | 0 | 1 | 3.75 | 0.72 | 2.5 | 3.10 | 3.7 | 4.40 | 5 | ▇▇▇▇▆ |
| Prev_Purchases | 0 | 1 | 25.35 | 14.45 | 1.0 | 13.00 | 25.0 | 38.00 | 50 | ▇▇▇▇▇ |
Character columns were converted to categorical factors; the dataset structure and summary were examined.
shopping_data <- shopping_data %>%
mutate(
Age_Group = cut(Age,
breaks = c(18, 30, 40, 50, 60, 70, 100),
labels = c("18-30", "30-40", "40-50", "50-60", "60-70", "70+"),
right = FALSE)
)
# Check the new column
shopping_data %>% count(Age_Group)
## # A tibble: 6 × 2
## Age_Group n
## <fct> <int>
## 1 18-30 878
## 2 30-40 724
## 3 40-50 739
## 4 50-60 771
## 5 60-70 721
## 6 70+ 67
A new variable that groups ages into ranges because analyzing individual ages is too detailed. This helps us see clearer trends in the data.”
# Prepare data
category_counts <- shopping_data %>%
count(Category, sort = TRUE)
# Create Plot
cat_plot <- ggplot(category_counts, aes(x = reorder(Category, -n), y = n, fill = Category))+
geom_col(show.legend = FALSE) +
coord_flip() + # Flip to make labels readable
geom_text(aes(label = n), vjust = 0.2, size = 3, fontface = "bold") +
labs(title = "Which Categories Sell the Most?",
x = "Category",
y = "Number of Purchases") +
theme_minimal() +
theme(legend.position = "none") # Hide legend as labels are enough
# Interactive
ggplotly(cat_plot, tooltip = c("x", "y")) %>%
layout(title = list(text = "Which Categories Sell the Most?<br><sup>Clothing dominates the sales</sup>"))
Clothing is the clear leader with over 1,700 purchases (45% of total), followed by Accessories. Outerwear is the smallest category. Takeaway: The business is primarily driven by apparel. Marketing efforts should focus on cross-selling Accessories to the massive base of Clothing buyers.
head(shopping_data)
## # A tibble: 6 × 19
## Customer_ID Age Gender Item Category Spend Location Size Color Season
## <dbl> <dbl> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <fct>
## 1 1 55 Male Blouse Clothing 53 Kentucky L Gray Winter
## 2 2 19 Male Sweater Clothing 64 Maine L Maro… Winter
## 3 3 50 Male Jeans Clothing 73 Massachus… S Maro… Spring
## 4 4 21 Male Sandals Footwear 90 Rhode Isl… M Maro… Spring
## 5 5 45 Male Blouse Clothing 49 Oregon M Turq… Spring
## 6 6 46 Male Sneakers Footwear 20 Wyoming M White Summer
## # ℹ 9 more variables: Rating <dbl>, Subscription <fct>, Shipping <fct>,
## # Discount <fct>, Promo_Code <fct>, Prev_Purchases <dbl>, Payment <fct>,
## # Frequency <fct>, Age_Group <fct>
# Prepare data: Average spend per age group
age_spend <- shopping_data %>%
group_by(Age_Group) %>%
summarise(Avg_Spend = mean(Spend))
# Create Plot
age_plot <- ggplot(age_spend, aes(x = Age_Group, y = Avg_Spend, group = 1)) +
geom_line(color = "#3498db", size = 1.5) +
geom_point(color = "#2c3e50", size = 4) +
labs(title = "Average Spending by Age Group",
subtitle = "Is there a 'richer' generation?",
x = "Age Group",
y = "Average Purchase Amount ($)") +
theme_minimal() +
scale_y_continuous(limits = c(0, 100)) # Set range to see real scale
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(age_plot)
Surprisingly, spending is almost perfectly flat across all ages. Whether the customer is 25 or 65, they spend roughly $60 per transaction.
Takeaway: We do not have a specific “high-value” age demographic. Our product appeals equally to all generations.
# Calculate average spend for Discount vs No Discount
discount_impact <- shopping_data %>%
group_by(Discount) %>%
summarise(
Avg_Spend = mean(Spend),
Count = n()
)
# Create Plot
disc_plot <- ggplot(discount_impact, aes(x = Discount, y = Avg_Spend, fill = Discount)) +
geom_col(width = 0.6) +
geom_text(aes(label = round(Avg_Spend, 2)), vjust = -0.5, fontface = "bold") +
labs(title = "Do Discounts Increase Spending?",
x = "Discount Applied?",
y = "Average Spend ($)") +
scale_fill_manual(values = c("No" = "#e74c3c", "Yes" = "#2ecc71")) +
coord_cartesian(ylim = c(50, 70)) + # Zoom in to see the difference
theme_minimal()
ggplotly(disc_plot)
Customers with a discount spent an average of $59.28. Customers without a discount spent $60.13. The discount did not lead to higher cart values; in fact, it was slightly lower.
Takeaway: “Discounts do not increase basket size, potentially reducing margin without boosting volume.”
# Create Plot
pay_plot <- ggplot(shopping_data, aes(x = Payment)) +
geom_bar(fill = "steelblue") +
labs(title = "Preferred Payment Methods",
x = "Method",
y = "Count") +
theme_minimal() +
coord_flip()
ggplotly(pay_plot)
Payment preference is incredibly evenly split. PayPal, Credit Cards,
Cash, and Venmo all have very similar usage rates (around 16-17%
each).
Takeaway: Customers demand flexibility. Removing any one of these
options (like Cash or Venmo) could alienate ~17% of our user base.
# Order the frequency levels logically
shopping_data$Frequency <- factor(shopping_data$Frequency,
levels = c("Weekly", "Bi-Weekly", "Fortnightly",
"Monthly", "Quarterly", "Every 3 Months", "Annually"))
# Calculate average rating
freq_rating <- shopping_data %>%
group_by(Frequency) %>%
summarise(Avg_Rating = mean(Rating))
# Create Plot
freq_plot <- ggplot(freq_rating, aes(x = Frequency, y = Avg_Rating)) +
geom_point(size = 5, color = "purple") +
geom_segment(aes(x=Frequency, xend=Frequency, y=0, yend=Avg_Rating), color="grey") +
labs(title = "Customer Satisfaction by Shopping Frequency",
y = "Average Rating (1-5)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(freq_plot)
Average ratings hover consistently around 3.7 to 3.8 regardless of how often they shop. Even “Annual” shoppers are just as happy as “Weekly” shoppers. Takeaway: Loyalty (frequency) does not equal higher satisfaction. Our frequent shoppers aren’t necessarily “happier,” they just buy more often.
# Prepare data
disc_cat <- shopping_data %>%
group_by(Category, Discount) %>%
summarise(Count = n(), .groups = "drop")
# Plot
disc_cat_plot <- ggplot(disc_cat,
aes(x = reorder(Category, -Count),
y = Count,
fill = Discount)) +
geom_col(position = "dodge") +
coord_flip() +
geom_text(aes(label = Count),
position = position_dodge(width = 0.9),
hjust = -0.2,
size = 4) +
labs(title = "Which Categories Depend Most on Discounts?",
subtitle = "Identifying price-sensitive product groups",
x = "Category",
y = "Number of Purchases") +
theme_minimal()
ggplotly(disc_cat_plot)
Some categories show a much higher number of discounted purchases than full-price purchases. These are likely price-sensitive fashion items where customers wait for deals. Categories with fewer discounts indicate strong brand value or necessity purchases.
loc_spend <- shopping_data %>%
group_by(Location) %>%
summarise(Avg_Spend = mean(Spend, na.rm = TRUE)) %>%
arrange(desc(Avg_Spend)) %>%
slice_head(n = 10)
loc_plot <- ggplot(loc_spend,
aes(x = reorder(Location, Avg_Spend),
y = Avg_Spend)) +
geom_col(fill = "steelblue") +
coord_flip() +
geom_text(aes(label = round(Avg_Spend, 2)),
hjust = -0.2,
size = 4,
fontface = "bold") +
labs(title = "Top 10 Locations by Average Spending",
subtitle = "High-value fashion markets",
x = "Location",
y = "Average Spend ($)") +
theme_minimal()
ggplotly(loc_plot)
Some locations show higher average transaction values. These areas may represent urban or fashion-forward markets where consumers spend more on style and trends.
cat_rating <- shopping_data %>%
group_by(Category) %>%
summarise(Avg_Rating = mean(Rating), .groups = "drop")
rating_plot <- ggplot(cat_rating,
aes(x = reorder(Category, Avg_Rating),
y = Avg_Rating)) +
geom_col(fill = "blue") +
coord_flip() +
geom_text(aes(label = round(Avg_Rating, 2)),
hjust = -0.2,
size = 4,
fontface = "bold") +
labs(title = "Customer Satisfaction by Category",
subtitle = "Which products delight customers most?",
x = "Category",
y = "Average Rating (1–5)") +
theme_minimal() +
coord_cartesian(ylim = c(0, 5))
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
ggplotly(rating_plot)
Some categories consistently receive higher ratings, meaning better fit, quality, or style satisfaction. Lower-rated categories may face issues like sizing, durability, or expectations mismatch.
Answer: Clothing (45%) and Accessories (32%). These two categories make up over three-quarters of all sales.
Answer: It doesn’t. Spending is remarkably consistent (~$60) across all age groups from 18 to 70+.
Answer: No. Discounted purchases actually had a slightly lower average value ($59 vs $60).
Answer: There is no dominant method. Usage is split evenly between PayPal, Credit Card, Cash, and Venmo.
Answer: No. Satisfaction ratings are stable (3.75/5.0) regardless of how often a customer shops.
Some fashion categories are more price-sensitive, indicating where discount strategies influence demand.
Answer: Average spending varies by location. The top 10 locations show noticeably higher transaction values, indicating stronger purchasing power and higher demand for fashion products in these areas.
Customer satisfaction varies by category, highlighting quality perception differences.
Since discounts aren’t increasing the average purchase value, stop offering generic discounts. Instead, use “Spend $100, Get $10 Off” promos to force the basket size up.
Since 77% of sales are Clothing or Accessories, create bundles (e.g., “Shirt + Belt” combos). This leverages the popularity of both categories.
Do not streamline payment options. The even split shows that our customers rely on having their specific preferred method available.
This study reveals that our customer base is highly consistent. We don’t have a specific “niche” age group, and our shoppers behave similarly regardless of discounts or frequency. Key Takeaway: We have a stable, predictable business model. Growth will likely come from increasing the average transaction value (bundling) rather than finding a new demographic.