This is a retail sales analysis done by me using R Studio. I chose to present this project in this way as a means of communicating my thought process when delivering analysis for stakeholders. The dataset I will be using can be found here: https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset/data.
I also want to state that I will be approaching this analysis from the perspective of a sales analyst, who was asked to analyze the given data for insights to help the marketing department with their next campaign. I will attempt to comment as much code as possible so less technical readers can understand and learn from the code.
Firstly, we want to prepare whichever packages we will be using.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(ggthemes)
Next, I want to import our data and assign it to a dataframe. After we assign the dataframe, I also want to get a quick understanding of our data structure by summarizing it. Let’s also change some column names:
# Load the data.table package's fread() function to quickly read a CSV file into a data frame
sales <- fread("C:\\Users\\Dave\\Desktop\\Data Practice Sets\\retail_sales_dataset.csv")
# Display a summary of the sales dataset (e.g., min, max, mean, etc.) for each column as loaded
summary(sales)
## Transaction ID Date Customer ID Gender
## Min. : 1.0 Min. :2023-01-01 Length:1000 Length:1000
## 1st Qu.: 250.8 1st Qu.:2023-04-08 Class :character Class :character
## Median : 500.5 Median :2023-06-29 Mode :character Mode :character
## Mean : 500.5 Mean :2023-07-03
## 3rd Qu.: 750.2 3rd Qu.:2023-10-04
## Max. :1000.0 Max. :2024-01-01
## Age Product Category Quantity Price per Unit
## Min. :18.00 Length:1000 Min. :1.000 Min. : 25.0
## 1st Qu.:29.00 Class :character 1st Qu.:1.000 1st Qu.: 30.0
## Median :42.00 Mode :character Median :3.000 Median : 50.0
## Mean :41.39 Mean :2.514 Mean :179.9
## 3rd Qu.:53.00 3rd Qu.:4.000 3rd Qu.:300.0
## Max. :64.00 Max. :4.000 Max. :500.0
## Total Amount
## Min. : 25
## 1st Qu.: 60
## Median : 135
## Mean : 456
## 3rd Qu.: 900
## Max. :2000
# Rename all columns in the sales dataset for easier coding, replacing the original column names
colnames(sales) <- c("Transaction.ID", "Date", "Customer.ID", "Gender", "Age", "Product.Category", "Quantity", "Price.per.Unit", "Total.Amount")
# Display a summary of the sales dataset again, now with the updated column names
summary(sales)
## Transaction.ID Date Customer.ID Gender
## Min. : 1.0 Min. :2023-01-01 Length:1000 Length:1000
## 1st Qu.: 250.8 1st Qu.:2023-04-08 Class :character Class :character
## Median : 500.5 Median :2023-06-29 Mode :character Mode :character
## Mean : 500.5 Mean :2023-07-03
## 3rd Qu.: 750.2 3rd Qu.:2023-10-04
## Max. :1000.0 Max. :2024-01-01
## Age Product.Category Quantity Price.per.Unit
## Min. :18.00 Length:1000 Min. :1.000 Min. : 25.0
## 1st Qu.:29.00 Class :character 1st Qu.:1.000 1st Qu.: 30.0
## Median :42.00 Mode :character Median :3.000 Median : 50.0
## Mean :41.39 Mean :2.514 Mean :179.9
## 3rd Qu.:53.00 3rd Qu.:4.000 3rd Qu.:300.0
## Max. :64.00 Max. :4.000 Max. :500.0
## Total.Amount
## Min. : 25
## 1st Qu.: 60
## Median : 135
## Mean : 456
## 3rd Qu.: 900
## Max. :2000
Now we can begin making sense of our data by asking questions in a way that our marketing department could find useful. Specifically, we can see the column names are: “Transaction.ID”, “Date”, “Customer.ID”, “Gender”, “Age”, “Product.Category”, “Quantity”, “Price.per.Unit”, and “Total.Amount”
We also noticed earlier that three of our columns are the “character” class; “Gender”, “Customer.ID”, and “Product.Category”, which could be good columns to begin our investigation for useful insights.
Let’s take a closer look at each one.
Gender
# Calculate the number of unique values in the Gender column of the sales dataset
length(unique(sales$Gender))
## [1] 2
# Display the distinct unique values present in the Gender column of the sales dataset
unique(sales$Gender)
## [1] "Male" "Female"
Knowing our data only contains 2 genders, this may be a good way to split our data to see the differences in sales patterns between the two groups.
Customer.ID
# Calculate the number of unique customer IDs in the Customer.ID column of the sales dataset
length(unique(sales$`Customer.ID`))
## [1] 1000
# Display the first 6 values from the Customer.ID column of the sales dataset
head(sales$`Customer.ID`)
## [1] "CUST001" "CUST002" "CUST003" "CUST004" "CUST005" "CUST006"
# Display the last 6 values from the Customer.ID column of the sales dataset
tail(sales$`Customer.ID`)
## [1] "CUST995" "CUST996" "CUST997" "CUST998" "CUST999" "CUST1000"
Here we can see that we have 1000 unique values that begin at “CUST001” and end at “CUST1000”. We either have not been tracking customers as return customers properly or we simply have 1000 unique customers. Since we can’t verify, we will skip over gleaming any details using this column for now.
Product Category
# Calculate the number of unique product categories in the Product.Category column of the sales dataset
length(unique(sales$`Product.Category`))
## [1] 3
# Display the distinct unique values present in the Product.Category column of the sales dataset
unique(sales$`Product.Category`)
## [1] "Beauty" "Clothing" "Electronics"
Now we know our sales are split into 3 distinct categories. Let’s continue our investigation into the rest of our columns in the data.
Transaction ID
# Calculate the number of unique transaction IDs in the Transaction.ID column of the sales dataset
length(unique(sales$`Transaction.ID`))
## [1] 1000
This matches our “Customer.ID” columns in unique entries. As we suspected, each transaction has been tracked as a separate customer. Let’s ignore for now.
Date
# Calculate the number of unique dates in the Date column of the sales dataset
length(unique(sales$`Date`))
## [1] 345
# Sort the entire sales dataset by the Date column, converting dates from string format (e.g., "MM,DD,YYYY") to chronological order from earliest to latest
sales[order(as.Date(sales$`Date`, format="%m,%d,%Y")),]
## Transaction.ID Date Customer.ID Gender Age Product.Category
## <int> <IDat> <char> <char> <int> <char>
## 1: 180 2023-01-01 CUST180 Male 41 Clothing
## 2: 522 2023-01-01 CUST522 Male 46 Beauty
## 3: 559 2023-01-01 CUST559 Female 40 Clothing
## 4: 163 2023-01-02 CUST163 Female 64 Clothing
## 5: 303 2023-01-02 CUST303 Male 19 Electronics
## ---
## 996: 805 2023-12-29 CUST805 Female 30 Beauty
## 997: 908 2023-12-29 CUST908 Male 46 Beauty
## 998: 857 2023-12-31 CUST857 Male 60 Electronics
## 999: 211 2024-01-01 CUST211 Male 42 Beauty
## 1000: 650 2024-01-01 CUST650 Male 55 Electronics
## Quantity Price.per.Unit Total.Amount
## <int> <int> <int>
## 1: 3 300 900
## 2: 3 500 1500
## 3: 4 300 1200
## 4: 3 50 150
## 5: 3 30 90
## ---
## 996: 3 500 1500
## 997: 4 300 1200
## 998: 2 25 50
## 999: 3 500 1500
## 1000: 1 30 30
Here we see we have sales recorded on 345 days beginning on 01/01/2023 and ending on 01/01/2024. This may be useful later to identify seasonality in sales.
Age
# Calculate the number of unique ages in the Age column of the sales dataset
length(unique(sales$`Age`))
## [1] 47
# Sort the entire sales dataset by the Age column, converting age values to numeric and ordering from youngest to oldest
sales[order(as.numeric(sales$`Age`)),]
## Transaction.ID Date Customer.ID Gender Age Product.Category
## <int> <IDat> <char> <char> <int> <char>
## 1: 22 2023-10-15 CUST022 Male 18 Clothing
## 2: 37 2023-05-23 CUST037 Female 18 Beauty
## 3: 58 2023-11-13 CUST058 Male 18 Clothing
## 4: 62 2023-12-27 CUST062 Male 18 Beauty
## 5: 74 2023-11-22 CUST074 Female 18 Beauty
## ---
## 996: 735 2023-10-04 CUST735 Female 64 Clothing
## 997: 758 2023-05-12 CUST758 Male 64 Clothing
## 998: 830 2023-06-22 CUST830 Female 64 Clothing
## 999: 882 2023-06-06 CUST882 Female 64 Electronics
## 1000: 897 2023-09-26 CUST897 Female 64 Electronics
## Quantity Price.per.Unit Total.Amount
## <int> <int> <int>
## 1: 2 50 100
## 2: 3 25 75
## 3: 4 300 1200
## 4: 2 50 100
## 5: 4 500 2000
## ---
## 996: 4 500 2000
## 997: 4 25 100
## 998: 3 50 150
## 999: 2 25 50
## 1000: 2 50 100
We can see we have 47 unique ages of our customers ranging from 18 to 64 years old. This is useful information to further split our data into buckets for age demographics later on.
Quantity, Price.per.Unit and Total.Amount
I have decided to investigate these together because in the previous code chunks, it appears that the Total.Amount column is just the sum of Price.per.Unit x Quantity. Checking this is simple enough;
# Create a new column 'checksum' in the sales dataset by multiplying the Quantity and Price.per.Unit columns
sales$checksum <- sales$Quantity * sales$`Price.per.Unit`
# Create a new column 'isequal' in the sales dataset with TRUE or FALSE values indicating whether Total.Amount equals the checksum column
sales$isequal <- sales$`Total.Amount` == sales$checksum
# Calculate the number of unique values (TRUE/FALSE) in the isequal column of the sales dataset
length(unique(sales$isequal))
## [1] 1
# Display the distinct unique values (TRUE/FALSE) present in the isequal column of the sales dataset
unique(sales$isequal)
## [1] TRUE
Here we see that indeed all of our checksum values we calculated match the Total.Amount column. Unfortunately, because we do not know if there are multiple products with the same Price.per.Unit, so we will have to contain our analysis of the sales data using the Total.Amount.
Please note that within this section, I will stop commenting on the code and will explain what I am attempting to do in text preceding the code.
Now we can begin building some graphical tools to help us answer questions about our data in a more intuitive way using the facts we have extracted from the data so far. Using what we learned earlier, I want to first investigate the difference between genders in our data.
Specifically, I want to build a bar chart that contrasts the total sales amount by each gender for all three product categories in our data. I also want to show the difference between genders in total sales as well.
# Start with the sales dataset
sales %>%
# Group by Product.Category and Gender to aggregate sales data
group_by(Product.Category, Gender) %>%
# Calculate the total sales amount for each group, ignoring NA values; drop grouping structure afterward
summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>%
# Regroup by Product.Category to enable within-category comparisons
group_by(Product.Category) %>%
# Reframe the data to include Total, Gender, and the difference (Male - Female) for each category
reframe(
Total = Total, # Retain the total sales for each gender
Gender = Gender, # Retain the gender labels
Diff = Total[Gender == "Male"] - Total[Gender == "Female"] # Calculate Male minus Female difference
) %>%
# Begin plotting with ggplot; x-axis is Gender, y-axis is Total, fill color by Gender
ggplot(aes(x = Gender, y = Total, fill = Gender)) +
# Create bars representing total sales for each gender
geom_bar(stat = "identity") +
# Add text labels showing the difference (Male - Female) only above Male bars
geom_text(aes(label = ifelse(Gender == "Male", Diff, "")), vjust = -0.5, size = 3) +
# Facet the plot by Product.Category, wrapping labels to fit within 15 characters per line
facet_wrap(~ Product.Category, labeller = label_wrap_gen(width = 15)) +
# Set plot titles and labels
labs(
title = "Total Amount by Gender Across Product Categories", # Main title
x = "Gender", # X-axis label
y = "Total Amount", # Y-axis label
caption = "Data aggregated by product category and gender" # Caption
) +
# Apply the Economist theme from ggthemes for a clean, professional look
ggthemes::theme_economist() +
# Customize theme elements
theme(
axis.text.x = element_blank(), # Remove x-axis text (Gender labels redundant with fill)
axis.text.y = element_text(size = 8), # Set y-axis text size to 8
axis.title = element_text(size = 12, face = "bold"), # Bold axis titles, size 12
plot.title = element_text(size = 14, face = "bold", hjust = 0.5), # Centered, bold title, size 14
plot.caption = element_text(size = 8, hjust = 0), # Caption size 8, left-aligned
strip.text = element_text(size = 10, face = "bold"), # Bold facet labels, size 10
panel.spacing = unit(1, "lines") # Set spacing between facets to 1 line
) +
# Define custom fill colors for genders: light blue for Male, light pink for Female
scale_fill_manual(values = c("Male" = "#87CEEB", "Female" = "#FFB6C1"))
Now we have some tangible insights we can begin to share with our marketing team. To summarize; Beauty Product Category - Females outspend Males by $6145 Clothing Product Category - Females outspend Males by $6970 Electronics Product Category - Males outspend Females by $3435
Let’s dig a little further to try and build on this information we just gained from above. Let’s build another bar chart, but this time let’s identify the total sales by gender, and the respective breakdown of what percentage of those sales for each product category.
# Calculate total spending by Gender
gender_totals <- sales %>%
group_by(Gender) %>%
summarise(Gender.Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") # Total sales per gender
# Calculate overall total for the title
grand_total <- sum(sales$Total.Amount, na.rm = TRUE)
# Calculate totals and percentages by Product.Category and Gender, using gender-specific totals
sales_totals <- sales %>%
group_by(Product.Category, Gender) %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>% # Per-category and gender totals
left_join(gender_totals, by = "Gender") %>% # Add gender totals to each row
mutate(Percent = Total / Gender.Total * 100) # Calculate percentage of gender's total
# Create a labeller function to combine Gender with its total
gender_labeller <- function(gender) {
totals <- gender_totals$Gender.Total[match(gender, gender_totals$Gender)]
paste(gender, ": $", format(round(totals), big.mark = ","), sep = "")
}
# Plot the data, faceted by Gender with totals in labels
sales_totals %>%
ggplot(aes(x = Product.Category, y = Total, fill = Product.Category)) +
geom_bar(stat = "identity") +
geom_text(aes(label = sprintf("%.1f%%", Percent)), vjust = -0.5, size = 2.5) + # Percentage labels
facet_wrap(~ Gender, labeller = labeller(Gender = gender_labeller)) + # Facet by Gender with custom labels
labs(
title = paste("Total Spending by Product Category (Overall: $",
format(round(grand_total), big.mark = ","), ")", sep = ""),
x = "Product Category",
y = "Total Amount",
caption = "Percentages reflect share of each gender's total sales"
) +
ggthemes::theme_economist() +
theme(
axis.text.x = element_blank(), # No x-axis labels since fill color identifies categories
axis.text.y = element_text(size = 8), # Smaller y-axis text
axis.title = element_text(size = 12, face = "bold"), # Bold axis titles
plot.title = element_text(size = 14, face = "bold", hjust = 0.5), # Centered, bold title
plot.caption = element_text(size = 8, hjust = 0), # Small caption
strip.text = element_text(size = 10, face = "bold"), # Bold facet labels
panel.spacing = unit(1, "lines") # Space between facets
) +
scale_fill_manual(values = c("Clothing" = "#BBE9A5", "Beauty" = "#FFFBAC", "Electronics" = "#FFC79A"))
Now we are starting to get some real insights from our data. We can see that Female spending is higher than Male spending, and it is also more balanced between product categories with the largest difference in spending being 2.8% as opposed to the 5.1% difference in Male spending between Beauty and Electronics.
At this point, depending on our marketing team’s needs, we can make simple recommendations. If they want to target our lowest performing product category for each Gender, we know that the campaign should target Beauty products for both Genders. Let’s build a few more charts so that we can begin building a small deck of recommendations.
# Calculate total spending by Gender and Product Category
sales_totals <- sales %>%
group_by(Product.Category, Gender) %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>%
mutate(
Gender_Abbrev = ifelse(Gender == "Female", "F", "M"), # Abbreviate Gender
Category_Gender = paste(Gender_Abbrev, Product.Category, sep = " - ") # Create combined label
) %>%
arrange(desc(Total)) # Sort by Total, largest to smallest
# Calculate grand total for the title
grand_total <- sum(sales$Total.Amount, na.rm = TRUE)
# Plot the data with bars for each Gender-Product Category combination, sorted
ggplot(sales_totals, aes(x = reorder(Category_Gender, -Total), y = Total, fill = Product.Category)) +
geom_bar(stat = "identity", width = 0.7) + # Narrower bars for better spacing
geom_text(aes(label = format(round(Total), big.mark = ",")), vjust = -0.3, size = 3) + # Total sales labels
labs(
title = paste("Spending by Gender and Product Category (Total: $",
format(round(grand_total), big.mark = ","), ")", sep = ""),
x = "Gender - Product Category",
y = "Total Amount",
caption = "Values represent total sales amounts"
) +
ggthemes::theme_economist() +
theme(
axis.text.x = element_text(size = 7, angle = 45, hjust = 1, vjust = 1), # Smaller, aligned x-axis labels
axis.text.y = element_text(size = 7), # Smaller y-axis text
axis.title = element_text(size = 10, face = "bold"), # Slightly smaller bold axis titles
plot.title = element_text(size = 12, face = "bold", hjust = 0.5), # Smaller, centered title
plot.caption = element_text(size = 6, hjust = 0), # Smaller caption
legend.position = "top", # Legend at top
legend.text = element_text(size = 7), # Smaller legend text
legend.title = element_text(size = 8, face = "bold"), # Smaller legend title
panel.grid.major.x = element_blank() # Remove vertical grid lines
) +
scale_fill_manual(values = c("Clothing" = "#BBE9A5", "Beauty" = "#FFFBAC", "Electronics" = "#FFC79A")) +
coord_cartesian(clip = "off") # Prevent clipping of text labels
Here we ranked all of the sales categories of each gender from largest to smallest, with Female Clothing being the largest in sales while Male Beauty Products being the smallest. If you remember earlier, we also have an age column in our dataset. Let’s break down each product category by age demographics of every 10 years starting at 18 until 64 for each gender.
# Calculate female total for percentage denominator first
female_total <- sales %>%
filter(Gender == "Female") %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE)) %>%
pull(Total)
# Calculate total spending by Product Category and Age for Females, then bucket Age into 10-year ranges
sales_totals_age <- sales %>%
filter(Gender == "Female") %>% # Filter for Female only
mutate(
Age = as.numeric(Age), # Ensure Age is numeric
Age_Bucket = cut(Age, breaks = c(17, 27, 37, 47, 57, 64), # Explicit breaks for 5 intervals
labels = c("18-27", "28-37", "38-47", "48-57", "58-64"), # 5 labels
include.lowest = TRUE, right = FALSE) # [17-27), [27-37), etc.
) %>%
group_by(Product.Category, Age_Bucket) %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>%
mutate(Percent = Total / female_total * 100) # Calculate percentage of female total
# Calculate grand total for the title (still using full dataset for consistency)
grand_total <- sum(sales$Total.Amount, na.rm = TRUE)
# Plot the data, faceted by Product Category, with shorter and wider vertical bars
ggplot(sales_totals_age, aes(x = Age_Bucket, y = Total, fill = Product.Category)) +
geom_bar(stat = "identity", width = 0.9) + # Wider bars
geom_text(aes(label = sprintf("%.1f%%", Percent)), vjust = -0.3, size = 3) + # Percentage labels
facet_wrap(~ Product.Category, scales = "free_y") + # Facet by Product Category, free y-axis scales
labs(
title = paste("Female Spending by Age Range and Product Category (Total: $",
format(round(grand_total), big.mark = ","), ")", sep = ""),
x = "Age Range",
y = "Total Amount",
caption = "Percentages reflect share of total female sales"
) +
ggthemes::theme_economist() +
theme(
axis.text.x = element_text(size = 7, angle = 45, hjust = 1, vjust = 1), # Smaller, rotated x-axis labels
axis.text.y = element_text(size = 7), # Smaller y-axis text
axis.title = element_text(size = 10, face = "bold"), # Bold axis titles
plot.title = element_text(size = 12, face = "bold", hjust = 0.5), # Centered, bold title
plot.caption = element_text(size = 6, hjust = 0), # Small caption
strip.text = element_text(size = 8, face = "bold"), # Bold facet labels
legend.position = "none", # Remove legend since fill is redundant with facets
panel.grid.major.x = element_blank(), # Remove vertical grid lines
panel.spacing = unit(1, "lines"), # Space between facets
panel.background = element_rect(fill = "#FFE4E9"), # Softer pale pink for plot panels
plot.background = element_rect(fill = "#FFE4E9") # Softer pale pink for entire plot
) +
scale_fill_manual(values = c("Clothing" = "#BBE9A5", "Beauty" = "#FFFBAC", "Electronics" = "#FFC79A")) +
coord_cartesian(clip = "off") # Prevent clipping of text labels
# Calculate male total for percentage denominator first
male_total <- sales %>%
filter(Gender == "Male") %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE)) %>%
pull(Total)
# Calculate total spending by Product Category and Age for Males, then bucket Age into 10-year ranges
sales_totals_age_male <- sales %>%
filter(Gender == "Male") %>% # Filter for Male only
mutate(
Age = as.numeric(Age), # Ensure Age is numeric
Age_Bucket = cut(Age, breaks = c(17, 27, 37, 47, 57, 64), # Explicit breaks for 5 intervals
labels = c("18-27", "28-37", "38-47", "48-57", "58-64"), # 5 labels
include.lowest = TRUE, right = FALSE) # [17-27), [27-37), etc.
) %>%
group_by(Product.Category, Age_Bucket) %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>%
mutate(Percent = Total / male_total * 100) # Calculate percentage of male total
# Calculate grand total for the title (still using full dataset for consistency)
grand_total <- sum(sales$Total.Amount, na.rm = TRUE)
# Plot the data for Males, faceted by Product Category, with shorter and wider vertical bars
ggplot(sales_totals_age_male, aes(x = Age_Bucket, y = Total, fill = Product.Category)) +
geom_bar(stat = "identity", width = 0.9) + # Wider bars
geom_text(aes(label = sprintf("%.1f%%", Percent)), vjust = -0.3, size = 3) + # Percentage labels
facet_wrap(~ Product.Category, scales = "free_y") + # Facet by Product Category, free y-axis scales
labs(
title = paste("Male Spending by Age Range and Product Category (Total: $",
format(round(grand_total), big.mark = ","), ")", sep = ""),
x = "Age Range",
y = "Total Amount",
caption = "Percentages reflect share of total male sales"
) +
ggthemes::theme_economist() +
theme(
axis.text.x = element_text(size = 7, angle = 45, hjust = 1, vjust = 1), # Smaller, rotated x-axis labels
axis.text.y = element_text(size = 7), # Smaller y-axis text
axis.title = element_text(size = 10, face = "bold"), # Bold axis titles
plot.title = element_text(size = 12, face = "bold", hjust = 0.5), # Centered, bold title
plot.caption = element_text(size = 6, hjust = 0), # Small caption
strip.text = element_text(size = 8, face = "bold"), # Bold facet labels
legend.position = "none", # Remove legend since fill is redundant with facets
panel.grid.major.x = element_blank(), # Remove vertical grid lines
panel.spacing = unit(1, "lines"), # Space between facets
panel.background = element_rect(fill = "#BCE2F5"), # Softer blue for plot panels
plot.background = element_rect(fill = "#BCE2F5") # Softer blue for entire plot
) +
scale_fill_manual(values = c("Clothing" = "#BBE9A5", "Beauty" = "#FFFBAC", "Electronics" = "#FFC79A")) +
coord_cartesian(clip = "off") # Prevent clipping of text labels
Finally we can start drilling down on recommendations for our marketing team. Specifically, we see spending across all categories for both genders declines after age 57 with drops ranging in 1.4% to 4.2%. The only exception here is Male spending on electronics which increases by 2.2%
Let’s finally quickly sum these:
# Calculate gender-specific totals for percentage denominators
female_total <- sales %>%
filter(Gender == "Female") %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE)) %>%
pull(Total)
male_total <- sales %>%
filter(Gender == "Male") %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE)) %>%
pull(Total)
# --- Female Data ---
# Calculate total spending by Product Category and Age for Females
female_sales <- sales %>%
filter(Gender == "Female") %>%
mutate(
Age = as.numeric(Age),
Age_Bucket = cut(Age, breaks = c(17, 27, 37, 47, 57, 64),
labels = c("18-27", "28-37", "38-47", "48-57", "58-64"),
include.lowest = TRUE, right = FALSE)
) %>%
group_by(Product.Category, Age_Bucket) %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop")
# Find the lowest sales total age demographic for each Product Category (Females) and add percentage
female_lowest <- female_sales %>%
group_by(Product.Category) %>%
slice_min(Total, n = 1, with_ties = FALSE) %>% # Get the row with the minimum Total
ungroup() %>%
mutate(Percent = Total / female_total * 100) # Percentage of female total
# --- Male Data ---
# Calculate total spending by Product Category and Age for Males
male_sales <- sales %>%
filter(Gender == "Male") %>%
mutate(
Age = as.numeric(Age),
Age_Bucket = cut(Age, breaks = c(17, 27, 37, 47, 57, 64),
labels = c("18-27", "28-37", "38-47", "48-57", "58-64"),
include.lowest = TRUE, right = FALSE)
) %>%
group_by(Product.Category, Age_Bucket) %>%
summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop")
# Find the lowest sales total age demographic for each Product Category (Males) and add percentage
male_lowest <- male_sales %>%
group_by(Product.Category) %>%
slice_min(Total, n = 1, with_ties = FALSE) %>% # Get the row with the minimum Total
ungroup() %>%
mutate(Percent = Total / male_total * 100) # Percentage of male total
# --- Print Results ---
cat("Lowest Sales Total Age Demographics by Product Category\n")
## Lowest Sales Total Age Demographics by Product Category
# Print Female Results
cat("Females:\n")
## Females:
for (i in 1:nrow(female_lowest)) {
cat(sprintf("%s: %s with $%s (%.1f%% of female total)\n",
female_lowest$Product.Category[i],
female_lowest$Age_Bucket[i],
format(round(female_lowest$Total[i]), big.mark = ","),
female_lowest$Percent[i]))
}
## Beauty: 58-64 with $8,080 (3.5% of female total)
## Clothing: 28-37 with $13,670 (5.9% of female total)
## Electronics: 58-64 with $9,110 (3.9% of female total)
# Add a separator
cat("\n")
# Print Male Results
cat("Males:\n")
## Males:
for (i in 1:nrow(male_lowest)) {
cat(sprintf("%s: %s with $%s (%.1f%% of male total)\n",
male_lowest$Product.Category[i],
male_lowest$Age_Bucket[i],
format(round(male_lowest$Total[i]), big.mark = ","),
male_lowest$Percent[i]))
}
## Beauty: 58-64 with $7,655 (3.4% of male total)
## Clothing: 58-64 with $10,175 (4.6% of male total)
## Electronics: 28-37 with $10,490 (4.7% of male total)
Based on our latest sales data, here are the lowest-spending demographics we should target in the next campaign to boost sales:
Females:
Beauty: Ages 58-64 ($8,080, 3.5%) – Push anti-aging products via
email/Facebook.
Clothing: Ages 28-37 ($13,670, 5.9%) – Promote workwear on Instagram/TikTok.
Electronics: Ages 58-64 ($9,110, 3.9%) – Highlight easy-to-use gadgets with YouTube tutorials.
Males:
Beauty: Ages 58-64 ($7,655, 3.4%) – Market grooming essentials via
TV/barbershops.
Clothing: Ages 58-64 ($10,175, 4.6%) – Offer comfy casuals through direct mail.
Electronics: Ages 28-37 ($10,490, 4.7%) – Target tech-savvy guys with gaming gear on Twitch/YouTube.