Retail Sales Analysis

This is a retail sales analysis done by me using R Studio. I chose to present this project in this way as a means of communicating my thought process when delivering analysis for stakeholders. The dataset I will be using can be found here: https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset/data.

I also want to state that I will be approaching this analysis from the perspective of a sales analyst, who was asked to analyze the given data for insights to help the marketing department with their next campaign. I will attempt to comment as much code as possible so less technical readers can understand and learn from the code.

Firstly, we want to prepare whichever packages we will be using.

Load Packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

library(ggthemes)

Data Exploration

Next, I want to import our data and assign it to a dataframe. After we assign the dataframe, I also want to get a quick understanding of our data structure by summarizing it. Let’s also change some column names:

# Load the data.table package's fread() function to quickly read a CSV file into a data frame
sales <- fread("C:\\Users\\Dave\\Desktop\\Data Practice Sets\\retail_sales_dataset.csv")

# Display a summary of the sales dataset (e.g., min, max, mean, etc.) for each column as loaded
summary(sales)

##  Transaction ID        Date            Customer ID           Gender         
##  Min.   :   1.0   Min.   :2023-01-01   Length:1000        Length:1000       
##  1st Qu.: 250.8   1st Qu.:2023-04-08   Class :character   Class :character  
##  Median : 500.5   Median :2023-06-29   Mode  :character   Mode  :character  
##  Mean   : 500.5   Mean   :2023-07-03                                        
##  3rd Qu.: 750.2   3rd Qu.:2023-10-04                                        
##  Max.   :1000.0   Max.   :2024-01-01                                        
##       Age        Product Category      Quantity     Price per Unit 
##  Min.   :18.00   Length:1000        Min.   :1.000   Min.   : 25.0  
##  1st Qu.:29.00   Class :character   1st Qu.:1.000   1st Qu.: 30.0  
##  Median :42.00   Mode  :character   Median :3.000   Median : 50.0  
##  Mean   :41.39                      Mean   :2.514   Mean   :179.9  
##  3rd Qu.:53.00                      3rd Qu.:4.000   3rd Qu.:300.0  
##  Max.   :64.00                      Max.   :4.000   Max.   :500.0  
##   Total Amount 
##  Min.   :  25  
##  1st Qu.:  60  
##  Median : 135  
##  Mean   : 456  
##  3rd Qu.: 900  
##  Max.   :2000

# Rename all columns in the sales dataset for easier coding, replacing the original column names
colnames(sales) <- c("Transaction.ID", "Date", "Customer.ID", "Gender", "Age", "Product.Category", "Quantity", "Price.per.Unit", "Total.Amount")

# Display a summary of the sales dataset again, now with the updated column names
summary(sales)

##  Transaction.ID        Date            Customer.ID           Gender         
##  Min.   :   1.0   Min.   :2023-01-01   Length:1000        Length:1000       
##  1st Qu.: 250.8   1st Qu.:2023-04-08   Class :character   Class :character  
##  Median : 500.5   Median :2023-06-29   Mode  :character   Mode  :character  
##  Mean   : 500.5   Mean   :2023-07-03                                        
##  3rd Qu.: 750.2   3rd Qu.:2023-10-04                                        
##  Max.   :1000.0   Max.   :2024-01-01                                        
##       Age        Product.Category      Quantity     Price.per.Unit 
##  Min.   :18.00   Length:1000        Min.   :1.000   Min.   : 25.0  
##  1st Qu.:29.00   Class :character   1st Qu.:1.000   1st Qu.: 30.0  
##  Median :42.00   Mode  :character   Median :3.000   Median : 50.0  
##  Mean   :41.39                      Mean   :2.514   Mean   :179.9  
##  3rd Qu.:53.00                      3rd Qu.:4.000   3rd Qu.:300.0  
##  Max.   :64.00                      Max.   :4.000   Max.   :500.0  
##   Total.Amount 
##  Min.   :  25  
##  1st Qu.:  60  
##  Median : 135  
##  Mean   : 456  
##  3rd Qu.: 900  
##  Max.   :2000

Now we can begin making sense of our data by asking questions in a way that our marketing department could find useful. Specifically, we can see the column names are: “Transaction.ID”, “Date”, “Customer.ID”, “Gender”, “Age”, “Product.Category”, “Quantity”, “Price.per.Unit”, and “Total.Amount”

We also noticed earlier that three of our columns are the “character” class; “Gender”, “Customer.ID”, and “Product.Category”, which could be good columns to begin our investigation for useful insights.

Let’s take a closer look at each one.

Gender

# Calculate the number of unique values in the Gender column of the sales dataset
length(unique(sales$Gender))

## [1] 2

# Display the distinct unique values present in the Gender column of the sales dataset
unique(sales$Gender)

## [1] "Male"   "Female"

Knowing our data only contains 2 genders, this may be a good way to split our data to see the differences in sales patterns between the two groups.

Customer.ID

# Calculate the number of unique customer IDs in the Customer.ID column of the sales dataset
length(unique(sales$`Customer.ID`))

## [1] 1000

# Display the first 6 values from the Customer.ID column of the sales dataset
head(sales$`Customer.ID`)

## [1] "CUST001" "CUST002" "CUST003" "CUST004" "CUST005" "CUST006"

# Display the last 6 values from the Customer.ID column of the sales dataset
tail(sales$`Customer.ID`)

## [1] "CUST995"  "CUST996"  "CUST997"  "CUST998"  "CUST999"  "CUST1000"

Here we can see that we have 1000 unique values that begin at “CUST001” and end at “CUST1000”. We either have not been tracking customers as return customers properly or we simply have 1000 unique customers. Since we can’t verify, we will skip over gleaming any details using this column for now.

Product Category

# Calculate the number of unique product categories in the Product.Category column of the sales dataset
length(unique(sales$`Product.Category`))

## [1] 3

# Display the distinct unique values present in the Product.Category column of the sales dataset
unique(sales$`Product.Category`)

## [1] "Beauty"      "Clothing"    "Electronics"

Now we know our sales are split into 3 distinct categories. Let’s continue our investigation into the rest of our columns in the data.

Transaction ID

# Calculate the number of unique transaction IDs in the Transaction.ID column of the sales dataset
length(unique(sales$`Transaction.ID`))

## [1] 1000

This matches our “Customer.ID” columns in unique entries. As we suspected, each transaction has been tracked as a separate customer. Let’s ignore for now.

Date

# Calculate the number of unique dates in the Date column of the sales dataset
length(unique(sales$`Date`))

## [1] 345

# Sort the entire sales dataset by the Date column, converting dates from string format (e.g., "MM,DD,YYYY") to chronological order from earliest to latest
sales[order(as.Date(sales$`Date`, format="%m,%d,%Y")),]

##       Transaction.ID       Date Customer.ID Gender   Age Product.Category
##                <int>     <IDat>      <char> <char> <int>           <char>
##    1:            180 2023-01-01     CUST180   Male    41         Clothing
##    2:            522 2023-01-01     CUST522   Male    46           Beauty
##    3:            559 2023-01-01     CUST559 Female    40         Clothing
##    4:            163 2023-01-02     CUST163 Female    64         Clothing
##    5:            303 2023-01-02     CUST303   Male    19      Electronics
##   ---                                                                    
##  996:            805 2023-12-29     CUST805 Female    30           Beauty
##  997:            908 2023-12-29     CUST908   Male    46           Beauty
##  998:            857 2023-12-31     CUST857   Male    60      Electronics
##  999:            211 2024-01-01     CUST211   Male    42           Beauty
## 1000:            650 2024-01-01     CUST650   Male    55      Electronics
##       Quantity Price.per.Unit Total.Amount
##          <int>          <int>        <int>
##    1:        3            300          900
##    2:        3            500         1500
##    3:        4            300         1200
##    4:        3             50          150
##    5:        3             30           90
##   ---                                     
##  996:        3            500         1500
##  997:        4            300         1200
##  998:        2             25           50
##  999:        3            500         1500
## 1000:        1             30           30

Here we see we have sales recorded on 345 days beginning on 01/01/2023 and ending on 01/01/2024. This may be useful later to identify seasonality in sales.

Age

# Calculate the number of unique ages in the Age column of the sales dataset
length(unique(sales$`Age`))

## [1] 47

# Sort the entire sales dataset by the Age column, converting age values to numeric and ordering from youngest to oldest
sales[order(as.numeric(sales$`Age`)),]

##       Transaction.ID       Date Customer.ID Gender   Age Product.Category
##                <int>     <IDat>      <char> <char> <int>           <char>
##    1:             22 2023-10-15     CUST022   Male    18         Clothing
##    2:             37 2023-05-23     CUST037 Female    18           Beauty
##    3:             58 2023-11-13     CUST058   Male    18         Clothing
##    4:             62 2023-12-27     CUST062   Male    18           Beauty
##    5:             74 2023-11-22     CUST074 Female    18           Beauty
##   ---                                                                    
##  996:            735 2023-10-04     CUST735 Female    64         Clothing
##  997:            758 2023-05-12     CUST758   Male    64         Clothing
##  998:            830 2023-06-22     CUST830 Female    64         Clothing
##  999:            882 2023-06-06     CUST882 Female    64      Electronics
## 1000:            897 2023-09-26     CUST897 Female    64      Electronics
##       Quantity Price.per.Unit Total.Amount
##          <int>          <int>        <int>
##    1:        2             50          100
##    2:        3             25           75
##    3:        4            300         1200
##    4:        2             50          100
##    5:        4            500         2000
##   ---                                     
##  996:        4            500         2000
##  997:        4             25          100
##  998:        3             50          150
##  999:        2             25           50
## 1000:        2             50          100

We can see we have 47 unique ages of our customers ranging from 18 to 64 years old. This is useful information to further split our data into buckets for age demographics later on.

Quantity, Price.per.Unit and Total.Amount

I have decided to investigate these together because in the previous code chunks, it appears that the Total.Amount column is just the sum of Price.per.Unit x Quantity. Checking this is simple enough;

# Create a new column 'checksum' in the sales dataset by multiplying the Quantity and Price.per.Unit columns
sales$checksum <- sales$Quantity * sales$`Price.per.Unit`

# Create a new column 'isequal' in the sales dataset with TRUE or FALSE values indicating whether Total.Amount equals the checksum column
sales$isequal <- sales$`Total.Amount` == sales$checksum

# Calculate the number of unique values (TRUE/FALSE) in the isequal column of the sales dataset
length(unique(sales$isequal))

## [1] 1

# Display the distinct unique values (TRUE/FALSE) present in the isequal column of the sales dataset
unique(sales$isequal)

## [1] TRUE

Here we see that indeed all of our checksum values we calculated match the Total.Amount column. Unfortunately, because we do not know if there are multiple products with the same Price.per.Unit, so we will have to contain our analysis of the sales data using the Total.Amount.

Data Presentation

Please note that within this section, I will stop commenting on the code and will explain what I am attempting to do in text preceding the code.

Now we can begin building some graphical tools to help us answer questions about our data in a more intuitive way using the facts we have extracted from the data so far. Using what we learned earlier, I want to first investigate the difference between genders in our data.

Specifically, I want to build a bar chart that contrasts the total sales amount by each gender for all three product categories in our data. I also want to show the difference between genders in total sales as well.

# Start with the sales dataset
sales %>%
  # Group by Product.Category and Gender to aggregate sales data
  group_by(Product.Category, Gender) %>%
  # Calculate the total sales amount for each group, ignoring NA values; drop grouping structure afterward
  summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>%
  # Regroup by Product.Category to enable within-category comparisons
  group_by(Product.Category) %>%
  # Reframe the data to include Total, Gender, and the difference (Male - Female) for each category
  reframe(  
    Total = Total,           # Retain the total sales for each gender
    Gender = Gender,         # Retain the gender labels
    Diff = Total[Gender == "Male"] - Total[Gender == "Female"]  # Calculate Male minus Female difference
  ) %>%
  # Begin plotting with ggplot; x-axis is Gender, y-axis is Total, fill color by Gender
  ggplot(aes(x = Gender, y = Total, fill = Gender)) +
  # Create bars representing total sales for each gender
  geom_bar(stat = "identity") +
  # Add text labels showing the difference (Male - Female) only above Male bars
  geom_text(aes(label = ifelse(Gender == "Male", Diff, "")), vjust = -0.5, size = 3) +
  # Facet the plot by Product.Category, wrapping labels to fit within 15 characters per line
  facet_wrap(~ Product.Category, labeller = label_wrap_gen(width = 15)) +
  # Set plot titles and labels
  labs(
    title = "Total Amount by Gender Across Product Categories",  # Main title
    x = "Gender",                                               # X-axis label
    y = "Total Amount",                                         # Y-axis label
    caption = "Data aggregated by product category and gender"  # Caption
  ) +
  # Apply the Economist theme from ggthemes for a clean, professional look
  ggthemes::theme_economist() +
  # Customize theme elements
  theme(
    axis.text.x = element_blank(),           # Remove x-axis text (Gender labels redundant with fill)
    axis.text.y = element_text(size = 8),    # Set y-axis text size to 8
    axis.title = element_text(size = 12, face = "bold"),  # Bold axis titles, size 12
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),  # Centered, bold title, size 14
    plot.caption = element_text(size = 8, hjust = 0),  # Caption size 8, left-aligned
    strip.text = element_text(size = 10, face = "bold"),  # Bold facet labels, size 10
    panel.spacing = unit(1, "lines")         # Set spacing between facets to 1 line
  ) +
  # Define custom fill colors for genders: light blue for Male, light pink for Female
  scale_fill_manual(values = c("Male" = "#87CEEB", "Female" = "#FFB6C1"))

Now we have some tangible insights we can begin to share with our marketing team. To summarize; Beauty Product Category - Females outspend Males by $6145 Clothing Product Category - Females outspend Males by $6970 Electronics Product Category - Males outspend Females by $3435

Let’s dig a little further to try and build on this information we just gained from above. Let’s build another bar chart, but this time let’s identify the total sales by gender, and the respective breakdown of what percentage of those sales for each product category.

# Calculate total spending by Gender
gender_totals <- sales %>%
  group_by(Gender) %>%
  summarise(Gender.Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop")  # Total sales per gender

# Calculate overall total for the title
grand_total <- sum(sales$Total.Amount, na.rm = TRUE)

# Calculate totals and percentages by Product.Category and Gender, using gender-specific totals
sales_totals <- sales %>%
  group_by(Product.Category, Gender) %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>%  # Per-category and gender totals
  left_join(gender_totals, by = "Gender") %>%  # Add gender totals to each row
  mutate(Percent = Total / Gender.Total * 100)  # Calculate percentage of gender's total

# Create a labeller function to combine Gender with its total
gender_labeller <- function(gender) {
  totals <- gender_totals$Gender.Total[match(gender, gender_totals$Gender)]
  paste(gender, ": $", format(round(totals), big.mark = ","), sep = "")
}

# Plot the data, faceted by Gender with totals in labels
sales_totals %>%
  ggplot(aes(x = Product.Category, y = Total, fill = Product.Category)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = sprintf("%.1f%%", Percent)), vjust = -0.5, size = 2.5) +  # Percentage labels
  facet_wrap(~ Gender, labeller = labeller(Gender = gender_labeller)) +  # Facet by Gender with custom labels
  labs(
    title = paste("Total Spending by Product Category (Overall: $", 
                  format(round(grand_total), big.mark = ","), ")", sep = ""),
    x = "Product Category",
    y = "Total Amount",
    caption = "Percentages reflect share of each gender's total sales"
  ) +
  ggthemes::theme_economist() +
  theme(
    axis.text.x = element_blank(),  # No x-axis labels since fill color identifies categories
    axis.text.y = element_text(size = 8),  # Smaller y-axis text
    axis.title = element_text(size = 12, face = "bold"),  # Bold axis titles
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),  # Centered, bold title
    plot.caption = element_text(size = 8, hjust = 0),  # Small caption
    strip.text = element_text(size = 10, face = "bold"),  # Bold facet labels
    panel.spacing = unit(1, "lines")  # Space between facets
  ) +
  scale_fill_manual(values = c("Clothing" = "#BBE9A5", "Beauty" = "#FFFBAC", "Electronics" = "#FFC79A"))

Now we are starting to get some real insights from our data. We can see that Female spending is higher than Male spending, and it is also more balanced between product categories with the largest difference in spending being 2.8% as opposed to the 5.1% difference in Male spending between Beauty and Electronics.

At this point, depending on our marketing team’s needs, we can make simple recommendations. If they want to target our lowest performing product category for each Gender, we know that the campaign should target Beauty products for both Genders. Let’s build a few more charts so that we can begin building a small deck of recommendations.

# Calculate total spending by Gender and Product Category
sales_totals <- sales %>%
  group_by(Product.Category, Gender) %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>%
  mutate(
    Gender_Abbrev = ifelse(Gender == "Female", "F", "M"),  # Abbreviate Gender
    Category_Gender = paste(Gender_Abbrev, Product.Category, sep = " - ")  # Create combined label
  ) %>%
  arrange(desc(Total))  # Sort by Total, largest to smallest

# Calculate grand total for the title
grand_total <- sum(sales$Total.Amount, na.rm = TRUE)

# Plot the data with bars for each Gender-Product Category combination, sorted
ggplot(sales_totals, aes(x = reorder(Category_Gender, -Total), y = Total, fill = Product.Category)) +
  geom_bar(stat = "identity", width = 0.7) +  # Narrower bars for better spacing
  geom_text(aes(label = format(round(Total), big.mark = ",")), vjust = -0.3, size = 3) +  # Total sales labels
  labs(
    title = paste("Spending by Gender and Product Category (Total: $", 
                  format(round(grand_total), big.mark = ","), ")", sep = ""),
    x = "Gender - Product Category",
    y = "Total Amount",
    caption = "Values represent total sales amounts"
  ) +
  ggthemes::theme_economist() +
  theme(
    axis.text.x = element_text(size = 7, angle = 45, hjust = 1, vjust = 1),  # Smaller, aligned x-axis labels
    axis.text.y = element_text(size = 7),  # Smaller y-axis text
    axis.title = element_text(size = 10, face = "bold"),  # Slightly smaller bold axis titles
    plot.title = element_text(size = 12, face = "bold", hjust = 0.5),  # Smaller, centered title
    plot.caption = element_text(size = 6, hjust = 0),  # Smaller caption
    legend.position = "top",  # Legend at top
    legend.text = element_text(size = 7),  # Smaller legend text
    legend.title = element_text(size = 8, face = "bold"),  # Smaller legend title
    panel.grid.major.x = element_blank()  # Remove vertical grid lines
  ) +
  scale_fill_manual(values = c("Clothing" = "#BBE9A5", "Beauty" = "#FFFBAC", "Electronics" = "#FFC79A")) +
  coord_cartesian(clip = "off")  # Prevent clipping of text labels

Here we ranked all of the sales categories of each gender from largest to smallest, with Female Clothing being the largest in sales while Male Beauty Products being the smallest. If you remember earlier, we also have an age column in our dataset. Let’s break down each product category by age demographics of every 10 years starting at 18 until 64 for each gender.

# Calculate female total for percentage denominator first
female_total <- sales %>%
  filter(Gender == "Female") %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE)) %>%
  pull(Total)

# Calculate total spending by Product Category and Age for Females, then bucket Age into 10-year ranges
sales_totals_age <- sales %>%
  filter(Gender == "Female") %>%  # Filter for Female only
  mutate(
    Age = as.numeric(Age),  # Ensure Age is numeric
    Age_Bucket = cut(Age, breaks = c(17, 27, 37, 47, 57, 64),  # Explicit breaks for 5 intervals
                     labels = c("18-27", "28-37", "38-47", "48-57", "58-64"),  # 5 labels
                     include.lowest = TRUE, right = FALSE)  # [17-27), [27-37), etc.
  ) %>%
  group_by(Product.Category, Age_Bucket) %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>%
  mutate(Percent = Total / female_total * 100)  # Calculate percentage of female total

# Calculate grand total for the title (still using full dataset for consistency)
grand_total <- sum(sales$Total.Amount, na.rm = TRUE)

# Plot the data, faceted by Product Category, with shorter and wider vertical bars
ggplot(sales_totals_age, aes(x = Age_Bucket, y = Total, fill = Product.Category)) +
  geom_bar(stat = "identity", width = 0.9) +  # Wider bars
  geom_text(aes(label = sprintf("%.1f%%", Percent)), vjust = -0.3, size = 3) +  # Percentage labels
  facet_wrap(~ Product.Category, scales = "free_y") +  # Facet by Product Category, free y-axis scales
  labs(
    title = paste("Female Spending by Age Range and Product Category (Total: $", 
                  format(round(grand_total), big.mark = ","), ")", sep = ""),
    x = "Age Range",
    y = "Total Amount",
    caption = "Percentages reflect share of total female sales"
  ) +
  ggthemes::theme_economist() +
  theme(
    axis.text.x = element_text(size = 7, angle = 45, hjust = 1, vjust = 1),  # Smaller, rotated x-axis labels
    axis.text.y = element_text(size = 7),  # Smaller y-axis text
    axis.title = element_text(size = 10, face = "bold"),  # Bold axis titles
    plot.title = element_text(size = 12, face = "bold", hjust = 0.5),  # Centered, bold title
    plot.caption = element_text(size = 6, hjust = 0),  # Small caption
    strip.text = element_text(size = 8, face = "bold"),  # Bold facet labels
    legend.position = "none",  # Remove legend since fill is redundant with facets
    panel.grid.major.x = element_blank(),  # Remove vertical grid lines
    panel.spacing = unit(1, "lines"),  # Space between facets
    panel.background = element_rect(fill = "#FFE4E9"),  # Softer pale pink for plot panels
    plot.background = element_rect(fill = "#FFE4E9")   # Softer pale pink for entire plot
  ) +
  scale_fill_manual(values = c("Clothing" = "#BBE9A5", "Beauty" = "#FFFBAC", "Electronics" = "#FFC79A")) +
  coord_cartesian(clip = "off")  # Prevent clipping of text labels

# Calculate male total for percentage denominator first
male_total <- sales %>%
  filter(Gender == "Male") %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE)) %>%
  pull(Total)

# Calculate total spending by Product Category and Age for Males, then bucket Age into 10-year ranges
sales_totals_age_male <- sales %>%
  filter(Gender == "Male") %>%  # Filter for Male only
  mutate(
    Age = as.numeric(Age),  # Ensure Age is numeric
    Age_Bucket = cut(Age, breaks = c(17, 27, 37, 47, 57, 64),  # Explicit breaks for 5 intervals
                     labels = c("18-27", "28-37", "38-47", "48-57", "58-64"),  # 5 labels
                     include.lowest = TRUE, right = FALSE)  # [17-27), [27-37), etc.
  ) %>%
  group_by(Product.Category, Age_Bucket) %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop") %>%
  mutate(Percent = Total / male_total * 100)  # Calculate percentage of male total

# Calculate grand total for the title (still using full dataset for consistency)
grand_total <- sum(sales$Total.Amount, na.rm = TRUE)

# Plot the data for Males, faceted by Product Category, with shorter and wider vertical bars
ggplot(sales_totals_age_male, aes(x = Age_Bucket, y = Total, fill = Product.Category)) +
  geom_bar(stat = "identity", width = 0.9) +  # Wider bars
  geom_text(aes(label = sprintf("%.1f%%", Percent)), vjust = -0.3, size = 3) +  # Percentage labels
  facet_wrap(~ Product.Category, scales = "free_y") +  # Facet by Product Category, free y-axis scales
  labs(
    title = paste("Male Spending by Age Range and Product Category (Total: $", 
                  format(round(grand_total), big.mark = ","), ")", sep = ""),
    x = "Age Range",
    y = "Total Amount",
    caption = "Percentages reflect share of total male sales"
  ) +
  ggthemes::theme_economist() +
  theme(
    axis.text.x = element_text(size = 7, angle = 45, hjust = 1, vjust = 1),  # Smaller, rotated x-axis labels
    axis.text.y = element_text(size = 7),  # Smaller y-axis text
    axis.title = element_text(size = 10, face = "bold"),  # Bold axis titles
    plot.title = element_text(size = 12, face = "bold", hjust = 0.5),  # Centered, bold title
    plot.caption = element_text(size = 6, hjust = 0),  # Small caption
    strip.text = element_text(size = 8, face = "bold"),  # Bold facet labels
    legend.position = "none",  # Remove legend since fill is redundant with facets
    panel.grid.major.x = element_blank(),  # Remove vertical grid lines
    panel.spacing = unit(1, "lines"),  # Space between facets
    panel.background = element_rect(fill = "#BCE2F5"),  # Softer blue for plot panels
    plot.background = element_rect(fill = "#BCE2F5")   # Softer blue for entire plot
  ) +
  scale_fill_manual(values = c("Clothing" = "#BBE9A5", "Beauty" = "#FFFBAC", "Electronics" = "#FFC79A")) +
  coord_cartesian(clip = "off")  # Prevent clipping of text labels

Finally we can start drilling down on recommendations for our marketing team. Specifically, we see spending across all categories for both genders declines after age 57 with drops ranging in 1.4% to 4.2%. The only exception here is Male spending on electronics which increases by 2.2%

Let’s finally quickly sum these:

# Calculate gender-specific totals for percentage denominators
female_total <- sales %>%
  filter(Gender == "Female") %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE)) %>%
  pull(Total)

male_total <- sales %>%
  filter(Gender == "Male") %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE)) %>%
  pull(Total)

# --- Female Data ---
# Calculate total spending by Product Category and Age for Females
female_sales <- sales %>%
  filter(Gender == "Female") %>%
  mutate(
    Age = as.numeric(Age),
    Age_Bucket = cut(Age, breaks = c(17, 27, 37, 47, 57, 64),
                     labels = c("18-27", "28-37", "38-47", "48-57", "58-64"),
                     include.lowest = TRUE, right = FALSE)
  ) %>%
  group_by(Product.Category, Age_Bucket) %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop")

# Find the lowest sales total age demographic for each Product Category (Females) and add percentage
female_lowest <- female_sales %>%
  group_by(Product.Category) %>%
  slice_min(Total, n = 1, with_ties = FALSE) %>%  # Get the row with the minimum Total
  ungroup() %>%
  mutate(Percent = Total / female_total * 100)  # Percentage of female total

# --- Male Data ---
# Calculate total spending by Product Category and Age for Males
male_sales <- sales %>%
  filter(Gender == "Male") %>%
  mutate(
    Age = as.numeric(Age),
    Age_Bucket = cut(Age, breaks = c(17, 27, 37, 47, 57, 64),
                     labels = c("18-27", "28-37", "38-47", "48-57", "58-64"),
                     include.lowest = TRUE, right = FALSE)
  ) %>%
  group_by(Product.Category, Age_Bucket) %>%
  summarise(Total = sum(Total.Amount, na.rm = TRUE), .groups = "drop")

# Find the lowest sales total age demographic for each Product Category (Males) and add percentage
male_lowest <- male_sales %>%
  group_by(Product.Category) %>%
  slice_min(Total, n = 1, with_ties = FALSE) %>%  # Get the row with the minimum Total
  ungroup() %>%
  mutate(Percent = Total / male_total * 100)  # Percentage of male total

# --- Print Results ---
cat("Lowest Sales Total Age Demographics by Product Category\n")

## Lowest Sales Total Age Demographics by Product Category

# Print Female Results
cat("Females:\n")

## Females:

for (i in 1:nrow(female_lowest)) {
  cat(sprintf("%s: %s with $%s (%.1f%% of female total)\n",
              female_lowest$Product.Category[i],
              female_lowest$Age_Bucket[i],
              format(round(female_lowest$Total[i]), big.mark = ","),
              female_lowest$Percent[i]))
}

## Beauty: 58-64 with $8,080 (3.5% of female total)
## Clothing: 28-37 with $13,670 (5.9% of female total)
## Electronics: 58-64 with $9,110 (3.9% of female total)

# Add a separator
cat("\n")

# Print Male Results
cat("Males:\n")

## Males:

for (i in 1:nrow(male_lowest)) {
  cat(sprintf("%s: %s with $%s (%.1f%% of male total)\n",
              male_lowest$Product.Category[i],
              male_lowest$Age_Bucket[i],
              format(round(male_lowest$Total[i]), big.mark = ","),
              male_lowest$Percent[i]))
}

## Beauty: 58-64 with $7,655 (3.4% of male total)
## Clothing: 58-64 with $10,175 (4.6% of male total)
## Electronics: 28-37 with $10,490 (4.7% of male total)

Recommendation

Based on our latest sales data, here are the lowest-spending demographics we should target in the next campaign to boost sales:

Females:
Beauty: Ages 58-64 ($8,080, 3.5%) – Push anti-aging products via email/Facebook.

Clothing: Ages 28-37 ($13,670, 5.9%) – Promote workwear on Instagram/TikTok.

Electronics: Ages 58-64 ($9,110, 3.9%) – Highlight easy-to-use gadgets with YouTube tutorials.

Males:
Beauty: Ages 58-64 ($7,655, 3.4%) – Market grooming essentials via TV/barbershops.

Clothing: Ages 58-64 ($10,175, 4.6%) – Offer comfy casuals through direct mail.

Electronics: Ages 28-37 ($10,490, 4.7%) – Target tech-savvy guys with gaming gear on Twitch/YouTube.

Sales Analysis

David Kubacki

2025-04-01

Retail Sales Analysis

Load Packages

Data Exploration

Data Presentation

Recommendation