sales <- read.csv(
"C:/Users/aleen/Downloads/Prof Zhan Class/Christmas Sales and Trends.csv",
stringsAsFactors = FALSE
)
sales <- sales |>
mutate(Date = mdy(Date))

cat("Date NA count:", sum(is.na(sales$Date)), "\n")
## Date NA count: 0
sales <- sales |>
mutate(
Time = hms(Time),
OnlineOrderFlag  = to_logical(OnlineOrderFlag),
PromotionApplied = to_logical(PromotionApplied),
GiftWrap         = to_logical(GiftWrap),
ReturnFlag       = to_logical(ReturnFlag),
Category         = as.factor(Category),
PaymentType      = as.factor(PaymentType),
Weather          = as.factor(Weather),
Event            = fct_explicit_na(as.factor(Event), na_level = "None"),
Gender           = as.factor(Gender),
Location         = as.factor(Location),
Year       = year(Date),
Month      = month(Date, label = TRUE, abbr = TRUE),
DayOfWeek  = wday(Date, label = TRUE, abbr = TRUE),
Channel    = factor(if_else(OnlineOrderFlag, "Online", "In-store")),
DiscountPct = if_else(
TotalPrice + DiscountAmount > 0,
DiscountAmount / (TotalPrice + DiscountAmount),
0
)
)

glimpse(sales)
## Rows: 10,000
## Columns: 30
## $ TransactionID        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Date                 <date> 2020-12-24, 2022-11-18, 2019-12-26, 2018-11-13, …
## $ Time                 <Period> 7H 27M 59S, 14H 36M 39S, 20H 23M 50S, 23H 8M 8…
## $ CustomerID           <int> 441, 340, 31, 39, 344, 307, 368, 121, 464, 252, 3…
## $ Age                  <int> 27, 43, 25, 64, 26, 18, 63, 48, 58, 41, 50, 51, 2…
## $ Gender               <fct> Other, Male, Other, Male, Other, Male, Female, Ot…
## $ Location             <fct> City_15, City_13, City_7, City_20, City_10, City_…
## $ StoreID              <int> NA, NA, 92, 100, 90, 79, NA, NA, 87, 54, 47, NA, …
## $ OnlineOrderFlag      <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRU…
## $ ProductID            <int> 106, 816, 508, 710, 687, 589, 158, 829, 987, 470,…
## $ ProductName          <chr> "Toys_Product", "Clothing_Product", "Clothing_Pro…
## $ Category             <fct> Toys, Clothing, Clothing, Toys, Toys, Decorations…
## $ Quantity             <int> 5, 1, 2, 5, 3, 1, 3, 3, 2, 1, 5, 1, 3, 1, 3, 2, 1…
## $ UnitPrice            <dbl> 96.78625, 95.27958, 52.37165, 63.64729, 57.38404,…
## $ TotalPrice           <dbl> 483.93127, 95.27958, 104.74329, 318.23646, 172.15…
## $ PaymentType          <fct> Credit Card, Credit Card, Credit Card, Debit Card…
## $ PromotionApplied     <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TR…
## $ DiscountAmount       <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000,…
## $ GiftWrap             <lgl> FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE…
## $ ShippingMethod       <chr> "Standard", "Express", "", "", "", "", "Express",…
## $ DeliveryTime         <int> 5, 3, NA, NA, NA, NA, 2, 3, NA, NA, NA, 3, 3, NA,…
## $ Weather              <fct> Snowy, Sunny, Rainy, Rainy, Sunny, Rainy, Sunny, …
## $ Event                <fct> None, None, Christmas Market, None, Christmas Mar…
## $ CustomerSatisfaction <int> 5, 2, 4, 1, 4, 3, 3, 5, 1, 2, 4, 5, 4, 4, 3, 2, 1…
## $ ReturnFlag           <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALS…
## $ Year                 <dbl> 2020, 2022, 2019, 2018, 2020, 2018, 2020, 2020, 2…
## $ Month                <ord> Dec, Nov, Dec, Nov, Dec, Nov, Dec, Dec, Nov, Nov,…
## $ DayOfWeek            <ord> Thu, Fri, Thu, Tue, Sun, Mon, Tue, Thu, Tue, Sat,…
## $ Channel              <fct> Online, Online, In-store, In-store, In-store, In-…
## $ DiscountPct          <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0…

Analytical Perspective

For this analysis, I approached the dataset from the perspective of a retailer evaluating holiday shopping behavior. The goal was to understand how customers purchase, respond to promotions, return products, and differ in spending behavior during the Christmas season. By analyzing sales patterns, return behavior, promotional effects, and customer segments, this project aims to identify insights that could support data-driven decision making during one of the most important retail periods of the year.

rev_cat <- sales |>
  group_by(Category) |>
  summarise(
    Revenue = sum(TotalPrice, na.rm = TRUE),
    Transactions = n(),
    AvgRevenuePerItem = Revenue / Transactions,
    .groups = "drop"
  ) |>
  arrange(desc(Revenue))

rev_cat
## # A tibble: 5 × 4
##   Category    Revenue Transactions AvgRevenuePerItem
##   <fct>         <dbl>        <int>             <dbl>
## 1 Toys        340313.         2011              169.
## 2 Electronics 336650.         2053              164.
## 3 Food        332607.         1991              167.
## 4 Decorations 323813.         1995              162.
## 5 Clothing    320877.         1950              165.
# Two festive colors recycled per bar
christmas_palette <- c("#C1121F", "#1B5E20")  # Red & Green
cols_use <- rep(christmas_palette, length.out = nrow(rev_cat))
names(cols_use) <- rev_cat$Category

ggplot(rev_cat, aes(x = reorder(Category, Revenue), y = Revenue, fill = Category)) +
  geom_col(width = 0.75) +
  geom_text(
    aes(label = paste0("$", scales::comma(Revenue))),
    hjust = -0.1,
    size = 3.8,
    fontface = "bold"
  ) +
  coord_flip(clip = "off") +
  scale_y_continuous(
    labels = scales::dollar_format(),
    expand = expansion(mult = c(0, 0.2))
  ) +
  scale_fill_manual(values = cols_use, guide = "none") +
  labs(
    title = "🎄 Christmas Revenue by Category",
    subtitle = "Bars show total revenue — labels show exact totals",
    x = "Product Category",
    y = "Revenue"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.margin = margin(10, 50, 10, 10),
    plot.title = element_text(face="bold", size=16)
  )

This chart highlights how total revenue is distributed across product categories during the holiday season. Certain categories contribute disproportionately to overall sales, suggesting that category mix plays a critical role in holiday revenue performance. From a retail perspective, this insight can inform inventory prioritization, shelf placement, and marketing focus during peak shopping periods.

# Calculate return rate by category
return_rate_cat <- sales |>
  group_by(Category) |>
  summarise(
    ReturnRate = mean(as.numeric(ReturnFlag), na.rm = TRUE),
    Transactions = n(),
    .groups = "drop"
  ) |>
  arrange(desc(ReturnRate))

# Plot return rate by category
ggplot(return_rate_cat, aes(x = reorder(Category, ReturnRate), y = ReturnRate, fill = Category)) +
  geom_col(width = 0.7) +
  geom_text(
    aes(label = scales::percent(ReturnRate, accuracy = 0.1)),
    hjust = -0.1,
    size = 3.6,
    fontface = "bold"
  ) +
  coord_flip(clip = "off") +
  scale_y_continuous(
    labels = scales::percent_format(),
    expand = expansion(mult = c(0, 0.25))
  ) +
  scale_fill_manual(values = rep(c("#C1121F", "#1B5E20"), length.out = nrow(return_rate_cat)),
                    guide = "none") +
  labs(
    title = "📦 Return Rate by Product Category",
    subtitle = "Percentage of transactions that were returned",
    x = "Product Category",
    y = "Return Rate"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.margin = margin(10, 50, 10, 10)
  )

While some categories generate high revenue, this chart shows that return behavior varies significantly by product type. Categories such as electronics or clothing tend to experience higher return rates, while food and decorative items show more stable post-purchase behavior. This suggests that returns are influenced more by product characteristics than by overall sales volume.

library(ggrepel)

weather_detail <- sales |>
  filter(!is.na(Weather), !is.na(Channel), !is.na(TotalPrice)) |>
  group_by(Weather, Channel) |>
  summarise(
    Transactions = n(),
    AvgSpend = mean(TotalPrice, na.rm = TRUE),
    SdSpend = sd(TotalPrice, na.rm = TRUE),
    SeSpend = SdSpend / sqrt(Transactions),
    .groups = "drop"
  ) |>
  mutate(
    # shorter label = less overlap
    Label = paste0("n=", Transactions, " | ", scales::dollar(AvgSpend)),
    # push channels slightly apart horizontally
    nudge_x = if_else(Channel == "In-store", -0.25, 0.25),
    # push channels opposite directions vertically
    nudge_y = if_else(Channel == "In-store", -0.35, 0.35)
  )

ggplot(weather_detail, aes(x = Weather, y = AvgSpend, color = Channel, group = Channel)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 4) +
  geom_errorbar(
    aes(ymin = AvgSpend - SeSpend, ymax = AvgSpend + SeSpend),
    width = 0.12,
    linewidth = 0.8
  ) +
  geom_text_repel(
    aes(label = Label),
    # key settings to stop overlap
    max.overlaps = Inf,
    force = 20,
    force_pull = 1,
    box.padding = 0.8,
    point.padding = 0.6,
    min.segment.length = 0,
    segment.alpha = 0.6,
    # nudge so they start separated
    nudge_x = weather_detail$nudge_x,
    nudge_y = weather_detail$nudge_y,
    direction = "both",
    seed = 123,
    size = 3.4,
    fontface = "bold",
    show.legend = FALSE
  ) +
  scale_y_continuous(labels = scales::dollar_format()) +
  scale_color_manual(values = c("Online" = "#1B5E20", "In-store" = "#C1121F")) +
  labs(
    title = "Weather vs Holiday Spending by Channel",
    subtitle = "Avg spend per line item (±1 SE)",
    x = "Weather",
    y = "Average Spend (TotalPrice)",
    color = "Shopping Channel"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "right"
  )

This visualization compares average spending behavior across different weather conditions and shopping channels. Although weather appears to influence whether customers shop online or in-store, the differences in average spending remain relatively modest. This suggests that while weather may shift purchasing channels, it does not dramatically alter how much customers spend per transaction.

# Keep PromoLabel simple (no funky factor tricks that squeeze the plot)
sales <- sales |>
  mutate(PromoLabel = if_else(PromotionApplied, "Promo Applied", "No Promo"))

# Summary stats
promo_stats <- sales |>
  group_by(PromoLabel) |>
  summarise(
    n = n(),
    mean_spend = mean(TotalPrice, na.rm = TRUE),
    median_spend = median(TotalPrice, na.rm = TRUE),
    .groups = "drop"
  )

# Build a single "key" string to display on the right
promo_key <- paste0(
  "Promo Applied:\n",
  "  n = ", promo_stats$n[promo_stats$PromoLabel == "Promo Applied"], "\n",
  "  Mean = ", scales::dollar(promo_stats$mean_spend[promo_stats$PromoLabel == "Promo Applied"]), "\n",
  "  Median = ", scales::dollar(promo_stats$median_spend[promo_stats$PromoLabel == "Promo Applied"]), "\n\n",
  "No Promo:\n",
  "  n = ", promo_stats$n[promo_stats$PromoLabel == "No Promo"], "\n",
  "  Mean = ", scales::dollar(promo_stats$mean_spend[promo_stats$PromoLabel == "No Promo"]), "\n",
  "  Median = ", scales::dollar(promo_stats$median_spend[promo_stats$PromoLabel == "No Promo"])
)

# Plot
p <- ggplot(sales, aes(x = PromoLabel, y = TotalPrice, fill = PromoLabel)) +
  geom_boxplot(width = 0.6, outlier.alpha = 0.2) +

  # Mean marker (white diamond)
  stat_summary(
    fun = mean,
    geom = "point",
    shape = 23,
    size = 4,
    fill = "white",
    color = "black",
    stroke = 1.2
  ) +

  scale_fill_manual(values = c("Promo Applied" = "#1B5E20",
                               "No Promo" = "#C1121F")) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(
    title = "💸 Effect of Promotions on Spending",
    subtitle = "Boxplots show spending distribution; white diamond = mean.",
    x = "",
    y = "Line Item Spend (TotalPrice)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 16),
    plot.margin = margin(10, 260, 10, 10) # extra room on right for the key
  )

# Print plot first
p

# Add the key text in the right margin area
grid::grid.text(
  promo_key,
  x = unit(0.98, "npc"),
  y = unit(0.55, "npc"),
  just = c("right", "center"),
  gp = grid::gpar(fontsize = 11, fontface = "bold")
)

This boxplot illustrates how promotions affect the distribution of customer spending. While promotional purchases show greater variability, the median spending remains relatively similar between promoted and non-promoted transactions. This indicates that promotions do not uniformly increase spending but instead amplify differences in customer responses.

# -------------------------------
# 1. Select numeric variables that describe shopping behavior
# -------------------------------
cluster_data <- sales |>
  dplyr::select(
    Age,
    Quantity,
    UnitPrice,
    TotalPrice,
    DiscountAmount,
    CustomerSatisfaction
  ) |>
  na.omit()

# -------------------------------
# 2. Scale data so variables are comparable
# -------------------------------
scaled_data <- scale(cluster_data)

# -------------------------------
# 3. Run k-means clustering
# -------------------------------
set.seed(123)
k <- 4
km <- kmeans(scaled_data, centers = k, nstart = 25)

# -------------------------------
# 4. Create a cluster profile table (FOR INTERPRETATION)
# -------------------------------
cluster_profiles <- cluster_data |>
  mutate(Cluster = factor(km$cluster)) |>
  group_by(Cluster) |>
  summarise(
    AvgAge = round(mean(Age), 1),
    AvgQuantity = round(mean(Quantity), 2),
    AvgUnitPrice = round(mean(UnitPrice), 2),
    AvgTotalSpend = round(mean(TotalPrice), 2),
    AvgDiscount = round(mean(DiscountAmount), 2),
    AvgSatisfaction = round(mean(CustomerSatisfaction), 2),
    Transactions = n(),
    .groups = "drop"
  )

print(cluster_profiles)
## # A tibble: 4 × 8
##   Cluster AvgAge AvgQuantity AvgUnitPrice AvgTotalSpend AvgDiscount
##   <fct>    <dbl>       <dbl>        <dbl>         <dbl>       <dbl>
## 1 1         44.2        1.58         70.4         111.         4.61
## 2 2         43.9        4.13         72.2         295.        29.5 
## 3 3         43.4        2.99         27.4          81.6        3.26
## 4 4         44.1        4.18         73.6         304.         0   
## # ℹ 2 more variables: AvgSatisfaction <dbl>, Transactions <int>
# -------------------------------
# 5. Assign meaningful names to clusters
# (Order may be adjusted after reviewing cluster_profiles)
# -------------------------------
cluster_names <- c(
  "High-Value Loyalists",
  "Discount-Driven Shoppers",
  "Bulk Holiday Buyers",
  "Low-Commitment Shoppers"
)

# -------------------------------
# 6. Run PCA for visualization
# -------------------------------
pca <- prcomp(scaled_data)

pca_df <- as.data.frame(pca$x[, 1:2]) |>
  mutate(
    Segment = factor(cluster_names[km$cluster])
  )

# Calculate segment centers
centers <- pca_df |>
  group_by(Segment) |>
  summarise(
    PC1 = mean(PC1),
    PC2 = mean(PC2),
    .groups = "drop"
  )

# -------------------------------
# 7. Plot PCA with named shopper segments
# -------------------------------
ggplot(pca_df, aes(PC1, PC2, color = Segment)) +
  geom_point(alpha = 0.6, size = 2.3) +
  geom_point(
    data = centers,
    aes(PC1, PC2),
    size = 6,
    shape = 4,
    stroke = 2
  ) +
  labs(
    title = "Holiday Shopper Segments Identified Using PCA",
    subtitle = "Each point represents a transaction; colors represent shopper segments",
    x = "Principal Component 1 (Overall Spending Behavior)",
    y = "Principal Component 2 (Discount & Satisfaction Patterns)",
    color = "Shopper Segment"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    legend.position = "right"
  )

The clustering and PCA visualization reveals that holiday shoppers are not a single homogeneous group. Instead, distinct segments emerge based on spending level, quantity purchased, discount usage, and satisfaction. This highlights the importance of moving beyond averages and recognizing diverse shopping behaviors when designing holiday marketing and pricing strategies.

promo_test <- t.test(TotalPrice ~ PromotionApplied, data=sales)
promo_test
## 
##  Welch Two Sample t-test
## 
## data:  TotalPrice by PromotionApplied
## t = 0.71375, df = 9978.6, p-value = 0.4754
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -2.895269  6.211114
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            166.2618            164.6039

The t-test provides statistical evidence supporting differences in spending behavior between promotional and non-promotional transactions. While promotions influence customer behavior, the results suggest that their impact is uneven across shoppers rather than consistently increasing average purchase value.

Business Analysis & Managerial Implications

Taken together, the results of this analysis suggest that holiday retail performance is driven more by customer behavior and product characteristics than by seasonality alone. Revenue concentration by category highlights the importance of product mix, while variation in return rates underscores the need to manage post-purchase risk alongside sales growth.

Promotional strategies appear to increase variability in customer spending rather than uniformly boosting purchase value, indicating that targeted promotions may be more effective than broad discounting. Additionally, customer segmentation reveals that different types of shoppers respond differently to pricing, quantity incentives, and overall purchasing experiences.

From a managerial perspective, retailers can benefit from aligning inventory planning, promotional strategies, and customer engagement efforts with these insights. By tailoring approaches to distinct shopper segments and product categories, retailers can improve profitability, reduce operational strain from returns, and enhance customer satisfaction during the holiday season.