DATA 4030 Final Project

Author

Team Owls

Analyzing The Chocolate Market: Team Owls Final Project for DATA 4030

Team Members:

  • Asa Gardner - Data Science & Analytics Major, Sophmore

  • Jacob Aliff - Data Science & Analytics Major, Senior

  • Moyuk Dey - Industrial Engineering Major, Data Science & Analytics Major Minor, Senior

  • Reid Heilesen - Industrial Engineering Major, Data Science & Analytics Major Minor, Senior

library(tidyverse)
library(ggplot2)
library(scales)
library(dplyr)
library(patchwork)
library(leaflet)
library(qrcode)
library(DT)
library(plotly)
library(knitr)

Introduction

Our data set (“Chocolate Sales Dataset 2023 - 2024”, authored by Thanh Nhan Nguyen) was found on Kaggle (https://www.kaggle.com/datasets/ssssws/chocolate-sales-dataset-2023-2024). Team Owls chose this data set out of a desire to study a unique topic and a shared fondness for chocolate. The data set contains five CSV’s transformed and merged as needed for analysis. These five CVS’s contain synthetic retail transaction data for chocolate sales across multiple stores and companies.

Goals

This project seeks to analyze the stability, decline, or growth of the chocolate market. To do this, Team Owls tracked trends such as sales across the entire period, sales by season, sales by month, and distribution of sales. Additionally, we analyzed impact factors such as day of the week on sales, gender on company loyalty, sales by best and worst performing stores, and sales by product type.

Data

sales <- read_csv("data/sales.csv")
customers <- read_csv("data/customers.csv")
products <- read_csv("data/products.csv")
stores <- read_csv("data/stores.csv")
calendar <- read_csv("data/calendar.csv")

The five CSV files are sales, stores, calendar, customers, and products. Each contains vital information about the chocolate market. Feel free to explore the five CSV files! The three most-used ones were sales, customers, and products. Customers and products are included here as interactive tables for your convenience! Sales is too big to run as an interactive table, so a tibble is provided.

Sales:

# A tibble: 1,000,000 × 11
   order_id    order_date product_id store_id customer_id quantity unit_price
   <chr>       <date>     <chr>      <chr>    <chr>          <dbl>      <dbl>
 1 0RD00000001 2023-01-07 P0080      S093     C040749            5      14.4 
 2 0RD00000002 2023-10-22 P0173      S065     C020161            3      12.0 
 3 0RD00000003 2023-05-07 P0115      S078     C048069            2      10.0 
 4 0RD00000004 2024-06-23 P0186      S088     C047901            2      14.7 
 5 0RD00000005 2024-09-24 P0197      S054     C033950            1      12.3 
 6 0RD00000006 2024-03-29 P0160      S089     C008918            4      13.5 
 7 0RD00000007 2023-02-26 P0062      S024     C002897            1      12.0 
 8 0RD00000008 2023-11-03 P0111      S085     C038072            5       4.62
 9 0RD00000009 2024-10-11 P0135      S029     C003786            4       7.88
10 0RD00000010 2023-12-17 P0069      S056     C043148            3       8.88
# ℹ 999,990 more rows
# ℹ 4 more variables: discount <dbl>, revenue <dbl>, cost <dbl>, profit <dbl>

Customers:

Products:

Transformation

For analysis, several of the CSV’s were transformed, merged, or otherwise altered for clarity and usability.

# Sales transformation
sales <- sales |>
  mutate(
    month = format(order_date, format = "%b"),
    year = year(order_date),
    month_year = as.Date(format(order_date, "%Y-%m-01")),
    season = case_when(
        month %in% c("Dec", "Jan", "Feb") ~ "Winter",
        month %in% c("Mar", "Apr", "May") ~ "Spring",
        month %in% c("Jun", "Jul", "Aug") ~ "Summer",
        month %in% c("Sep", "Oct", "Nov") ~ "Fall"
      )
    )

# Daily sales transformation
daily_sales <- sales |>
  group_by(order_date, month, year, season) |>
  summarise(num_orders = n(), .groups = "drop")

# Monthly sales transformation
monthly_sales <- sales |>
  group_by(month_year) |>
  summarise(
    num_orders = n(),
    product_quantity_sold = sum(quantity)
    )


# Weekly sales merged daily_sales & calendar
weekly_sales <- merge(daily_sales, calendar, 
                      by.x = "order_date", 
                      by.y = "date", 
                      all.x = TRUE) |>
  mutate(day_of_week = case_when(
    day_of_week == 0 ~ "Sunday",
    day_of_week == 1 ~ "Monday",
    day_of_week == 2 ~ "Tuesday",
    day_of_week == 3 ~ "Wednesday",
    day_of_week == 4 ~ "Thursday",
    day_of_week == 5 ~ "Friday",
    day_of_week == 6 ~ "Saturday")
  )

# Customers transformation
customers <- customers |> # DO NOT RUN THIS TWICE OR IT BREAKS
  mutate(
    loyalty_member = case_when(
      loyalty_member == 0 ~ "Not a Loyalty Member",
      loyalty_member == 1 ~ "Loyalty Member"
    )
  )

# Sales transformation
best_stores <- sales |>
  group_by(store_id) |>
  summarise(total_sales = n(), .groups = "drop") |>
  slice_max(total_sales, n = 3) |>
  arrange(desc(total_sales))
worst_stores <- sales |>
  group_by(store_id) |>
  summarise(total_sales = n(), .groups = "drop") |>
  slice_min(total_sales, n = 3) |>
  arrange(total_sales)

# Sales transformation
top_products <- sales |>
  group_by(product_id) |>
  summarise(num_orders = n(), .groups = "drop") |>
  slice_max(num_orders, n = 5) |>
  arrange(desc(num_orders))
bottom_products <- sales |>
  group_by(product_id) |>
  summarise(num_orders = n(), .groups = "drop") |>
  slice_min(num_orders, n = 5)  |>
  arrange(num_orders)

Analysis

Seasonal & Monthly Impact on Sales: What Season Do Customers Buy Chocolate?

# PLOT 1: BAR CHART - ORDER COUNTS BY AGGGREGATED SEASONS
ggplot(sales, aes(x = season, fill = season)) +
  geom_bar() +
  labs(
    title = "Plot 1: Order Counts by Season",
    x = NULL,
    y = "Count of Orders",
    fill = "Season") +
  scale_y_continuous(
    labels = label_number(scale = 1/1000, suffix = "K")
    ) +
  coord_cartesian(
    ylim = c(200000, 250000)
  ) +
  theme(legend.position = "bottom")

Seasonal Sales Patterns:

Chocolate sales across the market do not have a significant rise or drop depending on season. They slightly rise above 250K/month in Spring and Summer and fall slightly below that threshold in Fall and Winter. The fact that these shifts are minuscule indicates that individuals do not significantly buy more or less chocolate at particular periods of the year, making sales fairly consistent and stable year-round.

# PLOT 2: LINE CHART - SALES BY MONTH ACROSS 2023-2024
ggplot(monthly_sales, aes(x = month_year, y = num_orders)) +
  geom_point() +
  geom_line() +
  geom_point(
    data = monthly_sales |>
      filter(num_orders < 40000),
    color = "red2",
    size = 2.5
    ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_date(breaks = "1 month",
               date_labels = "%b '%y") +
  coord_cartesian(
    ylim = c(30000, 45000)
  ) +
  annotate(
    geom = "label",
    x = as.Date("2023-08-01"), y = 38000,
    label = "February is the worst performing \nmonth of both years!",
    size = 3
  ) +
  labs(
    title = "Plot 2: Sales By Month Across 2023-2024",
    x = "Date",
    y = "Sales Per Month"
  )

Monthly Sales Patterns:

The graph shows between around 38,000 and 44,000 chocolate boxes being bought per month in 2023 and 2024. The general pattern shows mild rises and declines without any clear upward or downward trend by the end. The only month with consistently noticeable declines in both 2023 and 2024 is February — which might be able to be explained by the month’s shorter length. However, the market has a very regular pattern of success on a monthly basis.

Daily Sales Analysis: Sales Across the Entire Date Range, the Distribution, & What Day Of the Week Do People Buy Chocolate?

# PLOT 3: SCATTERPLOT: DAILY SALES ACROSS 2023-2024
ggplot(daily_sales, aes(x = order_date, y = num_orders)) +
  geom_point() +
  geom_smooth(method = lm, color = "red") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_date(breaks = "1 month",
               date_labels = "%b '%y") +
  labs(
    title = "Plot 3: Daily Sales 2023 - 2024",
    x = "Date",
    y = "Number of Orders"
  )
`geom_smooth()` using formula = 'y ~ x'

Daily Sales Patterns:

Daily chocolate orders stayed between approx. 1,250 and 1,450 for two years. The trend line is barely negative — showing that demand is fairly constant over time. This means that from a long-term standpoint, the chocolate market can be predicted with very little overall fluctuations. However, fluctuations do vary more significantly day-by-day.

# PLOT 4: SCATTERPLOT: DAILY SALES ACROSS 2023-2024
interactive1 <- ggplot(daily_sales |>
         filter(
           order_date >= as.Date("2023-12-01"),
           order_date <= as.Date("2024-11-30")), 
       aes(x = order_date, y = num_orders)) +
  geom_point(aes(color = season)) +
  geom_smooth(aes(color = season)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_date(breaks = "1 month",
               date_labels = "%b '%y") +
  labs(
    title = "Plot 4: Daily Sales by Season in 2024",
    color = "Season",
    x = "Date",
    y = "Number of Orders"
  )

ggplotly(interactive1)

Daily Sales by Season Patterns:

Early, we ananlyzed season and month impact on sales. This graph briefly goes back to that. Aggregating daily sales by season, we can see trend lines are not so smooth like as seen in the previous graph. However, the lines of best fit never stray far from the center of the distribution, and the disparity can likely be explained by the smaller number of data values making up each season, as well as the wide differences in sales on a day-to-day-basis.

# PLOT 3: HISTOGRAM: DISTRIBUTION OF DAILY SALES

interactive2 <- ggplot(daily_sales, aes(x = num_orders)) +
  geom_histogram(binwidth = 10, color = "gray90", fill = "navy") +
  labs(
    title = "Plot 5: Distribution of Daily Sales",
    x = "Number of Orders Per Day",
    y = "Frequency"
  ) +
  scale_y_continuous(
    breaks = seq(0, 100, by = 20)
  ) +
  scale_x_continuous(
    breaks = seq(1240, 1460, by = 20)
  )

ggplotly(interactive2)

Notes on the Distribution:

Notice the bell curve of the distribution! (sidenote, Asa is writing this part of the analysis — and I am super passionate about statistics!) Bell curves are awesome because they can tell us so much about the data. Our chocolate sales have a normal distribution — unimodal, symmetric. The mean, median, and mode should be roughly the same. The empirical rule also allows estimation of data placement.

Now, why is all this important? If Team Owls or anyone else chose to apply statistical analysis to this data (such as t-tests, ANOVA tests, or linear regression analysis), these models operate under the assumption that the data is normally distributed! Without normal distribution, any findings are less certain.

# PLOT 5: FACETED BOXPLOTS OF DAY-OF-WEEK IMPACT ON NUMBER OF ORDERS 
ggplot(weekly_sales, aes(y = num_orders, fill = day_of_week)) +
  geom_boxplot() +
  facet_wrap(~ day_of_week) +
  theme(legend.position = "none") +
  labs(
    title = "Plot 6: Order Distribution by Day of Week",
    x = NULL,
    y = "Number of Orders") +
  scale_x_date(labels = NULL)

Box Plot Patterns:

Earlier, we observed that sales per day can vary significantly day-to-day. However, order distributions for all seven days of the week are fairly similar in shape with medians and interquartile ranges at approximately similar ranges. The whiskers and outliers are also fairly evenly distributed across all categories. Chocolate demand is relatively the same Monday to Sunday, and weekend vs. weekday has no noticeable disparity either. This indicates that although sales has very high variability from day-to day, day-of-the-week must not be a contributing factor. Each day has a normal distribution.

Factors Analysis: What are the Potential Impacts of Factors Such As Gender, Age, Store, and Chocolate Type?

# PLOT 6: BAR CHART: GENDER COMPARISON TO LOYALTY PROGRAM
ggplot(customers, aes(x = gender, fill = loyalty_member)) +
  geom_bar(position = "fill") +
  theme(legend.position = "none") +
  labs(
    title = "Plot 7: Loyalty Membership by Gender (100% Stacked)",
    x = "Gender of Customer",
    y = "Percentage",
    fill = "Loyalty Member",
    ) +
  annotate(
    geom = "label",
    x = "Female", y = 0.25,
    label = "Not Subscribed to Loyalty Program",
    size = 3
  ) +
    annotate(
    geom = "label",
    x = "Female", y = 0.75,
    label = "Subscribed to Loyalty Program",
    size = 3
  ) +
    annotate(
    geom = "label",
    x = "Male", y = 0.25,
    label = "Not Subscribed to Loyalty Program",
    size = 3
  ) +
    annotate(
    geom = "label",
    x = "Male", y = 0.75,
    label = "Subscribed to Loyalty Program",
    size = 3
  ) +
  scale_y_continuous(labels = scales::percent)

Gender Patterns:

The graph illustrates the percentage of men and women subscribed to company loyalty programs is almost the same. There are roughly equal numbers of members and non-members for any gender.
The balance indicates that there is no need for gender-specific marketing as the reward program is appealing to both males and females. Instead, marketing should focus on how to bring up the subscribed percentage as a whole.

# PLOT 7: FACETED BOXPLOTS: AGE ON LOYALTY MEMBER
ggplot(customers, aes(y = age, fill = loyalty_member)) +
  geom_boxplot() +
  facet_wrap(~loyalty_member) +
  scale_x_continuous(label = NULL) +
  theme(legend.position = "none") +
  labs(
    title = "Plot 8: Age Impact on Loyalty Membership",
    y = "Age"
    )

Age Patterns:

The bulk of loyalty members are between ~30-60 years of age, with ~44 being the median age. The range lines from roughly 20 to 70 illustrate that the program is attractive for a wide variety of age groups, but interest peaks in mid-adulthood. The similar distibutions on both boxplots suggests that, like gender, age is not a determining factor on loyalty subscription.

Demographic characteristics aren’t a strong predictor of someone’s loyalty status. This suggests that variances in purchasing behavior are more likely to be explained by differences in behaviour or purchase patterns than by differences in demography.

# PLOT 8 AND 9:
# BAR CHART: BEST PERFORMING STORE SALE COUNTS
# BAR CHART: WORST PERFORMING STORE SALE COUNTS

plot3 <- ggplot(best_stores, aes(x = reorder(store_id, -total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "red4") +
  labs(
    title = "Plot 9: Sale Counts of \nTop-Performing Stores",
    x = "Store ID",
    y = "Number of Sales"
  ) +
  coord_cartesian(
    ylim = c(9500, 10250)
    ) +
    annotate(
    geom = "label",
    x = "S013", y = 10000,
    label = "City:\nMelbourne",
    size = 2.5,
    alpha = 0.9
  ) +
    annotate(
    geom = "label",
    x = "S033", y = 9950,
    label = "City:\nToronto",
    size = 2.5,
    alpha = 0.9
  ) +
    annotate(
    geom = "label",
    x = "S027", y = 9900,
    label = "City:\nSydney",
    size = 2.5,
    alpha = 0.9
  )

plot4 <- ggplot(worst_stores, aes(x = reorder(store_id, -total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "navy") +
  labs(
    title = "Plot 10: Sale Counts of \nWorst-Performing Stores",
    x = "Store ID",
    y = "Number of Sales"
  ) +
  coord_cartesian(
    ylim = c(9500, 10250)
    ) +
    annotate(
    geom = "label",
    x = "S029", y = 9700,
    label = "City:\nMelbourne",
    size = 2.5,
    alpha = 0.9
  ) +
    annotate(
    geom = "label",
    x = "S057", y = 9650,
    label = "City:\nLondon",
    size = 2.5,
    alpha = 0.9
  ) +
    annotate(
    geom = "label",
    x = "S077", y = 9600,
    label = "  City:  \nParis",
    size = 2.5,
    alpha = 0.9
  )

plot3 + plot4

Best vs. Worst Stores:

Although the sales differences between the top three and bottom three may look prominent in these two graphs, pay attention to the y-axis — which was zoomed in to 9500-10100 (this was done so that differences could be seen at all!). Taking into account the y-axis scaling, we can see that store does not have much of an impact on sales.

# PLOT 10 AND 11:
# BAR CHART - BEST SELLING TYPES OF CHCOLATE
# BAR CHART - WORST SELLING TYPES OF CHCOLATE 

plot1 <- ggplot(top_products, aes(x = reorder(product_id, -num_orders), 
                         y = num_orders)) +
  geom_bar(stat = "identity", fill = "red4") +
  labs(
    title = "Plot 11: Best Selling Chocolate",
    x = "Product ID",
    y = "Number of Orders") +
  coord_cartesian(
    ylim = c(4500, 5200)
  ) +
    annotate(
    geom = "label",
    x = "P0146", y = 4800,
    label = "White Chocolate 50%\nBrand: Mars",
    size = 2.5
  ) +
    annotate(
    geom = "label",
    x = "P0008", y = 4850,
    label = "White Chocolate 90%\nBrand: Godiva",
    size = 2.5
  ) +
    annotate(
    geom = "label",
    x = "P0107", y = 4900,
    label = "Truffle Chocolate 80%\nBrand: Ferrero",
    size = 2.5
  ) +
    annotate(
    geom = "label",
    x = "P0091", y = 4950,
    label = "Dark Chocolate 60%\nBrand: Cadbury",
    size = 2.5
  ) +
    annotate(
    geom = "label",
    x = "P0021", y = 5000,
    label = "Dark Chocolate 50%\nBrand: Cadbury",
    size = 2.5
  )

plot2 <- ggplot(bottom_products, aes(x = reorder(product_id, num_orders), 
                         y = num_orders)) +
  geom_bar(stat = "identity", fill = "navy") +
  labs(
    title = "Plot 12: Worst Selling Chocolate",
    x = "Product ID",
    y = "Number of Orders") +
  coord_cartesian(
    ylim = c(4500, 5200)
  ) +
    annotate(
    geom = "label",
    x = "P0080", y = 4750,
    label = "Praline Chocolate 70%\nBrand: Hershey",
    size = 2.5
  ) +
    annotate(
    geom = "label",
    x = "P0178", y = 4700,
    label = "Praline Chocolate 80%\nBrand: Hershey",
    size = 2.5
  ) +
    annotate(
    geom = "label",
    x = "P0028", y = 4650,
    label = "Dark Chocolate 50%\nBrand: Lindt",
    size = 2.5
  ) +
    annotate(
    geom = "label",
    x = "P0035", y = 4600,
    label = "White Chocolate 70%\nBrand: Hershey",
    size = 2.5
  ) +
    annotate(
    geom = "label",
    x = "P0048", y = 4550,
    label = "Praline Chocolate 90%\nBrand: Cadbury",
    size = 2.5
  )

plot1 / plot2

Best and Worst Selling Chocolates:

Similar to the previous graph, visualization on a normal y-axis range would not have allowed differences to be seen at all — so the scale was zoomed. In reality, there is only about a 250 sales count different between the fifth best chocolate and the fifth worst chocolate. Given this, chocolate brand and percentage does not seem to have a substantial impact on sales per type.

Conclusion

  • Seasonal & Monthly Impact on Sales: What Season Do Customers Buy Chocolate?
    Season and month had little impact on chocolate. The most significant finding was February’s lower monthly sales across both years of interest.

  • Daily Sales Analysis: Sales Across the Entire Date Range, the Distribution, & What Day Of the Week Do People Buy Chocolate?
    The slope of sales remained stable across 2023-2024, and the distribution of daily sales was normal was well. A few sales outliers were observed when faceting by weekday, but nothing significant to note.

  • Factors Analysis: What are the Potential Impacts of Factors Such As Gender, Age, Store, and Chocolate Type?
    Gender and age showed little impact on loyalty programs. Additionally, stores all performe relatively the same, only differing a few hundred total sales between our best and worst stores. Chocolate type had similar findings as a factor.

Interactive Map

  • Asa Gardner - agardn61@students.kennesaw.edu

  • Jacob Aliff - jaliff2@students.kennesaw.edu

  • Moyuk Dey - msurja@students.kennesaw.edu

  • Reid Heilesen - rheilese@students.kennesaw.edu