Introduction

Growing Candy Sales

The Business Problem our Group is trying to solve is in two parts. First, we wanted to identify when candy sales peak during the year. Next, we want to break down these categories by demographic to identify who purchases candy the most. This will allow us to target the correct consumers to ultimately grow sales.

How We Came to Our Solution

We first had to identify where our candy sales peaked. We did so by joining our transactions data with our product data. After identifying these peaks, we wanted to see who was making these purchases. We joined the demographics data to these two previous data sets to identify who made up the bulk of our Candy Sales.

How Will This Analysis Benefit Us?

This Analysis will help us in two ways. First, as candy is highly seasonal, it will allow us to see which periods of the year it spikes on. Secondly, we will be able to see who is purchasing these products the most. These trends will allow us to tailor our marketing strategies to the correct consumers.

Packages Required

completejourney – contains the data we will be using for this analysis

tidyverse – a collection of packages that allow for better data importing, tidying, manipulation, and data visualization

lubridate – makes it easier to work with data and time functions

ggplot2 – To facilitate in making more visually appealing graphics

Data Preperation

library(completejourney)
library(tidyverse)
library(forcats)
library(dplyr)
library(knitr)
library(ggplot2)
transactions <- get_transactions()
dim(transactions)

## [1] 1469307      11

promotions <- get_promotions()
dim(promotions)

## [1] 20940529        5

products

## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>

demographics

## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital_status household_size
##    <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
##  1 1            65+   35-49K    Homeowner      Married        2             
##  2 1001         45-54 50-74K    Homeowner      Unmarried      1             
##  3 1003         35-44 25-34K    <NA>           Unmarried      1             
##  4 1004         25-34 15-24K    <NA>           Unmarried      1             
##  5 101          45-54 Under 15K Homeowner      Married        4             
##  6 1012         35-44 35-49K    <NA>           Married        5+            
##  7 1014         45-54 15-24K    <NA>           Married        4             
##  8 1015         45-54 50-74K    Homeowner      Unmarried      1             
##  9 1018         45-54 35-49K    Homeowner      Married        5+            
## 10 1020         45-54 25-34K    Homeowner      Married        2             
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>

After importing all of the necessary data and tools, we began our analysis

products %>%
  filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  mutate(month = month(transaction_timestamp)) %>%
  group_by(month) %>%
  summarise(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = month, y = total_sales)) +
  geom_line(color = "orange") +
  scale_y_continuous("Total Candy Sales",labels = scales::dollar)+
  scale_x_continuous("Month of the Year", breaks = 1:12, labels = month.abb) +
  ggtitle("Candy Sales by Month",
          subtitle = "Spikes in April, October and December")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart identifies the spikes in demand. The three major spikes came before Easter, Halloween, and Christmas.

products %>%
  filter(str_detect(product_category,regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarise(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = income, y = total_sales)) +
  geom_col(fill = "red", color = "black")+
  labs(x = "Income", y = "Total Sales") +
  ggtitle("Total Candy Sales by Income Range",
          subtitle = "Income Range 50-74k Spends the Most on Candy")+
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + 
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This Chart Breaks down the income range who accounts for the most candy sales. The leading group is individuals making 50-74k.

products %>%
 filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(product_type) %>%
  summarise(total_sales = sum(sales_value)) %>%
  arrange(desc(total_sales)) %>%
  top_n(5,total_sales) %>%
  ggplot(aes(x = total_sales,y = product_type)) +
  geom_col(fill = "red",color = "black") +
  scale_y_discrete("Product Type")+
  scale_x_continuous("Total Sales",labels = scales::dollar)+
  ggtitle("Total Sales by Product Type",
          subtitle = "Chocolate is the Top Product")+
    theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart identifies the top 5 candy products. Candy bars and Chocolate are doing well, but a close 5th is chewing gum.

#yearly
products %>%
  inner_join(transactions, "product_id") %>%
  select(product_id, sales_value, transaction_timestamp, product_type) %>%
  filter(str_detect(product_type, regex("^CANDY"))) %>%
  group_by(transaction_timestamp) %>%
  summarize(total_sales = sum(sales_value)) %>%
  filter(total_sales > 20) %>% 
  ggplot(aes(x = transaction_timestamp, y = total_sales)) +
  geom_point() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sales greater than $20 year round", 
       subtitle = "We wanted to preface with this graph because it contains all of the outliers that shift the sales
during holidays. The majority of transactions are under $20 and each holiday has their own chunk of
outliers.",
       x = "Date",
       y = "Sales")

This graph shows the amount spent per transaction on candy. As you can see, majority of the purchases are under 20 dollars. However, many of the out-liers come during our sales peak.

products %>%
  filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY -CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(kids_count,income) %>%
  summarise(quantity_purchased = sum(quantity)) %>%
  ggplot(aes(x=kids_count,y=quantity_purchased)) +
  geom_boxplot(fill = "red", color = "black")+
  labs(x = "Number of Kids in Household", y = "Quantity Purchased")+
  ggtitle("Number of Kids vs the Quanity of Candy Sold",
          subtitle = "Families with More Kids Buy Less Candy")+
     theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart shows the relationship between the number of kids a family has and their candy purchases. Surprisingly, the greater the number of kids you have, the less candy you buy.

products %>%
  filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(marital_status,income) %>%
  summarise(quantity_purchased = sum(quantity)) %>%
  ggplot(aes(x=marital_status,y=quantity_purchased)) +
  geom_boxplot(color = "black",fill = "blue")+
    labs(x = "Marital Status", y = "Quantity Purchased")+
  ggtitle("Marital Status vs the Quanity of Candy Sold",
          subtitle = "Both Groups Behave Similarly")+
     theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart shows the influence you marital status has on candy purchases. While unmarried have a higher range, the numbers are overall very similar.

demographics %>%
  filter(household_size %in% c('1', '2', '3', '4', '5+')) %>%
  inner_join(transactions, "household_id") %>%
  inner_join(products, "product_id") %>%
  filter(str_detect(product_category, regex(
    "(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  group_by(household_size, income) %>%
  summarize(total_sales = sum(sales_value)) %>%
  arrange(income) %>%
  ggplot(aes(x = total_sales, y = income, color = household_size)) +
  geom_point(size = 2) +
  labs(x = "Total Sales", y = "Income") +
  scale_x_continuous(labels = scales::dollar_format(prefix = "$"))+
  ggtitle("Sales by Income Range and Household Size",
          subtitle = "25-99k Appear to be our Strongest Demographic")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart identifies the spending habits of each income range based on their household size. Consumers with less kids tend to spend more on candy. Furthermore, lower income levels spend more on candy as well.

products %>%
  inner_join(transactions, "product_id") %>%
  select(product_id, sales_value, transaction_timestamp, product_type) %>%
  filter(str_detect(product_type, regex("^CANDY"))) %>%
  filter(week(transaction_timestamp) == 14:18) %>%
  group_by(transaction_timestamp) %>%
  summarize(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = transaction_timestamp, y = total_sales)) +
  geom_smooth() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sales from April 1st to May 6th", 
       subtitle = "The graph peaks around Easter and has a second peak after Easter likely when candy goes
on sale",
       x = "Date",
       y = "Sales") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart identifies the average amount spent on candy transactions around Easter (April 10th). You can see the sales spike around and after the holiday.

products %>%
  inner_join(transactions, "product_id") %>%
  select(product_id, sales_value, transaction_timestamp, product_type) %>%
  filter(str_detect(product_type, regex("^CANDY"))) %>%
  filter(week(transaction_timestamp) == 40:45) %>%
  group_by(transaction_timestamp) %>%
  summarize(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = transaction_timestamp, y = total_sales)) +
  geom_smooth() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sales from October 1st to October 31st", 
       subtitle = "The graph gradually increases thoughout the month and has a strong increase during the week 
of halloween",
       x = "Date",
       y = "Sales") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart shows the average transaction cost for sales during the month of October. Sales skyrocket right before Halloween. They don’t see a similar resurgence in the following weeks like Easter does.

products %>%
  inner_join(transactions, "product_id") %>%
  select(product_id, sales_value, transaction_timestamp, product_type) %>%
  filter(str_detect(product_type, regex("^CANDY"))) %>%
  filter(month(transaction_timestamp) == 12) %>%
  group_by(transaction_timestamp) %>%
  summarize(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = transaction_timestamp, y = total_sales)) +
  geom_smooth() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sales from December 1st to December 31st", 
       subtitle = "The graph gradually increases thoughout the month and peaks around christmas with a strong 
drop into the new year",
       x = "Date",
       y = "Sales") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart shows the average amount spent per transaction in the month of December. This graph is similar to the data we saw in October, as sales decrease after the holiday.

Mid-Term Project - Group 5 - BANA4080

Adam Miklovic, Josh Danszczak, Brady Bowser

2024-02-25

Introduction

Growing Candy Sales

How We Came to Our Solution

How Will This Analysis Benefit Us?

Packages Required

Data Preperation

Summary

How We Answered the Problem

Our Recommendation

Limitations