Introduction

Growing Candy Sales

The Business Problem our Group is trying to solve is in two parts. First, we wanted to identify when candy sales peak during the year. Next, we want to break down these categories by demographic to identify who purchases candy the most. This will allow us to target the correct consumers to ultimately grow sales.

How We Came to Our Solution

We first had to identify where our candy sales peaked. We did so by joining our transactions data with our product data. After identifying these peaks, we wanted to see who was making these purchases. We joined the demographics data to these two previous data sets to identify who made up the bulk of our Candy Sales.

How Will This Analysis Benefit Us?

This Analysis will help us in two ways. First, as candy is highly seasonal, it will allow us to see which periods of the year it spikes on. Secondly, we will be able to see who is purchasing these products the most. These trends will allow us to tailor our marketing strategies to the correct consumers.

Packages Required

completejourney – contains the data we will be using for this analysis

tidyverse – a collection of packages that allow for better data importing, tidying, manipulation, and data visualization

lubridate – makes it easier to work with data and time functions

ggplot2 – To facilitate in making more visually appealing graphics

Data Preperation

library(completejourney)
library(tidyverse)
library(forcats)
library(dplyr)
library(knitr)
library(ggplot2)
transactions <- get_transactions()
dim(transactions)
## [1] 1469307      11
promotions <- get_promotions()
dim(promotions)
## [1] 20940529        5
products
## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
demographics
## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital_status household_size
##    <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
##  1 1            65+   35-49K    Homeowner      Married        2             
##  2 1001         45-54 50-74K    Homeowner      Unmarried      1             
##  3 1003         35-44 25-34K    <NA>           Unmarried      1             
##  4 1004         25-34 15-24K    <NA>           Unmarried      1             
##  5 101          45-54 Under 15K Homeowner      Married        4             
##  6 1012         35-44 35-49K    <NA>           Married        5+            
##  7 1014         45-54 15-24K    <NA>           Married        4             
##  8 1015         45-54 50-74K    Homeowner      Unmarried      1             
##  9 1018         45-54 35-49K    Homeowner      Married        5+            
## 10 1020         45-54 25-34K    Homeowner      Married        2             
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>

After importing all of the necessary data and tools, we began our analysis

products %>%
  filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  mutate(month = month(transaction_timestamp)) %>%
  group_by(month) %>%
  summarise(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = month, y = total_sales)) +
  geom_line(color = "orange") +
  scale_y_continuous("Total Candy Sales",labels = scales::dollar)+
  scale_x_continuous("Month of the Year", breaks = 1:12, labels = month.abb) +
  ggtitle("Candy Sales by Month",
          subtitle = "Spikes in April, October and December")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart identifies the spikes in demand. The three major spikes came before Easter, Halloween, and Christmas.

products %>%
  filter(str_detect(product_category,regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarise(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = income, y = total_sales)) +
  geom_col(fill = "red", color = "black")+
  labs(x = "Income", y = "Total Sales") +
  ggtitle("Total Candy Sales by Income Range",
          subtitle = "Income Range 50-74k Spends the Most on Candy")+
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + 
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This Chart Breaks down the income range who accounts for the most candy sales. The leading group is individuals making 50-74k.

products %>%
 filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(product_type) %>%
  summarise(total_sales = sum(sales_value)) %>%
  arrange(desc(total_sales)) %>%
  top_n(5,total_sales) %>%
  ggplot(aes(x = total_sales,y = product_type)) +
  geom_col(fill = "red",color = "black") +
  scale_y_discrete("Product Type")+
  scale_x_continuous("Total Sales",labels = scales::dollar)+
  ggtitle("Total Sales by Product Type",
          subtitle = "Chocolate is the Top Product")+
    theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart identifies the top 5 candy products. Candy bars and Chocolate are doing well, but a close 5th is chewing gum.

#yearly
products %>%
  inner_join(transactions, "product_id") %>%
  select(product_id, sales_value, transaction_timestamp, product_type) %>%
  filter(str_detect(product_type, regex("^CANDY"))) %>%
  group_by(transaction_timestamp) %>%
  summarize(total_sales = sum(sales_value)) %>%
  filter(total_sales > 20) %>% 
  ggplot(aes(x = transaction_timestamp, y = total_sales)) +
  geom_point() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sales greater than $20 year round", 
       subtitle = "We wanted to preface with this graph because it contains all of the outliers that shift the sales
during holidays. The majority of transactions are under $20 and each holiday has their own chunk of
outliers.",
       x = "Date",
       y = "Sales")

This graph shows the amount spent per transaction on candy. As you can see, majority of the purchases are under 20 dollars. However, many of the out-liers come during our sales peak.

products %>%
  filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY -CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(kids_count,income) %>%
  summarise(quantity_purchased = sum(quantity)) %>%
  ggplot(aes(x=kids_count,y=quantity_purchased)) +
  geom_boxplot(fill = "red", color = "black")+
  labs(x = "Number of Kids in Household", y = "Quantity Purchased")+
  ggtitle("Number of Kids vs the Quanity of Candy Sold",
          subtitle = "Families with More Kids Buy Less Candy")+
     theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart shows the relationship between the number of kids a family has and their candy purchases. Surprisingly, the greater the number of kids you have, the less candy you buy.

products %>%
  filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(marital_status,income) %>%
  summarise(quantity_purchased = sum(quantity)) %>%
  ggplot(aes(x=marital_status,y=quantity_purchased)) +
  geom_boxplot(color = "black",fill = "blue")+
    labs(x = "Marital Status", y = "Quantity Purchased")+
  ggtitle("Marital Status vs the Quanity of Candy Sold",
          subtitle = "Both Groups Behave Similarly")+
     theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart shows the influence you marital status has on candy purchases. While unmarried have a higher range, the numbers are overall very similar.

demographics %>%
  filter(household_size %in% c('1', '2', '3', '4', '5+')) %>%
  inner_join(transactions, "household_id") %>%
  inner_join(products, "product_id") %>%
  filter(str_detect(product_category, regex(
    "(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
  group_by(household_size, income) %>%
  summarize(total_sales = sum(sales_value)) %>%
  arrange(income) %>%
  ggplot(aes(x = total_sales, y = income, color = household_size)) +
  geom_point(size = 2) +
  labs(x = "Total Sales", y = "Income") +
  scale_x_continuous(labels = scales::dollar_format(prefix = "$"))+
  ggtitle("Sales by Income Range and Household Size",
          subtitle = "25-99k Appear to be our Strongest Demographic")+
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart identifies the spending habits of each income range based on their household size. Consumers with less kids tend to spend more on candy. Furthermore, lower income levels spend more on candy as well.

products %>%
  inner_join(transactions, "product_id") %>%
  select(product_id, sales_value, transaction_timestamp, product_type) %>%
  filter(str_detect(product_type, regex("^CANDY"))) %>%
  filter(week(transaction_timestamp) == 14:18) %>%
  group_by(transaction_timestamp) %>%
  summarize(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = transaction_timestamp, y = total_sales)) +
  geom_smooth() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sales from April 1st to May 6th", 
       subtitle = "The graph peaks around Easter and has a second peak after Easter likely when candy goes
on sale",
       x = "Date",
       y = "Sales") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart identifies the average amount spent on candy transactions around Easter (April 10th). You can see the sales spike around and after the holiday.

products %>%
  inner_join(transactions, "product_id") %>%
  select(product_id, sales_value, transaction_timestamp, product_type) %>%
  filter(str_detect(product_type, regex("^CANDY"))) %>%
  filter(week(transaction_timestamp) == 40:45) %>%
  group_by(transaction_timestamp) %>%
  summarize(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = transaction_timestamp, y = total_sales)) +
  geom_smooth() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sales from October 1st to October 31st", 
       subtitle = "The graph gradually increases thoughout the month and has a strong increase during the week 
of halloween",
       x = "Date",
       y = "Sales") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart shows the average transaction cost for sales during the month of October. Sales skyrocket right before Halloween. They don’t see a similar resurgence in the following weeks like Easter does.

products %>%
  inner_join(transactions, "product_id") %>%
  select(product_id, sales_value, transaction_timestamp, product_type) %>%
  filter(str_detect(product_type, regex("^CANDY"))) %>%
  filter(month(transaction_timestamp) == 12) %>%
  group_by(transaction_timestamp) %>%
  summarize(total_sales = sum(sales_value)) %>%
  ggplot(aes(x = transaction_timestamp, y = total_sales)) +
  geom_smooth() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sales from December 1st to December 31st", 
       subtitle = "The graph gradually increases thoughout the month and peaks around christmas with a strong 
drop into the new year",
       x = "Date",
       y = "Sales") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

This chart shows the average amount spent per transaction in the month of December. This graph is similar to the data we saw in October, as sales decrease after the holiday.

Summary

At the start, our group set out to identify when candy sales peaked and how to increase those sales during those spikes.

How We Answered the Problem

To begin, we started by joining products with demographics and transactions. We created a line chart to identify the peaks of candy sales. These came around Easter, Halloween, and Christmas. Next, we wanted to see which income range was buying the most candy. We created a chart that showed total sales by income range, finding the highest amount spent in the 50-74k bucket. Next, we wanted to see which product type of candy was sold the most. We identified that chocolate was the most common. Next, we broke down how demographics like marital status, number of kids and household size affected these numbers. Some of these stood out, such as families with less kids being more prone to spend money on candy. Lastly, we identified the sales pattern of our candy for each holiday peak. One thing that stood out to us was that more people bought candy after Easter. This is a great time frame for increased marketing and promotions.

Our Recommendation

There are a few key takeaways we believe will help us increase our candy sales next year. The first of which is proper marketing to selected groups. We would increase targeted ads, campaigns, and other promotions to consumers in the income range of 35-99k. The three income ranges in this area were our highest sellers for candy. Furthermore, we should focus our efforts on consumers without children as they buy more candy than consumers without children. Lastly, I would increase marketing about candy sales the weeks after holidays. Like we saw with Easter, many consumers plan on buying cheaper candy for themselves after the holidays. With a focused marketing strategy for the selected groups, we could increase candy sales in this time frame for Christmas and Halloween as well.

Limitations

There are a few limitations to this study. The first of which is that the data we are using is from 2017. This is potentially bad as the data is now seven years old. It is possible that consumer habits have changed since then. The second limitation with our study is that we only based it on one year of data. There are always fluctuations in data year to year. It is possible that some of our insights were based on out-liers rather than consistent patterns. The third limitation is that the product types are broad, the top selling product is candy bars. It is impossible to determine which specific candy bar is doing the best with the context given.