The Business Problem our Group is trying to solve is in two parts. First, we wanted to identify when candy sales peak during the year. Next, we want to break down these categories by demographic to identify who purchases candy the most. This will allow us to target the correct consumers to ultimately grow sales.
We first had to identify where our candy sales peaked. We did so by joining our transactions data with our product data. After identifying these peaks, we wanted to see who was making these purchases. We joined the demographics data to these two previous data sets to identify who made up the bulk of our Candy Sales.
This Analysis will help us in two ways. First, as candy is highly seasonal, it will allow us to see which periods of the year it spikes on. Secondly, we will be able to see who is purchasing these products the most. These trends will allow us to tailor our marketing strategies to the correct consumers.
completejourney – contains the data we will be using for this analysis
tidyverse – a collection of packages that allow for better data importing, tidying, manipulation, and data visualization
lubridate – makes it easier to work with data and time functions
ggplot2 – To facilitate in making more visually appealing graphics
library(completejourney)
library(tidyverse)
library(forcats)
library(dplyr)
library(knitr)
library(ggplot2)
transactions <- get_transactions()
dim(transactions)
## [1] 1469307 11
promotions <- get_promotions()
dim(promotions)
## [1] 20940529 5
products
## # A tibble: 92,331 × 7
## product_id manufacturer_id department brand product_category product_type
## <chr> <chr> <chr> <fct> <chr> <chr>
## 1 25671 2 GROCERY Natio… FRZN ICE ICE - CRUSH…
## 2 26081 2 MISCELLANEOUS Natio… <NA> <NA>
## 3 26093 69 PASTRY Priva… BREAD BREAD:ITALI…
## 4 26190 69 GROCERY Priva… FRUIT - SHELF S… APPLE SAUCE
## 5 26355 69 GROCERY Priva… COOKIES/CONES SPECIALTY C…
## 6 26426 69 GROCERY Priva… SPICES & EXTRAC… SPICES & SE…
## 7 26540 69 GROCERY Priva… COOKIES/CONES TRAY PACK/C…
## 8 26601 69 DRUG GM Priva… VITAMINS VITAMIN - M…
## 9 26636 69 PASTRY Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691 16 GROCERY Priva… PNT BTR/JELLY/J… HONEY
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
demographics
## # A tibble: 801 × 8
## household_id age income home_ownership marital_status household_size
## <chr> <ord> <ord> <ord> <ord> <ord>
## 1 1 65+ 35-49K Homeowner Married 2
## 2 1001 45-54 50-74K Homeowner Unmarried 1
## 3 1003 35-44 25-34K <NA> Unmarried 1
## 4 1004 25-34 15-24K <NA> Unmarried 1
## 5 101 45-54 Under 15K Homeowner Married 4
## 6 1012 35-44 35-49K <NA> Married 5+
## 7 1014 45-54 15-24K <NA> Married 4
## 8 1015 45-54 50-74K Homeowner Unmarried 1
## 9 1018 45-54 35-49K Homeowner Married 5+
## 10 1020 45-54 25-34K Homeowner Married 2
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
After importing all of the necessary data and tools, we began our analysis
products %>%
filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
mutate(month = month(transaction_timestamp)) %>%
group_by(month) %>%
summarise(total_sales = sum(sales_value)) %>%
ggplot(aes(x = month, y = total_sales)) +
geom_line(color = "orange") +
scale_y_continuous("Total Candy Sales",labels = scales::dollar)+
scale_x_continuous("Month of the Year", breaks = 1:12, labels = month.abb) +
ggtitle("Candy Sales by Month",
subtitle = "Spikes in April, October and December")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
This chart identifies the spikes in demand. The three major spikes came before Easter, Halloween, and Christmas.
products %>%
filter(str_detect(product_category,regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
group_by(income) %>%
summarise(total_sales = sum(sales_value)) %>%
ggplot(aes(x = income, y = total_sales)) +
geom_col(fill = "red", color = "black")+
labs(x = "Income", y = "Total Sales") +
ggtitle("Total Candy Sales by Income Range",
subtitle = "Income Range 50-74k Spends the Most on Candy")+
scale_y_continuous(labels = scales::dollar_format(prefix = "$")) +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
This Chart Breaks down the income range who accounts for the most candy sales. The leading group is individuals making 50-74k.
products %>%
filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
group_by(product_type) %>%
summarise(total_sales = sum(sales_value)) %>%
arrange(desc(total_sales)) %>%
top_n(5,total_sales) %>%
ggplot(aes(x = total_sales,y = product_type)) +
geom_col(fill = "red",color = "black") +
scale_y_discrete("Product Type")+
scale_x_continuous("Total Sales",labels = scales::dollar)+
ggtitle("Total Sales by Product Type",
subtitle = "Chocolate is the Top Product")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
This chart identifies the top 5 candy products. Candy bars and Chocolate are doing well, but a close 5th is chewing gum.
#yearly
products %>%
inner_join(transactions, "product_id") %>%
select(product_id, sales_value, transaction_timestamp, product_type) %>%
filter(str_detect(product_type, regex("^CANDY"))) %>%
group_by(transaction_timestamp) %>%
summarize(total_sales = sum(sales_value)) %>%
filter(total_sales > 20) %>%
ggplot(aes(x = transaction_timestamp, y = total_sales)) +
geom_point() +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Sales greater than $20 year round",
subtitle = "We wanted to preface with this graph because it contains all of the outliers that shift the sales
during holidays. The majority of transactions are under $20 and each holiday has their own chunk of
outliers.",
x = "Date",
y = "Sales")
This graph shows the amount spent per transaction on candy. As you can see, majority of the purchases are under 20 dollars. However, many of the out-liers come during our sales peak.
products %>%
filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY -CHECKLANE)"))) %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
group_by(kids_count,income) %>%
summarise(quantity_purchased = sum(quantity)) %>%
ggplot(aes(x=kids_count,y=quantity_purchased)) +
geom_boxplot(fill = "red", color = "black")+
labs(x = "Number of Kids in Household", y = "Quantity Purchased")+
ggtitle("Number of Kids vs the Quanity of Candy Sold",
subtitle = "Families with More Kids Buy Less Candy")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
This chart shows the relationship between the number of kids a family has and their candy purchases. Surprisingly, the greater the number of kids you have, the less candy you buy.
products %>%
filter(str_detect(product_category, regex("(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
group_by(marital_status,income) %>%
summarise(quantity_purchased = sum(quantity)) %>%
ggplot(aes(x=marital_status,y=quantity_purchased)) +
geom_boxplot(color = "black",fill = "blue")+
labs(x = "Marital Status", y = "Quantity Purchased")+
ggtitle("Marital Status vs the Quanity of Candy Sold",
subtitle = "Both Groups Behave Similarly")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
This chart shows the influence you marital status has on candy purchases. While unmarried have a higher range, the numbers are overall very similar.
demographics %>%
filter(household_size %in% c('1', '2', '3', '4', '5+')) %>%
inner_join(transactions, "household_id") %>%
inner_join(products, "product_id") %>%
filter(str_detect(product_category, regex(
"(CANDY - PACKAGED)|(CANDY - CHECKLANE)"))) %>%
group_by(household_size, income) %>%
summarize(total_sales = sum(sales_value)) %>%
arrange(income) %>%
ggplot(aes(x = total_sales, y = income, color = household_size)) +
geom_point(size = 2) +
labs(x = "Total Sales", y = "Income") +
scale_x_continuous(labels = scales::dollar_format(prefix = "$"))+
ggtitle("Sales by Income Range and Household Size",
subtitle = "25-99k Appear to be our Strongest Demographic")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
This chart identifies the spending habits of each income range based on their household size. Consumers with less kids tend to spend more on candy. Furthermore, lower income levels spend more on candy as well.
products %>%
inner_join(transactions, "product_id") %>%
select(product_id, sales_value, transaction_timestamp, product_type) %>%
filter(str_detect(product_type, regex("^CANDY"))) %>%
filter(week(transaction_timestamp) == 14:18) %>%
group_by(transaction_timestamp) %>%
summarize(total_sales = sum(sales_value)) %>%
ggplot(aes(x = transaction_timestamp, y = total_sales)) +
geom_smooth() +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Sales from April 1st to May 6th",
subtitle = "The graph peaks around Easter and has a second peak after Easter likely when candy goes
on sale",
x = "Date",
y = "Sales") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
This chart identifies the average amount spent on candy transactions around Easter (April 10th). You can see the sales spike around and after the holiday.
products %>%
inner_join(transactions, "product_id") %>%
select(product_id, sales_value, transaction_timestamp, product_type) %>%
filter(str_detect(product_type, regex("^CANDY"))) %>%
filter(week(transaction_timestamp) == 40:45) %>%
group_by(transaction_timestamp) %>%
summarize(total_sales = sum(sales_value)) %>%
ggplot(aes(x = transaction_timestamp, y = total_sales)) +
geom_smooth() +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Sales from October 1st to October 31st",
subtitle = "The graph gradually increases thoughout the month and has a strong increase during the week
of halloween",
x = "Date",
y = "Sales") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
This chart shows the average transaction cost for sales during the month of October. Sales skyrocket right before Halloween. They don’t see a similar resurgence in the following weeks like Easter does.
products %>%
inner_join(transactions, "product_id") %>%
select(product_id, sales_value, transaction_timestamp, product_type) %>%
filter(str_detect(product_type, regex("^CANDY"))) %>%
filter(month(transaction_timestamp) == 12) %>%
group_by(transaction_timestamp) %>%
summarize(total_sales = sum(sales_value)) %>%
ggplot(aes(x = transaction_timestamp, y = total_sales)) +
geom_smooth() +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Sales from December 1st to December 31st",
subtitle = "The graph gradually increases thoughout the month and peaks around christmas with a strong
drop into the new year",
x = "Date",
y = "Sales") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
This chart shows the average amount spent per transaction in the month of December. This graph is similar to the data we saw in October, as sales decrease after the holiday.
At the start, our group set out to identify when candy sales peaked and how to increase those sales during those spikes.
To begin, we started by joining products with demographics and transactions. We created a line chart to identify the peaks of candy sales. These came around Easter, Halloween, and Christmas. Next, we wanted to see which income range was buying the most candy. We created a chart that showed total sales by income range, finding the highest amount spent in the 50-74k bucket. Next, we wanted to see which product type of candy was sold the most. We identified that chocolate was the most common. Next, we broke down how demographics like marital status, number of kids and household size affected these numbers. Some of these stood out, such as families with less kids being more prone to spend money on candy. Lastly, we identified the sales pattern of our candy for each holiday peak. One thing that stood out to us was that more people bought candy after Easter. This is a great time frame for increased marketing and promotions.
There are a few key takeaways we believe will help us increase our candy sales next year. The first of which is proper marketing to selected groups. We would increase targeted ads, campaigns, and other promotions to consumers in the income range of 35-99k. The three income ranges in this area were our highest sellers for candy. Furthermore, we should focus our efforts on consumers without children as they buy more candy than consumers without children. Lastly, I would increase marketing about candy sales the weeks after holidays. Like we saw with Easter, many consumers plan on buying cheaper candy for themselves after the holidays. With a focused marketing strategy for the selected groups, we could increase candy sales in this time frame for Christmas and Halloween as well.
There are a few limitations to this study. The first of which is that the data we are using is from 2017. This is potentially bad as the data is now seven years old. It is possible that consumer habits have changed since then. The second limitation with our study is that we only based it on one year of data. There are always fluctuations in data year to year. It is possible that some of our insights were based on out-liers rather than consistent patterns. The third limitation is that the product types are broad, the top selling product is candy bars. It is impossible to determine which specific candy bar is doing the best with the context given.