Introduction
Business Question: The business problem we are addressing is twofold. First, we aim to identify when ice cream sales peak during the year to understand the seasonality of demand. Second, we want to break down these sales by key demographic groups to determine who buys the most ice cream. By identifying these patterns, we can better target the right consumers, tailoring marketing strategies that will ultimately boost ice cream sales and drive growth for Regork.
Growing Ice Cream Sales My business challenge is deeply understanding the factors influencing ice cream sales. Our goal is to identify the periods when ice cream sales reach their highest levels throughout the year and to analyze them by demographic segments to uncover who buys ice cream most often. With this knowledge, we can target appropriate consumers more effectively, ultimately boosting ice cream sales.
How This Analysis Will Benefit Us This analysis will provide us with two key advantages. First, since ice cream sales exhibit significant seasonal trends, understanding the times of increased demand will help us optimize our inventory management and marketing strategies during these peak periods, leading to cost savings and increased revenue. Second, by pinpointing the demographic profiles of our main customers, we can customize our marketing initiatives to reach and engage these target groups effectively. These insights will enable us to design targeted promotions, seasonal campaigns, and product offerings that resonate with our customer base, ultimately leading to increased ice cream sales.
completejourney- Data used for this analysis.
tidyverse- Collection of R packages for data science and analysis.
dplyr- Data manipulation and transformation using a tidy syntax.
knitr- Dynamic report generation in R using R Markdown.
ggplot2- Data visualization based on the Grammar of Graphics.
forcats- Tools for working with categorical variables (factors).
lubridate- Date and time manipulation and parsing in R.
library(completejourney)
## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(forcats)
library(dplyr)
library(knitr)
transactions <- get_transactions()
dim(transactions)
## [1] 1469307 11
promotions <- get_promotions()
dim(promotions)
## [1] 20940529 5
products
## # A tibble: 92,331 × 7
## product_id manufacturer_id department brand product_category product_type
## <chr> <chr> <chr> <fct> <chr> <chr>
## 1 25671 2 GROCERY Natio… FRZN ICE ICE - CRUSH…
## 2 26081 2 MISCELLANEOUS Natio… <NA> <NA>
## 3 26093 69 PASTRY Priva… BREAD BREAD:ITALI…
## 4 26190 69 GROCERY Priva… FRUIT - SHELF S… APPLE SAUCE
## 5 26355 69 GROCERY Priva… COOKIES/CONES SPECIALTY C…
## 6 26426 69 GROCERY Priva… SPICES & EXTRAC… SPICES & SE…
## 7 26540 69 GROCERY Priva… COOKIES/CONES TRAY PACK/C…
## 8 26601 69 DRUG GM Priva… VITAMINS VITAMIN - M…
## 9 26636 69 PASTRY Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691 16 GROCERY Priva… PNT BTR/JELLY/J… HONEY
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
demographics
## # A tibble: 801 × 8
## household_id age income home_ownership marital_status household_size
## <chr> <ord> <ord> <ord> <ord> <ord>
## 1 1 65+ 35-49K Homeowner Married 2
## 2 1001 45-54 50-74K Homeowner Unmarried 1
## 3 1003 35-44 25-34K <NA> Unmarried 1
## 4 1004 25-34 15-24K <NA> Unmarried 1
## 5 101 45-54 Under 15K Homeowner Married 4
## 6 1012 35-44 35-49K <NA> Married 5+
## 7 1014 45-54 15-24K <NA> Married 4
## 8 1015 45-54 50-74K Homeowner Unmarried 1
## 9 1018 45-54 35-49K Homeowner Married 5+
## 10 1020 45-54 25-34K Homeowner Married 2
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
products %>%
filter(str_detect(product_category, regex("ICE CREAM/MILK/SHERBTS"))) %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
mutate(month = month(transaction_timestamp)) %>%
group_by(month) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
ggplot(aes(x = month, y = total_sales)) +
geom_line(color = "red") +
scale_y_continuous("Total Ice Cream Sales", labels = scales::dollar) +
scale_x_continuous("Month of the Year", breaks = 1:12, labels = month.abb) +
ggtitle("Ice Cream Sales by Month",
subtitle = "Highest Sales are in July") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
The chart above shows that the highest Ice Cream sales happen on July 4th. Next I wanted to see what exact day where ice sales are the highest.
#July sales of ice cream
products %>%
inner_join(transactions, by = "product_id") %>%
select(product_id, sales_value, transaction_timestamp, product_type) %>%
filter(str_detect(product_type, regex("^ICE CREAM"))) %>% # Focus on ice cream
filter(month(transaction_timestamp) == 7) %>% # Filter for July
mutate(transaction_timestamp = as.Date(transaction_timestamp)) %>% # Convert to Date
group_by(transaction_timestamp) %>% # Group by transaction timestamp
summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>% # Summarize total sales by day
ggplot(aes(x = transaction_timestamp, y = total_sales)) +
geom_line(color = "blue") + # Use lines to represent sales
scale_y_continuous(labels = scales::dollar) + # Format y-axis as dollar values
scale_x_date(date_breaks = "1 day", date_labels = "%d") + # Show all days on the x-axis
labs(title = "Ice Cream Sales from July 1st to July 31st",
subtitle = "Daily Sales Trends for Ice Cream in July",
x = "Day of July",
y = "Sales ($)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
What this shows is that July 4th is where the most ice cream is sold in the entire year.
Next I Wanted to What are the biggest money maker of Ice Cream. Maybe we have a large amount being sold in a particular product.
ice_cream_data <- products %>%
filter(str_detect(product_category, regex("ICE CREAM/MILK/SHERBTS", ignore_case = TRUE)))
ice_cream_product_types <- ice_cream_data %>%
select(product_id, product_type)
print(ice_cream_product_types)
## # A tibble: 815 × 2
## product_id product_type
## <chr> <chr>
## 1 26738 TRADITIONAL
## 2 26941 TRADITIONAL
## 3 27030 TRADITIONAL
## 4 27158 TRADITIONAL
## 5 27404 TRADITIONAL
## 6 27522 TRADITIONAL
## 7 27558 TRADITIONAL
## 8 27686 TRADITIONAL
## 9 27762 TRADITIONAL
## 10 27764 TRADITIONAL
## # ℹ 805 more rows
#chart
ice_cream_sales <- ice_cream_data %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
group_by(product_type) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE))
# Ice Cream sales
print(ice_cream_sales)
## # A tibble: 6 × 2
## product_type total_sales
## <chr> <dbl>
## 1 PAILS 1214.
## 2 PREMIUM 14841.
## 3 PREMIUM PINTS 1857.
## 4 QUARTS 303.
## 5 SUPER PREMIUM PINTS 3404.
## 6 TRADITIONAL 4658.
#chart of Ice Cream sales per product
ice_cream_sales %>%
ggplot(aes(x = total_sales, y = product_type)) +
geom_col(fill = "blue", color = "black") +
scale_y_discrete("Product Type") +
scale_x_continuous("Total Sales", labels = scales::dollar) +
ggtitle("Total Sales by Product Type for ICE CREAM/MILK/SHERBTS",
subtitle = "Product Types and Sales") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
#Pails refers <-- to larger containers of ice cream, often found in bulk sizes (e.g., 3 gallons or more).
#Premium <-- This category generally includes ice cream made with higher quality ingredients, usually containing more milk fat (often 14% or more). Premium ice creams are richer and creamier than regular ice creams.
#Premium Pints <-- Similar to the premium category, but sold in pint-sized containers (16 oz). These are convenient for individual consumption or small households and are often marketed as high-quality options.
#Quarts <-- Ice cream sold in quart-sized containers (32 oz). This size is popular for family consumption, offering a good balance between quantity and convenience.
#Super Premium Pints <-- This is a higher tier than regular premium ice creams, with even more milk fat (often 16% or higher) and more intense flavors. These are typically marketed as gourmet or artisanal products and are sold in pint-sized containers.
#Traditional <-- This category typically refers to regular ice cream that doesn't fall into the premium or super-premium categories. It usually contains lower milk fat (10% to 12%) and is made with standard ingredients
Charts and graphs is total sales per ice cream Product
In the next Chart I wanted to know what is the largest buyers per Income Level
products %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
filter(str_detect(product_type, regex("^ICE CREAM"))) %>%
group_by(income, product_type) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
ggplot(aes(x = income, y = total_sales, fill = product_type)) +
geom_bar(stat = "identity", position = "stack") +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Ice Cream Sales by Income Level and Product Type",
x = "Income Level",
y = "Total Sales ($)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
## `summarise()` has grouped output by 'income'. You can override using the
## `.groups` argument.
Next I wanted to combine monthly Ice Cream sales, Income Level and time of year. I wanted to see if a particular group stands out during July 4th.
# Join datasets and filter for ice cream products
monthly_income_ice_cream_sales <- transactions %>%
inner_join(products, by = "product_id") %>%
left_join(demographics, by = "household_id") %>%
filter(str_detect(product_type, regex("^ICE CREAM")))
# Calculate total sales by month and income, removing NA values
monthly_income_sales <- monthly_income_ice_cream_sales %>%
mutate(transaction_month = floor_date(transaction_timestamp, "month") %>% as.Date()) %>% # Convert to Date
filter(!is.na(income) & !is.na(sales_value)) %>% # Remove NA values for income and sales_value
group_by(transaction_month, income) %>%
summarise(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
arrange(transaction_month, income)
# View the results
print(monthly_income_sales)
## # A tibble: 119 × 3
## transaction_month income total_sales
## <date> <ord> <dbl>
## 1 2017-01-01 Under 15K 20.6
## 2 2017-01-01 15-24K 7.88
## 3 2017-01-01 25-34K 9.04
## 4 2017-01-01 35-49K 32.7
## 5 2017-01-01 50-74K 59.7
## 6 2017-01-01 75-99K 24.2
## 7 2017-01-01 100-124K 7.87
## 8 2017-01-01 125-149K 46.5
## 9 2017-01-01 150-174K 16.6
## 10 2017-01-01 175-199K 3.98
## # ℹ 109 more rows
# Define a custom color palette for distinct income levels
custom_colors <- c(
"Under 15K" = "red",
"15-24K" = "orange",
"25-34K" = "yellow",
"35-49K" = "green",
"50-74K" = "blue",
"75-99K" = "purple",
"100-124K" = "cyan",
"125-149K" = "magenta",
"150-174K" = "brown",
"175-199K" = "pink"
)
# Visualize total sales by income over time with improved readability and distinct colors
ggplot(monthly_income_sales, aes(x = transaction_month, y = total_sales, color = income)) +
geom_line(size = 1.2) + # Thicker lines for better visibility
labs(title = "Monthly Ice Cream Sales by Income",
x = "Month",
y = "Total Sales ($)",
color = "Income Level") +
theme_minimal(base_size = 15) + # Increase base font size
scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") + # Format x-axis labels
scale_color_manual(values = custom_colors) + # Apply custom colors
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
panel.grid.major = element_line(color = "grey80"), # Add grid lines for clarity
panel.grid.minor = element_blank()) # Remove minor grid lines
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The graph above shows that 50-74k is the largest buyers of ice cream during July.
Lastly, I wanted to know if households with kids in the 50-74k are purchasing their weight.
income_children_data <- transactions %>%
inner_join(products, by = "product_id") %>%
left_join(demographics, by = "household_id") %>%
filter(!is.na(income) & !is.na(kids_count)) # Ensure no NA values in income or kids_count
# Count the number of households for each income and kids count, scaled per 1000
income_kids_count <- income_children_data %>%
group_by(income, kids_count) %>%
summarise(household_count = n() / 1000, .groups = "drop") %>% # Divide by 1000
arrange(income, kids_count)
# View the results
print(income_kids_count)
## # A tibble: 45 × 3
## income kids_count household_count
## <ord> <ord> <dbl>
## 1 Under 15K 0 38.4
## 2 Under 15K 1 12.8
## 3 Under 15K 2 7.43
## 4 Under 15K 3+ 5.44
## 5 15-24K 0 40.3
## 6 15-24K 1 12.3
## 7 15-24K 2 7.53
## 8 15-24K 3+ 1.42
## 9 25-34K 0 47.1
## 10 25-34K 1 15.4
## # ℹ 35 more rows
# Plot the results as a bar graph per 1000 households
ggplot(income_kids_count, aes(x = income, y = household_count, fill = as.factor(kids_count))) +
geom_bar(stat = "identity", position = "dodge") + # Dodge position for side-by-side bars
labs(title = "Household Count by Income and Number of Children",
x = "Income Level",
y = "Household Count (per 1000)",
fill = "Number of Children") +
theme_minimal(base_size = 15) + # Increase base font size
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
panel.grid.major = element_line(color = "grey80"), # Add grid lines for clarity
panel.grid.minor = element_blank()) # Remove minor grid lines
This shows that in every income level children in the household is not as high as they could. Especially in 50-74k range.
In these next 2 graphs I want to look at the percentages of household makeup and the purchases of ice cream.
# Join datasets to get household counts
household_data <- transactions %>%
inner_join(products, by = "product_id") %>%
left_join(demographics, by = "household_id") %>%
filter(!is.na(kids_count)) # Ensure no NA values in kids_count
# Calculate total households and households with children
household_summary <- household_data %>%
summarise(
total_households = n(),
households_with_children = sum(kids_count > 0, na.rm = TRUE), # Count households with children
households_without_children = total_households - households_with_children # Calculate households without children
) %>%
mutate(
percentage_with_children = (households_with_children / total_households) * 100,
percentage_without_children = (households_without_children / total_households) * 100
)
# Prepare data for plotting
plot_data <- data.frame(
category = c("With Children", "Without Children"),
percentage = c(household_summary$percentage_with_children, household_summary$percentage_without_children)
)
# Create bar graph with percentages inside the bars
p <- ggplot(plot_data, aes(x = category, y = percentage, fill = category)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
vjust = 1.5, # Position text inside the bars
size = 4, # Reduce text size to fit
color = "white") + # White text to contrast with bar color
scale_y_continuous(limits = c(0, max(plot_data$percentage) * 1.2)) + # Increase y-axis limits for more space
labs(title = "Percentage of Households with and without Children",
x = "Household Category",
y = "Percentage (%)") +
theme_minimal(base_size = 15) +
theme(legend.position = "none") # Remove legend if not needed
# Display the plot
print(p)
# Calculate the total quantity purchased grouped by the number of kids
ice_cream_data <- products %>%
filter(str_detect(product_category, regex("ICE CREAM|MILK|SHERBETS", ignore_case = TRUE))) %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
group_by(kids_count) %>%
summarise(quantity_purchased = sum(quantity, na.rm = TRUE)) %>%
ungroup()
# Calculate percentages
ice_cream_data <- ice_cream_data %>%
mutate(percentage = (quantity_purchased / sum(quantity_purchased)) * 100)
# Create a pie chart with percentages
ggplot(ice_cream_data, aes(x = "", y = quantity_purchased, fill = factor(kids_count))) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), # Center labels on the segments
color = "white", size = 4) + # Change the color and size of the text
labs(title = "Ice Cream Purchases by Number of Kids in Household",
fill = "Number of Kids") +
theme_void() + # Remove axes and background
theme(plot.title = element_text(face = "bold", size = 16))
What these graphs show is that generally it is equal between households with kids and without. However it seems that households with children are under purchasing ice cream by 1 percent. Considering that kids love ice cream there might be some room to improve by more than 1 percent.
These show that households with children ARE in the minority but we want to know if households with kids are purchasing their weight.
In the next chart I wanted to show the Number of kids vs Quantity of ice cream sold.
products %>%
filter(str_detect(product_category, regex("(ICE CREAM)"))) %>%
inner_join(transactions, by = "product_id") %>%
inner_join(demographics, by = "household_id") %>%
group_by(kids_count, income) %>%
summarise(quantity_purchased = sum(quantity)) %>%
ggplot(aes(x = kids_count, y = quantity_purchased)) +
geom_boxplot(fill = "blue", color = "black") +
labs(x = "Number of Kids in Household", y = "Quantity Purchased") +
ggtitle("Number of Kids vs the Quantity of Ice Cream Sold",
subtitle = "Exploring Ice Cream Sales by Household Size") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic"))
## `summarise()` has grouped output by 'kids_count'. You can override using the
## `.groups` argument.
In the end there are 4 themes that we get out of this. 1: July 4th is the day with the most ice cream purchases. 2: 50-74k is the income range where people buy the most ice cream. 3: Premium products are the most purchased ice cream. 4: Households without kids and in the 50-74k salary range, buy the most ice cream.
How I Answered the Problem To address the challenge of understanding ice cream sales dynamics, I implemented a comprehensive analytical approach. This involved examining our transaction data alongside product data, which allowed me to identify sales peaks throughout the year and reveal patterns in consumer purchasing behavior. By integrating demographic data, I could analyze the characteristics of our ice cream buyers and uncover which demographic segments contributed most significantly to sales. This multi-faceted analysis highlighted trends in purchasing behavior and provided insights into the specific consumer groups we should target, ultimately guiding our strategies to increase ice cream sales effectively.
Recommendations Based on our analysis of ice cream sales and consumer demographics, I recommend implementing a targeted marketing strategy that leverages the identified peak sales period on July 4th and focuses on key demographic groups. The strategy should emphasize the $50K to $74K income range, where we see the highest volume of ice cream purchases. To effectively engage this demographic, we should promote our premium ice cream products, as they have shown to be the most popular among consumers. This can be achieved through promotional campaigns that highlight unique flavors and limited-time offers specifically around the Fourth of July. Additionally, we must intensify our marketing efforts towards households with kids, who often purchase ice cream as a treat for their children, especially during summer celebrations. By aligning our marketing efforts with the preferences of our target demographics and the trends observed during peak purchase times, we can significantly boost ice cream sales and foster brand loyalty.
Conclusion Our analysis underscores a substantial opportunity for Regork to enhance ice cream sales, particularly on July 4th, the day with the highest ice cream purchases of the year. Households with an income between $50K and $74K emerge as the most active buyers, representing a significant share of total sales. This demographic often seeks affordable yet enjoyable premium treats to share with family and friends during festive occasions, making them a key target for our marketing initiatives. Given these insights, it is essential that Regork strategically focuses its marketing efforts on this income demographic leading up to July 4th. In summary, a focused marketing strategy targeting the $50K to $74K income demographic, families with kids, and emphasizing our premium products will enable Regork to capitalize on the increased demand for ice cream during this critical time.