Midterm Project

Introduction

Introduction

Business Question: The business problem we are addressing is twofold. First, we aim to identify when ice cream sales peak during the year to understand the seasonality of demand. Second, we want to break down these sales by key demographic groups to determine who buys the most ice cream. By identifying these patterns, we can better target the right consumers, tailoring marketing strategies that will ultimately boost ice cream sales and drive growth for Regork.

Growing Ice Cream Sales My business challenge is deeply understanding the factors influencing ice cream sales. Our goal is to identify the periods when ice cream sales reach their highest levels throughout the year and to analyze them by demographic segments to uncover who buys ice cream most often. With this knowledge, we can target appropriate consumers more effectively, ultimately boosting ice cream sales.

How This Analysis Will Benefit Us This analysis will provide us with two key advantages. First, since ice cream sales exhibit significant seasonal trends, understanding the times of increased demand will help us optimize our inventory management and marketing strategies during these peak periods, leading to cost savings and increased revenue. Second, by pinpointing the demographic profiles of our main customers, we can customize our marketing initiatives to reach and engage these target groups effectively. These insights will enable us to design targeted promotions, seasonal campaigns, and product offerings that resonate with our customer base, ultimately leading to increased ice cream sales.

Packages Required

completejourney- Data used for this analysis.

tidyverse- Collection of R packages for data science and analysis.

dplyr- Data manipulation and transformation using a tidy syntax.

knitr- Dynamic report generation in R using R Markdown.

ggplot2- Data visualization based on the Grammar of Graphics.

forcats- Tools for working with categorical variables (factors).

lubridate- Date and time manipulation and parsing in R.

Preperation

library(completejourney)

## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(forcats)
library(dplyr)
library(knitr)

transactions <- get_transactions()
dim(transactions)

## [1] 1469307      11

promotions <- get_promotions()
dim(promotions)

## [1] 20940529        5

products

## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>

demographics

## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital_status household_size
##    <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
##  1 1            65+   35-49K    Homeowner      Married        2             
##  2 1001         45-54 50-74K    Homeowner      Unmarried      1             
##  3 1003         35-44 25-34K    <NA>           Unmarried      1             
##  4 1004         25-34 15-24K    <NA>           Unmarried      1             
##  5 101          45-54 Under 15K Homeowner      Married        4             
##  6 1012         35-44 35-49K    <NA>           Married        5+            
##  7 1014         45-54 15-24K    <NA>           Married        4             
##  8 1015         45-54 50-74K    Homeowner      Unmarried      1             
##  9 1018         45-54 35-49K    Homeowner      Married        5+            
## 10 1020         45-54 25-34K    Homeowner      Married        2             
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>

Ice Cream Sales

products %>%
  filter(str_detect(product_category, regex("ICE CREAM/MILK/SHERBTS"))) %>%  
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  mutate(month = month(transaction_timestamp)) %>%
  group_by(month) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%  
  ggplot(aes(x = month, y = total_sales)) +
  geom_line(color = "red") +
  scale_y_continuous("Total Ice Cream Sales", labels = scales::dollar) +  
  scale_x_continuous("Month of the Year", breaks = 1:12, labels = month.abb) +
  ggtitle("Ice Cream Sales by Month",
          subtitle = "Highest Sales are in July") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

The chart above shows that the highest Ice Cream sales happen on July 4th. Next I wanted to see what exact day where ice sales are the highest.

#July sales of ice cream

products %>%
  inner_join(transactions, by = "product_id") %>%
  select(product_id, sales_value, transaction_timestamp, product_type) %>%
  filter(str_detect(product_type, regex("^ICE CREAM"))) %>%  # Focus on ice cream
  filter(month(transaction_timestamp) == 7) %>%  # Filter for July
  mutate(transaction_timestamp = as.Date(transaction_timestamp)) %>%  # Convert to Date
  group_by(transaction_timestamp) %>%  # Group by transaction timestamp
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%  # Summarize total sales by day
  ggplot(aes(x = transaction_timestamp, y = total_sales)) +
  geom_line(color = "blue") +  # Use lines to represent sales
  scale_y_continuous(labels = scales::dollar) +  # Format y-axis as dollar values
  scale_x_date(date_breaks = "1 day", date_labels = "%d") +  # Show all days on the x-axis
  labs(title = "Ice Cream Sales from July 1st to July 31st", 
       subtitle = "Daily Sales Trends for Ice Cream in July",
       x = "Day of July",
       y = "Sales ($)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

What this shows is that July 4th is where the most ice cream is sold in the entire year.

Next I Wanted to What are the biggest money maker of Ice Cream. Maybe we have a large amount being sold in a particular product.

ice_cream_data <- products %>%
  filter(str_detect(product_category, regex("ICE CREAM/MILK/SHERBTS", ignore_case = TRUE)))

ice_cream_product_types <- ice_cream_data %>%
  select(product_id, product_type)

print(ice_cream_product_types)

## # A tibble: 815 × 2
##    product_id product_type
##    <chr>      <chr>       
##  1 26738      TRADITIONAL 
##  2 26941      TRADITIONAL 
##  3 27030      TRADITIONAL 
##  4 27158      TRADITIONAL 
##  5 27404      TRADITIONAL 
##  6 27522      TRADITIONAL 
##  7 27558      TRADITIONAL 
##  8 27686      TRADITIONAL 
##  9 27762      TRADITIONAL 
## 10 27764      TRADITIONAL 
## # ℹ 805 more rows

 #chart 
ice_cream_sales <- ice_cream_data %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(product_type) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE))
# Ice Cream sales 
print(ice_cream_sales)

## # A tibble: 6 × 2
##   product_type        total_sales
##   <chr>                     <dbl>
## 1 PAILS                     1214.
## 2 PREMIUM                  14841.
## 3 PREMIUM PINTS             1857.
## 4 QUARTS                     303.
## 5 SUPER PREMIUM PINTS       3404.
## 6 TRADITIONAL               4658.

#chart of Ice Cream sales per product 

ice_cream_sales %>%
  ggplot(aes(x = total_sales, y = product_type)) +
  geom_col(fill = "blue", color = "black") +
  scale_y_discrete("Product Type") +
  scale_x_continuous("Total Sales", labels = scales::dollar) +
  ggtitle("Total Sales by Product Type for ICE CREAM/MILK/SHERBTS",
          subtitle = "Product Types and Sales") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

#Pails refers <-- to larger containers of ice cream, often found in bulk sizes (e.g., 3 gallons or more).
#Premium <--  This category generally includes ice cream made with higher quality ingredients, usually containing more milk fat (often 14% or more). Premium ice creams are richer and creamier than regular ice creams.
#Premium Pints <-- Similar to the premium category, but sold in pint-sized containers (16 oz). These are convenient for individual consumption or small households and are often marketed as high-quality options.
#Quarts <-- Ice cream sold in quart-sized containers (32 oz). This size is popular for family consumption, offering a good balance between quantity and convenience.
#Super Premium Pints <-- This is a higher tier than regular premium ice creams, with even more milk fat (often 16% or higher) and more intense flavors. These are typically marketed as gourmet or artisanal products and are sold in pint-sized containers.
#Traditional  <-- This category typically refers to regular ice cream that doesn't fall into the premium or super-premium categories. It usually contains lower milk fat (10% to 12%) and is made with standard ingredients

Charts and graphs is total sales per ice cream Product

In the next Chart I wanted to know what is the largest buyers per Income Level

products %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  filter(str_detect(product_type, regex("^ICE CREAM"))) %>%
  group_by(income, product_type) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  ggplot(aes(x = income, y = total_sales, fill = product_type)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Ice Cream Sales by Income Level and Product Type",
       x = "Income Level", 
       y = "Total Sales ($)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

## `summarise()` has grouped output by 'income'. You can override using the
## `.groups` argument.

Next I wanted to combine monthly Ice Cream sales, Income Level and time of year. I wanted to see if a particular group stands out during July 4th.

# Join datasets and filter for ice cream products
monthly_income_ice_cream_sales <- transactions %>%
  inner_join(products, by = "product_id") %>%
  left_join(demographics, by = "household_id") %>%
  filter(str_detect(product_type, regex("^ICE CREAM")))

# Calculate total sales by month and income, removing NA values
monthly_income_sales <- monthly_income_ice_cream_sales %>%
  mutate(transaction_month = floor_date(transaction_timestamp, "month") %>% as.Date()) %>%  # Convert to Date
  filter(!is.na(income) & !is.na(sales_value)) %>%  # Remove NA values for income and sales_value
  group_by(transaction_month, income) %>%
  summarise(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
  arrange(transaction_month, income)

# View the results
print(monthly_income_sales)

## # A tibble: 119 × 3
##    transaction_month income    total_sales
##    <date>            <ord>           <dbl>
##  1 2017-01-01        Under 15K       20.6 
##  2 2017-01-01        15-24K           7.88
##  3 2017-01-01        25-34K           9.04
##  4 2017-01-01        35-49K          32.7 
##  5 2017-01-01        50-74K          59.7 
##  6 2017-01-01        75-99K          24.2 
##  7 2017-01-01        100-124K         7.87
##  8 2017-01-01        125-149K        46.5 
##  9 2017-01-01        150-174K        16.6 
## 10 2017-01-01        175-199K         3.98
## # ℹ 109 more rows

# Define a custom color palette for distinct income levels
custom_colors <- c(
  "Under 15K" = "red",
  "15-24K" = "orange",
  "25-34K" = "yellow",
  "35-49K" = "green",
  "50-74K" = "blue",
  "75-99K" = "purple",
  "100-124K" = "cyan",
  "125-149K" = "magenta",
  "150-174K" = "brown",
  "175-199K" = "pink"
)

# Visualize total sales by income over time with improved readability and distinct colors
ggplot(monthly_income_sales, aes(x = transaction_month, y = total_sales, color = income)) +
  geom_line(size = 1.2) +  # Thicker lines for better visibility
  labs(title = "Monthly Ice Cream Sales by Income",
       x = "Month",
       y = "Total Sales ($)",
       color = "Income Level") +
  theme_minimal(base_size = 15) +  # Increase base font size
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +  # Format x-axis labels
  scale_color_manual(values = custom_colors) +  # Apply custom colors
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
        panel.grid.major = element_line(color = "grey80"),  # Add grid lines for clarity
        panel.grid.minor = element_blank())  # Remove minor grid lines

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The graph above shows that 50-74k is the largest buyers of ice cream during July.

Lastly, I wanted to know if households with kids in the 50-74k are purchasing their weight.

income_children_data <- transactions %>%
  inner_join(products, by = "product_id") %>%
  left_join(demographics, by = "household_id") %>%
  filter(!is.na(income) & !is.na(kids_count))  # Ensure no NA values in income or kids_count

# Count the number of households for each income and kids count, scaled per 1000
income_kids_count <- income_children_data %>%
  group_by(income, kids_count) %>%
  summarise(household_count = n() / 1000, .groups = "drop") %>%  # Divide by 1000
  arrange(income, kids_count)

# View the results
print(income_kids_count)

## # A tibble: 45 × 3
##    income    kids_count household_count
##    <ord>     <ord>                <dbl>
##  1 Under 15K 0                    38.4 
##  2 Under 15K 1                    12.8 
##  3 Under 15K 2                     7.43
##  4 Under 15K 3+                    5.44
##  5 15-24K    0                    40.3 
##  6 15-24K    1                    12.3 
##  7 15-24K    2                     7.53
##  8 15-24K    3+                    1.42
##  9 25-34K    0                    47.1 
## 10 25-34K    1                    15.4 
## # ℹ 35 more rows

# Plot the results as a bar graph per 1000 households
ggplot(income_kids_count, aes(x = income, y = household_count, fill = as.factor(kids_count))) +
  geom_bar(stat = "identity", position = "dodge") +  # Dodge position for side-by-side bars
  labs(title = "Household Count by Income and Number of Children",
       x = "Income Level",
       y = "Household Count (per 1000)",
       fill = "Number of Children") +
  theme_minimal(base_size = 15) +  # Increase base font size
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
        panel.grid.major = element_line(color = "grey80"),  # Add grid lines for clarity
        panel.grid.minor = element_blank())  # Remove minor grid lines

This shows that in every income level children in the household is not as high as they could. Especially in 50-74k range.

Household Makeup

In these next 2 graphs I want to look at the percentages of household makeup and the purchases of ice cream.

# Join datasets to get household counts
household_data <- transactions %>%
  inner_join(products, by = "product_id") %>%
  left_join(demographics, by = "household_id") %>%
  filter(!is.na(kids_count))  # Ensure no NA values in kids_count

# Calculate total households and households with children
household_summary <- household_data %>%
  summarise(
    total_households = n(),
    households_with_children = sum(kids_count > 0, na.rm = TRUE),  # Count households with children
    households_without_children = total_households - households_with_children  # Calculate households without children
  ) %>%
  mutate(
    percentage_with_children = (households_with_children / total_households) * 100,
    percentage_without_children = (households_without_children / total_households) * 100
  )

# Prepare data for plotting
plot_data <- data.frame(
  category = c("With Children", "Without Children"),
  percentage = c(household_summary$percentage_with_children, household_summary$percentage_without_children)
)

# Create bar graph with percentages inside the bars
p <- ggplot(plot_data, aes(x = category, y = percentage, fill = category)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            vjust = 1.5,  # Position text inside the bars
            size = 4,  # Reduce text size to fit
            color = "white") +  # White text to contrast with bar color
  scale_y_continuous(limits = c(0, max(plot_data$percentage) * 1.2)) +  # Increase y-axis limits for more space
  labs(title = "Percentage of Households with and without Children",
       x = "Household Category",
       y = "Percentage (%)") +
  theme_minimal(base_size = 15) +
  theme(legend.position = "none")  # Remove legend if not needed

# Display the plot
print(p)

# Calculate the total quantity purchased grouped by the number of kids
ice_cream_data <- products %>%
  filter(str_detect(product_category, regex("ICE CREAM|MILK|SHERBETS", ignore_case = TRUE))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(kids_count) %>%
  summarise(quantity_purchased = sum(quantity, na.rm = TRUE)) %>%
  ungroup()

# Calculate percentages
ice_cream_data <- ice_cream_data %>%
  mutate(percentage = (quantity_purchased / sum(quantity_purchased)) * 100)

# Create a pie chart with percentages
ggplot(ice_cream_data, aes(x = "", y = quantity_purchased, fill = factor(kids_count))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5),  # Center labels on the segments
            color = "white", size = 4) +  # Change the color and size of the text
  labs(title = "Ice Cream Purchases by Number of Kids in Household",
       fill = "Number of Kids") +
  theme_void() +  # Remove axes and background
  theme(plot.title = element_text(face = "bold", size = 16))

What these graphs show is that generally it is equal between households with kids and without. However it seems that households with children are under purchasing ice cream by 1 percent. Considering that kids love ice cream there might be some room to improve by more than 1 percent.

These show that households with children ARE in the minority but we want to know if households with kids are purchasing their weight.

In the next chart I wanted to show the Number of kids vs Quantity of ice cream sold.

products %>%
  filter(str_detect(product_category, regex("(ICE CREAM)"))) %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(kids_count, income) %>%
  summarise(quantity_purchased = sum(quantity)) %>%
  ggplot(aes(x = kids_count, y = quantity_purchased)) +
  geom_boxplot(fill = "blue", color = "black") +
  labs(x = "Number of Kids in Household", y = "Quantity Purchased") +
  ggtitle("Number of Kids vs the Quantity of Ice Cream Sold",
          subtitle = "Exploring Ice Cream Sales by Household Size") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic"))

## `summarise()` has grouped output by 'kids_count'. You can override using the
## `.groups` argument.

In the end there are 4 themes that we get out of this. 1: July 4th is the day with the most ice cream purchases. 2: 50-74k is the income range where people buy the most ice cream. 3: Premium products are the most purchased ice cream. 4: Households without kids and in the 50-74k salary range, buy the most ice cream.

Conclusion and recomendations

How I Answered the Problem To address the challenge of understanding ice cream sales dynamics, I implemented a comprehensive analytical approach. This involved examining our transaction data alongside product data, which allowed me to identify sales peaks throughout the year and reveal patterns in consumer purchasing behavior. By integrating demographic data, I could analyze the characteristics of our ice cream buyers and uncover which demographic segments contributed most significantly to sales. This multi-faceted analysis highlighted trends in purchasing behavior and provided insights into the specific consumer groups we should target, ultimately guiding our strategies to increase ice cream sales effectively.

Recommendations Based on our analysis of ice cream sales and consumer demographics, I recommend implementing a targeted marketing strategy that leverages the identified peak sales period on July 4th and focuses on key demographic groups. The strategy should emphasize the $50K to $74K income range, where we see the highest volume of ice cream purchases. To effectively engage this demographic, we should promote our premium ice cream products, as they have shown to be the most popular among consumers. This can be achieved through promotional campaigns that highlight unique flavors and limited-time offers specifically around the Fourth of July. Additionally, we must intensify our marketing efforts towards households with kids, who often purchase ice cream as a treat for their children, especially during summer celebrations. By aligning our marketing efforts with the preferences of our target demographics and the trends observed during peak purchase times, we can significantly boost ice cream sales and foster brand loyalty.

Conclusion Our analysis underscores a substantial opportunity for Regork to enhance ice cream sales, particularly on July 4th, the day with the highest ice cream purchases of the year. Households with an income between $50K and $74K emerge as the most active buyers, representing a significant share of total sales. This demographic often seeks affordable yet enjoyable premium treats to share with family and friends during festive occasions, making them a key target for our marketing initiatives. Given these insights, it is essential that Regork strategically focuses its marketing efforts on this income demographic leading up to July 4th. In summary, a focused marketing strategy targeting the $50K to $74K income demographic, families with kids, and emphasizing our premium products will enable Regork to capitalize on the increased demand for ice cream during this critical time.