Midterm Project - BANA 4080

INTRODUCTION

As part of the Regork team, our main goal is to find ways for the company to grow, especially by increasing sales of packaged candy between big holidays. Even though our grocery chain sells a lot of candy during events like Halloween and Christmas, there’s a noticeable slowdown in between these occasions. This is a significant problem that the CEO should address because it means there’s a chance to make more money. If we strategically boost candy sales during the times when there are no holidays, we can make more profit and make our customers happier.

To solve this challenge, we looked at Regork’s past sales data, checking how customers buy things and considering outside factors like weather, promotions, and local events. We used methods like time series analysis, predictive modeling, and customer segmentation to figure out patterns and what influences candy sales during non-holiday times. Tools like ggplot helped us see current sales patterns and what factors are affecting them.

Our goal is to give the CEO useful insights to come up with a plan to increase candy sales between major holidays. We suggest things like special promotions during specific times and for different groups of customers, placing candy in the store where customers are more likely to buy, and exploring partnerships or events to offer exclusive candy options. By using these data-driven ideas, we can take advantage of opportunities to make more money and keep our customers happy all year round. This way of thinking matches what our company aims for - to keep growing and making our customers satisfied.

PACKAGES REQUIRED

The following R packages are required in order to run the code in this R project:

library(tidyverse) # packages to perform data manipulation and visualization
library(completejourney) # relational data containing household transaction information
library(lubridate) # package to handle date manipulation

DATA TIDY

Data Sets Used

The four data sets which were used for this analysis:

Transactions
household_id
store_id
basket_id
product_id
quantity
sales_value
retail_disc
coupon_disc
coupon_match_disc
week
transaction_timestamp

Products
product_id
manufacturer_id
department
brand
product_category
product_type
package_size

Demographics
household_id
age
income
home_ownership
marital_status
household_size
household_comp
kids_count

Coupon_Redemptions
household_id
coupon_upc
campaign_id
redemption_date

Data Manipulation

The following R code aims to fetch transaction data and store it in a data frame named ‘transactions’. We will use the function get_transactions() to retrieve the transaction data. The fetched data will be organized and stored in a structured format using a data frame. The resulting data frame, named ‘transactions’, will be utilized for further analysis and processing

transactions <- get_transactions()

Candy Transactions: This combines transaction data with product details and demographic information to specifically focus on transactions related to packaged candies. This is achieved through a series of data manipulations, including joining the transactions and products datasets based on product IDs, filtering for entries with a product category indicative of packaged candy using regular expressions, and finally, joining the resulting dataset with demographic information based on household IDs.

candyData <- transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(str_detect(product_category, regex("candy - packaged", ignore_case = TRUE))) %>%
  inner_join(demographics, by = "household_id")

Coupon Data: Combines coupon redemption data with demographic information to gain insights into the utilization of coupons among various demographic groups. This is achieved through an inner join operation, connecting coupon_redemptions and demographics datasets based on household IDs.

couponData <- coupon_redemptions %>%
  inner_join(demographics, by = "household_id")

Joined transactions, products, and demographics to determine how frequently items in a specific product category are bought by households with varying numbers of kids. The analysis is summarized by grouping the data based on product category and kids count

productCatergories <- transactions %>%
  inner_join(products, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(product_category, kids_count) %>%
  summarize(timesBought = n())

EXPLORATORY DATA ANALYSIS

Distribution of candy sales by Kid Count

The visualization of candy sales data has brought to light a fascinating trend: households without children consistently lead in packaged candy purchases during holiday periods (April,October and continuously rising till December). Despite the absence of children in these households, there is a conspicuous surge in candy sales during festive seasons like Halloween, Christmas, and Easter. This unexpected finding suggests that individuals or couples without children actively participate in the tradition of purchasing and indulging in candies during celebratory times.

The data not only reveals the diverse nature of consumer preferences but also presents a strategic opportunity for retailers and marketers. By recognizing the notable spike in candy sales among households without children during holidays, businesses can tailor their marketing strategies to specifically target this demographic during seasonal promotions.

candyData %>%
  mutate(month = month(transaction_timestamp)) %>%
  mutate(day = day(transaction_timestamp)) %>%
  mutate(date = date(transaction_timestamp)) %>%
  group_by(date, kids_count) %>%
  summarize(sales = sum(sales_value)) %>%
  ggplot(aes(x = date, y = sales)) +
    geom_point(color = 'darkblue') +
    facet_wrap(~kids_count) +
    ggtitle("Sales of candy throughout the year broken down by number fo kids in the household") +
    scale_x_date("Day of Year(2017)", date_breaks = "1 month", date_labels = "%b") +
    scale_y_continuous("Total sales of candy each day", labels = scales::dollar) +
    theme(panel.spacing.x = unit(0.5, "cm", data = NULL))

Total Sales by Kids Count

The bar plot of total sales of packaged candy based on the number of kids in households reveals a compelling trend. Surprisingly, households with zero children (0 kids) exhibit the highest total sales, implying a potential inclination for candy purchases among individuals or couples without children. Conversely, as the number of kids increases, there is a gradual decline in total sales, suggesting that households with more children tend to spend less on packaged candies. This inverse relationship may be influenced by factors such as budget constraints, dietary considerations, or alternative snack preferences within larger families

candyData %>%
  group_by(kids_count) %>%
  summarize(sales = sum(sales_value)) %>%
  ggplot(aes(x = kids_count, y = sales)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    ggtitle("Total sales in household by the number of kids") +
    scale_x_discrete("Number of Kids") +
    scale_y_continuous("Total sales", labels = scales::dollar)

Total Coupon Redemptions by Kids Count

What we found out was coupon redemptions and candy sales data provides nuanced insights into consumer behavior, revealing patterns that can inform targeted marketing strategies. Notably, the first plot showcasing coupon usage by the number of kids unveils a potential correlation between the propensity to utilize coupons.

In the second plot, which examines total candy sales by kids count and income, an intriguing connection emerges. The income range of 35k to 99k stands out as the most lucrative, suggesting that households within this bracket are key contributors to overall candy sales. This finding underscores the importance of tailoring marketing efforts to align with the preferences of consumers in this income range. Moreover, the third plot, combining both coupon usage in candy sales data, reinforces the potential value of targeting households with zero kids. This not only displays the highest coupon redemptions but also boasts the highest total candy sales, emphasizing the effectiveness of coupon offerings for this specific demographic.

couponData %>%
  group_by(kids_count) %>%
  summarize(totalRedemptions = n()) %>%
  ggplot(aes(x = kids_count, y = totalRedemptions)) +
    geom_bar(stat = 'identity', fill = 'darkgreen') +
    ggtitle("Total number of coupons used in household by the number of kids") +
    scale_x_discrete("Number of Kids") +
    scale_y_continuous("Total redemptions")

candyData %>%
  group_by(kids_count, income) %>%
  summarize(sales = sum(sales_value)) %>%
  ggplot(aes(x = kids_count, y = sales)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    facet_wrap(~income) + 
    ggtitle("Total sales in household by the number of kids and income") +
    scale_x_discrete("Number of Kids") +
    scale_y_continuous("Total sales", labels = scales::dollar)

couponData %>%
  group_by(kids_count, income) %>%
  summarize(totalRedemptions = n()) %>%
  ggplot(aes(x = kids_count, y = totalRedemptions)) +
    geom_bar(stat = 'identity', fill = 'darkgreen') +
    facet_wrap(~income) + 
    ggtitle("Total number of coupons used in household by the number of kids and income") +
    scale_x_discrete("Number of Kids") +
    scale_y_continuous("Total redemptions")

Top 3 Product Types by Kids Count

The top product categories based on household types reveals clear preferences linked to the presence or absence of kids. In households without children, the most frequently purchased items are baked bread, milk products, and soft drinks, accounting for the highest number of transactions. This finding suggests that families without kids prioritize essentials like bread and milk, while also showing a liking for soft drinks. Recognizing these popular choices can help retailers and marketers tailor their strategies to better serve the needs and preferences of households without kids, enhancing their overall shopping experience.

top_3_types <- productCatergories %>%
  group_by(kids_count) %>%
  top_n(3, timesBought) %>%
  arrange(kids_count)
top_3_types %>%
  ggplot(aes(x = kids_count, y = timesBought, fill = product_category)) +
    geom_bar(position = 'dodge', stat = 'identity') +
  ggtitle("Top three product categories households are buying by number of kids") +
  scale_x_discrete("Number of Kids") +
  scale_y_continuous("Number of transactions")

SUMMARY

The primary challenge faced by the Regork team centers around a noticeable slowdown in candy sales during periods between major holidays, despite robust sales during events like Halloween and Christmas. Recognizing this as an opportunity to increase profits, the team’s overarching goal is to strategically boost candy sales during non-holiday times, ensuring sustained growth for the company.

To address this challenge, the Regork team delved into the analysis of past sales data, considering various external factors such as weather, promotions, and local events. The methodology employed encompassed diverse techniques, including time series analysis, predictive modeling, and customer segmentation. Visualization tools like ggplot facilitated the observation of current sales patterns and the identification of influential factors affecting candy sales during periods with no major holidays.

One intriguing insight derived from the analysis is the consistent lead in packaged candy purchases by households without children during holiday periods, indicating unexpected consumer behavior. Additionally, the examination of total sales by kids count revealed that households with zero children exhibited the highest sales. This suggests a potential market inclination among individuals or couples without children, highlighting the importance of understanding diverse consumer preferences.

The implications of these insights are significant, proposing targeted strategies to tap into the market potential of households without children, especially during holidays. Recommendations include implementing special promotions, strategic product placement, and exploring partnerships or events to capitalize on this market segment and increase candy sales between major holidays.

However, the analysis has its limitations, such as potential oversights in external factors influencing sales and the static nature of the analysis. To enhance accuracy and provide more dynamic insights, future improvements could involve incorporating real-time data, conducting customer surveys, and employing advanced machine learning models for predictive analysis. Additionally, exploring customer feedback and preferences could offer a deeper understanding of consumer behavior, contributing to more informed strategic decisions.