Understanding customer behavior is crucial for businesses to optimize marketing strategies and improve sales performance. Retailers often struggle to effectively target the right customer segments, time promotions, and identify preferred products. By examining factors like customer demographics, spending behavior, and discount effectiveness, businesses can make more informed decisions to enhance customer engagement and drive sales.
This analysis seeks to address the question: How can businesses optimize their marketing strategies and product offerings by understanding the purchasing behavior, preferences, and responses to discounts of their target customers? The insights gained will guide better-targeted promotions, improved sales forecasting, and resource allocation.
We use the Complete Journey Dataset, consisting of three datasets:
Transactions: Sales transactions with product,
price, discount, and timestamp details.
Products: Product information, including brand,
category, and size.
Demographics: Household information such as age,
income, and family composition.
library(ggplot2)
library(dplyr)
library(tidyverse)
library(completejourney)
library(scales)
library(gridExtra)
ggplot2 – Creates visualizations to analyze
sales trends and customer behavior.
dplyr – Filters, groups, and summarizes
transactional data for insights.
tidyverse – Provides a cohesive set of tools for
data wrangling and visualization.
completejourney – Accesses retail sales and
discount data for analysis.
scales – Formats sales and discount values in
dollar amounts for clarity.
gridExtra – Arranges multiple plots for easy
comparison of trends.
transactions <- get_transactions()
products
## # A tibble: 92,331 × 7
## product_id manufacturer_id department brand product_category product_type
## <chr> <chr> <chr> <fct> <chr> <chr>
## 1 25671 2 GROCERY Natio… FRZN ICE ICE - CRUSH…
## 2 26081 2 MISCELLANEOUS Natio… <NA> <NA>
## 3 26093 69 PASTRY Priva… BREAD BREAD:ITALI…
## 4 26190 69 GROCERY Priva… FRUIT - SHELF S… APPLE SAUCE
## 5 26355 69 GROCERY Priva… COOKIES/CONES SPECIALTY C…
## 6 26426 69 GROCERY Priva… SPICES & EXTRAC… SPICES & SE…
## 7 26540 69 GROCERY Priva… COOKIES/CONES TRAY PACK/C…
## 8 26601 69 DRUG GM Priva… VITAMINS VITAMIN - M…
## 9 26636 69 PASTRY Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691 16 GROCERY Priva… PNT BTR/JELLY/J… HONEY
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
demographics
## # A tibble: 801 × 8
## household_id age income home_ownership marital_status household_size
## <chr> <ord> <ord> <ord> <ord> <ord>
## 1 1 65+ 35-49K Homeowner Married 2
## 2 1001 45-54 50-74K Homeowner Unmarried 1
## 3 1003 35-44 25-34K <NA> Unmarried 1
## 4 1004 25-34 15-24K <NA> Unmarried 1
## 5 101 45-54 Under 15K Homeowner Married 4
## 6 1012 35-44 35-49K <NA> Married 5+
## 7 1014 45-54 15-24K <NA> Married 4
## 8 1015 45-54 50-74K Homeowner Unmarried 1
## 9 1018 45-54 35-49K Homeowner Married 5+
## 10 1020 45-54 25-34K Homeowner Married 2
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
merged_data <- transactions %>%
left_join(products, by = "product_id") %>%
left_join(demographics, by = "household_id")
merged_data <- merged_data %>%
mutate(
transaction_date = as.Date(transaction_timestamp),
year = year(transaction_date),
month = month(transaction_date, label = TRUE, abbr = FALSE), # Full month name
weekday = weekdays(transaction_date) # Extract weekday name
)
# Remove rows with N/A values in the 'age' column
age_total_spending <- merged_data %>%
filter(!is.na(age)) %>%
group_by(age) %>%
summarize(total_spending = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_spending))
# Plot total spending by age with dollar formatting
ggplot(age_total_spending, aes(x = age, y = total_spending, fill = as.factor(age))) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = scales::dollar_format()) + # Format y-axis as dollars
labs(title = "Total Spending by Age", x = "Age", y = "Total Spending ($)") +
theme_minimal() +
theme(legend.position = "none")
The first visualization highlights total spending by different age groups. By analyzing the total spending for each age group, businesses can identify the most lucrative segments. The results show that the 45-54 age group is among the highest spenders, underlining the importance of this demographic group for businesses looking to maximize revenue.
# Filter the data for the 45-54 age group
age_group_45_54 <- merged_data %>%
filter(age == "45-54")
# Marital status distribution for 45-54 age group (percentage), removing NA values
marital_status_distribution <- age_group_45_54 %>%
filter(!is.na(marital_status)) %>%
count(marital_status) %>%
mutate(percentage = n / sum(n) * 100)
# Pie chart for marital status distribution
p1 <- ggplot(marital_status_distribution, aes(x = "", y = percentage, fill = marital_status)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(title = "Marital Status Distribution (45-54 Age Group)", fill = "Marital Status") +
theme_void() +
theme(legend.position = "right", plot.margin = margin(10, 10, 10, 10))
# Household size distribution for 45-54 age group (percentage), removing NA values
household_size_distribution <- age_group_45_54 %>%
filter(!is.na(household_size)) %>%
count(household_size) %>%
mutate(percentage = n / sum(n) * 100)
# Pie chart for household size distribution
p2 <- ggplot(household_size_distribution, aes(x = "", y = percentage, fill = as.factor(household_size))) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(title = "Household Size Distribution (45-54 Age Group)", fill = "Household Size") +
theme_void() +
theme(legend.position = "right", plot.margin = margin(10, 10, 10, 10))
# Arrange the two plots side by side with adjusted size
grid.arrange(p1, p2, ncol = 2, widths = c(1, 1),
layout_matrix = rbind(c(1, 5)))
For the 45-54 age group, we analyze the distribution of marital status and household size, which provides insights into the types of consumers within this segment. A pie chart and a bar chart visualize the percentage of customers who are married versus single, as well as the distribution of household sizes. Understanding these patterns helps businesses tailor their product offerings. For example, married couples with children may have different purchasing needs than singles or married couples without children.
# Filter the data for Married people with household size of 2
married_household_2 <- merged_data %>%
filter(marital_status == "Married" & household_size == 2)
# Calculate total sales for each product category
category_sales_married_2 <- married_household_2 %>%
group_by(product_category) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales))
# Select top 10 product categories
top_10_categories_married_2 <- category_sales_married_2[1:10,]
# Plot top 10 product categories for Married people with household size of 2
ggplot(top_10_categories_married_2, aes(x = reorder(product_category, total_sales), y = total_sales, fill = product_category)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 Product Categories for Married People with Household Size of 2", x = "Product Category", y = "Total Sales") +
scale_y_continuous(labels = label_dollar()) + # Format y-axis as dollar values
theme_minimal() +
theme(legend.position = "none")
Focusing on married individuals in the 45-54 age group with a household size of 2, we identify the top 10 product categories. This analysis helps businesses understand the product preferences of this specific demographic, allowing them to target product offerings that align with their interests. For instance, product categories like “Coupon/Misc”, “Beef”, and “Soft Drinks” are among the most popular, which can inform product development and promotional efforts.
# Filter the data for 45-54 age group, married individuals, and household size of 2
filtered_data <- merged_data %>%
filter(age == "45-54", marital_status == "Married", household_size == 2)
# Filter for the top 3 product categories: "Coupon/Misc", "Beef", and "Soft Drinks"
top_3_categories_filtered <- filtered_data %>%
filter(product_category %in% c("COUPON/MISC ITEMS", "BEEF", "SOFT DRINKS"))
# Calculate total sales for each product type within these categories
product_type_sales <- top_3_categories_filtered %>%
group_by(product_type) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales))
# Remove the top product type (first row) from the top 10 list
top_9_product_types <- product_type_sales[2:10,]
# Plot the remaining top 9 product types from the top 3 categories for the specified demographic group
ggplot(top_9_product_types, aes(x = reorder(product_type, total_sales), y = total_sales, fill = product_type)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 9 Product Types from Top 3 Categories (45-54 Age, Married, Household Size 2)",
x = "Product Type", y = "Total Sales") +
scale_y_continuous(labels = label_dollar()) + # Format y-axis as dollar values
theme_minimal() +
theme(legend.position = "none")
We delve deeper into the top product categories and identify the top 9 product types within the categories of “Coupon/Misc”, “Beef”, and “Soft Drinks.” This insight reveals more granular preferences within each category and allows businesses to optimize inventory and promotions by focusing on the best-selling product types.
# Calculate total discount (sum of all discount columns) for each transaction
age_group_45_54 <- age_group_45_54 %>%
mutate(total_discount = retail_disc + coupon_disc + coupon_match_disc)
# Summarize average discount and average sales for the age group
discount_impact_45_54 <- age_group_45_54 %>%
group_by(product_category) %>%
summarize(avg_discount = mean(total_discount, na.rm = TRUE),
avg_sales = mean(sales_value, na.rm = TRUE))
# Scatter plot to visualize the relationship between average discount and average sales
ggplot(discount_impact_45_54, aes(x = avg_discount, y = avg_sales)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", color = "red") +
labs(title = "Impact of Discounts (45-54 Age Group)",
x = "Average Discount", y = "Average Purchases") +
scale_x_continuous(labels = scales::dollar_format()) + # Format x-axis (average discount) as dollars
scale_y_continuous(labels = scales::dollar_format()) + # Format y-axis (average sales) as dollars
theme_minimal()
The analysis of discounts and sales for the 45-54 age group reveals that there is no strong correlation between the two variables. The scatter plot does not show a clear upward trend, indicating that higher discounts do not consistently lead to higher sales. This suggests that while discounts may still attract some buyers, they are not the primary driver of increased sales for this demographic. Businesses looking to boost sales among this group should consider alternative strategies, such as personalized promotions, product bundling, or enhancing product value rather than relying heavily on discounting.
# Count the purchases by weekday and month for the 45-54 age group
weekday_sales <- age_group_45_54 %>%
count(weekday) %>%
arrange(desc(n))
month_sales <- age_group_45_54 %>%
count(month) %>%
arrange(desc(n))
# Plot weekday sales distribution (bar chart)
weekday_plot <- ggplot(weekday_sales, aes(x = reorder(weekday, -n), y = n, fill = weekday)) +
geom_bar(stat = "identity") +
labs(title = "Weekday Purchase Distribution (45-54 Age Group)", x = "Weekday", y = "Number of Purchases") +
theme_minimal() +
theme(legend.position = "none")
# Plot month sales distribution (bar chart)
month_plot <- ggplot(month_sales, aes(x = reorder(month, -n), y = n, fill = month)) +
geom_bar(stat = "identity") +
labs(title = "Monthly Purchase Distribution (45-54 Age Group)", x = "Month", y = "Number of Purchases") +
theme_minimal() +
theme(legend.position = "none")
# Combine the two plots side by side
grid.arrange(weekday_plot, month_plot, ncol = 2)
The weekday plot shows that the 45-54 age group tends to make the most purchases on Sundays, Saturdays, and Fridays. This suggests that weekends, particularly Sunday and Saturday, are crucial for driving sales within this demographic. It’s likely that these individuals are more likely to shop when they have more leisure time to browse and make purchasing decisions. Fridays may also see a spike in purchases as consumers prepare for the weekend. For businesses, focusing marketing efforts and promotions on these specific days could help maximize sales. Similarly, the month plot reveals that December, August, and October are the months with the highest sales for this age group. December, being a peak holiday shopping month, sees a natural increase in purchases due to seasonal promotions and gift-buying behaviors. August could be linked to back-to-school shopping or end-of-summer sales, while October likely ties into fall promotions and the lead-up to holiday shopping. Businesses should tailor their campaigns and inventory planning to these months to effectively capitalize on these higher purchasing periods.
This analysis provides a deep dive into the purchasing behavior of individuals aged 45-54 who are married and have a household size of 2. Notably, the 45-54 age group had the highest sales and was found to be the most likely to spend significant amounts, making them a prime target for marketing efforts. It highlights key insights from multiple data points, such as total spending by age, marital status, household size, and product category preferences. Notably, the analysis identified “Coupon/Misc Items,” “Beef,” and “Soft Drinks” as the top three product categories for this demographic group, with significant sales volumes in these areas. The relationship between discounts and sales was also explored. There was no strong correlation between discounts and sales, suggesting that discounts alone are not a significant driver of increased purchases in this demographic. Furthermore, the analysis examined the purchasing patterns based on the day of the week and the month, revealing that specific days and months see higher purchase volumes, which can be crucial for optimizing marketing campaigns and inventory management. The overall findings point to a robust understanding of how this demographic group interacts with different products, how discounts influence their purchasing decisions, and when they are most likely to make purchases.
Based on these insights, several strategic recommendations can be made. First, businesses should prioritize and tailor their offerings in the top-selling product categories such as “Coupon/Misc Items,” “Beef,” and “Soft Drinks” to meet the preferences of this target demographic. Expanding product variety within these categories or introducing complementary products could drive additional sales. Second, instead of relying on discounts, businesses should focus on targeted promotions such as loyalty rewards, exclusive offers, or product bundling to encourage higher spending. Special promotional events or timed discounts could be utilized to further boost sales in these categories. Moreover, understanding the timing of purchases by weekday and month allows businesses to schedule promotions and marketing efforts to coincide with peak buying periods, ensuring higher engagement. Businesses could align advertising, stock levels, and promotions with these temporal trends to capitalize on consumer behavior patterns, driving both sales volume and customer satisfaction. In sum, businesses should use these insights to refine their marketing strategies, optimize product offerings, and design promotions that are data-driven, timing-focused, and targeted to maximize the lifetime value of this key demographic.