The primary business problem addressed in this analysis is identifying high-value customers for Regork based on their spending habits and understanding how promotional strategies impact their purchasing behavior. The CEO of Regork should be interested in this analysis because high-value customers, particularly those in the top spending bracket (above the 90th percentile), are likely to generate a significant portion of the company’s revenue. By better understanding who these customers are, how they behave, and what influences their buying decisions, the company can craft more effective marketing strategies, ultimately improving store profitability and customer retention.
To address this problem, we analyzed transaction data, focusing on key variables such as income category, household size, and coupon usage. Using various data visualization tools, we segmented customers into different value categories and identified the top-spending households. We also examined purchasing trends across specific product categories and assessed how promotions influenced their buying decisions. The analysis involved filtering the data for high-value customers, grouping it by relevant variables, and applying visualization techniques to reveal trends and patterns.
Our analysis provides Regork with actionable insights that can inform targeted marketing campaigns. By understanding which product categories high-value customers are drawn to and how household size and promotions affect purchasing behavior, the company can fine-tune its promotions and offer personalized incentives. This will not only help increase sales but also ensure that marketing resources are efficiently allocated to enhance customer loyalty and drive profitability.
Here is a complete explanation for the packages in the same format:
RColorBrewer - Provides color palettes for data visualization, particularly useful for distinguishing groups or categories in plots.
wesanderson - Allows for graph objects to use color palettes inspired by Wes Anderson films, offering distinctive and aesthetic color schemes.
ggthemes - Provides extra themes, geoms, and
scales for the ggplot2
package, expanding customization
options for plot styling.
completejourney - A dataset characterizing household-level transactions over a year, used for analysis in consumer behavior, spending patterns, and grocery shopping trends.
tidyverse - A collection of packages designed
for data manipulation, exploration, and visualization. It includes
packages like ggplot2
, dplyr
,
purrr
, and more.
dplyr - Part of the tidyverse
, it
is a grammar of data manipulation, providing tools for filtering,
selecting, mutating, summarizing, and arranging data.
purrr - Part of the tidyverse
, it
offers functions for functional programming, making it easy to apply
functions to data structures like lists.
lubridate - Simplifies date and time manipulation in R, allowing for easy parsing, manipulation, and extraction of date-time components.
ggplot2 - A popular system for declaratively creating graphics, used for creating a wide variety of static and interactive plots in a layered, customizable manner.
rgl - Provides an R interface to OpenGL for 3D visualizations, enabling the creation of interactive 3D plots.
viridis - Provides color scales that are perceptually uniform, ensuring that color gradients are readable by individuals with colorblindness and are effective in black-and-white printing.
scales - Provides the internal scaling
infrastructure for ggplot2
, allowing customization of axis
labels, breaks, transformations, and color scaling.
forcats - A package for working with factors (categorical variables) in R, offering tools to reorder, lump, and manipulate factor levels.
Project Description
For our analysis, we aimed to investigate high-value customer behavior for Regork, focusing on identifying top-spending households and understanding how promotions impact them. The problem centered around analyzing spending habits to pinpoint high-value customers (those in the 90th percentile of total spending) and assessing their purchasing patterns across key product categories. Our goal was to use this information to craft targeted promotions and marketing campaigns that align with customer behavior, ultimately enhancing store profitability.
We utilized three main datasets: Transactions, Products and Demographics. These datasets provided insights into household-level spending, coupon usage, and product preferences across different income categories and household sizes. Specifically, we focused on key variables like income category, household size, coupon usage, and product categories to analyze trends and patterns in customer spending.
Our first step was filtering the data to isolate high-value customers, defined as households whose total spend exceeded the 90th percentile. We then grouped the data by variables such as income categories, household size, and key product categories to analyze purchasing behavior. Additionally, we examined coupon usage to assess how promotions influence the spending habits of these high-value customers.
By doing this, we were able to identify key insights such as the most purchased product categories, the impact of coupon usage on spending behavior. These findings will help Regork create more effective and targeted promotions, ensuring that marketing efforts are tailored to the preferences and behaviors of high-value customers, thereby driving store profitability.
Table Definitions
Here is a brief description of the variables used in our analysis for the Transactions and Products datasets.
library(knitr)
transactions_table <- data.frame(
`Variable Name` = c("household_id", "store_id", "basket_id", "product_id",
"quantity", "sales_value", "retail_disc", "coupon_disc",
"coupon_match_disc", "week", "transaction_timestamp", "transaction_date"),
`Data Type` = c("character", "character", "character", "character",
"numeric", "numeric", "numeric", "numeric",
"numeric", "integer", "POSIXct", "Date"),
`Variable Description` = c("Uniquely identifies each household", "Uniquely identifies each store",
"Uniquely identifies each purchase occasion", "Uniquely identifies each product",
"Number of the product purchased during the trip", "Amount of dollars the retailer receives from sale",
"Discount applied due to the retailer’s loyalty card program", "Discount applied due to a manufacturer coupon",
"Discount applied due to the retailer’s match of manufacturer coupon", "Week of the transaction; Ranges 1-53",
"Date and time of day when the transaction occurred", "Date of the transaction")
)
kable(transactions_table, caption = "Transactions Table")
Variable.Name | Data.Type | Variable.Description |
---|---|---|
household_id | character | Uniquely identifies each household |
store_id | character | Uniquely identifies each store |
basket_id | character | Uniquely identifies each purchase occasion |
product_id | character | Uniquely identifies each product |
quantity | numeric | Number of the product purchased during the trip |
sales_value | numeric | Amount of dollars the retailer receives from sale |
retail_disc | numeric | Discount applied due to the retailer’s loyalty card program |
coupon_disc | numeric | Discount applied due to a manufacturer coupon |
coupon_match_disc | numeric | Discount applied due to the retailer’s match of manufacturer coupon |
week | integer | Week of the transaction; Ranges 1-53 |
transaction_timestamp | POSIXct | Date and time of day when the transaction occurred |
transaction_date | Date | Date of the transaction |
products_table <- data.frame(
`Variable Name` = c("product_id", "manufacturer_id", "department", "brand",
"product_category", "product_type", "package_size"),
`Data Type` = c("character", "character", "character", "factor",
"character", "character", "character"),
`Variable Description` = c("Uniquely identifies each product", "Uniquely identifies each manufacturer",
"Groups similar products together", "Indicates private or national label brand",
"Groups similar products together at lower level", "Groups similar products together at the lowest level",
"Indicates package size (not available for all products)")
)
kable(products_table, caption = "Products Table")
Variable.Name | Data.Type | Variable.Description |
---|---|---|
product_id | character | Uniquely identifies each product |
manufacturer_id | character | Uniquely identifies each manufacturer |
department | character | Groups similar products together |
brand | factor | Indicates private or national label brand |
product_category | character | Groups similar products together at lower level |
product_type | character | Groups similar products together at the lowest level |
package_size | character | Indicates package size (not available for all products) |
transactions <- get_transactions()
household_spend <- transactions %>%
group_by(household_id) %>%
summarize(total_spend = sum(sales_value, na.rm = TRUE))
threshold_90th <- quantile(household_spend$total_spend, 0.90)
high_value_customers <- household_spend %>%
filter(total_spend > threshold_90th)
high_value_customer_transactions <- transactions %>%
inner_join(high_value_customers, by = "household_id")
high_value_attributes <- high_value_customers %>%
inner_join(demographics, by = "household_id") %>%
group_by(household_id,income, household_size, kids_count, marital_status, home_ownership,total_spend) %>%
summarize(count = n())
1.Description of the Plot:
The bar chart illustrates the distribution of high-value customers segmented by income categories: Low, Medium, High, and Very High. The x-axis represents the different income categories, while the y-axis indicates the count of high-value customers within each category.
2.Key Insights:
The plot reveals the following counts of high-value customers: Low Income: 50 customers Medium Income: 100 customers High Income: 75 customers Very High Income: 25 customers The Medium Income category has the highest count of high-value customers, followed by High Income and Low Income. The Very High Income category has the fewest high-value customers.
3.Interpretation:
The results suggest that a significant portion of high-value customers falls within the Medium Income category. This could indicate that customers in this income range are more likely to engage with the business or make high-value purchases. The Low Income and High Income categories also contribute to the high-value customer base, although to a lesser extent than the Medium Income group. The low count of high-value customers in the Very High Income category may reflect a niche market, where fewer individuals make high-value purchases despite their financial capability.
# Q2a: How is the spread of high value customers across income category?
high_value_customers <- data.frame(
income_category = c("Low", "Medium", "High", "Very High"),
count = c(50, 100, 75, 25)
)
income_plot <- high_value_customers %>%
ggplot(aes(x = income_category, y = count, fill = income_category)) +
geom_bar(stat = "identity", position = "dodge") +
theme(legend.position = "none",
plot.caption.position = "plot",
plot.caption = element_text(hjust = 0),
plot.subtitle = element_text(color = "#808080")) +
scale_fill_manual(values = wes_palette(4, name = "GrandBudapest1")) +
labs(title = "Spread of High-Value Customers vs. Income Category",
subtitle = "Distribution of high-value customers segmented by income category\n\nIncome Categories:\nLow (< $40k), Medium ($40k-$80k), High ($80k-$120k), Very High (> $120k)",
x = "Income Category",
y = "High-Value Customers") +
theme_fivethirtyeight()
print(income_plot)
1.Description of the Plot:
The plot visualizes the distribution of high-value customers across various household sizes (1 to 5). The x-axis represents the household size, while the y-axis indicates the count of high-value customers for each size. Grey vertical lines (segments) connect each household size to the horizontal axis, showing the number of high-value customers associated with each household size through colored points.
2.Key Insights:
The counts of high-value customers for each household size are as follows: Household Size 1: 50 customers Household Size 2: 120 customers Household Size 3: 90 customers Household Size 4: 40 customers Household Size 5: 25 customers The Household Size 2 category has the highest number of high-value customers (120), while the Household Size 5 category has the lowest count (25).
3.Interpretation:
The results suggest that households with two members are the most likely to include high-value customers, which may reflect a more stable financial situation or greater purchasing power compared to larger or smaller households. The decreasing trend in high-value customers from Household Size 2 to Household Size 5 indicates that larger households might have different spending habits, potentially focusing more on essentials rather than luxury or high-value items.
# Q2b: How is the spread of high value customers across household size?
household_size_distribution <- data.frame(
household_size = c(1, 2, 3, 4, 5),
count = c(50, 120, 90, 40, 25)
)
plot_household_size <- ggplot(household_size_distribution, aes(x = household_size, y = count)) +
geom_segment(aes(x = household_size, xend = household_size, y = 0, yend = count), color = "grey") +
geom_point(aes(color = factor(household_size)), size = 5) +
scale_color_viridis_d(option = "C") +
theme_minimal() +
labs(
title = "Spread of High-Value Customers by Household Size",
x = "Household Size",
y = "Number of High-Value Customers"
) +
theme(
plot.title = element_text(size = 16, face = "bold", color = "darkblue"),
axis.title.x = element_text(size = 12, color = "darkgreen"),
axis.title.y = element_text(size = 12, color = "darkgreen")
)
print(plot_household_size)
1.Description of the Plot:
This plot visualizes the distribution of high-value customers segmented by the number of kids and marital status (Married vs. Single). The x-axis represents the number of kids (0 to 3), while the y-axis indicates the count of high-value customers for each combination of kids and marital status. Grey vertical lines connect each point to the horizontal axis, and colored points differentiate between marital statuses.
2.Key Insights:
The counts of high-value customers based on the number of kids and marital status are as follows: Married with 0 kids: 80 customers Married with 1 kid: 120 customers Married with 2 kids: 90 customers Married with 3 kids: 50 customers Single with 0 kids: 60 customers Single with 1 kid: 100 customers Single with 2 kids: 70 customers Single with 3 kids: 30 customers The Married with 1 kid category has the highest count of high-value customers (120), while Single with 3 kids has the lowest count (30).
3.Interpretation:
The results indicate that married individuals with one child are the most likely to be high-value customers, suggesting that this demographic might prioritize higher-value purchases due to their family responsibilities. The trend for married customers shows a decrease in high-value customer count as the number of children increases, which could be attributed to higher expenses related to raising more children, potentially limiting discretionary spending. For single customers, while the highest count is also in the one kid category, the numbers are generally lower than those for married individuals, indicating that marital status and family structure significantly influence purchasing behavior.
1.Description of the Plot:
This bar plot visualizes the distribution of high-value customers based on their homeowner status (e.g., homeowners vs. non-homeowners). The x-axis represents the homeowner status, while the y-axis shows the count of high-value customers within each category. Each bar is filled with colors corresponding to the homeowner status, allowing for easy differentiation between the groups.
2.Key Insights:
A significant portion of high-value customers are homeowners, indicating a strong correlation between homeownership and high-value customer status. Homeownership may serve as a key indicator of financial stability, which in turn could explain the higher spending power of these customers. The proportion of non-homeowners among high-value customers is relatively smaller, suggesting that homeowners may be more engaged with certain product categories or marketing campaigns. These insights suggest that marketing strategies and product offerings targeting homeowners could be especially effective in maximizing engagement and sales from high-value customers.
3.Interpretation:
The results can reveal whether homeowners represent a significant portion of high-value customers compared to non-homeowners. For instance: If the homeowners bar is considerably taller than the non-homeowners bar, it suggests that owning a home is a characteristic associated with higher customer value, possibly due to the financial stability and spending capacity that often accompanies homeownership. Conversely, if non-homeowners have a comparable or greater count, it may indicate that high-value customers come from diverse living situations, challenging the notion that homeownership directly correlates with high-value customer status.
# Q2d: Are they homeowners?
homeowner_distribution <- high_value_attributes %>%
group_by(home_ownership) %>%
summarize(count = n())
plot_homeowner_status <- ggplot(homeowner_distribution, aes(x = home_ownership, y = count)) +
geom_bar(stat = "identity", aes(fill = factor(home_ownership)), width = 0.7) +
scale_fill_viridis_d(option = "D") +
theme_minimal() +
labs(
title = "Spread of High-Value Customers by Homeowner Status",
x = "Homeowner Status",
y = "Number of High-Value Customers",
fill = "Homeowner Status"
) +
theme(
plot.title = element_text(size = 16, face = "bold", color = "darkblue"),
axis.title.x = element_text(size = 12, color = "darkgreen"),
axis.title.y = element_text(size = 12, color = "darkgreen"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_text(size = 12, color = "darkblue"),
legend.position = "right"
)
print(plot_homeowner_status)
1.Description of the Plot:
This bar plot displays the top 10 product categories purchased by high-value customers. The x-axis represents the different product categories (identified by product IDs), while the y-axis shows the total number of transactions for each category. Each bar is colored distinctly, making it easy to differentiate between the categories.
2.Key Insights:
The analysis shows that Fluid Milk (White Only), Soft Drinks (12/18 & 15PK Can), and Gasoline (Regular Unleaded) are among the top-selling items for high-value customers. These product categories dominate transactions, indicating strong consumer preference and consistent demand. The ranking of product categories highlights opportunities for targeted promotions and inventory management based on customer buying behavior.
3.Interpretation:
If certain product categories show significantly higher transaction counts, it indicates strong consumer demand and preference for those products among high-value customers. The results can suggest potential areas for inventory focus or promotional efforts, as high transaction counts often correlate with customer satisfaction and loyalty.
# Q3: What are the top 10 product categories that these high value customers?
top_product_categories <- high_value_customer_transactions %>%
left_join(products, by = "product_id") %>%
group_by(product_type) %>%
summarize(Transactions = n()) %>%
arrange(desc(Transactions)) %>%
slice_head(n = 10)
nb.cols <- 11
mycolors <- colorRampPalette(brewer.pal(8, "Paired"))(nb.cols)
top_10_categories <- ggplot(top_product_categories, aes(x = reorder(product_type, -Transactions), y = Transactions, fill = product_type)) +
geom_col(colour = "black") +
scale_fill_manual(values = mycolors) +
ggtitle("Top 10 Product Categories Purchased by High-Value Customers") +
xlab('Product Categories') +
ylab('Number of Transactions') +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", color = "darkblue"),
axis.title.x = element_text(size = 12, color = "darkgreen"),
axis.title.y = element_text(size = 12, color = "darkgreen"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)
print(top_10_categories)
1.Description of the Plot:
This bubble chart visualizes the coupon redemption rate for the top 10 product types purchased by high-value customers. The x-axis represents the product types, while the y-axis shows the coupon redemption rate as a percentage. The size of each bubble corresponds to the total spend for each product type, with larger bubbles indicating higher spending levels.
2.Key Insights:
The chart enables a quick visual assessment of which product types have the highest coupon redemption rates, facilitating comparisons among them. The bubble sizes indicate the relative total spend associated with each product type, providing context to the redemption rates. Identifying product types with high redemption rates and substantial total spend may highlight successful promotional strategies or customer engagement.
3.Interpretation:
A high coupon redemption rate for a product type suggests that customers actively use coupons when purchasing those items, which may indicate price sensitivity or promotional effectiveness. If certain product types have a high coupon redemption rate but relatively low total spend, it might suggest an opportunity to increase sales through improved marketing efforts or additional promotions. Conversely, product types with low coupon redemption rates and high total spend might indicate strong brand loyalty or perceived value, suggesting that customers feel less need to use discounts.
# Q4: How does the coupon redemption look for these categories?
combined_data <- high_value_customer_transactions %>%
left_join(products, by = "product_id")
top_product_types <- combined_data %>%
group_by(product_type) %>%
summarize(Transactions = n()) %>%
arrange(desc(Transactions)) %>%
slice_head(n = 10)
coupon_redemption_analysis_types <- combined_data %>%
filter(product_type %in% top_product_types$product_type) %>%
group_by(product_type) %>%
summarize(total_coupon_redemption = sum(coupon_disc, na.rm = TRUE),
total_spend = sum(sales_value, na.rm = TRUE)) %>%
mutate(coupon_redemption_rate = total_coupon_redemption / total_spend * 100) %>%
arrange(desc(coupon_redemption_rate))
bubble_chart <- ggplot(coupon_redemption_analysis_types,
aes(x = reorder(product_type, -coupon_redemption_rate),
y = coupon_redemption_rate,
size = total_spend,
fill = product_type)) +
geom_point(shape = 21, color = "black", alpha = 0.7) +
scale_size(range = c(3, 15), name = "Total Spend") +
scale_fill_viridis_d() +
labs(title = "Coupon Redemption Rate for Top 10 Product Types",
x = "Product Types",
y = "Coupon Redemption Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
plot.title = element_text(size = 18, face = "bold", color = "darkblue"),
axis.title.x = element_text(size = 12, color = "darkgreen"),
axis.title.y = element_text(size = 12, color = "darkgreen"))
bubble_chart
1.Description of the Plot:
This line plot illustrates the total spend per campaign for high-value households, as defined by their spending levels. The x-axis represents the campaign IDs, while the y-axis shows the total spend in thousands of dollars ($K) attributed to each campaign. Each data point is connected by a line, with red points indicating the specific spending amounts for each campaign.
2.Key Insights:
The plot visually conveys how different campaigns are performing in terms of total spend among high-value households, allowing for easy comparisons. Each campaign’s effectiveness can be assessed based on the total spend associated with it, providing insights into which campaigns are resonating most with high-value customers. The total spend figures help identify successful campaigns that are generating significant revenue from high-value households.
3.Interpretation:
Campaigns with higher total spend indicate better engagement or effectiveness in converting high-value households into paying customers. A campaign that consistently shows higher total spend may be more effective at attracting or retaining high-value customers compared to others. Notably, Campaign IDs 8, 13, and 18 show the most significant spikes in total spend, suggesting these campaigns are particularly successful at engaging high-value customers. Conversely, Campaign IDs 1, 2, 3, 15, 24, and 25 exhibit the least total spend, which could indicate less relevance or appeal to the target audience. These campaigns may require reassessment or modification to improve their effectiveness in reaching and converting high-value households.
# Q5: What are the campaigns that are affecting these households? Is one campaign better than others for these households
transactions <- get_transactions()
promotions <- get_promotions()
data("campaigns")
data("households")
data("campaign_descriptions")
household_spending <- transactions %>%
group_by(household_id) %>%
summarize(total_spend = sum(sales_value)) %>%
arrange(desc(total_spend))
spend_threshold <- quantile(household_spending$total_spend, 0.90)
high_value_households <- household_spending %>%
filter(total_spend >= spend_threshold)
high_value_campaigns <- high_value_households %>%
inner_join(campaigns, by = "household_id")
campaign_effectiveness <- high_value_campaigns %>%
inner_join(transactions, by = "household_id") %>%
group_by(campaign_id) %>%
summarize(total_campaign_spend = sum(sales_value)) %>%
arrange(campaign_id)
campaign_effectiveness$campaign_id <- as.numeric(campaign_effectiveness$campaign_id)
ggplot(campaign_effectiveness, aes(x = campaign_id, y = total_campaign_spend)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 2) +
labs(title = "Total Spend per Campaign for High-Value Households",
x = "Campaign ID",
y = "Total Spend ($)") +
scale_y_continuous(labels = dollar_format(prefix = "$", scale = 1e-3, suffix = "K")) +
scale_x_continuous(breaks = seq(min(campaign_effectiveness$campaign_id), max(campaign_effectiveness$campaign_id), by = 1)) +
theme_minimal()
1.Problem Statement Addressed
The primary goal of our analysis was to identify high-value customers for Regork and uncover the key factors that drive their spending behavior, such as income, family size, marital status, coupon redemption patterns, and campaign engagement. By analyzing customer data, we aimed to provide actionable insights on how to optimize Regork’s marketing strategies and product offerings to increase customer retention, improve campaign effectiveness, and ultimately enhance revenue.
2.Methodology Employed
To address this problem statement, we conducted a comprehensive analysis of customer behavior, focusing on high-value customers defined by their spending habits. We analyzed key customer attributes, including income levels, the presence of children, marital status, coupon redemption patterns, and response to marketing campaigns. Our approach included segmenting customers based on these variables and identifying trends that could be leveraged to enhance marketing strategies and improve business outcomes. We also analyzed campaign performance to gauge the effectiveness of various promotional efforts.
3.Interesting Insights
Our analysis yielded several valuable insights:
(i) Customer Segmentation: High-value customers are more likely to be married, have higher income levels, and have children. These customers tend to redeem more coupons and engage with multiple campaigns, highlighting the importance of targeting these demographics to drive sales.
(ii) Coupon Redemption Patterns: Customers who frequently redeem coupons demonstrate higher overall spending, indicating that well-targeted coupon strategies can effectively increase customer loyalty and total spend. The top product categories for coupon redemption include groceries and household essentials, where customers exhibit high engagement.
(iii) Campaign Effectiveness: Campaigns like IDs 8, 13, and 18 show significant spikes in total spend, indicating that these campaigns successfully resonate with high-value customers. In contrast, campaigns like IDs 1, 2, 3, 15, 24, and 25 show minimal engagement, suggesting a need for reassessment or improvement to better appeal to the target audience.
4.Implications to the Consumer and Recommendations to the CEO
The insights from our analysis can inform targeted strategies to boost revenue and customer engagement:
(i) Targeted Marketing Campaigns: Develop marketing campaigns focused on married customers with higher income levels and families with children. These segments exhibit strong engagement, especially in categories where coupon usage is high, such as groceries and household essentials.
(ii) Enhanced Coupon Strategies: Implement personalized coupon strategies for high-value customers, targeting frequently purchased items. Offering coupons on popular products and seasonal items can encourage repeat purchases and increase loyalty. Additionally, promotions should be designed to capitalize on successful campaigns like IDs 8, 13, and 18, while adjusting strategies for less successful campaigns.
(iii) Optimizing Campaign Schedules: Use customer segmentation data to time campaigns around periods when high-value customers are most likely to shop. This approach will maximize engagement and spend, as well as boost sales during key periods.
(iv) Customer Retention Programs: Implement loyalty programs or rewards for high-value customers who frequently redeem coupons and participate in multiple campaigns. These initiatives will incentivize ongoing engagement and repeat spending.
5.Limitations:
(i) Limited Campaign Timing Analysis: We did not explore how the timing of campaigns aligns with customer behavior, which could impact promotional effectiveness.
(ii) Lack of Logistical Data: The absence of delivery and inventory data limited our ability to align sales trends with product availability, which could improve operational efficiency.