This is a real story. A man discovered his daughter was pregnant after Target mailed him some coupons for baby items such as cots, maternity wear, and baby clothes based on her shopping habits. Target had robust demographics information about its customers. With customers’ demographic data, a business can predict and expand their market based on customers’ shopping habits.
However, not having customer demographics can significantly impact a business in several ways, leading to missed opportunities and inefficient marketing strategies. On the other hand, when businesses have robust customer demographics, it is easier to understand their audience and personalize product offerings, services, and marketing efforts.
In our analysis, we will identify that there is a tremendous amount of missing demographic information in the transactions data. This leaders to inefficient marketing to those customers. We refer to customers with full demographic information as “members” and those missing some demographic information as “guests.” Our goal is to address if the business leaders should allocate more resources to improve the completeness of our customers’ demographic information. We will also make suggestions on how to improve the collection and utilization of customer demographics data and predict customers more precisely.
The completejourney dataset is a dataset characterizing household level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store.
Tidyverse is an R programming package that helps to transform and better present data. It assists with data import, tidying, manipulation, and data visualization.
Knitr is a general-purpose package for dynamic report generation in R.
ggplot2 is a package implements a grammar of graphics.
gridExtra package provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.
library(tidyverse)
library(completejourney)
library(ggplot2)
library(gridExtra)
library(knitr)
We will utilize various datasets from the
completejourney package to conduct an analysis and provide
recommendations to the business. Our initial step will be to determine
the number of “members” and “guests” within our datasets. Households
with all fields (age, income, home ownership, marital status, household
size, household composition, and kids count) displaying “NA” will be
classified as “guests,” while households with populated fields will be
identified as “members.”
#join the demographics table to see if each transaction pairs with demographics information.
transactions <- get_transactions()
products <- products
promotions <- get_promotions()
demographics_gen <- demographics
transactions_demo <- transactions %>%
left_join(demographics, by = "household_id")
#assign new value "membership" if all values are "NA" among 'age', 'income', 'home_ownership', 'marital_status', 'household_size', 'household_comp', and 'kids_count'.
transactions_demo <- transactions_demo %>%
mutate(membership = if_else(is.na(age), "guest", "member"))
How many unique “members” and “guests” are in our dataset?
#Unique number of households group by membership
unique_hshld <- transactions_demo %>%
select(household_id, membership) %>%
group_by(membership) %>%
distinct(household_id) %>%
count(membership)
We are going to first take a look at our current “member” demographics from different key variables.
#member demographcis breakdown by each variable.
member_graph_1 <- demographics_gen %>%
ggplot(aes(age, y = after_stat(count/sum(count)), fill = age)) +
geom_bar()+
scale_y_continuous(labels = scales::percent) +
theme(legend.position = "none") +
ggtitle("Age Breakdown of 'Members'",
subtitle = "More members within the age group of 45-54") +
xlab("Age Groups") +
ylab("Percentage")
member_graph_2 <- demographics_gen %>%
ggplot(aes(income, y = after_stat(count/sum(count)), fill = income)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 55, hjust = 1))+
scale_y_continuous(labels = scales::percent) +
theme(legend.position = "none") +
ggtitle("Income Breakdown of 'Members'",
subtitle = "Approx. Half of the Customers' Income range between 35-99K") +
xlab("Income Groups") +
ylab("Percentage")
member_graph_3 <- demographics_gen %>%
filter(!is.na(marital_status)) %>%
ggplot(aes(marital_status, y = after_stat(count/sum(count)), fill = marital_status)) +
geom_bar() +
scale_y_continuous(labels = scales::percent) +
theme(legend.position = "none") +
ggtitle("Marital Status Breakdown of 'Members'") +
xlab("Marital Status") +
ylab("Percentage")
member_graph_4 <- demographics_gen %>%
ggplot(aes(household_size, y = after_stat(count/sum(count)), fill = household_size)) +
geom_bar() +
scale_y_continuous(labels = scales::percent) +
theme(legend.position = "none") +
ggtitle("Household Size Breakdown of 'Members'",
subtitle = "Most of the 'Members' with household size of two") +
xlab("Household Size") +
ylab("Percentage")
member_graph_5 <- demographics_gen %>%
ggplot(aes(household_comp, y = after_stat(count/sum(count)), fill = household_comp)) +
geom_bar() +
scale_y_continuous(labels = scales::percent) +
theme(legend.position = "none") +
ggtitle("Household Composition Breakdown of 'Members'",
subtitle = "Most of the customers are adutls without kids") +
xlab("Household Composition") +
ylab("Percentage")
member_graph_6 <- demographics_gen %>%
ggplot(aes(kids_count, y = after_stat(count/sum(count)), fill = kids_count)) +
geom_bar() +
scale_y_continuous(labels = scales::percent) +
theme(legend.position = "none") +
ggtitle("Kids Counts Breakdown of 'Members'",
subtitle = "Most of the customers do not have kids") +
xlab("Kids Count") +
ylab("Percentage")
Do “members” redeem more coupons than “members”?
Then, let’s look at store preferences between “members” and “guests”.
#Total customer count for each store.
total_customers_per_store <- transactions_demo %>%
select(store_id, membership, household_id) %>%
group_by(store_id, household_id, membership) %>%
summarise(total_customers = n()) %>%
group_by(store_id) %>%
summarise(total_customers = n()) %>%
arrange(desc(total_customers))
# Stores with the most customers. Top 20 with the most customers
store_graph_1 <- transactions_demo %>%
select(store_id, membership, household_id) %>%
group_by(store_id, household_id, membership) %>%
summarise(count_group = n()) %>%
group_by(store_id, membership) %>%
summarise(count_membership = n()) %>%
left_join(total_customers_per_store, by = "store_id") %>%
arrange(desc(total_customers)) %>%
head(40) %>%
ggplot(aes(x = reorder(store_id, desc(total_customers)), y = count_membership, fill = membership)) +
geom_col() +
theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
ggtitle("Top 20 Stores: Numbers of Frequent Shoppers",
subtitle = "More guests shop at the stores than members") +
xlab("Store ID") +
ylab("Number of Customers")
# Top 10 stores broken down by members and guest. Bars unstacked.
store_graph_2 <- transactions_demo %>%
select(store_id, membership, household_id) %>%
group_by(store_id, household_id, membership) %>%
summarise(count_group = n()) %>%
group_by(store_id, membership) %>%
summarise(count_membership = n()) %>%
left_join(total_customers_per_store, by = "store_id") %>%
arrange(desc(total_customers)) %>%
head(20) %>%
ggplot(aes(x = reorder(store_id, desc(count_membership)), y = count_membership, fill = membership)) +
geom_col(position = 'dodge') +
theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
ggtitle("Top 10 Stores: Numbers of Frequent Shoppers") +
xlab("Store ID") +
ylab("Number of Customers")
Next, we are going to discover store generated revenues based on the number difference between members and guests.
#Total sales per store in thousands.
total_sales_per_store <- transactions_demo %>%
mutate(total_sale_value = sales_value) %>%
group_by(store_id) %>%
summarise(total_sales_per_store_kilo = sum(total_sale_value) / 1000)
#Total number of members per store.
total_memeber_per_store <- transactions_demo %>%
select(store_id, membership, household_id) %>%
group_by(store_id, household_id, membership) %>%
summarise(total_customers = n()) %>%
filter(membership == "member") %>%
group_by(store_id) %>%
summarise(total_members = n()) %>%
arrange(desc(total_members))
#Total number of guests per store.
total_guest_per_store <- transactions_demo %>%
select(store_id, membership, household_id) %>%
group_by(store_id, household_id, membership) %>%
summarise(total_customers = n()) %>%
filter(membership == "guest") %>%
group_by(store_id) %>%
summarise(total_guest = n()) %>%
arrange(desc(total_guest))
#Total revenue summary by members
revenue_summary <- total_sales_per_store %>%
left_join(total_customers_per_store, by = "store_id") %>%
left_join(total_memeber_per_store, by = "store_id") %>%
left_join(total_guest_per_store, by = "store_id") %>%
mutate(total_members = ifelse(is.na(total_members), 0, total_members)) %>%
mutate(total_guest = ifelse(is.na(total_guest), 0, total_guest)) %>%
arrange(desc(total_sales_per_store_kilo)) %>%
head(20)
Next, we will take a look at the demographic impact among those stores.
#Top 10 stores
total_10_stores <- total_sales_per_store %>%
arrange(desc(total_sales_per_store_kilo)) %>%
head(10) %>%
select(store_id)
#Top 10 stores revenue generated by customers from different age groups.
sales_store_demo <- transactions_demo %>%
filter(membership == "member") %>% # Turn me off to see some interesting plots.
filter(store_id %in% total_10_stores$store_id) %>%
group_by(store_id,
household_id,
age, income,
home_ownership,
marital_status,
household_size,
household_comp,
kids_count) %>%
summarise(total_sales_per_store_kilo = sum(sales_value)/1000)
#age
revenue_by_age_graph <- sales_store_demo %>%
ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = age)) +
geom_col() +
scale_y_continuous(labels = scales::dollar_format()) +
xlab("Store ID") +
ylab("Total Sales by Store (thousands)") +
theme(legend.position = "bottom") +
ggtitle("Total Sales by Store - Breakdown by Age")
#income
revenue_by_income_graph <- sales_store_demo %>%
ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = income)) +
geom_col() +
scale_y_continuous(labels = scales::dollar_format()) +
xlab("Store ID") +
ylab("Total Sales by Store (thousands)") +
theme(legend.position = "bottom") +
ggtitle("Total Sales by Store - Breakdown by Income")
#home ownership
revenue_by_ho_graph <- sales_store_demo %>%
ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = home_ownership)) +
geom_col() +
scale_y_continuous(labels = scales::dollar_format()) +
xlab("Store ID") +
ylab("Total Sales by Store (thousands)") +
theme(legend.position = "bottom") +
ggtitle("Total Sales by Store - Breakdown by Home Ownership")
# marital status
revenue_by_ms_graph <- sales_store_demo %>%
ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = marital_status)) +
geom_col() +
scale_y_continuous(labels = scales::dollar_format()) +
xlab("Store ID") +
ylab("Total Sales by Store (thousands)") +
theme(legend.position = "bottom") +
ggtitle("Total Sales by Store - Breakdown by Marital Status")
# household size
revenue_by_hshldsize_graph <- sales_store_demo %>%
ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = household_size)) +
geom_col() +
scale_y_continuous(labels = scales::dollar_format()) +
xlab("Store ID") +
ylab("Total Sales by Store (thousands)") +
theme(legend.position = "bottom") +
ggtitle("Total Sales by Store - Breakdown by Household Size")
# household comp
revenue_by_hshldcomp_graph <- sales_store_demo %>%
ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = household_comp)) +
geom_col() +
scale_y_continuous(labels = scales::dollar_format()) +
xlab("Store ID") +
ylab("Total Sales by Store (thousands)") +
theme(legend.position = "bottom") +
ggtitle("Total Sales by Store - Breakdown by Household Comp")
# kids count
revenue_by_kids_graph <- sales_store_demo %>%
ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = kids_count)) +
geom_col() +
scale_y_continuous(labels = scales::dollar_format()) +
xlab("Store ID") +
ylab("Total Sales by Store (thousands)") +
theme(legend.position = "bottom") +
ggtitle("Total Sales by Store - Breakdown by Kids Count")
Lastly, we are discovering the top selling product categories among members to predict guests shopping habits.
#Total 10 stores with the most revenue
total_10_stores <- total_sales_per_store %>%
arrange(desc(total_sales_per_store_kilo)) %>%
head(10) %>%
select(store_id)
#Top product sales among the top 10 stores
top_product_sales <- transactions_demo %>%
filter(store_id %in% total_10_stores$store_id) %>%
mutate(total_sale_value = sales_value) %>%
left_join(products, by = "product_id") %>%
select(total_sale_value,product_category, membership) %>%
group_by(membership, product_category) %>%
summarize(total_sale_value = sum(total_sale_value)) %>%
arrange(desc(total_sale_value)) %>%
head(10)
top_product_sales_graph <- top_product_sales %>%
mutate(total_sale_value = paste0('$', round(total_sale_value,0))) %>%
rename(
Membership = membership,
"Product Category" = product_category,
"Total_Sales" = total_sale_value
)
#Top product generates the most revenue in COUPON/MISC ITEMS
COUPON_MISC_product_sales <- transactions_demo %>%
filter(store_id %in% total_10_stores$store_id) %>%
mutate(total_sale_value = sales_value) %>%
left_join(products, by = "product_id") %>%
filter(product_category == "COUPON/MISC ITEMS") %>%
select(total_sale_value, product_type, membership) %>%
group_by(membership, product_type) %>%
summarize(total_sale_value = sum(total_sale_value)) %>%
arrange(desc(total_sale_value)) %>%
head(10)
COUPON_MISC_product_sales_graph <- COUPON_MISC_product_sales %>%
mutate(total_sale_value = paste0('$', round(total_sale_value,0))) %>%
rename(
Membership = membership,
"Product Type" = product_type,
"Total Sales" = total_sale_value
)
# GASOLINE-REG UNLEADED filtered out. It is skewing the results.
# Product Category
top_product_sales_no_gas <- transactions_demo %>%
filter(store_id %in% total_10_stores$store_id) %>%
mutate(total_sale_value = sales_value) %>%
left_join(products, by = "product_id") %>%
filter(product_type != "GASOLINE-REG UNLEADED") %>%
select(total_sale_value, product_category, membership) %>%
group_by(membership, product_category) %>%
summarize(total_sale_value = sum(total_sale_value)) %>%
arrange(desc(total_sale_value)) %>%
head(20)
#Revenue generated at top 10 stores from top 10 product categories. Includes COUPON/MISC.
product_sales_graph_1 <- ggplot(top_product_sales, aes(x = membership, y = total_sale_value)) +
geom_col(fill = "#69b3a2") +
scale_y_continuous(labels = scales::dollar_format()) +
xlab("Membership") +
ylab("Total Sales Value") +
ggtitle("Total Sales Include Gasoline by Members & Guests")
#Revenue generated at top 10 stores without Gasoline.
product_sales_graph_2 <- ggplot(top_product_sales_no_gas, aes(x = product_category, y = total_sale_value, fill = membership)) +
geom_col() +
scale_y_continuous(labels = scales::dollar_format()) +
theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
xlab("Membership") +
ylab("Total Sales Value") +
ggtitle("Total Sales Per Product Category Exclude Gasoline")
Total demographics missing from the
transactions_demo.
missing_demo <- transactions_demo %>%
select(age, income, home_ownership, marital_status, household_size, household_comp, kids_count)
kable(colSums(is.na(missing_demo)), col.names = c("Demographics variable", "Count"), caption = "Number of Missing Demographics")
| Demographics variable | Count |
|---|---|
| age | 640457 |
| income | 640457 |
| home_ownership | 861364 |
| marital_status | 781889 |
| household_size | 640457 |
| household_comp | 640457 |
| kids_count | 640457 |
Distinct count of members and guests.
kable(unique_hshld, caption = "Unique number of members and guests", col.names = c("Customer Type", "Count"))
| Customer Type | Count |
|---|---|
| guest | 1668 |
| member | 801 |
Although the number of “guests” is twice that of the “members”, “members” still generate more revenue during the year.
#Total sales revenue gained. Members VS. Guests.
df1 <- transactions_demo %>%
group_by(membership) %>%
summarize(total_member_revenue = sum(sales_value))
ggplot(df1, aes(membership, label = round(total_member_revenue, 0), total_member_revenue, fill = membership)) +
geom_col() +
ggtitle("Total Revenue Generated by Members and Guests") +
ylab("Total Revenue ($)") +
xlab("Membership Type") +
scale_y_continuous(labels = scales::dollar) +
geom_text(nudge_y = 1)
The age breakdown shows a noticeable dip among the 55-65 age group. The income breakdown shows the lack of data in the higher income groups. The marital status shows an even split between married and unmarried. The household size, composition, kids count breakdown show there is a lack of data for households with children.
grid.arrange(member_graph_1, member_graph_2, member_graph_3, member_graph_4, member_graph_5, member_graph_6)
coupon_basket <- transactions_demo %>%
mutate(coupon_logic = case_when(coupon_disc == 0 ~ "Not Redeemed",
coupon_disc > 0 ~ "Redeemed"
)) %>%
select(basket_id, membership, coupon_logic) %>%
distinct_all()
group_no <- coupon_basket %>%
filter(coupon_logic == 'Not Redeemed')
group_yes <- coupon_basket %>%
filter(coupon_logic == 'Redeemed')
duplicates_rows <- bind_rows(group_no, group_yes) %>%
group_by(basket_id) %>%
summarise(n = n()) %>%
filter(n > 1) %>%
select(basket_id)
group_no <- group_no %>%
filter(! group_no$basket_id %in% duplicates_rows$basket_id)
coupon_used_pct <- bind_rows(group_yes, group_no)
coupon_used_pct %>%
ggplot(aes(x = membership, y = after_stat(count/sum(count)), fill = coupon_logic)) +
geom_bar() +
xlab("Membership") +
ylab("Percentage (Times of Redeeming a Coupon)") +
scale_y_continuous(labels = scales::percent) +
ggtitle("Do 'Members' Use More Coupons?")
coupon_redeem <- transactions_demo %>%
select(retail_disc, coupon_disc, membership) %>%
group_by(membership) %>%
summarize(
total_retail_amt = sum(retail_disc),
total_manufact_amt = sum(coupon_disc)
) %>%
pivot_longer(cols = -membership,
names_to = "Coupon_Type",
values_to = "Total_Amounts",
values_transform = list(total = as.integer)
)
coupon_redeem %>%
ggplot(aes(fill = Coupon_Type, y = Total_Amounts, x = membership)) +
geom_bar(position = "stack", stat = "identity") +
scale_y_continuous(labels = scales::dollar) +
theme(legend.position = "bottom") +
ggtitle("Coupon Redeemed in Dollars") +
xlab("Total Amounts") +
ylab("Membership")
grid.arrange(store_graph_1, store_graph_2)
revenue_summary %>%
ggplot(aes(x = reorder(store_id, desc(total_customers)), group = 1)) +
geom_line(aes(x = reorder(store_id, desc(total_customers)), y = total_sales_per_store_kilo, colour = "Total Sales by Store (thousand)")) +
geom_line(aes(x = reorder(store_id, desc(total_customers)), y = total_customers, colour = "Total Customers")) +
geom_line(aes(x = reorder(store_id, desc(total_customers)), y = total_members, colour = "Total Members")) +
geom_line(aes(x = reorder(store_id, desc(total_customers)), y = total_guest, colour = "Total Guest")) +
scale_color_manual(name = "Legend", values = c("Total Sales by Store (thousand)" = "orange",
"Total Customers" = "red",
"Total Members" = "green",
"Total Guest" = "black")) +
xlab("Store ID") +
ylab(" ") +
annotate("text", x = 2, y = 152, label = "$148") +
annotate("text", x = 12, y = 130, label = "$121") +
annotate("text", x = 19, y = 100, label = "$87") +
theme(axis.text.x = element_text(angle = 55, hjust = 1), legend.position = "bottom") +
ggtitle("Total Store Generated Revenue from Members and Guests",
subtitle = "Store 375, 406, 429 generate the most revenue")
grid.arrange(revenue_by_age_graph, revenue_by_income_graph, revenue_by_ho_graph, revenue_by_ms_graph, revenue_by_hshldsize_graph, revenue_by_hshldcomp_graph, revenue_by_kids_graph, ncol = 2)
kable(top_product_sales_graph)
| Membership | Product Category | Total_Sales |
|---|---|---|
| member | COUPON/MISC ITEMS | $69035 |
| guest | COUPON/MISC ITEMS | $41900 |
| member | BEEF | $18307 |
| member | SOFT DRINKS | $17414 |
| guest | SOFT DRINKS | $13250 |
| guest | BEEF | $12684 |
| member | FLUID MILK PRODUCTS | $11693 |
| member | FRZN MEAT/MEAT DINNERS | $11609 |
| member | CHEESE | $11565 |
| member | BAG SNACKS | $9131 |
kable(COUPON_MISC_product_sales_graph)
| Membership | Product Type | Total Sales |
|---|---|---|
| member | GASOLINE-REG UNLEADED | $66661 |
| guest | GASOLINE-REG UNLEADED | $40845 |
| member | OUTSIDE VENDORS GIFT CARDS | $815 |
| member | ELECTRONIC GIFT CARDS ACTIVATI | $405 |
| member | MEAT SUPPLIES | $359 |
| guest | MEAT SUPPLIES | $207 |
| guest | MISCELLANEOUS H & B AIDS | $189 |
| guest | AMERICAN EXPRESS GIFT CARD | $174 |
| member | MASTERCARD GIFT CARD | $162 |
| member | PRODUCE DEPT KEY RING | $154 |
grid.arrange(product_sales_graph_1, product_sales_graph_2, ncol = 2)
Not having customer demographics can negatively impact business outcomes. Our goal is to address if the business leaders should allocate more resources to improve the completeness of our customers’ demographic information.
We utilized various datasets from the completejourney
package, separated the customers into two groups based on whether they
have the demographic data available, analyzed transactions, coupon
usage, current “members” demographics breakdown, store preferences, and
top selling products.
It can be see that in most stores, “guests” outnumber “members”. Even though “members” generate more revenue. The data shows there are several categories that contain a lack of demographic data. When looking at the top selling stores, it can be see that revenue scales with the number of “members” that visit the store.
Better demographics data will allow for targeted marketing to these new “members”. This should increase revenue and increase customer loyalty. It is suggested that the business leaders support the collection of obtaining more complete demographics. This can be completed by launching a campaign that allows guest to unlock store discounts by completing the membership registration process.
Assumption and Limitation
- We assumed that the ability to target “guests” is limited due to lack of demographics data.
- We assumed that the demographics accross store would be similar.
- It appears that most stores have a base level of revenue that is not dependent on number of customers. The high performing stores do show scaling with “members” but our study does not take into account the difference between the high performing stores and the stores with baseline revenue.
- A certain percentage of “guests” will be natually resistant to provide demographics.
Citation
- Abdel Wahab, H., Diaa, N. M., & Ahmed Nagaty, S. (2023). Demographic characteristics and consumer decision-making styles: Do they impact fashion product involvement? Cogent Business & Management, 10(2), 2208430.
- Ahn, J. (2020). Effectiveness of demographic characteristics in understanding Malaysian customers’ perceived value of the integrated resort sector. Journal of Vacation Marketing, 26(2), 195-210.
- Gajanova, L., Nadanyiova, M., & Moravcikova, D. (2019). The use of demographic and psychographic segmentation to creating marketing strategy of brand loyalty. Scientific annals of economics and business, 66(1), 65-84.
- Hill, K. (2012). How target figured out a teen girl was pregnant before her father did. Forbes, Inc, 7, 4-1.
- Kelley, K. (2016). Understanding your customers: How demographics and psychographics can help. Penn State University Extension.
- Patel, J., & Bansal, A. (2018). Effect of demographic variables on e-marketing strategies: A review. International Journal of Academic Research and Development, 3(1), 311-321.