Introduction

This is a real story. A man discovered his daughter was pregnant after Target mailed him some coupons for baby items such as cots, maternity wear, and baby clothes based on her shopping habits. Target had robust demographics information about its customers. With customers’ demographic data, a business can predict and expand their market based on customers’ shopping habits.

However, not having customer demographics can significantly impact a business in several ways, leading to missed opportunities and inefficient marketing strategies. On the other hand, when businesses have robust customer demographics, it is easier to understand their audience and personalize product offerings, services, and marketing efforts.

In our analysis, we will identify that there is a tremendous amount of missing demographic information in the transactions data. This leaders to inefficient marketing to those customers. We refer to customers with full demographic information as “members” and those missing some demographic information as “guests.” Our goal is to address if the business leaders should allocate more resources to improve the completeness of our customers’ demographic information. We will also make suggestions on how to improve the collection and utilization of customer demographics data and predict customers more precisely.

Exploratory

Packages

  • The completejourney dataset is a dataset characterizing household level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store.

  • Tidyverse is an R programming package that helps to transform and better present data. It assists with data import, tidying, manipulation, and data visualization.

  • Knitr is a general-purpose package for dynamic report generation in R.

  • ggplot2 is a package implements a grammar of graphics.

  • gridExtra package provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables.

library(tidyverse)
library(completejourney)
library(ggplot2)
library(gridExtra)
library(knitr)

Data Preperation and Wrangling

We will utilize various datasets from the completejourney package to conduct an analysis and provide recommendations to the business. Our initial step will be to determine the number of “members” and “guests” within our datasets. Households with all fields (age, income, home ownership, marital status, household size, household composition, and kids count) displaying “NA” will be classified as “guests,” while households with populated fields will be identified as “members.”

#join the demographics table to see if each transaction pairs with demographics information.
transactions <- get_transactions()
products <- products
promotions <- get_promotions()
demographics_gen <- demographics
transactions_demo <- transactions %>%
  left_join(demographics, by = "household_id")
#assign new value "membership" if all values are "NA" among 'age', 'income', 'home_ownership', 'marital_status', 'household_size', 'household_comp', and 'kids_count'.
transactions_demo <- transactions_demo %>%
  mutate(membership = if_else(is.na(age), "guest", "member"))

How many unique “members” and “guests” are in our dataset?

#Unique number of households group by membership
unique_hshld <- transactions_demo %>%
  select(household_id, membership) %>%
  group_by(membership) %>%
  distinct(household_id) %>%
  count(membership)

We are going to first take a look at our current “member” demographics from different key variables.

#member demographcis breakdown by each variable.
member_graph_1 <- demographics_gen %>%
  ggplot(aes(age, y = after_stat(count/sum(count)), fill = age)) +
  geom_bar()+
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "none") +
  ggtitle("Age Breakdown of 'Members'",
          subtitle = "More members within the age group of 45-54") +
  xlab("Age Groups") +
  ylab("Percentage")

member_graph_2 <- demographics_gen %>%
  ggplot(aes(income, y = after_stat(count/sum(count)), fill = income)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 55, hjust = 1))+
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "none") +
  ggtitle("Income Breakdown of 'Members'",
          subtitle = "Approx. Half of the Customers' Income range between 35-99K") +
  xlab("Income Groups") +
  ylab("Percentage")

member_graph_3 <- demographics_gen %>%
  filter(!is.na(marital_status)) %>%
  ggplot(aes(marital_status, y = after_stat(count/sum(count)), fill = marital_status)) + 
  geom_bar() + 
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "none") +
  ggtitle("Marital Status Breakdown of 'Members'") +
  xlab("Marital Status") +
  ylab("Percentage")

member_graph_4 <- demographics_gen %>%
  ggplot(aes(household_size, y = after_stat(count/sum(count)), fill = household_size)) +
  geom_bar() +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "none") +
  ggtitle("Household Size Breakdown of 'Members'",
          subtitle = "Most of the 'Members' with household size of two") +
  xlab("Household Size") +
  ylab("Percentage")

member_graph_5 <- demographics_gen %>%
  ggplot(aes(household_comp, y = after_stat(count/sum(count)), fill = household_comp)) +
  geom_bar() +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "none") +
  ggtitle("Household Composition Breakdown of 'Members'",
          subtitle = "Most of the customers are adutls without kids") +
  xlab("Household Composition") +
  ylab("Percentage")

member_graph_6 <- demographics_gen %>%
  ggplot(aes(kids_count, y = after_stat(count/sum(count)), fill = kids_count)) +
  geom_bar() +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = "none") +
  ggtitle("Kids Counts Breakdown of 'Members'",
          subtitle = "Most of the customers do not have kids") +
  xlab("Kids Count") +
  ylab("Percentage")

Do “members” redeem more coupons than “members”?

Then, let’s look at store preferences between “members” and “guests”.

#Total customer count for each store.
total_customers_per_store <- transactions_demo %>%
  select(store_id, membership, household_id) %>%
  group_by(store_id, household_id, membership) %>%
  summarise(total_customers = n()) %>%
  group_by(store_id) %>%
  summarise(total_customers = n()) %>%
  arrange(desc(total_customers))

# Stores with the most customers. Top 20 with the most customers
store_graph_1 <- transactions_demo %>%
  select(store_id, membership, household_id) %>%
  group_by(store_id, household_id, membership) %>%
  summarise(count_group = n()) %>%
  group_by(store_id, membership) %>%
  summarise(count_membership = n()) %>%
  left_join(total_customers_per_store, by = "store_id") %>%
  arrange(desc(total_customers)) %>%
  head(40) %>%
  ggplot(aes(x = reorder(store_id, desc(total_customers)), y = count_membership, fill = membership)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
  ggtitle("Top 20 Stores: Numbers of Frequent Shoppers",
          subtitle = "More guests shop at the stores than members") +
  xlab("Store ID") +
  ylab("Number of Customers")
  
# Top 10 stores broken down by members and guest. Bars unstacked.
store_graph_2 <- transactions_demo %>%
  select(store_id, membership, household_id) %>%
  group_by(store_id, household_id, membership) %>%
  summarise(count_group = n()) %>%
  group_by(store_id, membership) %>%
  summarise(count_membership = n()) %>%
  left_join(total_customers_per_store, by = "store_id") %>%
  arrange(desc(total_customers)) %>%
  head(20) %>%
  ggplot(aes(x = reorder(store_id, desc(count_membership)), y = count_membership, fill = membership)) +
  geom_col(position = 'dodge') +
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
  ggtitle("Top 10 Stores: Numbers of Frequent Shoppers") +
  xlab("Store ID") +
  ylab("Number of Customers")

Next, we are going to discover store generated revenues based on the number difference between members and guests.

#Total sales per store in thousands.
total_sales_per_store <- transactions_demo %>%
  mutate(total_sale_value = sales_value) %>%
  group_by(store_id) %>%
  summarise(total_sales_per_store_kilo = sum(total_sale_value) / 1000)

#Total number of members per store.
total_memeber_per_store <- transactions_demo %>%
  select(store_id, membership, household_id) %>%
  group_by(store_id, household_id, membership) %>%
  summarise(total_customers = n()) %>%
  filter(membership == "member") %>%
  group_by(store_id) %>%
  summarise(total_members = n()) %>%
  arrange(desc(total_members))

#Total number of guests per store.
total_guest_per_store <- transactions_demo %>%
  select(store_id, membership, household_id) %>%
  group_by(store_id, household_id, membership) %>%
  summarise(total_customers = n()) %>%
  filter(membership == "guest") %>%
  group_by(store_id) %>%
  summarise(total_guest = n()) %>%
  arrange(desc(total_guest))

#Total revenue summary by members
revenue_summary <- total_sales_per_store %>%
  left_join(total_customers_per_store, by = "store_id") %>%
  left_join(total_memeber_per_store, by = "store_id") %>%
  left_join(total_guest_per_store, by = "store_id") %>%
  mutate(total_members = ifelse(is.na(total_members), 0, total_members)) %>%
  mutate(total_guest = ifelse(is.na(total_guest), 0, total_guest)) %>%
  arrange(desc(total_sales_per_store_kilo)) %>%
  head(20)

Next, we will take a look at the demographic impact among those stores.

#Top 10 stores
total_10_stores <- total_sales_per_store %>%
  arrange(desc(total_sales_per_store_kilo)) %>%
  head(10) %>%
  select(store_id)

#Top 10 stores revenue generated by customers from different age groups.
sales_store_demo <- transactions_demo %>%
  filter(membership == "member") %>% # Turn me off to see some interesting plots.
  filter(store_id %in% total_10_stores$store_id) %>%
  group_by(store_id, 
           household_id, 
           age, income, 
           home_ownership, 
           marital_status, 
           household_size, 
           household_comp, 
           kids_count) %>%
  summarise(total_sales_per_store_kilo = sum(sales_value)/1000)

#age
revenue_by_age_graph <- sales_store_demo %>%
  ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = age)) +
  geom_col() +
  scale_y_continuous(labels = scales::dollar_format()) + 
  xlab("Store ID") +
  ylab("Total Sales by Store (thousands)") +
  theme(legend.position = "bottom") +
  ggtitle("Total Sales by Store - Breakdown by Age")

#income
revenue_by_income_graph <- sales_store_demo %>%
  ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = income)) +
  geom_col() +
  scale_y_continuous(labels = scales::dollar_format()) + 
  xlab("Store ID") +
  ylab("Total Sales by Store (thousands)") +
  theme(legend.position = "bottom") +
  ggtitle("Total Sales by Store - Breakdown by Income")

#home ownership
revenue_by_ho_graph <- sales_store_demo %>%
  ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = home_ownership)) +
  geom_col() +
  scale_y_continuous(labels = scales::dollar_format()) + 
  xlab("Store ID") +
  ylab("Total Sales by Store (thousands)") +
  theme(legend.position = "bottom") +
  ggtitle("Total Sales by Store - Breakdown by Home Ownership")

# marital status
revenue_by_ms_graph <- sales_store_demo %>%
  ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = marital_status)) +
  geom_col() +
  scale_y_continuous(labels = scales::dollar_format()) + 
  xlab("Store ID") +
  ylab("Total Sales by Store (thousands)") +
  theme(legend.position = "bottom") +
  ggtitle("Total Sales by Store - Breakdown by Marital Status")

# household size
revenue_by_hshldsize_graph <- sales_store_demo %>%
  ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = household_size)) +
  geom_col() +
  scale_y_continuous(labels = scales::dollar_format()) + 
  xlab("Store ID") +
  ylab("Total Sales by Store (thousands)") +
  theme(legend.position = "bottom") +
  ggtitle("Total Sales by Store - Breakdown by Household Size")

# household comp
revenue_by_hshldcomp_graph <- sales_store_demo %>%
ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = household_comp)) +
  geom_col() +
  scale_y_continuous(labels = scales::dollar_format()) + 
  xlab("Store ID") +
  ylab("Total Sales by Store (thousands)") +
  theme(legend.position = "bottom") +
  ggtitle("Total Sales by Store - Breakdown by Household Comp")

# kids count
revenue_by_kids_graph <- sales_store_demo %>%
ggplot(aes(x = factor(store_id, level = total_10_stores$store_id), y = total_sales_per_store_kilo, fill = kids_count)) +
  geom_col() +
  scale_y_continuous(labels = scales::dollar_format()) + 
  xlab("Store ID") +
  ylab("Total Sales by Store (thousands)") +
  theme(legend.position = "bottom") +
  ggtitle("Total Sales by Store - Breakdown by Kids Count")

Lastly, we are discovering the top selling product categories among members to predict guests shopping habits.

#Total 10 stores with the most revenue
total_10_stores <- total_sales_per_store %>%
  arrange(desc(total_sales_per_store_kilo)) %>%
  head(10) %>%
  select(store_id)

#Top product sales among the top 10 stores
top_product_sales <- transactions_demo %>%
  filter(store_id %in% total_10_stores$store_id) %>%
  mutate(total_sale_value = sales_value) %>%
  left_join(products, by = "product_id") %>%
  select(total_sale_value,product_category, membership) %>%
  group_by(membership, product_category) %>%
  summarize(total_sale_value = sum(total_sale_value)) %>%
  arrange(desc(total_sale_value)) %>%
  head(10)
top_product_sales_graph <- top_product_sales %>%
  mutate(total_sale_value = paste0('$', round(total_sale_value,0))) %>%
  rename(
    Membership = membership,
     "Product Category" = product_category,
     "Total_Sales" = total_sale_value
  )

#Top product generates the most revenue in COUPON/MISC ITEMS
COUPON_MISC_product_sales <- transactions_demo %>%
  filter(store_id %in% total_10_stores$store_id) %>%
  mutate(total_sale_value = sales_value) %>%
  left_join(products, by = "product_id") %>%
  filter(product_category == "COUPON/MISC ITEMS") %>%
  select(total_sale_value, product_type, membership) %>%
  group_by(membership, product_type) %>%
  summarize(total_sale_value = sum(total_sale_value)) %>%
  arrange(desc(total_sale_value)) %>%
  head(10)
COUPON_MISC_product_sales_graph <- COUPON_MISC_product_sales %>%
  mutate(total_sale_value = paste0('$', round(total_sale_value,0))) %>%
  rename(
    Membership = membership,
     "Product Type" = product_type,
     "Total Sales" = total_sale_value
  )

# GASOLINE-REG UNLEADED filtered out. It is skewing the results. 
# Product Category
top_product_sales_no_gas <- transactions_demo %>%
  filter(store_id %in% total_10_stores$store_id) %>%
  mutate(total_sale_value = sales_value) %>%
  left_join(products, by = "product_id") %>%
  filter(product_type != "GASOLINE-REG UNLEADED") %>%
  select(total_sale_value, product_category, membership) %>%
  group_by(membership, product_category) %>%
  summarize(total_sale_value = sum(total_sale_value)) %>%
  arrange(desc(total_sale_value)) %>%
  head(20)

#Revenue generated at top 10 stores from top 10 product categories. Includes COUPON/MISC.
product_sales_graph_1 <- ggplot(top_product_sales, aes(x = membership, y = total_sale_value)) +
  geom_col(fill = "#69b3a2") + 
  scale_y_continuous(labels = scales::dollar_format()) +
  xlab("Membership") +
  ylab("Total Sales Value") +
  ggtitle("Total Sales Include Gasoline by Members & Guests")

#Revenue generated at top 10 stores without Gasoline.
product_sales_graph_2 <- ggplot(top_product_sales_no_gas, aes(x = product_category, y = total_sale_value, fill = membership)) +
  geom_col() + 
  scale_y_continuous(labels = scales::dollar_format()) +
  theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
  xlab("Membership") +
  ylab("Total Sales Value") +
  ggtitle("Total Sales Per Product Category Exclude Gasoline")

Data Visualization and Analysis

Overview

Total demographics missing from the transactions_demo.

missing_demo <- transactions_demo %>%
  select(age, income, home_ownership, marital_status, household_size, household_comp, kids_count)
kable(colSums(is.na(missing_demo)), col.names = c("Demographics variable", "Count"), caption = "Number of Missing Demographics")
Number of Missing Demographics
Demographics variable Count
age 640457
income 640457
home_ownership 861364
marital_status 781889
household_size 640457
household_comp 640457
kids_count 640457

Distinct count of members and guests.

kable(unique_hshld, caption = "Unique number of members and guests", col.names = c("Customer Type", "Count"))
Unique number of members and guests
Customer Type Count
guest 1668
member 801

Although the number of “guests” is twice that of the “members”, “members” still generate more revenue during the year.

#Total sales revenue gained. Members VS. Guests.
df1 <- transactions_demo %>%
  group_by(membership) %>%
  summarize(total_member_revenue = sum(sales_value))
ggplot(df1, aes(membership, label = round(total_member_revenue, 0),  total_member_revenue, fill = membership)) +
  geom_col() +
  ggtitle("Total Revenue Generated by Members and Guests") +
  ylab("Total Revenue ($)") +
  xlab("Membership Type") + 
  scale_y_continuous(labels = scales::dollar) +
  geom_text(nudge_y = 1)

The below graphs show different distributions of the available categories.

The age breakdown shows a noticeable dip among the 55-65 age group. The income breakdown shows the lack of data in the higher income groups. The marital status shows an even split between married and unmarried. The household size, composition, kids count breakdown show there is a lack of data for households with children.

grid.arrange(member_graph_1, member_graph_2, member_graph_3, member_graph_4, member_graph_5, member_graph_6)

Coupon Usage

The below graph shows how often “members” and “guests” redeem manufacture coupons per transaction.
coupon_basket <- transactions_demo %>%
  mutate(coupon_logic = case_when(coupon_disc == 0 ~ "Not Redeemed",
                   coupon_disc > 0  ~ "Redeemed"
  )) %>%
  select(basket_id, membership, coupon_logic) %>%
  distinct_all()
group_no <- coupon_basket %>%
  filter(coupon_logic == 'Not Redeemed')
group_yes <- coupon_basket %>%
  filter(coupon_logic == 'Redeemed')
duplicates_rows <- bind_rows(group_no, group_yes) %>%
  group_by(basket_id) %>%
  summarise(n = n()) %>%
  filter(n > 1) %>%
  select(basket_id)
group_no <- group_no %>%
  filter(! group_no$basket_id %in% duplicates_rows$basket_id)
coupon_used_pct <- bind_rows(group_yes, group_no)
coupon_used_pct %>%
ggplot(aes(x = membership, y = after_stat(count/sum(count)), fill = coupon_logic)) +
  geom_bar() +
  xlab("Membership") +
  ylab("Percentage (Times of Redeeming a Coupon)") +
  scale_y_continuous(labels = scales::percent) +
  ggtitle("Do 'Members' Use More Coupons?")

coupon_redeem <- transactions_demo %>%
  select(retail_disc, coupon_disc, membership) %>%
  group_by(membership) %>%
  summarize(
    total_retail_amt = sum(retail_disc),
    total_manufact_amt = sum(coupon_disc)
    ) %>%
  pivot_longer(cols = -membership,
               names_to = "Coupon_Type",
               values_to = "Total_Amounts", 
               values_transform = list(total = as.integer)
               )
coupon_redeem %>%
  ggplot(aes(fill = Coupon_Type, y = Total_Amounts, x = membership)) +
  geom_bar(position = "stack", stat = "identity") +
  scale_y_continuous(labels = scales::dollar) +
  theme(legend.position = "bottom") +
  ggtitle("Coupon Redeemed in Dollars") +
  xlab("Total Amounts") +
  ylab("Membership")

Store Proportions

Demographics Impact

The following graphs layout the different revenues generated for each demographic breakdown. Assuming that most stores will have a similar demographic breakdown, these graphs show areas needing more data in certain categories.
grid.arrange(revenue_by_age_graph, revenue_by_income_graph, revenue_by_ho_graph, revenue_by_ms_graph, revenue_by_hshldsize_graph, revenue_by_hshldcomp_graph, revenue_by_kids_graph, ncol = 2)

Product Purchase Trend

When looking at the top 10 product categrories, we can see that “members” not only generate more revenues total, but generate more revenue in each individual category. Gasoline sales are removed from the breakdown graph to highlight the sales of other categories.
kable(top_product_sales_graph)
Membership Product Category Total_Sales
member COUPON/MISC ITEMS $69035
guest COUPON/MISC ITEMS $41900
member BEEF $18307
member SOFT DRINKS $17414
guest SOFT DRINKS $13250
guest BEEF $12684
member FLUID MILK PRODUCTS $11693
member FRZN MEAT/MEAT DINNERS $11609
member CHEESE $11565
member BAG SNACKS $9131
kable(COUPON_MISC_product_sales_graph)
Membership Product Type Total Sales
member GASOLINE-REG UNLEADED $66661
guest GASOLINE-REG UNLEADED $40845
member OUTSIDE VENDORS GIFT CARDS $815
member ELECTRONIC GIFT CARDS ACTIVATI $405
member MEAT SUPPLIES $359
guest MEAT SUPPLIES $207
guest MISCELLANEOUS H & B AIDS $189
guest AMERICAN EXPRESS GIFT CARD $174
member MASTERCARD GIFT CARD $162
member PRODUCE DEPT KEY RING $154
grid.arrange(product_sales_graph_1, product_sales_graph_2, ncol = 2)

Business Suggestions

  • Since “members” use more coupons, with gathering more information from “guests”, we can target more “members” with coupons. Complete demographic information is important for business to develop marketing strategies.
  • Demographic data can help stores to choose what sales will maximize revenue, and how to boost customer engagement and improve their shopping experience.
  • It is suggested that high performing stores should be the first to implement demographics data collection as the effect will be seen more clearly, since the revenue scales with number of members. Outreach to “guests” to encourage them to become “members”.

Summary

Not having customer demographics can negatively impact business outcomes. Our goal is to address if the business leaders should allocate more resources to improve the completeness of our customers’ demographic information.

We utilized various datasets from the completejourney package, separated the customers into two groups based on whether they have the demographic data available, analyzed transactions, coupon usage, current “members” demographics breakdown, store preferences, and top selling products.

It can be see that in most stores, “guests” outnumber “members”. Even though “members” generate more revenue. The data shows there are several categories that contain a lack of demographic data. When looking at the top selling stores, it can be see that revenue scales with the number of “members” that visit the store.

Better demographics data will allow for targeted marketing to these new “members”. This should increase revenue and increase customer loyalty. It is suggested that the business leaders support the collection of obtaining more complete demographics. This can be completed by launching a campaign that allows guest to unlock store discounts by completing the membership registration process.

Assumption and Limitation
  • We assumed that the ability to target “guests” is limited due to lack of demographics data.
  • We assumed that the demographics accross store would be similar.
  • It appears that most stores have a base level of revenue that is not dependent on number of customers. The high performing stores do show scaling with “members” but our study does not take into account the difference between the high performing stores and the stores with baseline revenue.
  • A certain percentage of “guests” will be natually resistant to provide demographics.
Citation
  • Abdel Wahab, H., Diaa, N. M., & Ahmed Nagaty, S. (2023). Demographic characteristics and consumer decision-making styles: Do they impact fashion product involvement? Cogent Business & Management, 10(2), 2208430.
  • Ahn, J. (2020). Effectiveness of demographic characteristics in understanding Malaysian customers’ perceived value of the integrated resort sector. Journal of Vacation Marketing, 26(2), 195-210.
  • Gajanova, L., Nadanyiova, M., & Moravcikova, D. (2019). The use of demographic and psychographic segmentation to creating marketing strategy of brand loyalty. Scientific annals of economics and business, 66(1), 65-84.
  • Hill, K. (2012). How target figured out a teen girl was pregnant before her father did. Forbes, Inc, 7, 4-1.
  • Kelley, K. (2016). Understanding your customers: How demographics and psychographics can help. Penn State University Extension.
  • Patel, J., & Bansal, A. (2018). Effect of demographic variables on e-marketing strategies: A review. International Journal of Academic Research and Development, 3(1), 311-321.