Milk is a common household item that is consumed by the masses. Being an extremely versatile good, milk can combined with a breakfast cereal, used in baking/cooking, or even for enjoying a cold glass for the satisfaction of the consumer. Milk is also a relatively cheap good and is constantly in demand.
After running preliminary analysis on the data frames that were provided it was discovered that out of all the product transactions that occurred during the year 2017, milk was the fourth most purchased product, but it falls significantly lower from the revenue gained than the top three products by a margin of over $150,000.
With milk sales lagging, this analysis aims to provide insight to Regork on who is most likely to consume milk, and how Regork can target coupon campaigns to consumers where milk consumption is lagging based on the household characteristics of those consumers.
The household characteristics include age group, number of children, household size, and level of income.
The required packages to replicate this analysis include:
Importing Data Sets From Complete Journey
demo<-completejourney::demographics
transactions<-get_transactions()
products<-completejourney::products
campaigns<-completejourney::campaigns
campaign_descriptions<-completejourney::campaign_descriptions
coupons<-completejourney::coupons
coupon_redemptions<-completejourney::coupon_redemptions
Joining tables by the “product_id” variable
transactions_products <- transactions %>%
left_join(products, by = "product_id")
Joining tables by the “household_id” variable
complete_data <- transactions_products %>%
inner_join(demographics, by = "household_id")%>%
inner_join(campaigns, by = "household_id")
Joining tables by the “household_id” variable
complete_data<-complete_data%>%
inner_join(campaign_descriptions, by = "campaign_id")
After joining multiple data frames to combine one large table, we have a table with 488,742 observations and 28 variables.
dim(complete_data)
## [1] 4886742 28
A snapshot of the Data
head(complete_data)
## # A tibble: 6 x 28
## household_id store_id basket_id product_id quantity sales_value retail_disc
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 900 330 31198570~ 1095275 1 0.5 0
## 2 900 330 31198570~ 1095275 1 0.5 0
## 3 900 330 31198570~ 1095275 1 0.5 0
## 4 900 330 31198570~ 1095275 1 0.5 0
## 5 900 330 31198570~ 1095275 1 0.5 0
## 6 900 330 31198570~ 1095275 1 0.5 0
## # ... with 21 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>,
## # week <int>, transaction_timestamp <dttm>, manufacturer_id <chr>,
## # department <chr>, brand <fct>, product_category <chr>, product_type <chr>,
## # package_size <chr>, age <ord>, income <ord>, home_ownership <ord>,
## # marital_status <ord>, household_size <ord>, household_comp <ord>,
## # kids_count <ord>, campaign_id <chr>, campaign_type <ord>,
## # start_date <date>, end_date <date>
complete_data%>%
group_by(product_category)%>%
summarize(Spent=sum(sales_value,na.rm = TRUE))%>%
slice_max(order_by=Spent,n=5)
## # A tibble: 5 x 2
## product_category Spent
## <chr> <dbl>
## 1 COUPON/MISC ITEMS 1510881.
## 2 SOFT DRINKS 567578.
## 3 BEEF 544233.
## 4 FLUID MILK PRODUCTS 379666.
## 5 CHEESE 354232.
complete_data%>%
group_by(product_category)%>%
summarize(Quantity=sum(quantity,na.rm = TRUE))%>%
slice_max(order_by=Quantity,n=10)
## # A tibble: 10 x 2
## product_category Quantity
## <chr> <dbl>
## 1 COUPON/MISC ITEMS 600618898
## 2 SOFT DRINKS 268654
## 3 FUEL 229200
## 4 FLUID MILK PRODUCTS 215298
## 5 CHEESE 180084
## 6 BAKED BREAD/BUNS/ROLLS 177111
## 7 YOGURT 163223
## 8 SOUP 159330
## 9 FRZN MEAT/MEAT DINNERS 148716
## 10 VEGETABLES - SHELF STABLE 147811
complete_data%>%
group_by(product_category)%>%
summarize(Sales=sum(sales_value,na.rm = TRUE))%>%
slice_max(order_by=Sales,n=5)
## # A tibble: 5 x 2
## product_category Sales
## <chr> <dbl>
## 1 COUPON/MISC ITEMS 1510881.
## 2 SOFT DRINKS 567578.
## 3 BEEF 544233.
## 4 FLUID MILK PRODUCTS 379666.
## 5 CHEESE 354232.
As shown by the two tables above, milk is the 4th largest item purchased in regards to quantity, and the 4th largest item in regards to price. We can see that milk products fall behind in sales by around $164,000, a significant deficit that we are seeking to close to gap of.
milk_data <- complete_data %>%
filter(grepl("milk", product_type, ignore.case = TRUE))
# Summarize milk purchases by age, income, household size, and kids
milk_summary <- milk_data %>%
group_by(transaction_timestamp, age, income, household_size, kids_count) %>%
summarize(total_spent = sum(sales_value), .groups = 'drop')
# Convert transaction_timestamp to Date format using ymd_hms
milk_summary <- milk_summary %>%
mutate(transaction_timestamp = ymd_hms(transaction_timestamp, quiet = TRUE))
# Remove unknown groups
milk_summary <- milk_summary %>%
filter(!is.na(age) & age != "Unknown",
!is.na(income) & income != "Unknown",
!is.na(household_size) & household_size != "Unknown",
!is.na(kids_count) & kids_count != "Unknown")
# Plot milk purchases over time by age
ggplot(milk_summary, aes(x = transaction_timestamp, y = total_spent, color = as.factor(age))) +
geom_line() +
labs(title = "Milk Purchases Over Time by Age",
x = "Date", y = "Total Sales") +
scale_color_discrete(name = "Age Group") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "right")
We can see that throughout 2017, most milk purchases came from consumers age 25-44, with the peak of milk sales being age group 35-44 in January. Groups that are lagging in milk purchases are 19-24 and 55+. This implies that implementing a marketing campaign targeted toward households with consumers 19-24 and 55+ and would be be effective in increasing milk sales.
# Bar chart for milk purchases by kids count
ggplot(milk_summary, aes(x = as.factor(kids_count), y = total_spent, fill = as.factor(kids_count))) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(title = "Total Milk Purchases by Kids Count",
x = "Kids Count", y = "Total Sales", fill = "Kid Count")+
theme(
plot.title = element_text(hjust = 0.5))
From the bar chart above, it appears that has the kid count increases, the amount spent by households on milk decreases. For our coupon campaign, we could specifically market towards households with children as there is a larger customer base that we could tap into to boost milk sales.
# Bar chart for milk purchases by household size
ggplot(milk_summary, aes(x = as.factor(household_size), y = total_spent, fill = as.factor(household_size))) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(title = "Total Milk Purchases by Household Size",
x = "Household Size", y = "Total Sales") +
theme_minimal()+
scale_fill_discrete(name = "")
In the bar chart above, the largest consumers of milk based on household size is two, and decreases significantly as household size increases. This result makes sense as we learned that with more children in the household, the less likely that household is to buy milk. The household size plot reinforces that coupons should be distributed to households with more children.
milk_summary %>%
mutate(income_factor = fct_collapse(income,
"Under 100K" = c("Under 15K", "15-24K", "25-34K", "35-49K", "50-74K", "75-99K"),
"Above 100K" = c("100-124K", "125-149K", "150-174K", "175-199K", "200-249K", "250K+")
)) %>%
mutate(year = floor_date(transaction_timestamp, unit = "year")) %>%
group_by(income_factor, year) %>%
summarize(total_spent = sum(total_spent, na.rm = TRUE), .groups = "drop") %>%
ggplot(aes(x = income_factor, y = total_spent, fill = income_factor)) +
geom_violin(alpha = 0.7, color = "black") +
geom_point(aes(color = income_factor), position = position_dodge(width = 0), size = 3, alpha = 0.8) +
scale_fill_manual(values = c("Under 100K" = "#1f77b4", "Above 100K" = "black"), name = "Income") +
scale_color_manual(values = c("Under 100K" = "#1f77b4", "Above 100K" = "black"), name = "Income") +
scale_y_continuous(labels = scales::comma) +
labs(title = "Distribution of Milk Purchases by Income", x = "Income Bracket", y = "Total Sales") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
The sales of milk among income level is very interesting. We can see that people who make above 100K spend way less on milk than people who makes less than 100K. While coupons may not be utilized by people who make over 100K, if they fit other household characteristics, such as having multiple children, they could still use the milk coupons.
milk_months <- milk_data %>%
mutate(month = floor_date(ymd_hms(transaction_timestamp, quiet = TRUE), "month")) %>%
group_by(month) %>%
summarize(total_volume = sum(quantity), .groups = 'drop')
ggplot(milk_months, aes(x = month, y = total_volume)) +
geom_line(stat = "identity", color = "purple") +
labs(title = "Volume of Milk Purchases by Month",
x = "Month", y = "Total Volume") +
theme_minimal()
From the plot above, we can see that sales of milk by month are all relatively high, with the smallest volumes occurring in February, April, and November. For our coupon campaign, if we market about these three months, we should expect to see the total volume to increase among these months.
Now that we have identified characteristics of households that we should target, we are going to look at which campaign type was successful in 2017.
complete_data %>%
ggplot(aes(x = campaign_type, fill = campaign_type)) +
geom_bar() +
geom_text(
stat = "count",
aes(label = scales::comma(..count..)),
vjust = -0.5,
size = 4
) +
labs(
title = "Campaign Type Distribution",
x = "Campaign Type",
y = "Count"
) +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14), # Center and bold title
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none"
)
The plot above illustrates the count of coupons that were redeemed by households at our stores based on the campaign type that was used to market the coupons. There is minimal difference between type A and type B, so either one of these campaign types should be run to market coupons.
After analyzing the plots above to scan household characteristics and the type of campaign to distribute coupons, the summary tables below identify the households line up with the ideal characteristics.
Age Summary Table:
milk_data%>%
select(household_id,age,sales_value)%>%
group_by(household_id,age)%>%
summarize(sales_value = sum(sales_value,na.rm = TRUE))%>%
arrange((sales_value))
## # A tibble: 751 x 3
## # Groups: household_id [751]
## household_id age sales_value
## <chr> <ord> <dbl>
## 1 1219 19-24 2
## 2 460 65+ 3.6
## 3 621 45-54 3.78
## 4 1604 35-44 3.99
## 5 1499 55-64 4.24
## 6 2390 25-34 4.42
## 7 1076 25-34 4.54
## 8 2179 45-54 4.65
## 9 823 35-44 4.96
## 10 48 19-24 5.58
## # ... with 741 more rows
Kid Count Summary Table:
milk_data%>%
filter(kids_count != 0)%>%
select(household_id,kids_count,sales_value)%>%
group_by(household_id,kids_count)%>%
summarize(sales_value = sum(sales_value,na.rm = TRUE))%>%
arrange((sales_value))
## # A tibble: 261 x 3
## # Groups: household_id [261]
## household_id kids_count sales_value
## <chr> <ord> <dbl>
## 1 1219 1 2
## 2 2390 1 4.42
## 3 2179 1 4.65
## 4 823 1 4.96
## 5 2092 1 5.97
## 6 302 1 9.47
## 7 1120 2 15.5
## 8 71 1 15.8
## 9 1021 3+ 15.9
## 10 378 1 17.6
## # ... with 251 more rows
Household Size Table:
milk_data%>%
select(household_id,household_size,sales_value)%>%
group_by(household_id,household_size)%>%
summarize(sales_value = sum(sales_value,na.rm = TRUE))%>%
arrange((sales_value))
## # A tibble: 751 x 3
## # Groups: household_id [751]
## household_id household_size sales_value
## <chr> <ord> <dbl>
## 1 1219 3 2
## 2 460 1 3.6
## 3 621 1 3.78
## 4 1604 2 3.99
## 5 1499 2 4.24
## 6 2390 2 4.42
## 7 1076 2 4.54
## 8 2179 3 4.65
## 9 823 3 4.96
## 10 48 2 5.58
## # ... with 741 more rows
Household Income Level Table:
milk_data%>%
mutate(income_factor =fct_collapse(income,
"Under 100K"=c("Under 15K","15-24K","25-34K","35-49K","50-74K","75-99K"),
"Above 100K"=c("100-124K","125-149K","150-174K","175-199K","200-249K","250K+")))%>%
group_by(household_id,income_factor)%>%
summarize(sales_value=sum(sales_value))%>%
arrange((sales_value))
## # A tibble: 751 x 3
## # Groups: household_id [751]
## household_id income_factor sales_value
## <chr> <ord> <dbl>
## 1 1219 Under 100K 2
## 2 460 Under 100K 3.6
## 3 621 Under 100K 3.78
## 4 1604 Under 100K 3.99
## 5 1499 Under 100K 4.24
## 6 2390 Under 100K 4.42
## 7 1076 Under 100K 4.54
## 8 2179 Under 100K 4.65
## 9 823 Under 100K 4.96
## 10 48 Under 100K 5.58
## # ... with 741 more rows
The summary tables above show the households that purchased the least amount of milk based on the ideal characteristics in the plots above. Simply put, if a household id is present in the first few rows of the four summary tables above, it is an indication that milk consumption is is low there, and coupons should be distributed around February, April, and November, utilizing campaign Type A or B.
While this analysis was for milk only, with a little tweek in the first code chunk in the “Exploratory Data Analysis” and changing the name of any table from “milk_######” to the product type of interest. For the first code chunk, simply replace “milk” following the grepl function, with the product type of interest.