The goal of the project is to analyze and understand the sales pattern across the different campaigns that will help in improving the revenue for our company. This can be achieved by identifying products that the customers are buying more across different campaigns.
We plan to address the problem by adopting the following steps:
We want to identify the campaigns that generates the maximum revenue.
In order to better understand the driving factors for generating better revenue in the top 3 campaigns we try to understand the sales over the different age group and income in these campaigns.
After understanding what age group and income group is contributing to the revenue in the most popular campaign, we want to figure out what product(s) is making this particular campaign generate better revenue when compared to other top campaigns.
For this analysis, we have used complete journey datasets.
This analysis will help to understand what coupons should be recommended targeting which age and income group. Also it will give us insights that what products should the company target to give coupons on, so we can attract more customers.
The following packages are used in this project :
| Package Name | Purpose |
|---|---|
| dplyr | data manipulation |
| lubridate | parsing of date-times |
| ggplot2 | data visualization |
| gridExtra | functions on grid objects |
| completejourney | transactions related datasets |
library(dplyr)
library(lubridate)
library(ggplot2)
library(gridExtra)
library(completejourney)
The completejourney package provides access to data sets characterizing household level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store. For certain households, demographic information as well as direct marketing contact history are included.
The following datasets from complete journey package is used in this project:
transactions
This table contains transaction-level product purchases by households.
| Variable | Data Type | Description |
|---|---|---|
| household_id | character | Uniquely identifies each household |
| store_id | character | Uniquely identifies each store |
| basket_id | character | Uniquely identifies each purchase occasion |
| product_id | character | Uniquely identifies each product |
| quantity | numeric | Number of the product purchased during the trip |
| sales_value | numeric | Amount of dollars the retailer receives from sale |
| retail_disc | numeric | Discount applied due to the retailer’s loyalty card program |
| coupon_disc | numeric | Discount applied due to a manufacturer coupon |
| coupon_match_disc | numeric | Discount applied due to retailer’s match of manufacturer coupon |
| week | integer | Week of the transaction; Ranges 1-53 |
| transaction_timestamp | POSIXct | Date and time of day when the transaction occurred |
demographics
This table contains demographic information for a portion of households.
| Variable | Data Type | Description |
|---|---|---|
| household_id | character | Uniquely identifies each household |
| age | ordered factor | Estimated age range |
| income | ordered factor | Household income |
| home_ownership | ordered factor | Homeowner, renter, etc. |
| marital_status | ordered factor | Marital status (Married, Single, Unknown) |
| household_size | ordered factor | Size of household up to 5+ |
| household_comp | ordered factor | Household composition |
| kid_count | ordered factor | Number of children present up to 3+ |
products
This table contains metatdata regarding the products purchased (brand, description, etc.).
| Variable | Data Type | Description |
|---|---|---|
| product_id | character | Uniquely identifies each product |
| manufacturer_id | character | Uniquely identifies each manufacturer |
| department | character | Groups similar products together |
| brand | factor | Indicates private or national label brand |
| product_category | character | Groups similar products together at lower level |
| product_type | character | Groups similar products together at lowest level |
| package_size | character | Indicates package size (not available for all products) |
campaigns
This table contains identifying information for the marketing campaigns each households participated in.
| Variable | Data Type | Description |
|---|---|---|
| campaign_id | character | Uniquely identifies each campaign; Ranges 1-27 |
| household_id | character | Uniquely identifies each household |
campaign_descriptions
This table contains campaign metadata, specifically the time a campaign was active.
| Variable | Data Type | Description |
|---|---|---|
| campaign_id | character | Uniquely identifies each campaign; Ranges 1-27 |
| campaign_type | ordered factor | Type of campaign (Type A, Type B, Type C) |
| start_date | Date | Start date of campaign |
| end_date | Date | End date of campaign |
coupons
This table contains all the coupons sent to customers as part of a campaign, as well as the products for which each coupon is redeemable.
| Variable | Data Type | Description |
|---|---|---|
| coupon_upc | character | Uniquely identifies each coupon (unique to household and campaign) |
| product_id | character | Uniquely identifies each product |
| campaign_id | character | Uniquely identifies each campaign |
coupon_redemptions
This table contains individual coupon redemptions (household, day, UPC code, campaign).
| Variable | Data Type | Description |
|---|---|---|
| household_id | character | Uniquely identifies each household |
| coupon_upc | character | Uniquely identifies each coupon (unique to household and campaign) |
| campaign_id | character | Uniquely identifies each campaign |
| redemption_date | Date | Date when the coupon was redeemed |
# Getting all the required data sets required for our analysis.
transactions <- get_transactions()
coupons = coupons
coupon_redemptions = coupon_redemptions
demo = demographics
prod = products
campaigns = campaigns
campaign_desc = campaign_descriptions
# Preparing transactions, demo, and products data
tran_demo_prod = transactions %>%
inner_join(products, "product_id") %>%
inner_join(demo, "household_id")
# Preparing the campaign data
campaigns_data = campaign_desc %>%
left_join(campaigns, "campaign_id")
# Preparing the Household and campaign
tran_demo_prod_camp = tran_demo_prod %>%
left_join(campaigns_data, c("household_id")) %>%
left_join(coupon_redemptions, c("household_id", "campaign_id"))
# Preparing the coupons and redemptions
coup_data = coupons %>%
left_join(coupon_redemptions, c("coupon_upc", "campaign_id"))
# Preparing transactions, demo, products and coupon redemptions data
tran_demo_prod_red = tran_demo_prod %>% inner_join(coupon_redemptions, by = "household_id")
# Preparing campaigns Max Min Sales data
camp_max_min <- tran_demo_prod_red %>%
filter(yday(transaction_timestamp) == yday(redemption_date)) %>%
group_by(campaign_id, week) %>%
summarize(total_sale = sum(sales_value, na.rm = T)) %>%
summarize(min = min(total_sale),
max = max(total_sale))
other_camp <- camp_max_min %>%
filter(!(campaign_id %in% c(8,13,18)))
camp_max_min_high <- camp_max_min %>%
filter((campaign_id %in% c(8,13,18)))
ggplot(camp_max_min)+
geom_segment(aes(x=1, xend=2, y=`min`, yend=`max`, color=campaign_id), size=.99, show.legend=T)+
geom_vline(xintercept=1, linetype="dashed", size=.1)+
geom_vline(xintercept=2, linetype="dashed", size=.1)+
labs(x="Campaigns ", y="Sales Value", color = 'Campaign ID',
title = "Campaign Analysis with Sales Value")+
scale_x_continuous(breaks = seq(1, 2, by = 1))+
scale_y_continuous(breaks = seq(0, 15000, by = 5000), labels = scales::dollar)+
geom_text(label="Campaign Start", x=1, y=1*(max(camp_max_min$`min`, camp_max_min$`max`)), hjust=0.2, size=3)+
geom_text(label="Campaign End", x=2, y=1*(max(camp_max_min$`min`, camp_max_min$`max`)), hjust= 0.8, size=3)+
theme(plot.title = element_text(size = 15, face = 'bold'),
axis.text.x = element_text(face = "bold",hjust=1),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"))
Analysing all the campaign trends with respect to sales values
left_label <- paste(other_camp$campaign_id, round(other_camp$`min`),sep=", ")
right_label <- paste(other_camp$campaign_id, round(other_camp$`max`),sep=", ")
plot1 <- ggplot(other_camp) + geom_segment(aes(x=1, xend=2, y=`min`, yend=`max` ), size=.99, show.legend=F) +
geom_vline(xintercept=1, linetype="dashed", size=.1) +
geom_vline(xintercept=2, linetype="dashed", size=.1) +
labs(x="", y="Sales Value", title = " Low Performing Campaigns ") +
scale_x_continuous(breaks = seq(1, 2, by = 1))+
ylim(0,15000)+
theme(plot.title = element_text(size = 15, face = 'bold'),
axis.text.x = element_text(face = "bold",hjust=1),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"))
left_label1 <- paste(camp_max_min_high$campaign_id, round(camp_max_min_high$`min`),sep=", ")
right_label1 <- paste(camp_max_min_high$campaign_id, round(camp_max_min_high$`max`),sep=", ")
plot2<- ggplot(camp_max_min_high) + geom_segment(aes(x=1, xend=2, y=`min`, yend=`max`, color = campaign_id), size=.99, show.legend=F) +
geom_vline(xintercept=1, linetype="dashed", size=.1) +
geom_vline(xintercept=2, linetype="dashed", size=.1) +
labs(x="", y="Sales Value",title = "Top 3 Performing Campaigns ") +
scale_x_continuous(breaks = seq(1, 2, by = 1))+
scale_y_continuous(breaks = seq(0, 15000, by = 5000), labels = scales::dollar)+
geom_text(label=left_label1, y=1.05*camp_max_min_high$`min`, x=rep(1, NROW(camp_max_min_high)), hjust=0.2, size=3.3, dodge = 1)+
geom_text(label=right_label1, y=camp_max_min_high$`max`, x=rep(2, NROW(camp_max_min_high)), hjust=0.7, size=3.3)+
theme(plot.title = element_text(size = 15, face = 'bold'),
axis.text.x = element_text(face = "bold",hjust=1),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"))
grid.arrange( plot2, plot1, ncol = 2)
The campaigns 8,13,18 had highest margin compared to other campaigns
tran_demo_prod_red <- tran_demo_prod%>%inner_join(coupon_redemptions, by = "household_id")
camp <- tran_demo_prod_red%>%filter(yday(transaction_timestamp)==yday(redemption_date))
camp8<- tran_demo_prod_red%>%filter(yday(transaction_timestamp)==yday(redemption_date)) %>% filter(campaign_id==8)
camp18<- tran_demo_prod_red%>%filter(yday(transaction_timestamp)==yday(redemption_date)) %>% filter(campaign_id==18)
camp13<- tran_demo_prod_red%>%filter(yday(transaction_timestamp)==yday(redemption_date)) %>% filter(campaign_id==13)
p1<-camp%>%
group_by( age)%>%
summarize(total_sales = sum(sales_value))%>%
ggplot(aes(age, total_sales))+
geom_point(size=3) +
geom_segment(aes(x = age,
xend = age,
y = 0,
yend = total_sales)) +
scale_y_continuous( labels = scales::dollar)+
labs(title = "All Campaigns Sales by Age",
x = "Age",
y = "Sales")+
theme(plot.title = element_text(size = 10, face = 'bold'),
plot.subtitle = element_text(size = 5, face = 'italic'),
axis.text.x = element_text(face = "bold",hjust=1),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
plot.background = element_rect(fill = "white") )
p2<-camp8%>%
group_by( age)%>%
summarize(total_sales = sum(sales_value))%>%
ggplot(aes(age, total_sales))+
geom_point(size=3) +
geom_segment(aes(x = age,
xend = age,
y = 0,
yend = total_sales)) +
scale_y_continuous(labels = scales::dollar)+
labs(title = "Campaign 8 Sales by Age",
x = "Age",
y = "Sales")+
theme(plot.title = element_text(size = 10, face = 'bold'),
plot.subtitle = element_text(size = 5, face = 'italic'),
axis.text.x = element_text(face = "bold",hjust=1),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
plot.background = element_rect(fill = "white") )
p3<- camp13%>%
group_by( age)%>%
summarize(total_sales = sum(sales_value))%>%
ggplot(aes(age, total_sales))+
geom_point(size=3) +
geom_segment(aes(x = age,
xend = age,
y = 0,
yend = total_sales)) +
scale_y_continuous( labels = scales::dollar)+
labs(title = "Campaign 13 Sales by Age",
x = "Age",
y = "Sales")+
theme(plot.title = element_text(size = 10, face = 'bold'),
plot.subtitle = element_text(size = 5, face = 'italic'),
axis.text.x = element_text(face = "bold",hjust=1),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
plot.background = element_rect(fill = "white") )
p4<-camp18%>%
group_by( age)%>%
summarize(total_sales = sum(sales_value))%>%
ggplot(aes(age, total_sales))+
geom_point(size=3) +
scale_y_continuous( labels = scales::dollar)+
geom_segment(aes(x = age,
xend = age,
y = 0,
yend = total_sales)) +
labs(title = "Campaign 18 Sales by Age",
x = "Age",
y = "Sales")+
theme(plot.title = element_text(size = 10, face = 'bold'),
plot.subtitle = element_text(size = 5, face = 'italic'),
axis.text.x = element_text(face = "bold",hjust=1),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
plot.background = element_rect(fill = "white") )
grid.arrange(p1, p2, p3, p4, ncol = 2)
From the above plots it is observed that the 45-54 age group people were highest contributers followed by 35-44 age group and 25-34 age group. But, when it comes to campaign 18 age 25-34 group contributed more than 35-44 age group. It would be interesting to find what would be the driving factor for the 25-34 age group in the campaign 18.
tran_demo_prod_camp%>%
filter(age == '25-34')%>%
filter(campaign_id %in% c(13,8,18))%>%
group_by(campaign_id, age, income)%>%
summarize(total_sales = sum(sales_value))%>%
ggplot(aes(age, total_sales, fill = income))+
geom_col()+facet_grid(~campaign_id)+
scale_x_discrete(breaks = " ")+
scale_y_continuous(breaks = seq(0, 800000, by = 100000), labels = scales::comma)+
labs(title = " Top 3 Campaigns - Sales by Income Levels",
x = "Age Group 25-34",
y = "Sales",
fill = 'Income Levels')+
theme(plot.title = element_text(size = 10, face = 'bold'),
plot.subtitle = element_text(size = 5, face = 'italic'),
axis.text.x = element_text(face = "bold",hjust=1),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
plot.background = element_rect(fill = "white"))
From the Top 3 Campaigns - Sales by Income Levels chart, it is observed that people with income level under 15000$ shop more during the campaign 18. Let’s drill down to department level to find what were things which made these people shop more…
tran_demo_prod_red %>%
filter(yday(transaction_timestamp) == yday(redemption_date)) %>%
filter(campaign_id %in% c(8, 13, 18)) %>%
filter(income == 'Under 15K') %>%
filter(age == '25-34') %>%
filter(department != 'GROCERY')%>%
group_by(campaign_id) %>%
ggplot(aes(campaign_id, sales_value, fill = department)) +
geom_bar(stat = "identity")+
labs(title = "Top 3 Campaigns - Sales by Department",
subtitle = "Grocery is excluded to get clear understanding on other departments",
x = "Campaign ID",
y = "Total Sales",
fill = "Departments") +
theme(plot.title = element_text(size = 10, face = 'bold'),
plot.subtitle = element_text(size = 8, face = 'italic'),
axis.text.x = element_text(face = "bold",hjust=1),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
plot.background = element_rect(fill = "white"))
Since grocery was top department in all campaigns it was excluded from the above analysis to find which other departments performed better. We can clearly see that Meat, Meat-Packaged, Produce, Seafood-Packaged are sold the mostly in the Campaign 18.
The objective of the analysis is to understand the sales pattern across the different campaigns and improve the revenue for our company. In order to boost the revenue, we analyzed the various datasets in the completejourney package by various data wrangling and visualization techniques such as joining, filtering, and summarizing data affiliated with campaigns, demographics, and products.
After a thorough analysis, we have identified the following insights from the data we considered for our analysis:
First, we identified that, out of all the 27 campaigns, campaigns such as 8, 13, and 18 generate the maximum revenue.
We deep-dived into the transactions associated with Campaigns 8, 13, and 18, and found that under campaign 18, the sale value generated from the customers that belong to the age group ‘25-34’ is higher than that of the’35-44’ age group.
Later, we also analyzed the previous trend with income groups and found that the ‘Under 15k’ income households are the primary driving factor.
On further drilling down, we found that Meat related products such as Meat, Packaged meat, seafood, packaged seafood, and produce are major factors that are contributing to the increased revenue.
Including meat-related coupons similar to that used in campaign 18 in future campaigns will target more customers belonging to the age group of ‘25-34’ and ‘Under 15k’ income household groups will help us increase the revenue.
Limitations
Most of the products were not on display, they were placed in random locations than any other display locations which could have driven the customers. So, we could not do further analysis using the promotions dataset which could have given us the information on which is the right way to promote our products. Hence, using this analysis we can conclude only about which products to focus on to improve the revenue for the other campaigns too.