In the year 2017, there were 27 different campaigns for Regork. There were 39,841 total transactions from 743 unique households that participated in campaigns that year. There are three different types of campaigns:
Type A
Type B
Type C
The purpose of this analysis is to understand which campaigns had the largest success in bringing in revenue, as well as what demographic participated in these successful campaigns. With this data, we will be able to make proper business decisions on:
I will be using the Complete Journey data set to analyze this problem. We only have demographic data for 801 households. This data analysis will only describe households with demographic information.
In this exploratory data analysis, I will be conducting 3 different analyses:
Below are the perquisite libraries that need to be imported to conduct my exploratory data analysis:
tidyverse - Tidy the data
dplyr - Manipulate the data
ggrepel - Label the data
lubridate - Conduct data manipulation
completejourney - Where the data is coming from
DT - Used to display the data on the screen in a scroll able format
library(tidyverse)
library(dplyr)
library(ggrepel)
library(lubridate)
library(completejourney)
library(DT)
This section contains all the procedures followed in getting the data ready for analysis.
Complete Journey Data Set in R
Data Import Code:
library(completejourney)
transactions <- get_transactions()
In order to conduct this analysis, we must join together the following data sets:
Along with that, dropping variables that won’t be of use initially will make the data a bit more clean.
df <- campaigns %>%
inner_join(demographics, by = 'household_id') %>%
inner_join(transactions, by = 'household_id') %>%
inner_join(products, by = 'product_id') %>%
inner_join(campaign_descriptions, by = 'campaign_id') %>%
filter(
as.Date(transaction_timestamp, format = "%Y-%m-%d") > start_date,
as.Date(transaction_timestamp, format = "%Y-%m-%d") < end_date
)
class(df)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
colnames(df)
## [1] "campaign_id" "household_id" "age"
## [4] "income" "home_ownership" "marital_status"
## [7] "household_size" "household_comp" "kids_count"
## [10] "store_id" "basket_id" "product_id"
## [13] "quantity" "sales_value" "retail_disc"
## [16] "coupon_disc" "coupon_match_disc" "week"
## [19] "transaction_timestamp" "manufacturer_id" "department"
## [22] "brand" "product_category" "product_type"
## [25] "package_size" "campaign_type" "start_date"
## [28] "end_date"
Along with that, dropping variables that won’t be of use initially will make the data a bit more clean:
df <- df %>%
select(-c(manufacturer_id, package_size, retail_disc))
datatable(head(df,50))
class_meta <- sapply(df, class)
null_meta <- sapply(df, function(df) sum(is.na(df)))
var_name <- colnames(df)
df_meta <- cbind(var_name, class_meta, null_meta)
rownames(df_meta) <- NULL
colnames(df_meta) <- c('Variable Name', 'Class', 'NULL Vals')
datatable(df_meta)
This section dives into the analysis that solves the questions in the Synopsis.
As mentioned before, there are 3 different Campaign Types: Type A, Type B, Type C. Below shows the amount of campaigns in each Campaign Type, as well has how many transactions users in these campains completed in 2017.
transacs <- df %>%
distinct(basket_id, .keep_all = TRUE) %>%
group_by(campaign_type) %>%
summarise(
Transactions = scales::comma(n())
)
ids <- campaign_descriptions %>%
group_by(campaign_type) %>%
summarise(campaigns = n())
datatable(cbind(ids, transacs)[-3])
Type B had the most campaigns, while Type A had the least. While Type A had the least, Type A still brought in the most total customers that completed a transaction, showing Type A had very efficient campaigns. Now we want to at this data per campaign. Below shows the average amount of households who completed a transaction for each campaign, grouped by each type of campaign:
df_1 <- df %>%
distinct(basket_id, .keep_all = TRUE) %>%
group_by(campaign_id, campaign_type) %>%
summarise(trans = n())
df_1 %>%
group_by(campaign_type) %>%
summarise(avg_hh = round(mean(trans))) %>%
ggplot(aes(x = campaign_type, y = avg_hh)) +
geom_col(aes(fill = campaign_type)) +
labs(
title = 'Average Households Per Campaign by Campaign Type',
x = 'Campaign Type',
y = 'Average Households'
) +
theme(
legend.position = 'none',
plot.title = element_text(size = 15, face = 'bold')
) +
geom_text(aes(label = scales::comma(avg_hh)), vjust = -0.3) +
scale_y_continuous(labels = scales::comma)
Below shows the amount of revenue brought in by customers who participated in campaings by campaign ID:
df %>%
distinct(basket_id, .keep_all = TRUE) %>%
group_by(campaign_id, campaign_type) %>%
summarise(sales = sum(sales_value)) %>%
arrange(desc(sales)) %>%
ggplot(aes(x = fct_reorder(campaign_id, sales), y = sales, fill = campaign_type)) +
geom_col() +
coord_flip() +
labs(
title = "Sales by Campaign",
x = "Campaign ID",
y = "Sales"
) +
scale_fill_manual(name = 'Campaign Type',
values = c("#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
) +
scale_y_continuous(labels = scales::dollar, limits = c(0, 300000)) +
geom_text(aes(label = scales::dollar(sales)), hjust = -0.10, size = 3) +
theme(plot.title = element_text(size = 15, face = 'bold'))
It can be seen that campaign IDs 18, 13, and 8 performed very well and
explains the strong performance for Type A campaigns. I will dive deeper
into these campaign in the next section to see what demographic
contributed to these transactions.
To further dive into these these campaigns, we would like to see what demographic best makes up these campaigns. Below is a Heat Map showing the distribution of total sales by income and age from customers who participated these campaigns:
df %>%
distinct(basket_id, .keep_all = TRUE) %>%
filter(campaign_id %in% c(18, 13, 8)) %>%
group_by(income, age) %>%
summarise(
sales = sum(sales_value)
) %>%
ggplot(aes(x = income, y = age, fill = sales)) +
geom_tile(color = 'black') +
geom_text(aes(label = scales::dollar(sales)), color = 'black', size = 2) +
scale_fill_gradient2(low = "#075AFF",
mid = "#FFFFCC",
high = "#FF0000") +
coord_flip() +
labs(
title = 'Total Sales by Income and Age',
subtitle = 'The data shows that 45-54 year olds making 50-74K a year bring in the
largest amounts of revenue for all income brackets and ages.',
x = 'Income Bracket',
y = 'Age',
fill = 'Total Sales ($)'
) +
theme(plot.title = element_text(face = 'bold', size = 15))
## `summarise()` has grouped output by 'income'. You can override using the
## `.groups` argument.
Below is the top selling product categories for campaigns 18, 13, 8 filtered for 45-54 year olds making 50-74K a year:
df %>%
filter(
campaign_id %in% c(18, 13, 8),
str_detect(income, regex('^50-74K')),
str_detect(age, regex('45-'))
) %>%
group_by(product_category) %>%
summarise(sales = sum(sales_value)) %>%
arrange(desc(sales)) %>%
head(10) %>%
ggplot(aes(x = fct_reorder(product_category, sales), y = sales)) +
geom_col(color = 'black', width = 0.95, fill = '#FF9F47') +
coord_flip() +
labs(
title = 'Top Selling Product Categories',
y = 'Sales',
x = 'Product Category'
) +
theme(
legend.position = 'none',
plot.title = element_text(size = 15, face = 'bold')
) +
scale_y_continuous(labels = scales::dollar_format())
Now we want to look at the Monthly break down of sales among the main demographic for the best performing campaigns. Below shows the by month time-series for 45-54 year olds making 50-74K a year for campaign 13:
df %>%
distinct(basket_id, .keep_all = TRUE) %>%
filter(
campaign_id %in% c(18, 13, 8),
str_detect(income, regex('^50-74K')),
str_detect(age, regex('45-'))
) %>%
mutate(month = month(transaction_timestamp)) %>%
group_by(month, kids_count, campaign_id) %>%
summarise(sales = sum(sales_value)) %>%
arrange(desc(sales)) %>%
ggplot(aes(x = month, y = sales, fill = campaign_id)) +
geom_area(position = 'stack') +
scale_x_continuous(breaks = 1:12, labels = month.abb) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(
title = 'Total Sales by Month for the Top Consumers',
subtitle = 'The data below shows that campaigns 8 and 13 performed very well from
when it was deployed to end. Campaign 18 started slow but became high in revenue by November.',
x = 'Month',
y = 'Total Sales',
fill = 'Campaign'
) +
theme(
plot.title = element_text(size = 15, face = 'bold')
)
Now we have enough data to make insights about the questions we set out to find. The following is a summary of insights from the analysis:
Type A Campaigns had the least amount of campaigns, but brought in the most net revenue out of all the types.
Type A campaigns had the most efficient transactions/campaigns ratio out of any other type.
Type A Campaign ids 18, 13, and 8 were an outlier for bringing in high revenue in 2017. These make up the bin that will be called BPC (Best Performing Campaigns).
45-54 year olds making 50-74K a year made up a majority of the BPC. This demographic makes up the bin that will be called SCD (Strongest Consuming Demographic).
Campaigns 8 and 13 performed very well from when it was deployed to end when looking at the data for the SCD.
Campaign 18 started slow but became high in revenue by November when looking at the data for the SCD.
We can use these insights to make the following recommendations: