The dataset we have chosen contains the household level transactions of 2500 households who are frequent shoppers at a retailer.The retailer has conducted 30 campaigns, which were broadly categorized to three types, over the span of two years.
We have applied several data wrangling and modeling techniques to breakdown the datset and understand consumer behavior. Please navigate through the below tabs to view our project in detailed.
Through the dataset, we are trying to figure out how revenue of the retailer is affected through their marketing campaigns. While this is our main goal, we also tried to figure out how each type of marketing campaign affects the sales value and infer the impact of these marketing campaigns on the buying behaviour of the customers.
To understand the revenues of the retailer associated with each type of campaign, we can find the average amount of revenue earned by the retailer based on the household purchases for each type of campaign and see how it impacted the sales. One more approach is understanding how the length of the campaign has encouraged the households to use the coupons.
We started with cleaning and prepaing the data through proper formatting, so that the data is ready and actionable for complicated analyses. The current document provides details on how we have imported data into R and combined multiple datasets to be more accomodating and easy to access.
This analysis would help understand what type of marketing approaches attract the customer and could also provide more direction on how to design and modify the marketing strategies of the retailer.
We have chosed to use the following packages, from thousands of packages available for R.
library(knitr)
library(readr)
library(dplyr)
library(kableExtra)
library(purrr)
library(ggplot2)
library(gridExtra)
Reasons for using these packages in our project are detailed below.
knitr - To use R Markdown more efficiently with flexibility and speed. Knitr enables easy way to set working directory, so that it would be easy for anyone to share and run the R code.
readr - To read the files into R. This package offers optimized functionality, compared to the basic utils package, which is much slower.
dplyr - Allows to easily manipulate data in R, using simple and efficient syntax. More importantly, allows function chaining, preventing any potential cluttering in the code, which in turn makes code reading and writing easier.
kableExtra - Function ‘kable()’ is a light weight table generator coming from ‘knitr’. This package simplifies the way to manipulate the HTML or ‘LaTeX’ codes generated by ‘kable()’ and allows users to construct complex tables and customize styles using a readable syntax.
ggplot2 - Used as a replacement for plot functions available in Base R packages. This package provides advanced graphics and offers better visualizations, that are more programmer friendly.
gridExtra - To show the plots in a grid format.
For the purposes of our academic project, we have chosen to work on the dataset “Complete Journey customer transactions with marketing campaigns”. Required files for the same can be downloaded from this link.
After downloading the files to a secure location on your computer, set the working directory to the respective folder.The read_csv function automatically imports the data as tibbles, which display data when needed more optimally as compared to data frames. The class of all the below tables is tibble and it may take several minutes for data to load, given the bulk datasets used for this project.
file_names <- c( "campaign_desc","campaign_table","causal_data","coupon",
"coupon_redempt","hh_demographic","product","transaction_data")
for (i in seq_along(file_names)) {
file <- paste0("data/",file_names[i],".csv")
assign(file_names[i],read_csv(file))
}
The original purpose of this data is to provide information about the purchases of each household. Collectively, all the tables imported above, shed light on the products purchased by each household, number of transactions made, product descrition, brand of the product, coupon applied, coupon discount, sales value, campaign type, campaign length etc., The variables of the dataset are clearly explained in the following section.
The original datasource has 34 unique variables stored separately in 8 different tables. The 30 campaigns were conducted on 1584 households out of which, 434 households redeemed 556 coupons. We also have details of total 2500 households, who shopped for almost a million products. The relational schema of the tables along with the respective variables is displayed below
Not all the data provided in its raw form can be useful. Data cleaning is done in the next section for each table. Details of the same are mentioned in the following section.
It is essential to clean the data before analyzing it. In this step, we make sure that all the variables are assigned appropriate data types.
For all the tibbles in the dataset, we have changed the identifying columns (IDs) to strings from integers, as IDs should be used as identifiers and to not perform any arithmetic or numerical operations.
The following code converts the data type of campaign description to factor. A new variable called duration is created in the campaign_desc table that provides information on the number of days the campaign was run.
str(campaign_desc)
campaign_desc$description <- as.factor(campaign_desc$description)
campaign_desc$campaign <- as.character(campaign_desc$campaign)
campaign_desc <- mutate(campaign_desc, duration =
end_day - start_day)
str(campaign_table)
campaign_table$description <- as.factor(campaign_table$description)
campaign_table[,2:3] <- map(campaign_table[,2:3],as.character)
Below datasets contain the details of coupons. Unique coupon IDs and its mapping to the household that used the respective coupon on a particular day.
str(coupon)
coupon[,1:3] <- map(coupon[,1:3],as.character)
str(coupon_redempt)
coupon_redempt[,c(1,3,4)] <- map(coupon_redempt[,c(1,3,4)],as.character)
The hh_demographic table contains details of each household, like the size of household, the income group they fall into, Marital Status, Age Group and other demographic information. All the categorical variables are converted to factors
str(hh_demographic)
hh_demographic$household_key <- as.character(hh_demographic$household_key)
hh_demographic[,c(2:6,8)] <- map(hh_demographic[,c(2:6,8)],as.factor)
hh_demographic$household_size_desc <- as.integer(
hh_demographic$household_size_desc)
The product table has description of the products sold by the retailer, that were purchased by the current household group. This table has description of commodities, which have data type strings. product table also has a column called size that has the measurement unit of the respective product in ounces or pounds or in quantity. As this column cannot be generalized is left to be a character.
str(product)
product[,1:2] <- map(product[,1:2],as.character)
product[,3:4] <- map(product[,3:4],as.factor)
The below table has details of the products sold at a store. This table also provides information on how the particular product was displayed in store and also how the products were advertised in the mailer. The variables, Display and Mailer are categorical, and indicate the location where each product is displayed and advertised respectively.
str(causal_data)
causal_data[,1:2] <- map(causal_data[,1:2],as.character)
causal_data[,4:5] <- map(causal_data[,4:5],as.factor)
Transaction table has details of all the transactions made by the households and can be considered central to the schema. The column Basket ID, refers to each transaction made by the customer and has all the product details. Transaction detail table also provides information on the sales value of each product purchased by the customer, along with the discounts applied through the coupons.
str(transaction_data)
transaction_data[,c(1,2,4,7)] <- map(transaction_data[,c(1,2,4,7)],as.character)
transaction_data$trans_time <- as.character(transaction_data$trans_time)
Moving on from this step, we started performing analysis on the data, trying to connect data using commonalities amongst these data frames.
Visuals and graphs are a great way to view patterns and these help us to get a picture of overalla data without performing extensive analysis. For the purpose of this project, we have defined two functions that we used to create bar plots and box plots.
#Barplots
gg_bar <- function(df, x, colour, xlab = "", ylab = "", main = "", order = F) {
ifelse(order == T, xr <- with(df, reorder(x, x, function(a) -length(a))), xr <- x)
ggplot(data = df, aes(x = xr), position = "dodge") +
geom_bar( fill = colour ) +
scale_y_continuous(labels = scales::comma) +
labs(title = main, x = xlab, y = ylab) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank())
}
#Boxplots
gg_box <- function(df, y, colour, xlab = "", ylab = "", main = "") {
ggplot(data = df, aes(y = y)) +
geom_boxplot() +
labs(title = main, x = xlab, y = ylab) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank())
}
These functions make it easier to maintain the same them for all plots while making it easier for the programmer to implement changes or modify themes across the project.
We view the mode of promotion for all products through the below bar graph. Note that from both the bar plots it can be infered, products that are not on display and those that are not in the ad are suppressed.
Form the above plots, it is clear that most number of products are displayed through ‘Secondary Location’ and the most products that are promoted through mail are featured on the ‘Interior’ page.
Below are the plots summarising the demographic information of the households who purhase products from our retailer. Four plots show details about the Age, Income categor, composition of household and number of kids in each household.
Below boxplots show how well distributed the sales value and retail discounts are. Under the sales value of 5$ for each transaction there are no outliers. However, there seem to be many outliers, for each product, when it comes to retail discount.
A closer look into the data reveals that the most of the outliers in total retail discount is due to the quantities of each product purchased. For example, if the discount availed on purchasing an apple is 1$, then the discount for 10 apples is 10 dollars. In this case, the outliers are visible due to the quantity of product but not because of the huge discount applied to a single product.
We first inspect how long each type of campaign was run. There are 30 campaigns, belonging to three types as A,B and C.
summarize(group_by(campaign_desc,campaign_desc$description),
avg_Time = mean(duration))
## # A tibble: 3 x 2
## `campaign_desc$description` avg_Time
## <fct> <dbl>
## 1 TypeA 47.2
## 2 TypeB 37.6
## 3 TypeC 74.5
Type C campaign has the highest average duration of 74 days.
Interesting find: 87% of transactions have sales value below 5 dollars.
100*((transaction_data %>%
filter(sales_value < 5) %>%
nrow())/nrow(transaction_data))
## [1] 87.09925
Retail discounts are always indicated as a negative value or zero. But there are 36 records that have positive discount. These abnormal values could be a mistake in data entry and hence are fixed by multiplying with -1.
Additionally, of all the retail discount values, 97% of them are between -3 and 0. Only 3% purchases have received a discount of more than 3 dollars, per product.
100*((transaction_data %>%
filter(retail_disc > (-3) & retail_disc <= 0) %>%
nrow())/nrow(transaction_data))
## [1] 97.0955
Also, summary of the data revealed that of all the variables, two have missing values, which wouldn’t even be factoring in final analysis. As there are only 80 missing values of around 2.5 million observations, we can confidently ignore the missing values in the data of interest.
Finalizing the data and preparing it for the further analysis is done in this section. Now that we have cleaned all the tables and formatted them properly, to address the next scope of analysis ,datasets of interest are merged and new comprehensive datasets are created.
To get all the records related to transactions and campaigns, the entire transaction_data with campaign_table and campaign_desc are merged together. But, merging all the campaigns with transaction_data resulted in more than 15 million records, and takes a long time to return the results. Considering that it would be very cumbersome to perform complex analysis and build models with such as huge dataset, we merged the transaction_data with respect to type of campaign.
After merging the datasets and separating them as shown below, we take the sum and average of sales value for all the three types of campaigns.
| description | campaign | start_day | end_day | duration | household_key | basket_id | day | product_id | quantity | sales_value | store_id | retail_disc | trans_time | week_no | coupon_disc | coupon_match_disc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TypeA | 18 | 587 | 642 | 55 | 1 | 27601281299 | 51 | 825123 | 1 | 3.99 | 436 | 0.00 | 1456 | 8 | 0 | 0 |
| TypeA | 18 | 587 | 642 | 55 | 1 | 27601281299 | 51 | 831447 | 1 | 2.99 | 436 | 0.00 | 1456 | 8 | 0 | 0 |
| TypeA | 18 | 587 | 642 | 55 | 1 | 27601281299 | 51 | 840361 | 1 | 1.09 | 436 | -0.30 | 1456 | 8 | 0 | 0 |
| TypeA | 18 | 587 | 642 | 55 | 1 | 27601281299 | 51 | 845307 | 1 | 3.71 | 436 | -0.62 | 1456 | 8 | 0 | 0 |
Potential Revenue by Campaign A
## [1] 20583817
| description | campaign | start_day | end_day | duration | household_key | basket_id | day | product_id | quantity | sales_value | store_id | retail_disc | trans_time | week_no | coupon_disc | coupon_match_disc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TypeB | 24 | 659 | 719 | 60 | 31 | 28167945894 | 99 | 824396 | 1 | 9.29 | 324 | 0.00 | 1243 | 15 | 0.0 | 0.0 |
| TypeB | 24 | 659 | 719 | 60 | 31 | 28167945894 | 99 | 833919 | 1 | 5.99 | 324 | 0.00 | 1243 | 15 | 0.0 | 0.0 |
| TypeB | 24 | 659 | 719 | 60 | 31 | 28167945894 | 99 | 870426 | 1 | 1.70 | 324 | -1.09 | 1243 | 15 | -0.5 | -0.5 |
| TypeB | 24 | 659 | 719 | 60 | 31 | 28167945894 | 99 | 882190 | 1 | 2.49 | 324 | -0.40 | 1243 | 15 | -0.5 | -0.5 |
Potential Revenue by Campaign B
## [1] 18795388
| description | campaign | start_day | end_day | duration | household_key | basket_id | day | product_id | quantity | sales_value | store_id | retail_disc | trans_time | week_no | coupon_disc | coupon_match_disc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TypeC | 15 | 547 | 708 | 161 | 143 | 28154518085 | 98 | 844165 | 1 | 1.00 | 446 | -0.59 | 2308 | 15 | 0 | 0 |
| TypeC | 15 | 547 | 708 | 161 | 143 | 28154518085 | 98 | 935546 | 1 | 0.99 | 446 | 0.00 | 2308 | 15 | 0 | 0 |
| TypeC | 15 | 547 | 708 | 161 | 143 | 28154518085 | 98 | 992621 | 2 | 1.38 | 446 | 0.00 | 2308 | 15 | 0 | 0 |
| TypeC | 15 | 547 | 708 | 161 | 143 | 28154518085 | 98 | 1139216 | 1 | 7.99 | 446 | -1.50 | 2308 | 15 | 0 | 0 |
Potential Revenue by Campaign C
## [1] 4456140
Interesting Find: Campaign A has the highest potential sales value amongst all the campaigns, totalling up to 20.5 million dollars, which is close to the poetential sales value of Type C and Type B campaigns combined.However, the average sales value for all the three types of campaigns is around 3 dollars.
| household_key | day | coupon_upc | campaign | product_id | manufacturer | department | brand | commodity_desc | sub_commodity_desc | curr_size_of_product |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 421 | 10000085364 | 8 | 100512 | 1765 | GROCERY | National | FLUID MILK PRODUCTS | FLUID MILK WHITE ONLY | 1/2 GAL |
| 1 | 421 | 10000085364 | 8 | 527731 | 1194 | GROCERY | National | REFRGRATD JUICES/DRNKS | DAIRY CASE 100% PURE JUICE - O | 64 OZ |
| 1 | 421 | 10000085364 | 8 | 1054539 | 1194 | GROCERY | National | REFRGRATD JUICES/DRNKS | DAIRY CASE 100% PURE JUICE - O | 64 OZ |
| 1 | 421 | 10000085364 | 8 | 802268 | 1487 | GROCERY | National | YOGURT | YOGURT NOT MULTI-PACKS | 6 OZ |
From the above analysis, it is clear that Type A campaign yielded highest revenue, and is also the campaign that ran for longest duration. In our next steps, we would like to find out whether duration of campaign has significant impact on the revenue generated.
We only dealt with the revenues and coupons discounts overall until now. But, there are three types of Coupons discount provided to the customer as detailed below,
Normal Coupon Discount - Discount on product through coupons reedemed by the customer. These are the coupons distributed to customer through various campaigns.
Coupon Match Discount - This does not affect the sales value of the product. The discount is provided to customer, based on promotions by the manufacturer and the discounted amount is paid back to the retailer by the manufacturer.
Loyalty Card Discount - This discount is applied to the customers who are enrolled in a loyalty card program. The assumption is that this particular discount does not have anything to do with the coupon IDs or campaigns.
Based on these parameters, we created two more columns that identify whether a transaction has discounts based on coupons or loyalty card.
The total number of transactions that used loyalty card were,
## [1] 1303062
From the overall 2.5 Million million transactions that occured, about 50% transactions have obtained loyalty card discount. This is a very good rate, implying that relevant number of our customers would like to maintain long term relation with the retailer.
With a similar approach, we tried to find the number of transations that utilised coupon discounts. The coupon discounts include both the coupons provided by retailer and manufacturer of each product in each transaction.
## [1] 36422
Surprisingly, only 1.4% of the transactions have a valid coupon redeeemed. This means that the marketing campaigns are definitely not tapping enough customers.
To explore further, we look at customers who were using coupons even though they were not loyal customers of the retailer.
## [1] 16373
About 45% of the customers who redeemed coupons, were not a part of loyalty program. Since, the campaigns seem to have a positive impact on those who were not regular customers, we tried to focus on the demographics of this category. This approach allows us to understand what category of customers need to be targeted the most and which aspects of the marketing created most impact to attract unfamiliar customers.
We have used join operations to extract the informations of customers that used coupons, without loyalty card. It is important to note that the demographic informations is not fully available for all customers. Below are tables have the information of our potential target customers.
household_only_coupon <- transaction_data %>%
filter(loyalty == "No" & coupon_used == "Yes") %>%
select(household_key) %>% unique()
demographic_only_coupon <- left_join(household_only_coupon,
hh_demographic,by = "household_key")
m <- demographic_only_coupon
demographic_only_coupon <- m[rowSums(is.na(m[,-1])) != ncol(m[,-1]),]
Below are the details of customers who used coupons at least once, without loyalty cards
Unfortunately, these plots do not tell much about this segment of customers, as they match the demographic plots of overall household transactions. However, having demographic information of all the households in the dataset would improve the quality of our findings.
Comparing the average spend of households would provide us information about how frequent our transactions are with the customers. Also, grouping by each of the househlod as shown in below code would provide us household wise information.
avg_spend <- transaction_data %>%
group_by(household_key) %>%
summarize(Total_spend = sum(sales_value),
no_vistis = length(unique(basket_id)),
avg_sales_purchase =
sum(sales_value)/length(unique(basket_id)))
avg_spend <- avg_spend[order(-avg_spend$Total_spend),]
Below boxplot shows the average spending of the households. Spend of customers ranged from 2 to 1664$. Our most loyal customer has visited the retailer 1300 times in the span for 600 times amounting to a highest total spend of 11000$
One more interesting fact is that our highest spent customer has made purchases worth ‘38000’ in just 600 visits, with an average spend of 63$
It is also important to note that only 20% of the total households are contributing to about 50% of the total revenue.
To find the same, we used the cumulative sum of average amount spent by each household.
cumsum <- 100*cumsum(avg_spend$Total_spend)/sum(avg_spend$Total_spend)
running_totals <- cbind(index = c(1:2500),as.data.frame(cumsum))
We can infer that same through our plot as shown below.
The graph shows that about 1000 households are contributing to three fouths of sales, while the remaining 1500 are only adding to final quarter of sales. Through this, we can can clearly suggest that concentrating on regular loyal customers would be much more profitable to the retailer.
Though we did not have the details of dates provided in the dataset, we assumed Day 1 to be the beginning of the period when campaigning has begun. To find more information on the same, we grouped all our transactions by eah day in the dataset.
day_revenue <- transaction_data %>%
group_by(day) %>%
summarize(Revenue = sum(sales_value))
day_revenue$day <- as.integer(day_revenue$day)
After we have this data ready, a plot of Revnue with respect to day, shows that the Revenue has increased continuously. The graph has been increasing through out the 1st quarter and became more stable towards the end.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Although we have performed extensive analysis on the dataset, there is a lot more scope to this project. To understand the consumer behavior and improve marketing techniques, we would need more demographic data and details regarding the coupons.
We tried to identify the number of coupons or unique coupons that are distributed as a part of each campaign, but it seemed to be beyond the scope of this project. Due to the same, we were unable to figure out the Average Number of coupons redeemed per each campaign and for each household or the distinct number of coupons issued for each campaign.
Efficiency of the campaigns is calculated on basis of revenue here, but it can be built upon to identify the timeframes of coupon redemption, using the week and day details given for each transaction.
One more interesting direction to look at this would be to perform Market basket analysis of the products and interpret what kind of products and bands are most purchased, with ot without discounts.