We are studying the data containing household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. Our main objective is to understand customer purchasing behaviour after running different types of campaigns. Here, we would also observe how different demographic attributes of customers affect the company’s sales. This would provide insights to the retailer regarding future campaigns and product demand analysis.
library(tidyverse)#core package containing other packages(ggplot2,dplyr,etc)
library(tm) #framework for text mining
library(stringr) #consistent wrappers for string operations
library(sqldf) #used to run sql commands on dataframes
library(splus2R)#facilitates the conversion of S-PLUS packages to R packages
library(data.table)#Fast aggregation of large data
Data Source
The source of the data is https://www.dropbox.com/sh/7g8npy9k51dhtjm/AACOwFfvFSjw26fWNDZmaSS1a?dl=1
The data set contains customer transaction data. It also includes their household data. It is the data collected over 2 years for 2500 households. The data is stored in 8 tables namely: Household Table, Campaign Table, Campaign Description Table, Transaction Table, Product Table, Coupon Table, Causal Data Table Coupon Redemption Table. They are linked by several primary keys and foreign keys like Household Key, Product ID, Coupon ID, etc. The original data had several missing values in the Product Table in the commodity table. There were 15 missing values in the Commodity Description column and 30607 missing values in the Product size column.
Data Cleaning
#Importing csv files using for loops
file.name<-c("data/campaign_desc.csv","data/campaign_table.csv",
"data/causal_data.csv","data/coupon.csv","data/coupon_redempt.csv",
"data/hh_demographic.csv","data/product.csv","data/transaction_data.csv")
for (i in 1:length(file.name)){
#check it's existence and if exists, get it to the list
if (file.exists(file.name[i]))
f<-fread(file.name[i])
assign(paste0("File",i),f)
}
## Camp_desc table
camp_desc <- File1
camp_desc_arranged <- camp_desc %>% arrange(description)
## camp_table
camp_table <- File2
map(camp_table , ~sum(is.na(.))) ## checking for null values
## coupon
coup <- File4
map(coup , ~sum(is.na(.))) ## checking for null values
## coupon_redempt
coup_red <- File5
map(coup_red , ~sum(is.na(.))) ## checking for null values
#Linking campaign type with description
A <- c(8,13,18,26,30)
B <- c(1,2,4,5,7,9,10,11,12,16,17,19,21,22,23,24,25,28,29)
C <- c(3,6,14,15,20,27)
coup_red <- coup_red %>%
mutate(Coup_cat = ifelse(campaign %in% A , "Type A" , ifelse(campaign %in% B , "Type B" , "Type C")))
## causal_data
cau_da <- File3
map(cau_da , ~sum(is.na(.))) ## checking for null values
## hh_demo
hh_demo2 <- File6
hh_demo3 <- hh_demo2
map(hh_demo2 , ~sum(is.na(.))) ## checking for null values
hh_demo2 <- hh_demo2 %>%
filter(str_detect(hh_comp_desc , "No Kids") == TRUE) %>% mutate(kid_category_desc = "None")
hh_demo4 <- filter(hh_demo3 , str_detect(hh_comp_desc , "No Kids") == FALSE)
hh_demo5 <- bind_rows(hh_demo2 , hh_demo4)
hh_demo5 <- hh_demo5 %>%
arrange(household_key)
fg <- unique(hh_demo5$income_desc)
rt1 <- hh_demo5 %>%
filter(income_desc == "Under 15K") %>%
select(household_key)
rt1 <- as.vector(t(rt1))
rt2 <- hh_demo5 %>%
filter(income_desc == "15-24K") %>%
select(household_key)
rt2 <- as.vector(t(rt2))
rt3 <- hh_demo5 %>%
filter(income_desc == "25-34K") %>%
select(household_key)
rt3 <- as.vector(t(rt3))
rt4 <- hh_demo5 %>%
filter(income_desc == "35-49K") %>%
select(household_key)
rt4 <- as.vector(t(rt4))
rt5 <- hh_demo5 %>%
filter(income_desc == "50-74K") %>%
select(household_key)
rt5 <- as.vector(t(rt5))
rt6 <- hh_demo5 %>%
filter(income_desc == "75-99K") %>%
select(household_key)
rt6 <- as.vector(t(rt6))
rt7 <- hh_demo5 %>%
filter(income_desc == "100-124K") %>%
select(household_key)
rt7 <- as.vector(t(rt7))
rt8 <- hh_demo5 %>%
filter(income_desc == "125-149K") %>%
select(household_key)
rt8 <- as.vector(t(rt8))
rt9 <- hh_demo5 %>%
filter(income_desc == "150-174K") %>%
select(household_key)
rt9 <- as.vector(t(rt9))
rt10 <- hh_demo5 %>%
filter(income_desc == "175-199K") %>%
select(household_key)
rt10 <- as.vector(t(rt10))
rt11 <- hh_demo5 %>%
filter(income_desc == "200-249K") %>%
select(household_key)
rt11 <- as.vector(t(rt11))
rt12 <- hh_demo5 %>%
filter(income_desc == "250K+") %>%
select(household_key)
rt12 <- as.vector(t(rt12))
## transaction
trans_data <- read.csv("data/transaction_data.csv" , na.strings = c("" , "NA"))
map(trans_data , ~sum(is.na(.))) ## checking for null values
trans_data$trans_time <- substr(as.POSIXct(sprintf("%04.0f", trans_data$trans_time), format='%H%M'), 12, 16) #Converting Time
trans_data <- trans_data %>%
mutate(actual_prod_price = (sales_value + retail_disc + coupon_match_disc)/quantity , shelf_price = (sales_value - retail_disc - coupon_match_disc)/quantity)
fr1 <- trans_data %>%
filter(household_key %in% rt1) %>%
select(sales_value)
inc_1 <- sum(fr1)
fr2 <- trans_data %>%
filter(household_key %in% rt2) %>%
select(sales_value)
inc_2 <- sum(fr2)
fr3 <- trans_data %>%
filter(household_key %in% rt3) %>%
select(sales_value)
inc_3 <- sum(fr3)
fr4 <- trans_data %>%
filter(household_key %in% rt4) %>%
select(sales_value)
inc_4 <- sum(fr4)
fr5 <- trans_data %>%
filter(household_key %in% rt5) %>%
select(sales_value)
inc_5 <- sum(fr5)
fr6 <- trans_data %>%
filter(household_key %in% rt6) %>%
select(sales_value)
inc_6 <- sum(fr6)
fr7 <- trans_data %>%
filter(household_key %in% rt7) %>%
select(sales_value)
inc_7 <- sum(fr7)
fr8 <- trans_data %>%
filter(household_key %in% rt8) %>%
select(sales_value)
inc_8 <- sum(fr8)
fr9 <- trans_data %>%
filter(household_key %in% rt9) %>%
select(sales_value)
inc_9 <- sum(fr9)
fr10 <- trans_data %>%
filter(household_key %in% rt10) %>%
select(sales_value)
inc_10 <- sum(fr10)
fr11 <- trans_data %>%
filter(household_key %in% rt11) %>%
select(sales_value)
inc_11 <- sum(fr11)
fr12 <- trans_data %>%
filter(household_key %in% rt12) %>%
select(sales_value)
inc_12 <- sum(fr12)
#Mutating new column for Salary Bracket
trans_data4 <- trans_data %>%
mutate(Inc_cat = ifelse(household_key %in% rt1 , "Under 15K" , ifelse(household_key %in% rt2 , "15-24K" , ifelse(household_key %in% rt3 , "25-34K" , ifelse(household_key %in% rt4 , "35-49K" , ifelse(household_key %in% rt5 , "50-74K" , ifelse(household_key %in% rt6 , "75-99K" , ifelse(household_key %in% rt7 , "100-124K" , ifelse(household_key %in% rt8 , "125-149K" , ifelse(household_key %in% rt9 , "150-174K" , ifelse(household_key %in% rt10 , "175-199K" , ifelse(household_key %in% rt11 , "200-249K" , "250K+"))))))))))))
## Product table
pr3 <- read.csv("data/product.csv" , na.strings = c("" , "NA"))
map(pr3 , ~sum(is.na(.))) ## checking for null values
pr4 <- pr3 %>%
filter(!is.na(department) | !is.na(commodity_desc) ) ## We removed all the rows which had missing values(except in column curr_size)
map(pr4 , ~sum(is.na(.))) ## checking for null values
pr4$curr_size_of_product <- as.character(pr4$curr_size_of_product)
pr5 <- pr4
pr_wt_oz <- pr5 %>%
filter(str_detect(curr_size_of_product , "OZ|OUNCE|LB|GA") == TRUE) %>% mutate(Category = "Weight/Volume")
pr_all <- pr5 %>%
filter(str_detect(curr_size_of_product , "OZ|OUNCE|LB|GA") == FALSE) %>% mutate(Category = "Misc")
pr_null <- pr5 %>%
filter(is.na(curr_size_of_product)) %>%
mutate(Category = "NA")
pr_final <- bind_rows(pr_wt_oz , pr_all , pr_null)
#Using SQL commands
Prod_df <- sqldf("select *, Case when department in ('GROCERY','PASTRY','MEAT-PCKGD'
,'SEAFOOD-PCKGD','PRODUCE','NUTRITION','DELI','MEAT','SEAFOOD','SALAD BAR','GRO BAKERY'
,'FROZEN GROCERY','SPIRITS','RESTAURANT','MEAT-WHSE','DAIRY DELI','CHEF SHOPPE'
,'DELI/SNACK BAR','PORK'
) then 'F'
when department in ('COSMETICS','FLORAL','TRAVEL & LEISUR','MISC SALES TRAN'
,'KIOSK-GAS','ELECT &PLUMBING','GM MERCH EXP','COUP/STR & MFG','GARDEN CENTER','TOYS'
,'CHARITABLE CONT','PROD-WHS SALES','HBC','AUTOMOTIVE','VIDEO RENTAL','CNTRL/STORE SUP'
,'HOUSEWARES','POSTAL CENTER','PHOTO','VIDEO'
) then 'N'
when department in ('DRUG GM','RX','PHARMACY SUPPLY'
) then 'P'
when department = 'NA' then 'NA'
end as DeptCatCd
from pr_final")
Cleaned Data Sets
After going through above discussed cleaning steps we ended up with the following 2 tables(SUMMARY) :
demo_prod <- trans_data %>%
left_join(hh_demo5 , by = "household_key")
demo_prod_final <- demo_prod %>%
left_join(Prod_df , by = "product_id")
require(data.table)
setDT(coup_red); setDT(coup) # convert to data.tables by reference
coup_redeem_prod <- coup_red[coup, mult = "first", on = c("campaign" , "coupon_upc") , nomatch=0L]
head(demo_prod_final)
## household_key basket_id day product_id quantity sales_value store_id
## 1 2375 26984851472 1 1004906 1 1.39 364
## 2 2375 26984851472 1 1033142 1 0.82 364
## 3 2375 26984851472 1 1036325 1 0.99 364
## 4 2375 26984851472 1 1082185 1 1.21 364
## 5 2375 26984851472 1 8160430 1 1.50 364
## 6 2375 26984851516 1 826249 2 1.98 364
## retail_disc trans_time week_no coupon_disc coupon_match_disc
## 1 -0.60 16:31 1 0 0
## 2 0.00 16:31 1 0 0
## 3 -0.30 16:31 1 0 0
## 4 0.00 16:31 1 0 0
## 5 -0.39 16:31 1 0 0
## 6 -0.60 16:42 1 0 0
## actual_prod_price shelf_price age_desc marital_status_code income_desc
## 1 0.79 1.99 <NA> <NA> <NA>
## 2 0.82 0.82 <NA> <NA> <NA>
## 3 0.69 1.29 <NA> <NA> <NA>
## 4 1.21 1.21 <NA> <NA> <NA>
## 5 1.11 1.89 <NA> <NA> <NA>
## 6 0.69 1.29 <NA> <NA> <NA>
## homeowner_desc hh_comp_desc household_size_desc kid_category_desc
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
## manufacturer department brand commodity_desc
## 1 69 PRODUCE Private POTATOES
## 2 2 PRODUCE National ONIONS
## 3 69 PRODUCE Private VEGETABLES - ALL OTHERS
## 4 2 PRODUCE National TROPICAL FRUIT
## 5 69 PRODUCE Private ORGANICS FRUIT & VEGETABLES
## 6 69 GROCERY Private BAKED BREAD/BUNS/ROLLS
## sub_commodity_desc curr_size_of_product Category DeptCatCd
## 1 POTATOES RUSSET (BULK&BAG) 5 LB Weight/Volume F
## 2 ONIONS SWEET (BULK&BAG) 40 LB Weight/Volume F
## 3 CELERY <NA> NA F
## 4 BANANAS 40 LB Weight/Volume F
## 5 ORGANIC CARROTS 1 LB Weight/Volume F
## 6 HAMBURGER BUNS 12 OZ Weight/Volume F
head(coup_redeem_prod)
## household_key day coupon_upc campaign Coup_cat product_id
## 1: 321 446 10000089064 9 Type B 27754
## 2: 1229 491 10000089073 12 Type B 28897
## 3: 1804 668 52100000076 25 Type B 28929
## 4: 22 628 52100000031 22 Type B 28929
## 5: 158 443 52100000033 9 Type B 28929
## 6: 1804 668 52100000076 25 Type B 29096
#SUMMARY STATISTICS
#Comparing Actual Product Price and Shelf Price
demo_prod_final%>%
select(department,actual_prod_price,shelf_price)%>%
group_by(department)%>%
summarise(MEAN_ACTUAL_PROD=mean(actual_prod_price,na.rm = TRUE),MEAN_SHELF_PRICE=mean(shelf_price,na.rm = TRUE))
## # A tibble: 44 x 3
## department MEAN_ACTUAL_PROD MEAN_SHELF_PRICE
## <fct> <dbl> <dbl>
## 1 AUTOMOTIVE 6.52 6.52
## 2 CHARITABLE CONT 1.62 2.74
## 3 CHEF SHOPPE 2.54 2.76
## 4 CNTRL/STORE SUP 2.50 2.50
## 5 COSMETICS 3.21 5.00
## 6 COUP/STR & MFG 0.801 1.24
## 7 DAIRY DELI 0.922 0.977
## 8 DELI Inf Inf
## 9 DELI/SNACK BAR 3.23 3.23
## 10 DRUG GM NaN NaN
## # ... with 34 more rows
Our final tables contain the following variables:
Analysing Total Expenditure Department wise
#--------Department wise Total Expenditure----------------
dept_ana<-demo_prod_final%>%
select(department,actual_prod_price)%>%
group_by(as.factor(department))%>%
summarise(sum(actual_prod_price,na.rm=T))%>%
magrittr::set_colnames(c("Department","Total Expenditure"))%>%
arrange(desc(`Total Expenditure`))%>%
filter(!(`Total Expenditure`=='Inf'))%>%
top_n(40)
ggplot(dept_ana, aes(x=Department, y=`Total Expenditure`)) +
geom_point(size=3) +
geom_segment(aes(x=Department,
xend=Department,
y=0,
yend=`Total Expenditure`)) +
labs(title="Lollipop Chart",
subtitle="Department Vs Total Expenditure") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
Analysis of Sales of Customers for different Salary Brackets
#-------------------------Sales based on Salary Bracket-------------------
Sal_Bracket <- c("Under 15K" , "15-24K" , "25-34K" , "35-49K" , "50-74K" , "75-99K" , "100-124K" , "125-149K" , "150-174K" , "175-199K" , "200-249K" , "250K+")
Total_Sales <- c(inc_1 , inc_2 , inc_3 , inc_4 , inc_5 , inc_6 , inc_7 , inc_8 , inc_9 , inc_10 , inc_11 , inc_12)
tbs <- data.frame(Sal_Bracket , Total_Sales)
require(scales)
ggplot(tbs , aes(x = Sal_Bracket , y = Total_Sales , group = 1)) + geom_point() + geom_line(col = "red") + scale_y_continuous(labels = comma) + scale_x_discrete(limits = Sal_Bracket)
#-------------Plot of different categories of campaigns------------------
ggplot(coup_red, aes(campaign, fill = `Coup_cat`)) +
geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') + labs(title = "Density Description Of All The Campaigns" , subtitle = "Grouped By Campagin Type" , x = "Camapign Number" , y = "Density") + scale_fill_manual(values = c("Type A" = "red", "Type B" = "blue", "Type C" = "green"))
Sales Analysis on the basis of Time(every hour)
#---------------------------Time based Sales Analysis------------------------
demo_prod_final_time <- demo_prod_final
demo_prod_final_time$trans_time <- gsub( ":" , "" , demo_prod_final_time$trans_time)
min(demo_prod_final_time$trans_time)
## [1] "0000"
max(demo_prod_final_time$trans_time)
## [1] "2359"
t1 <- demo_prod_final_time %>%
filter(trans_time > 0000 & trans_time < 0100)%>%
select(sales_value)
ts1 <- sum(t1)
t2 <- demo_prod_final_time %>%
filter(trans_time > 0100 & trans_time < 0200)%>%
select(sales_value)
ts2 <- sum(t2)
t3 <- demo_prod_final_time %>%
filter(trans_time > 0200 & trans_time < 0300)%>%
select(sales_value)
ts3 <- sum(t3)
t11 <- demo_prod_final_time %>%
filter(trans_time > 1000 & trans_time < 1100) %>%
select(sales_value)
ts11 <- sum(t11)
t12 <- demo_prod_final_time %>%
filter(trans_time > 1100 & trans_time < 1200) %>%
select(sales_value)
ts12 <- sum(t12)
t13 <- demo_prod_final_time %>%
filter(trans_time > 1200 & trans_time < 1300) %>%
select(sales_value)
ts13 <- sum(t13)
t14 <- demo_prod_final_time %>%
filter(trans_time > 1300 & trans_time < 1400) %>%
select(sales_value)
ts14 <- sum(t14)
t15 <- demo_prod_final_time %>%
filter(trans_time > 1400 & trans_time < 1500) %>%
select(sales_value)
ts15 <- sum(t15)
t16 <- demo_prod_final_time %>%
filter(trans_time > 1500 & trans_time < 1600) %>%
select(sales_value)
ts16 <- sum(t16)
t17 <- demo_prod_final_time %>%
filter(trans_time > 1600 & trans_time < 1700) %>%
select(sales_value)
ts17 <- sum(t17)
t18 <- demo_prod_final_time %>%
filter(trans_time > 1700 & trans_time < 1800) %>%
select(sales_value)
ts18 <- sum(t18)
t19 <- demo_prod_final_time %>%
filter(trans_time > 1800 & trans_time < 1900) %>%
select(sales_value)
ts19 <- sum(t19)
t20 <- demo_prod_final_time %>%
filter(trans_time > 1900 & trans_time < 2000) %>%
select(sales_value)
ts20 <- sum(t20)
t21 <- demo_prod_final_time %>%
filter(trans_time > 2000 & trans_time < 2100) %>%
select(sales_value)
ts21 <- sum(t21)
t22 <- demo_prod_final_time %>%
filter(trans_time > 2100 & trans_time < 2200) %>%
select(sales_value)
ts22 <- sum(t22)
t23 <- demo_prod_final_time %>%
filter(trans_time > 2200 & trans_time < 2300) %>%
select(sales_value)
ts23 <- sum(t23)
t24 <- demo_prod_final_time %>%
filter(trans_time > 2300 & trans_time < 2400) %>%
select(sales_value)
ts24 <- sum(t24)
vals <- c(ts1 , ts2 , ts3 , ts11 , ts12 , ts13 , ts14 , ts15 , ts16 , ts17 , ts18 , ts19 , ts20 , ts21 , ts22 , ts23 , ts24)
timehours <- c("12-1 AM" , "1-2 AM" , "2-3 AM" , "10-11 AM" , "11-12 Noon" , "12-1 PM" , "1-2 PM" , "2-3 PM" , "3-4 PM" , "4-5 PM" , "5-6 PM" , "6-7 PM" , "7-8 PM" , "8-9 PM" , "9-10 PM" , "10-11 PM" , "11-12 Midnight")
tbs1 <- data.frame(timehours , vals)
require(scales)
ggplot(tbs1 , aes(x = timehours , y = vals , group = 1)) + geom_point() + geom_line(col = "red") + scale_y_continuous(labels = comma) + scale_x_discrete(limits = timehours)
Average Expenditure during different parts of the day
#-----------------Time-Slot Analysis of Expenditure(Average)-------------
time1<-demo_prod_final%>%
mutate(TimeBrackets=ifelse(trans_time>=00:00&trans_time<5:00,"Midnight",ifelse(trans_time>=5:00&trans_time<12:00,"Morning",ifelse(trans_time>=12:00&trans_time<19:00,"Afternoon/Evening","Night"))))%>%
select(TimeBrackets,actual_prod_price)%>%
filter(!(actual_prod_price=='Inf'|actual_prod_price=='-Inf'|actual_prod_price=='-NaN'))%>%
group_by(TimeBrackets)%>%
summarise(mean(actual_prod_price,na.rm=T))%>%
magrittr::set_colnames(c("Part of the Day","Average Expenditure"))
time1
## # A tibble: 4 x 2
## `Part of the Day` `Average Expenditure`
## <chr> <dbl>
## 1 Afternoon/Evening 2.03
## 2 Midnight 2.03
## 3 Morning 2.02
## 4 Night 2.02
Effect of Campaigns of the Sales
#---------------------- Before and After the campaign analysis-----------------
demo_prod_final_timeseries <- demo_prod_final %>%
filter(day < 224)
demo_prod_final_duringcampaign <- demo_prod_final %>%
filter(day > 224 & day < 720)
#Before Campaign
ggplot(demo_prod_final_timeseries , aes(x = day , y = sales_value , group = `brand` , colour = `brand`)) + geom_point() + geom_line()
#After Campaign
ggplot(demo_prod_final_duringcampaign , aes(x = day , y = sales_value , group = `brand` , colour = `brand`)) + geom_point() + geom_line()
Trend in Sales w.r.t Marketing Campaigns
#---Expenditure Trend before and after Marketing Campaign------
timeseries<-demo_prod_final%>%
select(day,actual_prod_price)%>%
group_by(day)%>%
summarise(sum(actual_prod_price,na.rm=T))%>%
magrittr::set_colnames(c("Day","Total Expenditure"))%>%
filter(!(`Total Expenditure`=='Inf'))
ggplot(timeseries, aes(x=Day)) +
geom_line(aes(y=`Total Expenditure`),col='Red') +
labs(title="Day vs Total Expenditure Trend",
y="Expenditure") + # title and caption
stat_smooth(aes(y=`Total Expenditure`)) +
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank()) + # turn off minor grid
geom_vline(xintercept = 224,
color = "black", size=1)
The problem statement has been addressed by studying and analysing the effect of company sales before and after campaigns. We also analysed the buying behaviour of customers having different attributes.
We analysed the effect of different campaigns, different attributes(salary bracket), customer sentiment after campaigns were started, and also its effect on different departents and timeslots.
From the graphical analysis, purchasing behaviour doesn’t get affected by the salaries of individuals; we see an abnormally high transaction value between 1 am to 2 am and finally the campaigns positively affected the purchasing behaviour of customers.
Give more focus on departments that have not been positively affected by campaigns, enhance customer base by targeting audience who were not affected by these campaigns as well. Consider increasing the duration of such campaigns as they tend to have positive effects on customer purchasing behaviour
We have not considered the positioning of the products in the stores which can play an instrumental role in driving sales