Final-Project

Objective and Introduction

We are studying the data containing household level transactions over two years from a group of 2,500 households who are frequent shoppers at a retailer. Our main objective is to understand customer purchasing behaviour after running different types of campaigns. Here, we would also observe how different demographic attributes of customers affect the company’s sales. This would provide insights to the retailer regarding future campaigns and product demand analysis.

We would be primarily analysing the purchases during and after the campaigns.
The data would be used from the transaction data, coupon redeem data and the product data.
The approach would be a 4 step procedure i.e. collecting and cleaning the data, the aggregating it by joining tables from the transaction ,product and demographic data.
We would analyse histograms of sale values across different types. We would get to know products on which we can apply discounts to verify customer buying trends
This will help the retailer in formulating strategies for future campaigns, understanding target audience, advertisement types, etc

Packages Used

library(tidyverse)#core package containing other packages(ggplot2,dplyr,etc)
library(tm) #framework for text mining 
library(stringr) #consistent wrappers for string operations
library(sqldf) #used to run sql commands on dataframes
library(splus2R)#facilitates the conversion of S-PLUS packages to R packages
library(data.table)#Fast aggregation of large data

Data Preparation

Data Source

The source of the data is https://www.dropbox.com/sh/7g8npy9k51dhtjm/AACOwFfvFSjw26fWNDZmaSS1a?dl=1

The data set contains customer transaction data. It also includes their household data. It is the data collected over 2 years for 2500 households. The data is stored in 8 tables namely: Household Table, Campaign Table, Campaign Description Table, Transaction Table, Product Table, Coupon Table, Causal Data Table Coupon Redemption Table. They are linked by several primary keys and foreign keys like Household Key, Product ID, Coupon ID, etc. The original data had several missing values in the Product Table in the commodity table. There were 15 missing values in the Commodity Description column and 30607 missing values in the Product size column.

Data Cleaning

#Importing csv files using for loops
file.name<-c("data/campaign_desc.csv","data/campaign_table.csv",
             "data/causal_data.csv","data/coupon.csv","data/coupon_redempt.csv",
             "data/hh_demographic.csv","data/product.csv","data/transaction_data.csv")

for (i in 1:length(file.name)){
  
  #check it's existence and if exists, get it to the list
  if (file.exists(file.name[i]))
    f<-fread(file.name[i])
  assign(paste0("File",i),f)
}
## Camp_desc table

camp_desc <- File1
camp_desc_arranged <- camp_desc %>% arrange(description)


## camp_table

camp_table <- File2
map(camp_table , ~sum(is.na(.))) ## checking for null values

## coupon

coup <- File4
map(coup , ~sum(is.na(.))) ## checking for null values

## coupon_redempt

coup_red <- File5
map(coup_red , ~sum(is.na(.))) ## checking for null values

#Linking campaign type with description
A <- c(8,13,18,26,30)
B <- c(1,2,4,5,7,9,10,11,12,16,17,19,21,22,23,24,25,28,29)
C <- c(3,6,14,15,20,27)

coup_red <- coup_red %>% 
  mutate(Coup_cat = ifelse(campaign %in% A , "Type A" , ifelse(campaign %in% B , "Type B" , "Type C")))

## causal_data

cau_da <- File3
map(cau_da , ~sum(is.na(.))) ## checking for null values

## hh_demo

hh_demo2 <- File6
hh_demo3 <- hh_demo2
map(hh_demo2 , ~sum(is.na(.))) ## checking for null values
hh_demo2 <- hh_demo2 %>% 
  filter(str_detect(hh_comp_desc , "No Kids") == TRUE) %>% mutate(kid_category_desc = "None") 
hh_demo4 <- filter(hh_demo3 , str_detect(hh_comp_desc , "No Kids") == FALSE)
hh_demo5 <- bind_rows(hh_demo2 , hh_demo4)
hh_demo5 <- hh_demo5 %>% 
  arrange(household_key)
fg <- unique(hh_demo5$income_desc)
rt1 <- hh_demo5 %>% 
  filter(income_desc == "Under 15K") %>%
  select(household_key)
rt1 <- as.vector(t(rt1))
rt2 <- hh_demo5 %>% 
  filter(income_desc == "15-24K") %>% 
  select(household_key)
rt2 <- as.vector(t(rt2))
rt3 <- hh_demo5 %>% 
  filter(income_desc == "25-34K") %>% 
  select(household_key)
rt3 <- as.vector(t(rt3))
rt4 <- hh_demo5 %>% 
  filter(income_desc == "35-49K") %>% 
  select(household_key)
rt4 <- as.vector(t(rt4))
rt5 <- hh_demo5 %>% 
  filter(income_desc == "50-74K") %>% 
  select(household_key)
rt5 <- as.vector(t(rt5))
rt6 <- hh_demo5 %>% 
  filter(income_desc == "75-99K") %>% 
  select(household_key)
rt6 <- as.vector(t(rt6))
rt7 <- hh_demo5 %>% 
  filter(income_desc == "100-124K") %>% 
  select(household_key)
rt7 <- as.vector(t(rt7))
rt8 <- hh_demo5 %>% 
  filter(income_desc == "125-149K") %>% 
  select(household_key)
rt8 <- as.vector(t(rt8))
rt9 <- hh_demo5 %>% 
  filter(income_desc == "150-174K") %>% 
  select(household_key)
rt9 <- as.vector(t(rt9))
rt10 <- hh_demo5 %>% 
  filter(income_desc == "175-199K") %>% 
  select(household_key)
rt10 <- as.vector(t(rt10))
rt11 <- hh_demo5 %>% 
  filter(income_desc == "200-249K") %>% 
  select(household_key)
rt11 <- as.vector(t(rt11))
rt12 <- hh_demo5 %>% 
  filter(income_desc == "250K+") %>% 
  select(household_key)
rt12 <- as.vector(t(rt12))


## transaction 
trans_data <- read.csv("data/transaction_data.csv" , na.strings = c("" , "NA"))
map(trans_data , ~sum(is.na(.))) ## checking for null values
trans_data$trans_time <- substr(as.POSIXct(sprintf("%04.0f", trans_data$trans_time), format='%H%M'), 12, 16) #Converting Time
trans_data <- trans_data %>% 
  mutate(actual_prod_price = (sales_value + retail_disc + coupon_match_disc)/quantity , shelf_price = (sales_value - retail_disc - coupon_match_disc)/quantity)
fr1 <- trans_data %>% 
  filter(household_key %in% rt1) %>% 
  select(sales_value)
inc_1 <- sum(fr1)
fr2 <- trans_data %>% 
  filter(household_key %in% rt2) %>% 
  select(sales_value)
inc_2 <- sum(fr2)
fr3 <- trans_data %>% 
  filter(household_key %in% rt3) %>% 
  select(sales_value)
inc_3 <- sum(fr3)
fr4 <- trans_data %>% 
  filter(household_key %in% rt4) %>%
  select(sales_value)
inc_4 <- sum(fr4)
fr5 <- trans_data %>% 
  filter(household_key %in% rt5) %>% 
  select(sales_value)
inc_5 <- sum(fr5)
fr6 <- trans_data %>% 
  filter(household_key %in% rt6) %>% 
  select(sales_value)
inc_6 <- sum(fr6)
fr7 <- trans_data %>% 
  filter(household_key %in% rt7) %>% 
  select(sales_value)
inc_7 <- sum(fr7)
fr8 <- trans_data %>% 
  filter(household_key %in% rt8) %>% 
  select(sales_value)
inc_8 <- sum(fr8)
fr9 <- trans_data %>% 
  filter(household_key %in% rt9) %>% 
  select(sales_value)
inc_9 <- sum(fr9)
fr10 <- trans_data %>% 
  filter(household_key %in% rt10) %>% 
  select(sales_value)
inc_10 <- sum(fr10)
fr11 <- trans_data %>% 
  filter(household_key %in% rt11) %>% 
  select(sales_value)
inc_11 <- sum(fr11)
fr12 <- trans_data %>% 
  filter(household_key %in% rt12) %>% 
  select(sales_value)
inc_12 <- sum(fr12)

#Mutating new column for Salary Bracket
trans_data4 <- trans_data %>% 
  mutate(Inc_cat = ifelse(household_key %in% rt1 , "Under 15K" , ifelse(household_key %in% rt2 , "15-24K" , ifelse(household_key %in% rt3 , "25-34K" , ifelse(household_key %in% rt4 , "35-49K" , ifelse(household_key %in% rt5 , "50-74K" , ifelse(household_key %in% rt6 , "75-99K" , ifelse(household_key %in% rt7 , "100-124K" , ifelse(household_key %in% rt8 , "125-149K" , ifelse(household_key %in% rt9 , "150-174K" , ifelse(household_key %in% rt10 , "175-199K" , ifelse(household_key %in% rt11 , "200-249K" , "250K+"))))))))))))


## Product table
pr3 <- read.csv("data/product.csv" , na.strings = c("" , "NA"))
map(pr3 , ~sum(is.na(.))) ## checking for null values
pr4 <- pr3 %>% 
  filter(!is.na(department) | !is.na(commodity_desc) ) ## We removed all the rows which had missing values(except in column curr_size)
map(pr4 , ~sum(is.na(.))) ## checking for null values
pr4$curr_size_of_product <- as.character(pr4$curr_size_of_product)
pr5 <- pr4
pr_wt_oz <- pr5 %>% 
  filter(str_detect(curr_size_of_product , "OZ|OUNCE|LB|GA") == TRUE) %>%    mutate(Category = "Weight/Volume") 
pr_all <- pr5 %>% 
  filter(str_detect(curr_size_of_product , "OZ|OUNCE|LB|GA") == FALSE) %>% mutate(Category = "Misc")
pr_null <- pr5 %>% 
  filter(is.na(curr_size_of_product))  %>% 
  mutate(Category = "NA") 
pr_final <- bind_rows(pr_wt_oz , pr_all , pr_null)
#Using SQL commands
Prod_df <- sqldf("select *, Case when department in ('GROCERY','PASTRY','MEAT-PCKGD'
                 ,'SEAFOOD-PCKGD','PRODUCE','NUTRITION','DELI','MEAT','SEAFOOD','SALAD BAR','GRO BAKERY'
                 ,'FROZEN GROCERY','SPIRITS','RESTAURANT','MEAT-WHSE','DAIRY DELI','CHEF SHOPPE'
                 ,'DELI/SNACK BAR','PORK'
) then 'F'
                 
                 when department in ('COSMETICS','FLORAL','TRAVEL & LEISUR','MISC SALES TRAN'
                 ,'KIOSK-GAS','ELECT &PLUMBING','GM MERCH EXP','COUP/STR & MFG','GARDEN CENTER','TOYS'
                 ,'CHARITABLE CONT','PROD-WHS SALES','HBC','AUTOMOTIVE','VIDEO RENTAL','CNTRL/STORE SUP'
                 ,'HOUSEWARES','POSTAL CENTER','PHOTO','VIDEO'
                 ) then 'N'
                 
                 when department in ('DRUG GM','RX','PHARMACY SUPPLY'
                 ) then 'P'
                 
                 when department = 'NA' then 'NA'
                 end as DeptCatCd
                 
                 from pr_final")

The rows with missing values in the Commodity Description column were removed as these cannot be analysed.
The bigger chunk of missing values in the Product Size column were coded with ‘NA’ . This column also had huge amounts of messy data with undefined units and unformatted cell entries. So we created a separate column to bifurcate the entries into 2 categories i.e. Weight/Volume column and a Miscellaneous column for clearer understanding and future analysis .
We also used ‘sqldf’ function to categorise the various products into food, non-food and pharmaceutical type.
Additionally we cleaned the Household Table by mutating the Kid’s Description category to ‘None’ for Households with no kids.
The time data which was in numeric form in the ‘Transaction’ table was changed to the Time format using the as.POSIXct function..
We added extra columns in the Transaction data table to compute Actual Product Price and Shelf Price.

Cleaned Data Sets

After going through above discussed cleaning steps we ended up with the following 2 tables(SUMMARY) :

demo_prod_final: The first table is the final table obtained after aggregating Demographic,Product and Transactional data.
coup_redeem_prod: The second table is the final table obtained after aggregating CouponRedeemed and corresponding Product tables. We can get the Product IDs for the coupons redeemed by the customers. This will be required for further analysis. Summary Statistics included for price

demo_prod <- trans_data %>% 
  left_join(hh_demo5 , by = "household_key")
demo_prod_final <- demo_prod %>% 
  left_join(Prod_df , by = "product_id") 
require(data.table)
setDT(coup_red); setDT(coup) # convert to data.tables by reference
coup_redeem_prod <- coup_red[coup, mult = "first", on = c("campaign" , "coupon_upc") , nomatch=0L]
head(demo_prod_final)

##   household_key   basket_id day product_id quantity sales_value store_id
## 1          2375 26984851472   1    1004906        1        1.39      364
## 2          2375 26984851472   1    1033142        1        0.82      364
## 3          2375 26984851472   1    1036325        1        0.99      364
## 4          2375 26984851472   1    1082185        1        1.21      364
## 5          2375 26984851472   1    8160430        1        1.50      364
## 6          2375 26984851516   1     826249        2        1.98      364
##   retail_disc trans_time week_no coupon_disc coupon_match_disc
## 1       -0.60      16:31       1           0                 0
## 2        0.00      16:31       1           0                 0
## 3       -0.30      16:31       1           0                 0
## 4        0.00      16:31       1           0                 0
## 5       -0.39      16:31       1           0                 0
## 6       -0.60      16:42       1           0                 0
##   actual_prod_price shelf_price age_desc marital_status_code income_desc
## 1              0.79        1.99     <NA>                <NA>        <NA>
## 2              0.82        0.82     <NA>                <NA>        <NA>
## 3              0.69        1.29     <NA>                <NA>        <NA>
## 4              1.21        1.21     <NA>                <NA>        <NA>
## 5              1.11        1.89     <NA>                <NA>        <NA>
## 6              0.69        1.29     <NA>                <NA>        <NA>
##   homeowner_desc hh_comp_desc household_size_desc kid_category_desc
## 1           <NA>         <NA>                <NA>              <NA>
## 2           <NA>         <NA>                <NA>              <NA>
## 3           <NA>         <NA>                <NA>              <NA>
## 4           <NA>         <NA>                <NA>              <NA>
## 5           <NA>         <NA>                <NA>              <NA>
## 6           <NA>         <NA>                <NA>              <NA>
##   manufacturer department    brand              commodity_desc
## 1           69    PRODUCE  Private                    POTATOES
## 2            2    PRODUCE National                      ONIONS
## 3           69    PRODUCE  Private     VEGETABLES - ALL OTHERS
## 4            2    PRODUCE National              TROPICAL FRUIT
## 5           69    PRODUCE  Private ORGANICS FRUIT & VEGETABLES
## 6           69    GROCERY  Private      BAKED BREAD/BUNS/ROLLS
##           sub_commodity_desc curr_size_of_product      Category DeptCatCd
## 1 POTATOES RUSSET (BULK&BAG)                 5 LB Weight/Volume         F
## 2    ONIONS SWEET (BULK&BAG)                40 LB Weight/Volume         F
## 3                     CELERY                 <NA>            NA         F
## 4                    BANANAS                40 LB Weight/Volume         F
## 5            ORGANIC CARROTS                 1 LB Weight/Volume         F
## 6             HAMBURGER BUNS                12 OZ Weight/Volume         F

head(coup_redeem_prod)

##    household_key day  coupon_upc campaign Coup_cat product_id
## 1:           321 446 10000089064        9   Type B      27754
## 2:          1229 491 10000089073       12   Type B      28897
## 3:          1804 668 52100000076       25   Type B      28929
## 4:            22 628 52100000031       22   Type B      28929
## 5:           158 443 52100000033        9   Type B      28929
## 6:          1804 668 52100000076       25   Type B      29096

#SUMMARY STATISTICS
#Comparing Actual Product Price and Shelf Price
demo_prod_final%>%
select(department,actual_prod_price,shelf_price)%>%
  group_by(department)%>%
  summarise(MEAN_ACTUAL_PROD=mean(actual_prod_price,na.rm = TRUE),MEAN_SHELF_PRICE=mean(shelf_price,na.rm = TRUE))

## # A tibble: 44 x 3
##    department      MEAN_ACTUAL_PROD MEAN_SHELF_PRICE
##    <fct>                      <dbl>            <dbl>
##  1 AUTOMOTIVE                 6.52             6.52 
##  2 CHARITABLE CONT            1.62             2.74 
##  3 CHEF SHOPPE                2.54             2.76 
##  4 CNTRL/STORE SUP            2.50             2.50 
##  5 COSMETICS                  3.21             5.00 
##  6 COUP/STR & MFG             0.801            1.24 
##  7 DAIRY DELI                 0.922            0.977
##  8 DELI                     Inf              Inf    
##  9 DELI/SNACK BAR             3.23             3.23 
## 10 DRUG GM                  NaN              NaN    
## # ... with 34 more rows

Variable List

Our final tables contain the following variables:

household_key : Uniquely identifies each household
basket_id : Uniquely identifies each purchase occassion
product_id : Uniquely identifies each product
quantity : Number of products purchased during the trip
sales_value : Amount of dollars retailer receives from the sale
store_id : Identifies unique stores
retail_disc : Discount applied due to retailer’s loyalty card program
trans_time : Time of Transaction
week_no : Week of the Transaction
coupon_disc : Discount due to manufacturer coupon
coupon_match_disc : Discount due to retailer’s match of manufacturer’s coupon
actual_prod_price : Actual price of the product after considering the discounts
shelf_price : Price of the product before any discount
age_desc : Estimated Age range
marital_status_code : Marital Status(A-Married, B-Single, C-Unknown)
income_desc : Household Income
homeowner_desc : Homeowner,etc
hh_comp_desc : Household Composition
household_size_desc : Size of Household upto 5+
kid_category_desc : Number of children present upto 3+
manufacturer : Code that links products with same manufacturer together
department : Groups similar products together
brand : Indicates private or national brand
commodity_desc : Groups similar products at a lower level
sub_commodity_desc : Groups similar products at the lowest level
curr_size_of_product : Indicates package size
Category : Categorised by Weight or Volume or None
DeptCatCd : Code based on Product
coupon_upc : Uniquely identifies each coupon
campaign : Uniquely identifies each campaign

Exploratory Data Analysis

The data gathered has been used to answer key questions regarding Customer spending trends over time and Product sales. It will also help us analyse the effect of different Marketing campaigns on Product sales.
We have divided the data using data manipulation techniques like filter, select and mutate to check coupon wise transaction details of customers.

To understand the effect of marketing campaigns we have studied the time period during and after the campaign and analyse the customer purchase data with discount amounts less than 0.
To analyse the customer attributes, we have focussed on the Household table and look at variables like age, income bracket, marital status, etc.
Histogram plots were required to study product sales for different campaigns.
We have 2 tables for study , one concentrates on the demographic attributes , products and purchases while other concentrates on the coupons being redeemed under specific campaigns and the products being covered by it. We are still working on collaborating the two tables to extrapolate better analysis which in turn will help us to make more accurate and beneficial proposals for our customer.

Analysing Total Expenditure Department wise

#--------Department wise Total Expenditure----------------
dept_ana<-demo_prod_final%>%
  select(department,actual_prod_price)%>%
  group_by(as.factor(department))%>%
  summarise(sum(actual_prod_price,na.rm=T))%>%
  magrittr::set_colnames(c("Department","Total Expenditure"))%>%
  arrange(desc(`Total Expenditure`))%>%
  filter(!(`Total Expenditure`=='Inf'))%>%
  top_n(40)


ggplot(dept_ana, aes(x=Department, y=`Total Expenditure`)) + 
  geom_point(size=3) + 
  geom_segment(aes(x=Department, 
                   xend=Department, 
                   y=0, 
                   yend=`Total Expenditure`)) + 
  labs(title="Lollipop Chart", 
       subtitle="Department Vs Total Expenditure") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

The Lollipop Graph clearly shows spikes in the data for certain departments like Nutrition, Housewares, Store Supplies, etc. as these are necessary products
Marketing campaigns should be tailored to increase sales of products from other departments as well.

Analysis of Sales of Customers for different Salary Brackets

#-------------------------Sales based on Salary Bracket-------------------

Sal_Bracket <- c("Under 15K" , "15-24K" , "25-34K" , "35-49K" , "50-74K" , "75-99K" , "100-124K" , "125-149K" , "150-174K" , "175-199K" , "200-249K" , "250K+")
Total_Sales <- c(inc_1 , inc_2 , inc_3 , inc_4 , inc_5 , inc_6 , inc_7 , inc_8 , inc_9 , inc_10 , inc_11 , inc_12)
tbs <- data.frame(Sal_Bracket , Total_Sales)
require(scales)
ggplot(tbs , aes(x = Sal_Bracket , y = Total_Sales , group = 1)) + geom_point() + geom_line(col = "red") + scale_y_continuous(labels = comma) + scale_x_discrete(limits = Sal_Bracket)

The data has been segregated into different salary brackets.
We can observe that the individuals within the salary bracket of $50-74k have spent the most.
This will be beneficial for us to identify our target audience
It is also interesting to note that individuals/households with high salaries have not spent as much as expected Analysis of Plots for different Campaigns(A,B and C)

#-------------Plot of different categories of campaigns------------------


ggplot(coup_red, aes(campaign, fill = `Coup_cat`)) + 
  geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') + labs(title = "Density Description Of All The Campaigns" , subtitle = "Grouped By Campagin Type" , x = "Camapign Number" , y = "Density") + scale_fill_manual(values = c("Type A" = "red", "Type B" = "blue", "Type C" = "green"))

Campaign B was the clear winner with maximum number of coupons being redeemed from the customers belonging to this category.
A probable reason can be due to the time this marketing campaign was initiated which favored alot of customers

Sales Analysis on the basis of Time(every hour)

#---------------------------Time based Sales Analysis------------------------

demo_prod_final_time <- demo_prod_final
demo_prod_final_time$trans_time <- gsub( ":" , "" , demo_prod_final_time$trans_time)
min(demo_prod_final_time$trans_time)

## [1] "0000"

max(demo_prod_final_time$trans_time)

## [1] "2359"

t1 <- demo_prod_final_time %>% 
  filter(trans_time > 0000 & trans_time < 0100)%>% 
  select(sales_value)
ts1 <- sum(t1)
t2 <- demo_prod_final_time %>% 
  filter(trans_time > 0100 & trans_time < 0200)%>% 
  select(sales_value)
ts2 <- sum(t2)
t3 <- demo_prod_final_time %>% 
  filter(trans_time > 0200 & trans_time < 0300)%>% 
  select(sales_value)
ts3 <- sum(t3)
t11 <- demo_prod_final_time %>% 
  filter(trans_time > 1000 & trans_time < 1100) %>% 
  select(sales_value)
ts11 <- sum(t11)
t12 <- demo_prod_final_time %>% 
  filter(trans_time > 1100 & trans_time < 1200) %>% 
  select(sales_value)
ts12 <- sum(t12)
t13 <- demo_prod_final_time %>% 
  filter(trans_time > 1200 & trans_time < 1300) %>% 
  select(sales_value)
ts13 <- sum(t13)
t14 <- demo_prod_final_time %>% 
  filter(trans_time > 1300 & trans_time < 1400) %>% 
  select(sales_value)
ts14 <- sum(t14)
t15 <- demo_prod_final_time %>% 
  filter(trans_time > 1400 & trans_time < 1500) %>% 
  select(sales_value)
ts15 <- sum(t15)
t16 <- demo_prod_final_time %>% 
  filter(trans_time > 1500 & trans_time < 1600) %>% 
  select(sales_value)
ts16 <- sum(t16)
t17 <- demo_prod_final_time %>% 
  filter(trans_time > 1600 & trans_time < 1700) %>% 
  select(sales_value)
ts17 <- sum(t17)
t18 <- demo_prod_final_time %>% 
  filter(trans_time > 1700 & trans_time < 1800) %>% 
  select(sales_value)
ts18 <- sum(t18)
t19 <- demo_prod_final_time %>% 
  filter(trans_time > 1800 & trans_time < 1900) %>% 
  select(sales_value)
ts19 <- sum(t19)
t20 <- demo_prod_final_time %>% 
  filter(trans_time > 1900 & trans_time < 2000) %>%
  select(sales_value)
ts20 <- sum(t20)
t21 <- demo_prod_final_time %>% 
  filter(trans_time > 2000 & trans_time < 2100) %>% 
  select(sales_value)
ts21 <- sum(t21)
t22 <- demo_prod_final_time %>% 
  filter(trans_time > 2100 & trans_time < 2200) %>% 
  select(sales_value)
ts22 <- sum(t22)
t23 <- demo_prod_final_time %>% 
  filter(trans_time > 2200 & trans_time < 2300) %>% 
  select(sales_value)
ts23 <- sum(t23)
t24 <- demo_prod_final_time %>% 
  filter(trans_time > 2300 & trans_time < 2400) %>% 
  select(sales_value)
ts24 <- sum(t24)
vals <- c(ts1 , ts2 , ts3 , ts11 , ts12 , ts13 , ts14 , ts15 , ts16 , ts17 , ts18 , ts19 , ts20 , ts21 , ts22 , ts23 , ts24)
timehours <- c("12-1 AM" , "1-2 AM" , "2-3 AM"  , "10-11 AM" , "11-12 Noon" , "12-1 PM" , "1-2 PM" , "2-3 PM" , "3-4 PM" , "4-5 PM" , "5-6 PM" , "6-7 PM" , "7-8 PM" , "8-9 PM" , "9-10 PM" , "10-11 PM" , "11-12 Midnight")
tbs1 <- data.frame(timehours , vals)
require(scales)
ggplot(tbs1 , aes(x = timehours , y = vals , group = 1)) + geom_point() + geom_line(col = "red") + scale_y_continuous(labels = comma) + scale_x_discrete(limits = timehours)

We tried to analyse the Total Sale Value based on 1 hour time slots of the day
It was interesting to conclude that maximum transactions occured during midnight between 1 AM and 2 AM and some of the transactions occured during the remaining course of the day.
This can be probably attributed to the fact that the stores shut down in the midnight hours(around 12 AM). The final status of the transaction may get updated in bulk at this time

Average Expenditure during different parts of the day

#-----------------Time-Slot Analysis of Expenditure(Average)-------------
time1<-demo_prod_final%>%
  mutate(TimeBrackets=ifelse(trans_time>=00:00&trans_time<5:00,"Midnight",ifelse(trans_time>=5:00&trans_time<12:00,"Morning",ifelse(trans_time>=12:00&trans_time<19:00,"Afternoon/Evening","Night"))))%>%
  select(TimeBrackets,actual_prod_price)%>%
  filter(!(actual_prod_price=='Inf'|actual_prod_price=='-Inf'|actual_prod_price=='-NaN'))%>%
  group_by(TimeBrackets)%>%
  summarise(mean(actual_prod_price,na.rm=T))%>%
  magrittr::set_colnames(c("Part of the Day","Average Expenditure")) 
time1

## # A tibble: 4 x 2
##   `Part of the Day` `Average Expenditure`
##   <chr>                             <dbl>
## 1 Afternoon/Evening                  2.03
## 2 Midnight                           2.03
## 3 Morning                            2.02
## 4 Night                              2.02

From the table we can that Average Expenditure per shift per product remains same through out the day

Effect of Campaigns of the Sales

#---------------------- Before and After the campaign analysis-----------------

demo_prod_final_timeseries <- demo_prod_final %>% 
  filter(day < 224)
demo_prod_final_duringcampaign <- demo_prod_final %>% 
  filter(day > 224 & day < 720)
#Before Campaign
ggplot(demo_prod_final_timeseries , aes(x = day , y = sales_value , group = `brand` , colour = `brand`)) + geom_point() + geom_line()

#After Campaign
ggplot(demo_prod_final_duringcampaign , aes(x = day , y = sales_value , group = `brand` , colour = `brand`)) + geom_point() + geom_line()

We have divided our data into 2 parts i.e. before the campaigns began and after the first campaign started which is Day 224 according to the data.
Upon close observation, we can clearly conclude that there has been a considerable rise in the total sales value which reinstates the fact that such marketing campaigns do have a positive impact on customer sentiment towards purchase of products.
Additionally, we can also conclude that National brands have a higher Sales value compared to Private and other unknown brands

Trend in Sales w.r.t Marketing Campaigns

#---Expenditure Trend before and after Marketing Campaign------
timeseries<-demo_prod_final%>%
  select(day,actual_prod_price)%>%
  group_by(day)%>%
  summarise(sum(actual_prod_price,na.rm=T))%>%
  magrittr::set_colnames(c("Day","Total Expenditure"))%>%
  filter(!(`Total Expenditure`=='Inf'))


ggplot(timeseries, aes(x=Day)) + 
  geom_line(aes(y=`Total Expenditure`),col='Red') + 
  labs(title="Day vs Total Expenditure Trend", 
       y="Expenditure") +  # title and caption
  stat_smooth(aes(y=`Total Expenditure`)) +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank()) + # turn off minor grid
  geom_vline(xintercept = 224, 
             color = "black", size=1)

This is the final plot. It shows the trend in total Expenditure
The trendline helps us in analysing our final goal of understanding the importance of marketing campaigns. The average of the total Expenditure post campaigning is higher. Even though it doesnt increase significantly, it shows slight increase towards the end.
The black line represents the start of campaigns
The intial increase in Expenditure can be associated with the fact that customers start spending in order to become eligible for the coupons based on their purchase mechanisms. Such customer behaviour should definitely be taken into account while designing Marketing Strategies

Summary

The problem statement has been addressed by studying and analysing the effect of company sales before and after campaigns. We also analysed the buying behaviour of customers having different attributes.
We analysed the effect of different campaigns, different attributes(salary bracket), customer sentiment after campaigns were started, and also its effect on different departents and timeslots.
From the graphical analysis, purchasing behaviour doesn’t get affected by the salaries of individuals; we see an abnormally high transaction value between 1 am to 2 am and finally the campaigns positively affected the purchasing behaviour of customers.
Give more focus on departments that have not been positively affected by campaigns, enhance customer base by targeting audience who were not affected by these campaigns as well. Consider increasing the duration of such campaigns as they tend to have positive effects on customer purchasing behaviour
We have not considered the positioning of the products in the stores which can play an instrumental role in driving sales