Objective of the project is to analyze the impact of factors such as brand, type of product etc on the mode of payment. Analyzing the price of the product along the same factors also helps the management to deduce insights on the customer behavior.
There are 29 different variables in the dataset with 45898 observations. One sample observation is listed below
setwd("~/Downloads")
The working directory was changed to /Users/Ramya/Downloads inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
ecommrough<-read.csv('PromotionDataV4.csv')
colnames(ecommrough)
[1] "OrderItemCode" "OrderID" "OrderDate" "Category"
[5] "SubCategory" "Brand" "ProductColor" "ProductSize"
[9] "MRP" "FinalTotalPrice" "VATPercent" "VAT"
[13] "CODCharge" "VendorDiscount" "WebsiteDiscountCode" "WebsiteDiscount"
[17] "HasVendorDiscount" "HasWebsiteDiscount" "CustomerID" "COD"
[21] "ShippingName" "ShippingCity" "ShippingState" "ShippingPincode"
[25] "ShippingAddressType" "BillingCity" "BillingState" "BillingPincode"
[29] "BillingAddressType"
Unique count of different variables of interest are as below:
x<-c("Brand","Product Category","Customers","Orders")
y<-c(length(unique(ecommrough$Brand)),length(unique(ecommrough$SubCategory)),length(unique(ecommrough$CustomerID)),length(unique(ecommrough$OrderID)))
z<-rbind(x,y)
z
[,1] [,2] [,3] [,4]
x "Brand" "Product Category" "Customers" "Orders"
y "10" "58" "31869" "34913"
Different brands have various price points. As depicted by the below graph, some brands on an average have higher MRP
brandprice<-aggregate(ecommrough$MRP,by=list(Brand=ecommrough$Brand),FUN=mean)
Brandplot <- plot(brandprice$Brand,brandprice$x,type="p",xlab="Brands",ylab="MRP",Main="Brands and their Prices",pch=6)
Different Brands and their categories are as follows:
brandprice$Type<-ifelse(brandprice$x>=1200,brandprice$Type<-"High",ifelse(brandprice$x>=1000,brandprice$Type<-"Medium",brandprice$Type<-"Low"))
ecommrough$BrandCategory<-ifelse(ecommrough$Brand=="Athena" | ecommrough$Brand=="MISS CHASE" | ecommrough$Brand=="MR BUTTON",ecommrough$BrandCategory<-2,ifelse(ecommrough$Brand=="FABALLEY" | ecommrough$Brand=="HARPA" | ecommrough$Brand=="THE VANCA",ecommrough$BrandCategory<-1,ecommrough$BrandCategory<-0))
brandprice
Effect of Brand on MRP:
Model 1: \(COD = BRAND1*\beta1 + BRAND2*\beta2+.....BRAND10*\beta10+\epsilon\)
Model 2: \(COD = BrandCategory1*\beta1 + BrandCategory2*\beta2+BrandCategory3*\beta3+\epsilon\)
d1<-ecommrough[,c(6,20)]
Model1<-glm(COD ~.,family=binomial(link='logit'),data=d1)
anova(Model1, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 45897 61256
Brand 9 223.76 45888 61032 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
d2<-ecommrough[,c(20,30)]
Model2<-glm(COD ~.,family=binomial(link='logit'),data=d2)
anova(Model2, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 45897 61256
BrandCategory 2 136.75 45895 61119 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model2 has lesser deviation and is hence a better model. Hence rather than considering independent brands it is better to consider brand categories (high/medium/low)
z<-cbind(aggregate(ecommrough$COD,by=list(ecommrough$BrandCategory),FUN=sum),aggregate(ecommrough$BrandCategory,by=list(ecommrough$BrandCategory),length)[,2])
z$CODShare<-z[,c(2)]*100/z[,c(3)]
colnames(z)<-c("Brand Category", "COD Purchases", "Total Purchases", "% of COD Purchases")
z
It can be noted that as the value of Brand increases COD preference is also increasing.Hence customers prefer COD for more expensive brands
There are two categories of customer based on the frequency: a) Repeat Customer b)Non-Repeating Customer. Since the data is for a limited period, it is not necessary that a non-repeating customer is actually a new customer.
x<-ecommrough$CustomerID[duplicated(ecommrough$CustomerID)]
ecommrough$RepeatCustomer<-ifelse(is.element(ecommrough$CustomerID,x),1,0)
boxplot(ecommrough$MRP~ecommrough$RepeatCustomer,xlab="Repeat Customer",ylab="MRP",Main=" Prices and Customer Frequency",col="powder blue",ylim=c(0,3000))
Model 3: \(COD = \alpha+RepeatCustomer*\beta +\epsilon\)
Model 4: \(MRP = \alpha+RepeatCustomer*\beta + \epsilon\)
d3<-ecommrough[,c(20,31)]
Model3<-glm(COD ~.,family=binomial(link='logit'),data=d3)
anova(Model3, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 45897 61256
RepeatCustomer 1 1.7522 45896 61254 0.1856
d4<-ecommrough[,c(9,31)]
Model4<-lm(MRP~RepeatCustomer,data=d4)
summary(Model4)
Call:
lm(formula = MRP ~ RepeatCustomer, data = d4)
Residuals:
Min 1Q Median 3Q Max
-794.7 -335.4 -94.7 256.3 10064.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1093.670 3.004 364.13 <2e-16 ***
RepeatCustomer -159.306 4.275 -37.27 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 457.9 on 45896 degrees of freedom
Multiple R-squared: 0.02937, Adjusted R-squared: 0.02935
F-statistic: 1389 on 1 and 45896 DF, p-value: < 2.2e-16
Customer Purchase Frequency significantly impacts the price of product purchased. However the choice on mode of payment is independent of customer purchase frequency. The same is indicated by the share of COD in total purchases of repeat v/s new customers.
z1<-cbind(aggregate(ecommrough$COD,by=list(CustomerFrequency=ecommrough$RepeatCustomer),FUN=sum),aggregate(ecommrough$RepeatCustomer,by=list(ecommrough$RepeatCustomer),length)[,2])
z1$CODCustomer<-z1[,c(2)]*100/z1[,c(3)]
z1$CustomerFrequency<-factor(z1$CustomerFrequency,labels = c("Non-Repeat","Repeat"))
colnames(z1)<-c("Customer Type", "COD Purchases", "Total Purchases", "% of COD Purchases")
z1
price<-aggregate(ecommrough[10],by=list(ecommrough$OrderID),FUN=sum)
demand<-aggregate(ecommrough[2],by=list(ecommrough$OrderID),FUN=length)
ecomm<-merge(price,demand, by.x="Group.1",by.y="Group.1")
ecomm$revenue<-ecomm$FinalTotalPrice*ecomm$OrderID
colnames(ecomm)<-c("Order ID", "Price", "Demand", "Revenue")
COD<-aggregate(ecommrough[20],by=list(ecommrough$OrderID),FUN=mean)
ecomm<-merge(ecomm,COD, by.x="Order ID",by.y="Group.1")
library(gplots)
plotmeans(ecomm$Price~ecomm$COD, legends = c("Non-COD","COD"),mean.labels=TRUE)
shapiro.test(ecomm$Price[1:5000])
Shapiro-Wilk normality test
data: ecomm$Price[1:5000]
W = 0.77411, p-value < 2.2e-16
The data is not normal
boxplot(ecomm$Price~ecomm$COD,names = c("Non-COD","COD"),ylab="Price",main="Average Price v/s Payment Options",col=c("gold","grey"),ylim=c(0,3000))
wilcox.test(ecomm$Price~ecomm$COD)
Wilcoxon rank sum test with continuity correction
data: ecomm$Price by ecomm$COD
W = 134560000, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
Hence we can conclude that there is significant difference in the average product price in payments with/without COD. COD payments are of higher value
library(gplots)
plotmeans(ecomm$Revenue~ecomm$COD, legends = c("Non-COD","COD"),mean.labels=TRUE)
shapiro.test(ecomm$Revenue[1:5000])
Shapiro-Wilk normality test
data: ecomm$Revenue[1:5000]
W = 0.24022, p-value < 2.2e-16
The data is not normal
boxplot(ecomm$Revenue~ecomm$COD,names = c("Non-COD","COD"),ylab="Revenue",ylim=c(0,10000),main="Average Revenue v/s Payment Options",col=c("gold","grey"))
wilcox.test(ecomm$Revenue~ecomm$COD)
Wilcoxon rank sum test with continuity correction
data: ecomm$Revenue by ecomm$COD
W = 136670000, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
We can conclude that the average revenue per order for the firm under with/without COD payment option is different.
Model 5: \(COD = \alpha+Price*\beta +\epsilon\)
Model 6: \(COD = \alpha+Revenue*\beta + \epsilon\)
d5<-ecomm[,c(2,5)]
Model5<-glm(COD ~.,family=binomial(link='logit'),data=d5)
anova(Model5, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 34912 46513
Price 1 71.334 34911 46442 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
d6<-ecomm[,c(4,5)]
Model6<-glm(COD ~.,family=binomial(link='logit'),data=d6)
anova(Model6, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 34912 46513
Revenue 1 4.0788 34911 46509 0.04343 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Deviance of model 2 is significantly lower. However at 1% significance we cannot reject null hypothesis of model 6 that Revenue and COD option are independent. Hence price is a more statistically significant variable than revenue.
Different products have various price points and some products on an average have higher MRP
Different Products and their categories are as follows:
productprice<-aggregate(ecommrough$MRP,by=list(Product=ecommrough$SubCategory),FUN=mean)
productprice$PriceCategory<-ifelse(productprice$x>=1200,brandprice$PriceCategory<-"High",ifelse(productprice$x>=1000,productprice$PriceCategory<-"Medium",productprice$PriceCategory<-"Low"))
productprice$Frequency<-aggregate(ecommrough$SubCategory,by=list(Product=ecommrough$SubCategory),FUN=length)[,2]
productprice$Frequency<-ifelse(productprice$Frequency>=1500,brandprice$Frequency<-"High",ifelse(productprice$Frequency>=800,productprice$Frequency<-"Medium",productprice$Frequency<-"Low"))
productprice
Below two product categories are of both high price and high frequency purchase and hence of high value.
Highproduct<-subset(productprice,productprice$PriceCategory=="High"&productprice$Frequency=="High")
ecommrough$HotProducts<-ifelse(ecommrough$SubCategory=="CASUAL SHIRTS" | ecommrough$SubCategory=="DRESSES",ecommrough$HotProducts<-1,ecommrough$HotProducts<-0)
Highproduct
Effect of high value and frequently purchased products on the payment decision
Model 7:
\(COD = \alpha+ProductCategory*\beta +\epsilon\)
d1<-ecommrough[,c(20,32)]
Model7<-glm(COD ~.,family=binomial(link='logit'),data=d1)
anova(Model7, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 45897 61256
HotProducts 1 64.192 45896 61191 1.129e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hence, high value product categories do have a significant impact on the purchase decision
z2<-cbind(aggregate(ecommrough$COD,by=list(HighValueProducts=ecommrough$HotProducts),FUN=sum),aggregate(ecommrough$HotProducts,by=list(ecommrough$HotProducts),length)[,2])
z2$CODCustomer<-z2[,c(2)]*100/z2[,c(3)]
z2$HighValueProducts<-factor(z2$HighValueProducts,labels = c("Low Value","High Value"))
colnames(z2)<-c("Product Class", "COD Purchases", "Total Purchases", "% of COD Purchases")
z2
It can be noted that as the value of Brand increases COD preference is also increasing.
If billing and shipping address aren’t the same, maybe the customer is buying it for someone as a gift and hence wouldn’t prefer COD. Below we test the same hypothesis
Model 8: \(COD = \alpha+AddressType*\beta +\epsilon\)
bs<-ecommrough[,c(20,34)]
model8 <- glm(COD ~.,family=binomial(link='logit'),data=bs)
anova(model8, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 45897 61256
address 1 16.191 45896 61239 5.728e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hence whether or not billing and shipping address is the same significantly impacts the payment mode choice.
z3<-cbind(aggregate(ecommrough$COD,by=list(AddressType=ecommrough$address),FUN=sum),aggregate(ecommrough$address,by=list(ecommrough$address),length)[,2])
z3$CODAddress<-z3[,c(2)]*100/z3[,c(3)]
z3$AddressType-factor(z3$AddressType,levels=c(0,1),labels = c("Not Same","Same"))
'-' not meaningful for factors
[1] NA NA
colnames(z3)<-c("Address Type", "COD Purchases", "Total Purchases", "% of COD Purchases")
z3
Contrary to expectation when the address type is not the same, COD is more preferred. This could mean that there are multiple users possibly of the samehousehold using a single account with different delivery addresses.
Model 9: \(COD = \alpha+CityType*\beta +\epsilon\)
metrocity<-ecommrough[,c(20,33)]
model9 <- glm(COD ~.,family=binomial(link='logit'),data=metrocity)
anova(model9, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 45897 61256
metro 1 788.61 45896 60467 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
z4<-cbind(aggregate(ecommrough$COD,by=list(MetroCity=ecommrough$metro),FUN=sum),aggregate(ecommrough$metro,by=list(ecommrough$metro),length)[,2])
z4$CODmetro<-z4[,c(2)]*100/z4[,c(3)]
z4$MetroCity<-factor(z4$MetroCity,labels = c("Not Metro","Metro"))
colnames(z4)<-c("City Type", "COD Purchases", "Total Purchases", "% of COD Purchases")
z4
From both the statistical test and the table above it can be clearly observed that the incidence of COD as a choice of payment is much higher in non-metro cities than in metro cities.
Model 10: \(COD = \alpha+HasVendorDiscount*\beta1+HasWebsiteDiscount*\beta2+HasVendorDiscount:HasWebsiteDiscount*\beta3 +\epsilon\)
Discountdata<-ecommrough[,c(17,18,20)]
Discountdata$Interaction<-Discountdata$HasVendorDiscount*Discountdata$HasWebsiteDiscount
model10 <- glm(COD ~.,family=binomial(link='logit'),data=Discountdata)
anova(model10, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 45897 61256
HasVendorDiscount 1 0.03 45896 61256 0.86724
HasWebsiteDiscount 1 350.40 45895 60905 < 2e-16 ***
Interaction 1 3.14 45894 60902 0.07657 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hence only website discount alone affects the choice of COD payments.
Based on the above analysis below is the summary of our statiscal inferences:
Model1 : Brand affects the COD payment choice
Model2 : Brand Category(High/Medium/Low cost) affects the COD payment choice more significantly than the effect of the individual brand(Model1)
Model3 : Whether a customer is a repeat customer or not doesn’t affect their COD payment choice
Model4 : Repeat Customers have significantly lesser cart value than first time purchasers
Model5 : Final Price of the product affects the COD payment choices
Model6 : At 1% significance level revenue doesn’t affect COD payment choice
Model7 : High Value Product subcategories(High Costs & High frequency) are purchased more over COD
Model8 : Contrary to expectation, COD is more preferred when billing and shipping address is not the same
Model9 : Whether or not the shipping is to a metro city, significantly impacts the COD payment choice
Model10: Vendor Discount doesn’t affect COD only website discount without any interaction effect with vendor discount affects COD payment choice
A single model is created based on the above inferences and tested for its accuracy.
regressfcn<-ecommrough[,c(30,10,32,33,34,20,18)]
regressfcn$productcategory_metro<-regressfcn$HotProducts*regressfcn$metro
regressfcn$brandcategory_metro<-regressfcn$BrandCategory*regressfcn$metro
regressfcn$productcategory_websitediscount<-regressfcn$HotProducts*regressfcn$HasWebsiteDiscount
regressfcn$brandcategory_websitediscount<-regressfcn$BrandCategory*regressfcn$HasWebsiteDiscount
train <- regressfcn[c(1:15000),]
test <- regressfcn[c(15001:45898),]
model <- glm(COD ~.,family=binomial(link='logit'),data=train)
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 14999 20265
BrandCategory 1 35.208 14998 20230 2.962e-09 ***
FinalTotalPrice 1 183.351 14997 20046 < 2.2e-16 ***
HotProducts 1 0.618 14996 20046 0.4320
metro 1 290.742 14995 19755 < 2.2e-16 ***
address 1 16.796 14994 19738 4.161e-05 ***
HasWebsiteDiscount 1 85.816 14993 19652 < 2.2e-16 ***
productcategory_metro 1 2.039 14992 19650 0.1533
brandcategory_metro 1 0.107 14991 19650 0.7433
productcategory_websitediscount 1 4.274 14990 19646 0.0387 *
brandcategory_websitediscount 1 0.832 14989 19645 0.3616
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the significance level, the model is re-adjusted as follows: \(COD = \alpha+BrandCategory1*\beta1+BrandCategory2*\beta2+FinalTotalPrice*\beta3+CityType*\beta4+AddressType*\beta5+HasWebsiteDiscount*\beta6 +\epsilon\)
regressfcn<-ecommrough[,c(30,10,33,34,20,18)]
train <- regressfcn[c(1:15000),]
test <- regressfcn[c(15001:45898),]
model <- glm(COD ~.,family=binomial(link='logit'),data=train)
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: COD
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 14999 20265
BrandCategory 1 35.208 14998 20230 2.962e-09 ***
FinalTotalPrice 1 183.351 14997 20046 < 2.2e-16 ***
metro 1 290.732 14996 19756 < 2.2e-16 ***
address 1 16.828 14995 19739 4.093e-05 ***
HasWebsiteDiscount 1 85.525 14994 19653 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fitted.results <- predict(model,test,type='response')
install.packages("ROCR")
Error in install.packages : Updating loaded packages
library(ROCR)
pred<-prediction(fitted.results,ecommrough$COD[c(15001:45898)])
plot(performance(pred,"acc"))
roc<-performance(pred,"tpr","fpr")
plot(roc,colorize=T,main="ROC Curve",ylab="Sensitivity",xlab="1-Specificity")
abline(a=0,b=1)
auc<-performance(pred,"auc")
auc<-unlist(slot(auc,"y.values"))
auc<-round(auc,4)
legend(0.6,0.2,auc,title = "AUC")
fitted.results <- predict(model,test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$COD)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 0.623373681144411"
The model can hence with 62% accuracy predict if the customer would opt for COD or electronic mode of payment.
Key Managerial Implications:
High price brand categories are usually bought through COD. Hence by offering some discount and mandating electronic payments, the firm can make more customers pay electronically.
The firm needs to have targetted promotions in Tier-II cities so that these customers shift to electronic payments.
Final Price affects the choice od mode of payment. This is typically because customer wants to see, touch and feel the product before the payment. Adding cutting edge features to website like CV based applications would instill confidence in the customers.
More Website Discounts affect the payments more. The firm can discount on behalf of vendors and ask the vendors to appropriate the share as website discounts impact electronic payments positively.
When billing and shipping address is not the same customer order increasingly through COD implying that multiple users might be using the same account. By giving incentives such as anniversary discounts, personalized suggestions or notifications website can drive more individual accounts increasing their user count as well
High average price product categories which are also bought frequently such as Casual Shirts, Dresses etc do not impact customers COD choices implying customers do not mind spending extra without increasingly preferring COD. Hence these sub categories are of greater significance for the company