DAM Project

0.Initial Analysis

Combine data based on order ID to analyze average price, demand and revenue per customer

Initial Analysis

round(mean(ecomm$Price),2)

[1] 992.28

round(mean(ecomm$Demand),1)

[1] 1.3

round(mean(ecomm$Revenue),2)

[1] 1626.13

plot of chunk unnamed-chunk-3

Initial Analysis

var(ecomm$Price)

[1] 389637.2

var(ecomm$Demand)

[1] 0.5241948

var(ecomm$Revenue)

[1] 8760974

plot of chunk unnamed-chunk-5

1.Pricing

Analysis: Testing if statistcally significant difference exists between price paid per OrderID with/without COD payment option

Pricing (1/2)

plot of chunk unnamed-chunk-6

aggregate(ecomm$Price, by=list(ecomm$COD), FUN=mean)

  Group.1         x
1       0  957.0355
2       1 1014.2809

aggregate(ecomm$Price, by=list(ecomm$COD), FUN=var)

  Group.1        x
1       0 346743.0
2       1 415168.1

Pricing (1/2)

shapiro.test(ecomm$Price[1:5000])


    Shapiro-Wilk normality test

data:  ecomm$Price[1:5000]
W = 0.77411, p-value < 2.2e-16

The data is not normal plot of chunk unnamed-chunk-9

wilcox.test(ecomm$Price~ecomm$COD)


    Wilcoxon rank sum test with continuity correction

data:  ecomm$Price by ecomm$COD
W = 134560000, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Hence we can conclude that there is significant difference in the average product price in payments with/without COD. COD payments are of higher value

2. Demand

Analysis: Testing if statistcally significant difference exists between no.of products sold with/without COD payment option

Demand

plot of chunk unnamed-chunk-11

aggregate(ecomm$Demand, by=list(ecomm$COD), FUN=mean)

  Group.1        x
1       0 1.323247
2       1 1.309267

aggregate(ecomm$Demand, by=list(ecomm$COD), FUN=var)

  Group.1         x
1       0 0.5333243
2       1 0.5184459

Demand

shapiro.test(ecomm$Demand[1:5000])


    Shapiro-Wilk normality test

data:  ecomm$Demand[1:5000]
W = 0.50564, p-value < 2.2e-16

The data is not normal plot of chunk unnamed-chunk-14

wilcox.test(ecomm$Demand~ecomm$COD)


    Wilcoxon rank sum test with continuity correction

data:  ecomm$Demand by ecomm$COD
W = 145840000, p-value = 0.01286
alternative hypothesis: true location shift is not equal to 0

At 1% significance level, we cannot reject that ordersize with/without COD are same.

3.Revenue

Analysis: Testing if statistcally significant difference exists between average revenue earnt per orderID with/without COD payment option

Revenue

plot of chunk unnamed-chunk-16

aggregate(ecomm$Revenue, by=list(ecomm$COD), FUN=mean)

  Group.1        x
1       0 1586.547
2       1 1650.838

aggregate(ecomm$Revenue, by=list(ecomm$COD), FUN=var)

  Group.1       x
1       0 9784098
2       1 8121215

Revenue

shapiro.test(ecomm$Revenue[1:5000])


    Shapiro-Wilk normality test

data:  ecomm$Revenue[1:5000]
W = 0.24022, p-value < 2.2e-16

The data is not normal plot of chunk unnamed-chunk-19

wilcox.test(ecomm$Revenue~ecomm$COD)


    Wilcoxon rank sum test with continuity correction

data:  ecomm$Revenue by ecomm$COD
W = 136670000, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

We can conclude that the average revenue pe order for the firm under with/without COD payment option is different.

4. Consumer Behavior

Analysis Areas:

Discounts provided
Brands
Product Type
COD Charges
Billing and shipping address
Time of Purchase
Metro v/s Non-Metro

Brands

plot of chunk unnamed-chunk-21

Brand_aov<-aov(Brands$FinalTotalPrice~Brands$Brand)
summary(Brand_aov)

                Df    Sum Sq   Mean Sq F value Pr(>F)    
Brands$Brand     9 1.669e+09 185411790    2268 <2e-16 ***
Residuals    45888 3.751e+09     81737                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Brands

Brands$Discount<-Brands$VendorDiscount+Brands$WebsiteDiscount
Brand_aov1<-aov(Brands$Discount~Brands$Brand)
summary(Brand_aov1)

                Df    Sum Sq   Mean Sq F value Pr(>F)    
Brands$Brand     9 1.148e+09 127513567    2565 <2e-16 ***
Residuals    45888 2.281e+09     49709                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Brands$COD<-factor(Brands$COD)
Brand_aov2<-aov(Brands$FinalTotalPrice~Brands$Brand+Brands$COD+Brands$Brand:Brands$COD)
summary(Brand_aov2)

                           Df    Sum Sq   Mean Sq F value Pr(>F)    
Brands$Brand                9 1.669e+09 185411790 2291.29 <2e-16 ***
Brands$COD                  1 2.686e+07  26862010  331.96 <2e-16 ***
Brands$Brand:Brands$COD     9 1.144e+07   1271426   15.71 <2e-16 ***
Residuals               45878 3.712e+09     80920                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Brands

Top Expensive Brands sold by the brand

  subset.AvgBrandPrice.Group.1..AvgBrandPrice.FinalTotalPrice...
1                                                         ATHENA
2                                                       FABALLEY
3                                                     MISS CHASE
4                                                      MR BUTTON

Products

Products<-ecommrough[c(5,10,14,16,20)]
Products$Discount<-Products$VendorDiscount+Products$WebsiteDiscount
Product_aov<-aov(Products$FinalTotalPrice~Products$SubCategory)
summary(Product_aov)

                        Df    Sum Sq  Mean Sq F value Pr(>F)    
Products$SubCategory    57 2.093e+09 36714202   505.9 <2e-16 ***
Residuals            45840 3.327e+09    72573                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AvgProdPrice<-aggregate(Products,by=list(Products$SubCategory),FUN=mean)
AvgProdPrice_Top<-data.frame(subset(AvgProdPrice,AvgProdPrice$FinalTotalPrice>2000))
View(AvgProdPrice_Top)
Products$COD<-factor(Products$COD)
Product_aov1<-aov(Products$Discount~Products$SubCategory)
summary(Product_aov1)

                        Df    Sum Sq  Mean Sq F value Pr(>F)    
Products$SubCategory    57 1.397e+09 24500557   552.7 <2e-16 ***
Residuals            45840 2.032e+09    44331                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Products

Product_aov2<-aov(Products$FinalTotalPrice~Products$SubCategory+Products$COD+Products$SubCategory:Products$COD)
summary(Product_aov2)

                                     Df    Sum Sq  Mean Sq F value
Products$SubCategory                 57 2.093e+09 36714202 509.801
Products$COD                          1 1.764e+07 17636017 244.888
Products$SubCategory:Products$COD    49 1.149e+07   234429   3.255
Residuals                         45790 3.298e+09    72017        
                                    Pr(>F)    
Products$SubCategory               < 2e-16 ***
Products$COD                       < 2e-16 ***
Products$SubCategory:Products$COD 1.36e-13 ***
Residuals                                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  subset.AvgProdPrice.Group.1..AvgProdPrice.Discount...1000.
1                                             CASUAL JACKETS
2                                          JACKETS & BLAZERS
3                                                      SUITS

  subset.AvgProdPrice.Group.1..AvgProdPrice.WebsiteDiscount...500.
1                                                   ETHNIC JACKETS
2                                                    FORMAL SHIRTS

Discount

discount<-ecommrough[,c(17,18,20)]
model <- glm(COD ~.,family=binomial(link='logit'),data=discount)
anova(model, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: COD

Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
NULL                               45897      61256             
HasVendorDiscount   1     0.03     45896      61256   0.8672    
HasWebsiteDiscount  1   350.40     45895      60905   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Billing & Shipping Address

bs<-ecommrough[,c(20,31)]
model <- glm(COD ~.,family=binomial(link='logit'),data=bs)
anova(model, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: COD

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    45897      61256              
address  1   16.191     45896      61239 5.728e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Time of Purchase

timing<-ecommrough[,c(20,32)]
model <- glm(COD ~.,family=binomial(link='logit'),data=timing)
anova(model, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: COD

Terms added sequentially (first to last)

     Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
NULL                 45897      61256            
time  1   10.336     45896      61245 0.001305 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Metro

metrocity<-ecommrough[,c(20,30)]
model <- glm(COD ~.,family=binomial(link='logit'),data=metrocity)
anova(model, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: COD

Terms added sequentially (first to last)

      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                  45897      61256              
metro  1   788.61     45896      60467 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6. Regressions

Logistic Regression based on the following factors

Total order cost
Brand
Product
Vendor and Website Discount
Billing and shipping address
Metro v/s Non-Metro

Regressions (1/2)

regressfcn<-ecommrough[,c(10,13,14,16,20,30,31)]
train <- regressfcn[c(1:15000),]
test <- regressfcn[c(15001:45898),]
model <- glm(COD ~.,family=binomial(link='logit'),data=train)
anova(model, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: COD

Terms added sequentially (first to last)

                Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                            14999      20265              
FinalTotalPrice  1     93.5     14998      20171 < 2.2e-16 ***
CODCharge        1   4280.0     14997      15891 < 2.2e-16 ***
VendorDiscount   1     49.2     14996      15842 2.294e-12 ***
WebsiteDiscount  1     16.8     14995      15825 4.133e-05 ***
metro            1    184.9     14994      15640 < 2.2e-16 ***
address          1     25.3     14993      15615 4.907e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regressions

fitted.results <- predict(model,test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)

misClasificError <- mean(fitted.results != test$COD)
print(paste('Accuracy',1-misClasificError))

[1] "Accuracy 0.691468703475953"

Ending

alt text