DAM Project Group 1

Dimensions of the data

promotion.df <- read.csv("PromotionDataV4.csv")
attach(promotion.df)
##Number of Orders and total number of columns
dim(promotion.df)

## [1] 45898    29

Number of Brands, Subcategories and unique order IDs

##Number of Unique Brands
length(unique(Brand))

## [1] 10

##Number of subcategories
length(unique(SubCategory))

## [1] 58

##Number of Order IDs
length(unique(OrderID))

## [1] 34913

Number of Shipping Cities, Pincodes and States

##Number of Shipping Cities
length(unique(ShippingCity))

## [1] 791

##Number of Shipping Pincodes
length(unique(ShippingPincode))

## [1] 2818

##Number of Shipping States
length(unique(ShippingState))

## [1] 32

Number of COD and Electronic orders

##Number of COD orders
cod.df <- subset(promotion.df, COD=='1')
nrow(cod.df)

## [1] 28144

##Number of Electronic payments
nocod.df <- subset(promotion.df, COD=='0')
nrow(nocod.df)

## [1] 17754

COD vs NON-COD distribution

table(COD)

## COD
##     0     1 
## 17754 28144

round(prop.table(table(COD))*100,2)

## COD
##     0     1 
## 38.68 61.32

Vendor Dicount Vs No Vendor Discount

table(HasVendorDiscount)

## HasVendorDiscount
##     0     1 
## 22682 23216

round(prop.table(table(HasVendorDiscount))*100,2)

## HasVendorDiscount
##     0     1 
## 49.42 50.58

Website discount Vs No Website Discount

table(HasWebsiteDiscount)

## HasWebsiteDiscount
##     0     1 
## 29871 16027

round(prop.table(table(HasWebsiteDiscount))*100,2)

## HasWebsiteDiscount
##     0     1 
## 65.08 34.92

Classification of Orders according to COD, Vendor and Website dicount

table(COD,HasVendorDiscount,HasWebsiteDiscount)

## , , HasWebsiteDiscount = 0
## 
##    HasVendorDiscount
## COD     0     1
##   0  4292  6350
##   1  8244 10985
## 
## , , HasWebsiteDiscount = 1
## 
##    HasVendorDiscount
## COD     0     1
##   0  4473  2639
##   1  5673  3242

round(prop.table(table(COD,HasVendorDiscount,HasWebsiteDiscount))*100,2)

## , , HasWebsiteDiscount = 0
## 
##    HasVendorDiscount
## COD     0     1
##   0  9.35 13.84
##   1 17.96 23.93
## 
## , , HasWebsiteDiscount = 1
## 
##    HasVendorDiscount
## COD     0     1
##   0  9.75  5.75
##   1 12.36  7.06

Boxplot of different combinations of COD, Vendor and Website dicount Vs Final Total Price

boxplot(FinalTotalPrice ~ COD+HasVendorDiscount+HasWebsiteDiscount, 
        horizontal = TRUE, xlab = "Final Total Price",
        ylab = "COD + VENDOR + WEBSITE")

This boxplot shows that when a product has COD option, website discount but no vendor discount, the total price is the highest.

Number of orders for each Brand

lens <- as.data.frame(aggregate(FinalTotalPrice, by = list(Brand),length))
lens[order(-lens[,2]),]

##             Group.1    x
## 4             HARPA 7271
## 3        GRITSTONES 6901
## 2          FABALLEY 6899
## 7  MONTEIL & MUNERO 6539
## 5             MEIRA 5422
## 9         THE VANCA 4804
## 1            ATHENA 4140
## 6        MISS CHASE 1678
## 10   TSHIRT COMPANY 1396
## 8         MR BUTTON  848

plot(lens,type ="s")

The above table shows that HARPA (7271) has the greatest number of orders followed by GRITSTONES (6901), FABALLEY (6899) and MONTEIL & MUNERO (6539)

Total revenue per Brand

rev <- as.data.frame(aggregate(FinalTotalPrice, by = list(Brand),sum))
rev[order(-rev[,2]),]

##             Group.1       x
## 2          FABALLEY 6169716
## 4             HARPA 5768717
## 7  MONTEIL & MUNERO 4246537
## 3        GRITSTONES 4089553
## 1            ATHENA 3945855
## 9         THE VANCA 3463846
## 5             MEIRA 2960980
## 8         MR BUTTON 1488278
## 6        MISS CHASE 1431303
## 10   TSHIRT COMPANY 1078742

The Table aboove shows that Faballey has the greatest revenue from all the listed brands

Avergae revenue per Brand per order

avgrev <- as.data.frame(aggregate(FinalTotalPrice, by = list(Brand),mean))
avgrev[order(-avgrev[,2]),]

##             Group.1         x
## 8         MR BUTTON 1755.0447
## 1            ATHENA  953.1052
## 2          FABALLEY  894.2914
## 6        MISS CHASE  852.9814
## 4             HARPA  793.3870
## 10   TSHIRT COMPANY  772.7380
## 9         THE VANCA  721.0337
## 7  MONTEIL & MUNERO  649.4169
## 3        GRITSTONES  592.6030
## 5             MEIRA  546.1047

The above table shows that the average revenue per order for MR BUTTON is highest amongst all other brands

Weekly distribution of Orders on the website

weekly <- weekdays(as.POSIXct(OrderDate, format="%m/%d/%Y %H:%M"))
weekly <- as.data.frame(table(weekly))
weekly[order(-weekly[,2]),]

##      weekly Freq
## 1    Friday 9400
## 4    Sunday 8771
## 3  Saturday 8230
## 5  Thursday 6100
## 2    Monday 5171
## 6   Tuesday 4398
## 7 Wednesday 3828

Thus the above table shows that the last three days of the weeks constitute most of the orders and Tuesdays and Wednesdays have very low order made on the site.

City Wise trends

The top 10 cities in terms of number of NON-COD orders are as

city <- aggregate(ShippingCity, by = list(ShippingCity), length)
#city[order(-city[,2]),][1:10,1:2]
table(ShippingCity,COD)[order(-table(ShippingCity,COD)[,1]),][1:10,]

##             COD
## ShippingCity    0    1
##    Bangalore 2674 2230
##    New Delhi 1999 2586
##    Mumbai    1627 2218
##    Pune      1395 1162
##    Gurgaon   1060  944
##    Hyderabad  999 1000
##    Noida      606  536
##    Kolkata    578 1261
##    Chennai    544  476
##    Thane      416  505

The top 10 cities in terms of number of COD orders are as

table(ShippingCity,COD)[order(-table(ShippingCity,COD)[,2]),][1:10,]

##             COD
## ShippingCity    0    1
##    New Delhi 1999 2586
##    Bangalore 2674 2230
##    Mumbai    1627 2218
##    Kolkata    578 1261
##    Pune      1395 1162
##    Hyderabad  999 1000
##    Gurgaon   1060  944
##    Noida      606  536
##    Thane      416  505
##    Chennai    544  476

HARPA exclusive

Lets try and focus on our most famous brand HARPA. As already stated it has the most number of orders and the second best revenue amongst all the brands

harpa.df <- subset(promotion.df, Brand == "HARPA")
nrow(harpa.df)

## [1] 7271

Subcategory orders amongst Harpa

subs <- as.data.frame(aggregate(harpa.df$SubCategory, by = list(harpa.df$SubCategory), length))
subs[order(-subs[,2]),]

##                     Group.1    x
## 19                     TOPS 4958
## 5                   DRESSES 1111
## 22           WINTER JACKETS  425
## 13                   SHRUGS  148
## 6                  JEGGINGS  132
## 17                 SWEATERS  126
## 11                   SHIRTS   93
## 20                 TROUSERS   71
## 15                   SKIRTS   67
## 4               CAPS & HATS   58
## 21                   TUNICS   26
## 10              SHIFT DRESS   22
## 14             SKATER DRESS    9
## 2          ASYMMETRIC DRESS    7
## 1                    3/4THS    3
## 7                 JUMPSUITS    3
## 18                 T SHIRTS    3
## 3             BODYCON DRESS    2
## 8                  LEGGINGS    2
## 12                   SHORTS    2
## 16           SUMMER JACKETS    2
## 9  OFF SHOULDER/TUBE/HALTER    1

Thus tops have been the highest sold items for Harpa with 4958 orders Now lets again categorized this data for finding out which which is the highest MRP Tops so as to be considered for further analysis

MRP

harpatops <- subset(harpa.df, harpa.df$SubCategory == "TOPS")
subsa <- as.data.frame(aggregate(harpatops$MRP, by = list(harpatops$MRP), length))
subsa[order(-subsa[,2]),]

##    Group.1    x
## 13     999 2931
## 11     899  540
## 14    1099  524
## 10     799  256
## 19    1599  222
## 9      749  171
## 12     949   81
## 18    1499   54
## 6      599   45
## 1      299   29
## 8      699   27
## 3      499   26
## 5      559   18
## 2      349    9
## 16    1299    9
## 7      649    8
## 4      550    5
## 15    1199    2
## 17    1399    1

Thus here we can clearly see that Harpa tops with MRP has the greatest number 2931 of orders by the customers. Now let us take a subset of these tops for further analysis.

harpatops999 <-subset(harpatops, harpatops$MRP == "999")
nrow(harpatops999)

## [1] 2931

Let us find the number of orders for different combinations of vendor and website discount

##Number of orders of each combination
table(harpatops999$HasVendorDiscount,harpatops999$HasWebsiteDiscount)

##    
##        0    1
##   0  388 1040
##   1 1114  389

##Percentage of orders of each combination
round(prop.table(table(harpatops999$HasVendorDiscount,harpatops999$HasWebsiteDiscount))*100,2)

##    
##         0     1
##   0 13.24 35.48
##   1 38.01 13.27

Thus whenever there is one or the other exclusive discount, the orders are high as compared to none or both discounts.

Chi-Square test for dependance of discounts

chisq.test(table(harpatops999$HasVendorDiscount,harpatops999$HasWebsiteDiscount))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(harpatops999$HasVendorDiscount, harpatops999$HasWebsiteDiscount)
## X-squared = 644.11, df = 1, p-value < 2.2e-16

Thus we reject the null hypothesis that Vendor and website are independent

Normality test

Now let us test the normality of the final price of harpatops999

Let us use Shapiro-Wilk normality test

shapiro.test(harpatops999$FinalTotalPrice)

## 
##  Shapiro-Wilk normality test
## 
## data:  harpatops999$FinalTotalPrice
## W = 0.8724, p-value < 2.2e-16

Thus the value of p is less than 0.05 and hence we can reject the null hypothesis that the data is normal

Apple-Apple comparison

Now that we have only apples left in the basket, let us find the effect of COD, Vendor Discounts and Website discount on the Final price of a MRP 999 Harpa Top

Individual effect of COD, Vendor Dicount and Website discount on Final total price

library(coefplot)

## Loading required package: ggplot2

input <- harpatops999[,c("FinalTotalPrice","HasVendorDiscount","HasWebsiteDiscount","COD",
                         "MRP","CODCharge", "VendorDiscount", "WebsiteDiscount" )]
summary(glm(FinalTotalPrice~ HasVendorDiscount + HasWebsiteDiscount + COD, data = input))

## 
## Call:
## glm(formula = FinalTotalPrice ~ HasVendorDiscount + HasWebsiteDiscount + 
##     COD, data = input)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -466.66   -48.86   -12.66    75.29   316.29  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         973.657      4.867 200.046  < 2e-16 ***
## HasVendorDiscount  -332.710      4.474 -74.366  < 2e-16 ***
## HasWebsiteDiscount -190.530      4.490 -42.435  < 2e-16 ***
## COD                  16.243      3.983   4.078 4.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 11419.5)
## 
##     Null deviance: 97590916  on 2930  degrees of freedom
## Residual deviance: 33424881  on 2927  degrees of freedom
## AIC: 35708
## 
## Number of Fisher Scoring iterations: 2

coefplot(glm(FinalTotalPrice~ HasVendorDiscount + HasWebsiteDiscount + COD, data = input))

## Warning: Ignoring unknown aesthetics: xmin, xmax

library(visreg)

## Warning: package 'visreg' was built under R version 3.4.3

visreg(glm(FinalTotalPrice~ HasVendorDiscount + HasWebsiteDiscount + COD, data = input))

Interaction effect of COD, Vendor Dicount and Website discount on Final total price

interact <- glm(FinalTotalPrice~ HasVendorDiscount + HasWebsiteDiscount + COD
            + input$HasVendorDiscount:input$HasWebsiteDiscount + input$HasVendorDiscount:COD
            + input$HasWebsiteDiscount:COD + input$HasVendorDiscount:input$HasWebsiteDiscount:COD,data = input)
summary(interact)

## 
## Call:
## glm(formula = FinalTotalPrice ~ HasVendorDiscount + HasWebsiteDiscount + 
##     COD + input$HasVendorDiscount:input$HasWebsiteDiscount + 
##     input$HasVendorDiscount:COD + input$HasWebsiteDiscount:COD + 
##     input$HasVendorDiscount:input$HasWebsiteDiscount:COD, data = input)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -502.93   -26.57     0.00    22.43   297.22  
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                          1048.950      7.883
## HasVendorDiscount                                    -440.983      8.950
## HasWebsiteDiscount                                   -297.219      8.772
## COD                                                    26.566      9.841
## input$HasVendorDiscount:input$HasWebsiteDiscount      281.778     12.109
## COD:input$HasVendorDiscount                            -2.496     11.334
## COD:input$HasWebsiteDiscount                           -8.422     11.427
## COD:input$HasVendorDiscount:input$HasWebsiteDiscount -105.245     15.901
##                                                      t value Pr(>|t|)    
## (Intercept)                                          133.057  < 2e-16 ***
## HasVendorDiscount                                    -49.270  < 2e-16 ***
## HasWebsiteDiscount                                   -33.884  < 2e-16 ***
## COD                                                    2.700  0.00698 ** 
## input$HasVendorDiscount:input$HasWebsiteDiscount      23.270  < 2e-16 ***
## COD:input$HasVendorDiscount                           -0.220  0.82569    
## COD:input$HasWebsiteDiscount                          -0.737  0.46120    
## COD:input$HasVendorDiscount:input$HasWebsiteDiscount  -6.619 4.29e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 8638.77)
## 
##     Null deviance: 97590916  on 2930  degrees of freedom
## Residual deviance: 25251124  on 2923  degrees of freedom
## AIC: 34894
## 
## Number of Fisher Scoring iterations: 2

coefplot(interact,xmin = -400, xmax = 1200)

## Warning: Ignoring unknown aesthetics: xmin, xmax

Thus we can clearly see from this logistic regression model that having vendor or website discount only has a greater impact on bringing down prices for HARPA tops having MRP 999 rather than having both the discounts. Also COD does not affect the choice of the customers when choosig HARPA tops 999 as it has a low coefficient.

Inferences

More number of people are opting for COD. However in top 10 cities, we can easily see that the ratio is more or less for both COD and electronic payments. Thus clearly indicationg that people from small cities and towns opt for COD much more than their urban counterparts. This can lead to a focus on ease of payment options while issuing a COD parcel. Perhaps some mobile device that can swipe cards or accept wallet or online payments while COD would be more helpful as there is often a problem with the exact amount of change. Also the same would be easy to track.
More number of people purchase from the site on the last three days Friday, Saturday and Sunday of the week. Tuesdays and Wednesdays have the least number of orders placed. Perhaps site could run certain additional promotional campaigns like mid-week sales to attract more customers during this period
Also we can clearly figure out that rathere than COD or Electronic payments, it is the Vendor and Website discounts that make the actual change in the purchasing behavior of the people. This can be stated from the visreg diagram.
The Brand GRITSTONES does not have any Vendor discount given by the vendor. However the sales of the brand are less when there is Website discount probably because customers consider it as Permium
Also customers tend to wear black and blue more often than the other colors probably because of the genuinity of these colors in pictures and reality as compared to the variability in other shades.
Website discounts and Vendor discounts have a better impact when combined with COD facility.
The average revenue per brand of MR BUTTON is considerably high as compared to other brands