DAM Project Submission (AY 2017-18)

Our key findings

31869 customers were unique (Total Transactions 45898)
App05, Happy30, ICICV500JG, NEW30, NEW25, RUSH30, SALE30, STYLE25, STYLE30, WOW30 were the key website discount coupon codes
Tops and dresses were the key subcategories (demand wise)
Maharashtra, Karnatka, Delhi and UP were the top states (Transaction wise)
New Delhi, Bengaluru, Mumbai, Pune, Hyderabad and Kolkata were the top cities (Transaction Wise)
Black was the most demanded color in apparel category; followed by blue
Medium followed by small were the most ordered sizes by customers
Harpa followed by Gritsones, Faballey and Moneteil and Munero were the most ordered brands
Gritstones was the only brand without any vendor discount

Decision Tree Analysis

As most of the data was categorical in nature, we used decision analysis to find out the relationship among variables. We tried to analyse the impact of website discount as well as vendor discount on brand sales (all the possible combinations were tested). We even tried to test if the brand sales exhibits any relationship with mode of payment or not. R packages which are installed to do the analysis are party and partykit

Please click the code button to expand the sections that house the R code used in our analysis.

# DECISION TREES

rm(list=ls());
# Installing package party and partykit to run decision trees
library(party);
library(partykit)
setwd("~/code/DAM/project")
data<-read.csv("data/Data.csv");
str(data);
data$Brand_Numeric <- as.factor(data$Brand_Numeric);
data$Category_Numeric <- as.factor(data$Category_Numeric);
data$SubCategory_Numeric <- as.factor(data$SubCategory_Numeric);
data$ProductColor_Numeric <- as.factor(data$ProductColor_Numeric);
data$ProductSize_Numeric<- as.factor(data$ProductSize_Numeric);
data$ShippingCity_Numeric <- as.factor(data$ShippingCity_Numeric);
data$ShippingState_Numeric <- as.factor(data$ShippingState_Numeric);
data$WebsiteDiscountCode_Numeric <- as.factor(data$WebsiteDiscountCode_Numeric);
data$Website_Vendor <- as.factor (data$Website_Vendor);
data$COD <- as.factor((data$COD));

Website Discount and Brand Sales

We have found out that brand sales are strong associated with website discount as it has come significant using decision tree analysis. The required decision tree is shown below:

## Plotting decision tree to see impact of website discount on Brand
dt1 <- data$Brand ~ data$HasWebsiteDiscount;
dt1_tree <- ctree(dt1, data=data);
plot(dt1_tree);

Vendor Discount with Brand Sales

We have found out that brand sales are strong associated with vendor discount as it has come significant using decision tree analysis. Required decision tree is shown below:

## Plotting decision tree to see impact of vendor discount on Brand
dt2 <-  data$Brand  ~ data$HasVendorDiscount;
dt2_tree <- ctree(dt2, data=data);
plot(dt2_tree);

Vendor or Website Discount with Brand Sales

We have found out that brand sales are strong associated with vendor or website discount as it has come significant using decision tree analysis. Required decision tree is shown below:

## Plotting decision tree to see impact of vendor discount or website on Brand
dt3 <-  data$Brand ~ data$Website_Vendor
dt3_tree <- ctree(dt3, data=data);
plot(dt3_tree);

Vendor & Website Discount with Brand Sales

We have found out that brand sales are strong associated with vendor & website discount as it has come significant using decision tree analysis. We can observe there are two nodes in this model as both the variables are tested in the model and came significant. Required decision tree is shown below:

## Plotting decision tree to see impact of vendor discount and website on Brand
dt4 <- data$Brand  ~ data$HasVendorDiscount + data$HasWebsiteDiscount;
dt4_tree <- ctree(dt4, data=data);
plot(dt4_tree)

Cash on Delivery with Brand Sales

We have found out that brand sales are strong associated with cash on delivery as it has come significant using decision tree analysis. People preference of cash on delivery can be seen from the decision tree. Perhaps data belongs to demonetization period where people preference was strongly related to cash on delivery. Required decision tree is shown below:

## Plotting decision tree to see impact of COD on Brand Sales
dt5 <- data$Brand  ~ data$COD;
dt5_tree <- ctree(dt5, data=data);
plot(dt5_tree)

## Plotting decision tree to see impact of cash on delivery when billing city and shipping city are different
dt6 <- data$COD  ~ data$ShippingCity;
dt6_tree <- ctree(dt6, data=data);
plot(dt6_tree)

Logistic regression

Determining whether an order use COD

These models are tested to find the probability of an event given the independent variables. In this case dependent variable is categorical which matches our requirement of analysis with categorical variable. We made use of package “pscl” and “ROCR” to do the logistic analysis. We tried to assess the impact of brand, final total price, has vendor discount, has website discount &??? if billing address is same as shipping address or not??? on number of transactions related to COD or not. We used logistic regression because our dependent variable i.e. COD is categorical. For validation purposed we have split the data into two parts train (~ 40,000 observations) and test (~6000 observations) data.

#Logistic regression

rm(list=ls());
#install.packages("pscl")#, dependencies = t);
#install.packages("ROCR")#, dependencies = t);

data<-read.csv("data/Log.csv");
## Splitting the data into train and test data for validation and model scoring purposes
train <- data[1:40000,];
test <- data[40001:45898,];
## Scoring the Model

attach(data)

model <- glm(COD ~ Brand + FinalTotalPrice + HasVendorDiscount + HasWebsiteDiscount,
             + Isbillingshippingcitysame,
             family=binomial(link='logit'),data=train);
## Summarizing the model
summary(model);

## 
## Call:
## glm(formula = COD ~ Brand + FinalTotalPrice + HasVendorDiscount + 
##     HasWebsiteDiscount, family = binomial(link = "logit"), data = train, 
##     weights = +Isbillingshippingcitysame)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3024  -1.2858   0.8502   0.9863   1.4663  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            3.255e-01  6.745e-02   4.826 1.40e-06 ***
## BrandFABALLEY         -3.569e-01  4.366e-02  -8.173 3.00e-16 ***
## BrandGRITSTONES       -1.401e-01  5.396e-02  -2.597  0.00941 ** 
## BrandHARPA            -3.679e-01  4.319e-02  -8.520  < 2e-16 ***
## BrandMEIRA            -2.450e-01  5.573e-02  -4.396 1.10e-05 ***
## BrandMISS CHASE       -5.626e-02  8.315e-02  -0.677  0.49864    
## BrandMONTEIL & MUNERO  1.103e-02  4.865e-02   0.227  0.82070    
## BrandMR BUTTON        -6.215e-01  9.980e-02  -6.227 4.74e-10 ***
## BrandTHE VANCA        -3.160e-01  5.063e-02  -6.241 4.36e-10 ***
## BrandTSHIRT COMPANY   -1.863e-01  8.309e-02  -2.242  0.02494 *  
## FinalTotalPrice        6.112e-04  4.645e-05  13.160  < 2e-16 ***
## HasVendorDiscount     -1.514e-02  3.166e-02  -0.478  0.63260    
## HasWebsiteDiscount    -3.464e-01  2.626e-02 -13.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49478  on 36923  degrees of freedom
## Residual deviance: 48777  on 36911  degrees of freedom
## AIC: 48803
## 
## Number of Fisher Scoring iterations: 4

We can see brand chase and Monteil & Munero were not significant in the model i.e. they have no impact in deciding the mode of payment. Vendor discount also has no impact on cash on delivery sales. Significant coefficients can be used to find out log odds ratio of the model. Now we can run the anova() function on the model to analyze the table of deviance. Results from anova are shown below:

## Making Anova table for the model
anova(model, test="Chisq");

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: COD
## 
## Terms added sequentially (first to last)
## 
## 
##                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               36923      49478              
## Brand               9  206.695     36914      49271 < 2.2e-16 ***
## FinalTotalPrice     1  297.791     36913      48974 < 2.2e-16 ***
## HasVendorDiscount   1   22.919     36912      48951  1.69e-06 ***
## HasWebsiteDiscount  1  173.739     36911      48777 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Adding of brand, final total price, has vendor discount, has website discount &??? if billing address is same as shipping address or not??? reduces the residual variance. Hence significant variables are having impact on reducing residual variances. While no exact equivalent to the R2 of linear regression exists, the McFadden R2 index can be used to assess the model fit.

## Checking the Mcfadden R Square
library(pscl);

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

pR2(model);

##           llh       llhNull            G2      McFadden          r2ML 
## -2.438847e+04 -2.473904e+04  7.011434e+02  1.417079e-02  1.737585e-02 
##          r2CU 
##  2.448226e-02

In the steps above, we briefly evaluated the fitting of the model, now we would like to see how the model is doing when predicting y on a new set of data. By setting the parameter type=‘response’'’, R will output probabilities in the form of P(y=1|X). Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0.

## Assessing the predictive ability of the model
fitted.results <- predict(model,newdata = subset(test,select=c(2,3,4,5,6)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$COD)
print(paste('Accuracy',1-misClasificError))

## [1] "Accuracy 0.630891827738216"

The 0.63 accuracy on the test set is quite a good result. However, keep in mind that this result is somewhat dependent on the manual split of the data that we made earlier. As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier. The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate

(FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5. In our case the AUC is 0.58 which shows it???s a good model.

## Plotting ROC 
library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

p <- predict(model, newdata=subset(test,select=c(2,3,4,5,6)), type="response")
pr <- prediction(p[1:5898], test$COD)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

## Estiamating AUC
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.582904

Number of Unique Customers

There were 31869 unique customers in our dataset. We has around 46000 transactions. It depicts there are repeat purchases done by customers. The R output is shown below:

### Unique Customers
salesData.df <- read.csv('data/Data.csv')
salesData.df$CustomerID <- as.numeric(salesData.df$CustomerID)
unique_customers<-unique(salesData.df$CustomerID)
length(unique_customers)

## [1] 31869

Which are the most and least popular sizes?

We found out that Size M is the most popular. Shirt size 46 is the least popular

### Size Analysis (in descending order)
t = table(salesData.df$ProductSize);
z = as.data.frame(t); 
head(z[order(-z$Freq),c(1,2)])

##    Var1  Freq
## 14    M 12959
## 13    L 11087
## 15    S  9556
## 16   XL  7350
## 18  XXL  2199
## 17   XS   670

head(z[order(z$Freq),c(1,2)])

##    Var1 Freq
## 11   46    1
## 7   3XL   18
## 5    36   21
## 1    28   55
## 4    34   56
## 2    30   71

Assessing subcategory demand

Through this analysis, the best sellers and worst sellers on the store can be determined

The best sellers of the store are…

x <- aggregate ( salesData.df$OrderID ~ salesData.df$SubCategory, salesData.df, length)
x[order(x$`salesData.df$OrderID`, decreasing = TRUE),]

##    salesData.df$SubCategory salesData.df$OrderID
## 55                     TOPS                13014
## 12                  DRESSES                 8869
## 30            POLO T-SHIRTS                 3793
## 33      ROUND NECK T-SHIRTS                 2937
## 38                   SHIRTS                 2224
## 9             CASUAL SHIRTS                 1882
## 63          V NECK T-SHIRTS                 1707
## 17          HENLEY T-SHIRTS                 1621
## 61                     Tops                 1591
## 18       HIGH NECK T SHIRTS                 1114
## 59                 TROUSERS                 1048
## 54                 T SHIRTS                  991
## 42                   SKIRTS                  795
## 48                   Shirts                  673
## 65           WINTER JACKETS                  548
## 64               WAISTCOATS                  345
## 39                   SHORTS                  260
## 21                 JEGGINGS                  247
## 7               CAPS & HATS                  238
## 40                   SHRUGS                  191
## 22                JUMPSUITS                  165
## 10                   CHINOS                  148
## 24                   KURTIS                  137
## 46                 SWEATERS                  127
## 23                   KURTAS                  122
## 60                   TUNICS                  122
## 8            CASUAL JACKETS                   96
## 47              SWEATSHIRTS                   96
## 13           ETHNIC JACKETS                   84
## 35                   SALWAR                   64
## 56              TRACK PANTS                   62
## 27        MANDARIN T-SHIRTS                   51
## 44                    SUITS                   50
## 37              SHIFT DRESS                   44
## 28                NIGHTWEAR                   41
## 6                    CAPRIS                   35
## 19        JACKETS & BLAZERS                   34
## 5                 CAMISOLES                   32
## 41             SKATER DRESS                   32
## 51                   Skirts                   31
## 25                 LEGGINGS                   30
## 49                   Shorts                   26
## 4             BODYCON DRESS                   25
## 32      ROUND NECK SWEATERS                   24
## 50                   Shrugs                   20
## 1                    3/4THS                   18
## 34         ROUND NECK VESTS                   16
## 2          ASYMMETRIC DRESS                   15
## 11            DENIM JACKETS                   12
## 15          FORMAL TROUSERS                   10
## 62          V NECK SWEATERS                    8
## 36                  SCARVES                    7
## 3                 BEACHWEAR                    4
## 26               LOUNGEWEAR                    3
## 29 OFF SHOULDER/TUBE/HALTER                    3
## 45           SUMMER JACKETS                    3
## 14            FORMAL SHIRTS                    2
## 20                    JEANS                    2
## 31          QUILTED JACKETS                    2
## 57               TRACKPANTS                    2
## 16              GSTOP235BLK                    1
## 43                STOCKINGS                    1
## 52           Summer Jackets                    1
## 53                 Sweaters                    1
## 58               TRACKSUITS                    1

Assessing subcategory demand

The worst sellers are…

head(x[order(x$`salesData.df$OrderID`, decreasing = FALSE),])

##    salesData.df$SubCategory salesData.df$OrderID
## 16              GSTOP235BLK                    1
## 43                STOCKINGS                    1
## 52           Summer Jackets                    1
## 53                 Sweaters                    1
## 58               TRACKSUITS                    1
## 14            FORMAL SHIRTS                    2

Brand analysis

Harpa was the highest selling brand followed by Gritstones and Faballey. Mr. Button was the lowest selling brand along with TSHIRT Company.

setwd('~/code/DAM/project')

data<-read.csv("data/Data.csv");
x<- data.frame(data$Brand, data$HasVendorDiscount, data$HasWebsiteDiscount, data$Website_Vendor, data$
                 VendorDiscount, data$WebsiteDiscount);
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r");

crosstab(x, row.vars = "data.Brand");

##                   
## data.Brand            Count  Total %
##   ATHENA            4140.00     9.02
##   FABALLEY          6899.00    15.03
##   GRITSTONES        6901.00    15.04
##   HARPA             7271.00    15.84
##   MEIRA             5422.00    11.81
##   MISS CHASE        1678.00     3.66
##   MONTEIL & MUNERO  6539.00    14.25
##   MR BUTTON          848.00     1.85
##   THE VANCA         4804.00    10.47
##   TSHIRT COMPANY    1396.00     3.04
##   Sum              45898.00   100.00

Except Gritstones every brand has vendor discount. The maximum units with vendor discount are being sold by The Vance followed by Monteil & Munero.

crosstab(x, row.vars = "data.Brand", col.vars = "data.HasVendorDiscount");

##                  data.HasVendorDiscount     0     1   Sum
## data.Brand                                               
## ATHENA                                   1866  2274  4140
## FABALLEY                                 4102  2797  6899
## GRITSTONES                               6901     0  6901
## HARPA                                    3226  4045  7271
## MEIRA                                    3001  2421  5422
## MISS CHASE                                  8  1670  1678
## MONTEIL & MUNERO                         1918  4621  6539
## MR BUTTON                                 340   508   848
## THE VANCA                                 102  4702  4804
## TSHIRT COMPANY                           1218   178  1396
## Sum                                     22682 23216 45898

The maximum units with website discount are being sold by Faballey followed by Harpa.

crosstab(x, row.vars = "data.Brand", col.vars = "data.HasWebsiteDiscount");

##                  data.HasWebsiteDiscount     0     1   Sum
## data.Brand                                                
## ATHENA                                    2538  1602  4140
## FABALLEY                                  2899  4000  6899
## GRITSTONES                                6154   747  6901
## HARPA                                     4087  3184  7271
## MEIRA                                     4485   937  5422
## MISS CHASE                                1281   397  1678
## MONTEIL & MUNERO                          3813  2726  6539
## MR BUTTON                                  221   627   848
## THE VANCA                                 3834   970  4804
## TSHIRT COMPANY                             559   837  1396
## Sum                                      29871 16027 45898

The maximum units with website discount or vendor discount are being sold by Harpa followed by Monteil & Munero. Sales are being driven by promotion which can be clearly seen in the R output.

crosstab(x, row.vars = "data.Brand", col.vars = "data.Website_Vendor");

##                  data.Website_Vendor     0     1   Sum
## data.Brand                                            
## ATHENA                                 710  3430  4140
## FABALLEY                              1232  5667  6899
## GRITSTONES                            6154   747  6901
## HARPA                                  930  6341  7271
## MEIRA                                 2452  2970  5422
## MISS CHASE                               6  1672  1678
## MONTEIL & MUNERO                       515  6024  6539
## MR BUTTON                               61   787   848
## THE VANCA                               56  4748  4804
## TSHIRT COMPANY                         420   976  1396
## Sum                                  12536 33362 45898

Analysing select brands

brand_harpa.df <- subset(salesData.df, Brand == "HARPA")
mytable <- xtabs(~ FinalTotalPrice + HasVendorDiscount + HasWebsiteDiscount, data=brand_harpa.df)
margin.table(mytable, c(2,3))

##                  HasWebsiteDiscount
## HasVendorDiscount    0    1
##                 0  930 2296
##                 1 3157  888

brand_vanca.df <- subset(salesData.df, Brand == "THE VANCA")
mytable2 <- xtabs(~ FinalTotalPrice + HasVendorDiscount + HasWebsiteDiscount, data=brand_vanca.df)
margin.table(mytable2, c(2,3))

##                  HasWebsiteDiscount
## HasVendorDiscount    0    1
##                 0   56   46
##                 1 3778  924

brand_grit.df <- subset(salesData.df, Brand == "GRITSTONES")
mytable3 <- xtabs(~ FinalTotalPrice + HasVendorDiscount + HasWebsiteDiscount, data=brand_grit.df)
margin.table(mytable3, c(2,3))

##                  HasWebsiteDiscount
## HasVendorDiscount    0    1
##                 0 6154  747

Shipments from states and billings to states

Similar to the results of Shipment from states, most Shipments have been billed to the state of Maharashtra followed by Karnataka

x <- aggregate ( salesData.df$OrderID ~ salesData.df$ShippingState, salesData.df, length)
x[order(x$`salesData.df$OrderID`, decreasing = TRUE),1:2]

##    salesData.df$ShippingState salesData.df$OrderID
## 19                         MH                 8669
## 17                         KA                 5762
## 9                          DL                 5392
## 30                         UP                 3959
## 14                         HR                 3003
## 2                          AP                 2840
## 32                         WB                 2357
## 12                         GJ                 1905
## 28                         TN                 1749
## 26                         RJ                 1242
## 21                         MP                 1148
## 23                         OR                 1141
## 24                         PB                 1076
## 4                          AS                  861
## 16                         JK                  651
## 15                         JH                  557
## 6                          CH                  547
## 11                         GA                  523
## 31                         UT                  518
## 5                          BR                  423
## 18                         KL                  371
## 7                          CT                  360
## 13                         HP                  290
## 20                         ML                  275
## 22                         MZ                  121
## 29                         TR                   53
## 25                         PY                   32
## 10                         DN                   23
## 27                         SK                   22
## 1                          AN                   18
## 8                          DD                    9
## 3                          AR                    1

x <- aggregate ( salesData.df$OrderID ~ salesData.df$BillingState, salesData.df, length)
x[order(x$`salesData.df$OrderID`, decreasing = TRUE),1:2]

##    salesData.df$BillingState salesData.df$OrderID
## 18                        MH                 8668
## 16                        KA                 5750
## 8                         DL                 5504
## 31                        UP                 3962
## 13                        HR                 2916
## 2                         AP                 2864
## 33                        WB                 2345
## 11                        GJ                 1879
## 29                        TN                 1728
## 27                        RJ                 1244
## 24                        OR                 1160
## 21                        MP                 1157
## 25                        PB                 1050
## 3                         AS                  855
## 15                        JK                  648
## 14                        JH                  548
## 5                         CH                  543
## 10                        GA                  521
## 32                        UT                  515
## 4                         BR                  442
## 17                        KL                  386
## 6                         CT                  361
## 12                        HP                  297
## 19                        ML                  273
## 22                        MZ                  116
## 30                        TR                   52
## 26                        PY                   36
## 9                         DN                   24
## 28                        SK                   23
## 1                         AN                   16
## 7                         DD                    9
## 20                        MN                    3
## 23                        NL                    3

Shipments from cities

Maximum shipments and billings are at Bangalore city. This information can be utilized to improve the operationl efficiency of the business.

x <- aggregate ( salesData.df$OrderID ~ salesData.df$ShippingCity, salesData.df, length)
head(x[order(x$`salesData.df$OrderID`, decreasing = TRUE),c(1)])

## [1] Bangalore New Delhi Mumbai    Pune      Gurgaon   Hyderabad
## 791 Levels:  Cuttack 110058 400005 AGRA AHMEDABAD AHMEDNAGAR ... wipro technologies, plot no. 2,3 and 4, gautam budha nagar, greater noida

Billings to cities

x <- aggregate ( salesData.df$OrderID ~ salesData.df$BillingCity, salesData.df, length)
head(x[order(x$`salesData.df$OrderID`, decreasing = TRUE),c(1)])

## [1] Bangalore New Delhi Mumbai    Pune      Hyderabad Gurgaon  
## 983 Levels: 190008 AHMEDABAD AHMEDNAGAR AIZAWL ANKLESHWAR ... zirakpur

Corrgrams between price and discounts

There is a slightly negative correlation between the extents of vendor discounts and website discounts.

library(corrgram)
corrgram(salesData.df[,c("FinalTotalPrice","VendorDiscount","WebsiteDiscount")], lower.panel = panel.shade, upper.panel = panel.pie)

Corrgrams between COD and the availability of discounts

There seems to be no correlation between COD and whether the vendor has offered a discount.

corrgram(salesData.df[,c("COD","HasVendorDiscount","HasWebsiteDiscount")], lower.panel = panel.shade, upper.panel = panel.pie)

Brand Analysis

We analysed the brands Harpa, The Vanca and Gritstones to determine the extent to which website and vendor discounts have been offered for their products.

As the Brand ‘HARPA’ is the best one in terms of total number of sales, we studied the number of sales happening in COD mode in presence of vendor discount and website discount respectively. We found out that for ‘HARPA’ the maximum sales in COD mode is happening when vendor discount is provided. Also, sales in COD mode is higher when website discount is not there.

storedata.df <- read.csv('data/Data.csv')
brand_harpa.df <- subset(storedata.df, Brand == "HARPA")
mytable <- xtabs(~ COD + HasVendorDiscount + HasWebsiteDiscount, data=brand_harpa.df)
margin.table(mytable, c(1,2))

##    HasVendorDiscount
## COD    0    1
##   0 1457 1666
##   1 1769 2379

margin.table(mytable3, c(2,3))

##                  HasWebsiteDiscount
## HasVendorDiscount    0    1
##                 0 6154  747

In general, without classifying into COD or non-COD transaction, we find that maximum ???HARPA??? sales are happening in case of only vendor discount and no website discount.

For Vanca, The brand ???THE VANCA??? is the second best in terms of total number of sales, we studied the number of sales happening in COD mode in presence of vendor discount and website discount respectively. We found out that for ???THE VANCA??? the maximum sales in COD mode is happening when vendor discount is provided. Also, sales in COD mode is higher when website discount is not there.

brand_vanca.df <- subset(storedata.df, Brand == "THE VANCA")
mytable2 <- xtabs(~ COD + HasVendorDiscount + HasWebsiteDiscount, data=brand_vanca.df)
margin.table(mytable2, c(2,3))

##                  HasWebsiteDiscount
## HasVendorDiscount    0    1
##                 0   56   46
##                 1 3778  924

For Gritstones,

The brand ???GRITSTONES is ranking third in terms of total number of sales, we studied the number of sales happening in COD mode in presence of vendor discount and website discount respectively. We found out that for ???GRITSTONES???, no vendor is applied. And the maximum sales in happening when COD mode is allowed for transaction versus non-COD mode. Also, sales in COD mode is higher when website discount is not there.

brand_grit.df <- subset(storedata.df, Brand == "GRITSTONES")
mytable3 <- xtabs(~ COD + HasVendorDiscount + HasWebsiteDiscount, data=brand_grit.df)
margin.table(mytable3, c(2,3))

##                  HasWebsiteDiscount
## HasVendorDiscount    0    1
##                 0 6154  747

In General, without considering the impact of COD mode of transaction, the ???GRITSTONES??? brand is sold maximum when there is no discounts ??? be it vendor or website discounts. However, the COD payments is facilitating the sales of the brand.

Hypothesis testing for finding COD’s significance in price

Since p-value < 0.05, we reject the null hypothesis that the average price in the presence and absence of COS is the same.

cyes <- subset( salesData.df, salesData.df$COD == 1)
cno <- subset( salesData.df, salesData.df$COD == 0)
t.test(cyes$FinalTotalPrice, cno$FinalTotalPrice)

## 
##  Welch Two Sample t-test
## 
## data:  cyes$FinalTotalPrice and cno$FinalTotalPrice
## t = 16.261, df = 42172, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  45.24489 57.64667
## sample estimates:
## mean of x mean of y 
##  774.6938  723.2480

Hypothesis testing for finding Metro/Non-metro Significance

Since p-value < 0.05, we reject the null hypothesis that the difference in the average prices in Metro and non-Metro cities is zero

myes <- subset( salesData.df, salesData.df$IsMetro == 1)
mno <- subset( salesData.df, salesData.df$IsMetro == 0)
t.test(cyes$FinalTotalPrice, cno$FinalTotalPrice)

## 
##  Welch Two Sample t-test
## 
## data:  cyes$FinalTotalPrice and cno$FinalTotalPrice
## t = 16.261, df = 42172, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  45.24489 57.64667
## sample estimates:
## mean of x mean of y 
##  774.6938  723.2480

ANOVA Test to consider the impact on Total Sales in absence & presence of COD

From all the three ANOVA model, we can conclude that COD is making a significant difference in sales of the three brands ???HARPA???, ???THE VANCA??? and ???GRITSTONES???.

fit <- aov(FinalTotalPrice ~ COD, data = brand_harpa.df)
summary(fit)

##               Df    Sum Sq Mean Sq F value   Pr(>F)    
## COD            1   3697315 3697315   45.35 1.78e-11 ***
## Residuals   7269 592660094   81533                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit <- aov(FinalTotalPrice ~ COD, data = brand_vanca.df)
summary(fit)

##               Df    Sum Sq Mean Sq F value  Pr(>F)   
## COD            1    636134  636134   6.718 0.00957 **
## Residuals   4802 454698021   94689                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

fit <- aov(FinalTotalPrice ~ COD, data = brand_grit.df)
summary(fit)

##               Df   Sum Sq Mean Sq F value Pr(>F)    
## COD            1   665038  665038      87 <2e-16 ***
## Residuals   6899 52738923    7644                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

DAM Project Submission (AY 2017-18)

Shikhar Kohli (PGP32117), Himanshu (PGP32300), Deepak Sharma (PGP32146), Abhishek (PGP32279), Subhra(PGP32204), Rounak Khemka (PGP32199)

12/12/2017

Our key findings

Decision Tree Analysis

Website Discount and Brand Sales

Vendor Discount with Brand Sales

Vendor or Website Discount with Brand Sales

Vendor & Website Discount with Brand Sales

Cash on Delivery with Brand Sales

Logistic regression

Determining whether an order use COD

Number of Unique Customers

Which are the most and least popular sizes?

Assessing subcategory demand

Assessing subcategory demand

Brand analysis

Analysing select brands

Shipments from states and billings to states

Shipments from cities

Billings to cities

Corrgrams between price and discounts

Corrgrams between COD and the availability of discounts

Brand Analysis

Hypothesis testing for finding COD’s significance in price

Hypothesis testing for finding Metro/Non-metro Significance

ANOVA Test to consider the impact on Total Sales in absence & presence of COD