31869 customers were unique (Total Transactions 45898)
App05, Happy30, ICICV500JG, NEW30, NEW25, RUSH30, SALE30, STYLE25, STYLE30, WOW30 were the key website discount coupon codes
Tops and dresses were the key subcategories (demand wise)
Maharashtra, Karnatka, Delhi and UP were the top states (Transaction wise)
New Delhi, Bengaluru, Mumbai, Pune, Hyderabad and Kolkata were the top cities (Transaction Wise)
Black was the most demanded color in apparel category; followed by blue
Medium followed by small were the most ordered sizes by customers
Harpa followed by Gritsones, Faballey and Moneteil and Munero were the most ordered brands
Gritstones was the only brand without any vendor discount
As most of the data was categorical in nature, we used decision analysis to find out the relationship among variables. We tried to analyse the impact of website discount as well as vendor discount on brand sales (all the possible combinations were tested). We even tried to test if the brand sales exhibits any relationship with mode of payment or not. R packages which are installed to do the analysis are party and partykit
Please click the code button to expand the sections that house the R code used in our analysis.
# DECISION TREES
rm(list=ls());
# Installing package party and partykit to run decision trees
library(party);
library(partykit)
setwd("~/code/DAM/project")
data<-read.csv("data/Data.csv");
str(data);
data$Brand_Numeric <- as.factor(data$Brand_Numeric);
data$Category_Numeric <- as.factor(data$Category_Numeric);
data$SubCategory_Numeric <- as.factor(data$SubCategory_Numeric);
data$ProductColor_Numeric <- as.factor(data$ProductColor_Numeric);
data$ProductSize_Numeric<- as.factor(data$ProductSize_Numeric);
data$ShippingCity_Numeric <- as.factor(data$ShippingCity_Numeric);
data$ShippingState_Numeric <- as.factor(data$ShippingState_Numeric);
data$WebsiteDiscountCode_Numeric <- as.factor(data$WebsiteDiscountCode_Numeric);
data$Website_Vendor <- as.factor (data$Website_Vendor);
data$COD <- as.factor((data$COD));
We have found out that brand sales are strong associated with website discount as it has come significant using decision tree analysis. The required decision tree is shown below:
## Plotting decision tree to see impact of website discount on Brand
dt1 <- data$Brand ~ data$HasWebsiteDiscount;
dt1_tree <- ctree(dt1, data=data);
plot(dt1_tree);
We have found out that brand sales are strong associated with vendor discount as it has come significant using decision tree analysis. Required decision tree is shown below:
## Plotting decision tree to see impact of vendor discount on Brand
dt2 <- data$Brand ~ data$HasVendorDiscount;
dt2_tree <- ctree(dt2, data=data);
plot(dt2_tree);
We have found out that brand sales are strong associated with vendor or website discount as it has come significant using decision tree analysis. Required decision tree is shown below:
## Plotting decision tree to see impact of vendor discount or website on Brand
dt3 <- data$Brand ~ data$Website_Vendor
dt3_tree <- ctree(dt3, data=data);
plot(dt3_tree);
We have found out that brand sales are strong associated with vendor & website discount as it has come significant using decision tree analysis. We can observe there are two nodes in this model as both the variables are tested in the model and came significant. Required decision tree is shown below:
## Plotting decision tree to see impact of vendor discount and website on Brand
dt4 <- data$Brand ~ data$HasVendorDiscount + data$HasWebsiteDiscount;
dt4_tree <- ctree(dt4, data=data);
plot(dt4_tree)
We have found out that brand sales are strong associated with cash on delivery as it has come significant using decision tree analysis. People preference of cash on delivery can be seen from the decision tree. Perhaps data belongs to demonetization period where people preference was strongly related to cash on delivery. Required decision tree is shown below:
## Plotting decision tree to see impact of COD on Brand Sales
dt5 <- data$Brand ~ data$COD;
dt5_tree <- ctree(dt5, data=data);
plot(dt5_tree)
## Plotting decision tree to see impact of cash on delivery when billing city and shipping city are different
dt6 <- data$COD ~ data$ShippingCity;
dt6_tree <- ctree(dt6, data=data);
plot(dt6_tree)
These models are tested to find the probability of an event given the independent variables. In this case dependent variable is categorical which matches our requirement of analysis with categorical variable. We made use of package “pscl” and “ROCR” to do the logistic analysis. We tried to assess the impact of brand, final total price, has vendor discount, has website discount &??? if billing address is same as shipping address or not??? on number of transactions related to COD or not. We used logistic regression because our dependent variable i.e. COD is categorical. For validation purposed we have split the data into two parts train (~ 40,000 observations) and test (~6000 observations) data.
#Logistic regression
rm(list=ls());
#install.packages("pscl")#, dependencies = t);
#install.packages("ROCR")#, dependencies = t);
data<-read.csv("data/Log.csv");
## Splitting the data into train and test data for validation and model scoring purposes
train <- data[1:40000,];
test <- data[40001:45898,];
## Scoring the Model
attach(data)
model <- glm(COD ~ Brand + FinalTotalPrice + HasVendorDiscount + HasWebsiteDiscount,
+ Isbillingshippingcitysame,
family=binomial(link='logit'),data=train);
## Summarizing the model
summary(model);
##
## Call:
## glm(formula = COD ~ Brand + FinalTotalPrice + HasVendorDiscount +
## HasWebsiteDiscount, family = binomial(link = "logit"), data = train,
## weights = +Isbillingshippingcitysame)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3024 -1.2858 0.8502 0.9863 1.4663
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.255e-01 6.745e-02 4.826 1.40e-06 ***
## BrandFABALLEY -3.569e-01 4.366e-02 -8.173 3.00e-16 ***
## BrandGRITSTONES -1.401e-01 5.396e-02 -2.597 0.00941 **
## BrandHARPA -3.679e-01 4.319e-02 -8.520 < 2e-16 ***
## BrandMEIRA -2.450e-01 5.573e-02 -4.396 1.10e-05 ***
## BrandMISS CHASE -5.626e-02 8.315e-02 -0.677 0.49864
## BrandMONTEIL & MUNERO 1.103e-02 4.865e-02 0.227 0.82070
## BrandMR BUTTON -6.215e-01 9.980e-02 -6.227 4.74e-10 ***
## BrandTHE VANCA -3.160e-01 5.063e-02 -6.241 4.36e-10 ***
## BrandTSHIRT COMPANY -1.863e-01 8.309e-02 -2.242 0.02494 *
## FinalTotalPrice 6.112e-04 4.645e-05 13.160 < 2e-16 ***
## HasVendorDiscount -1.514e-02 3.166e-02 -0.478 0.63260
## HasWebsiteDiscount -3.464e-01 2.626e-02 -13.192 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 49478 on 36923 degrees of freedom
## Residual deviance: 48777 on 36911 degrees of freedom
## AIC: 48803
##
## Number of Fisher Scoring iterations: 4
We can see brand chase and Monteil & Munero were not significant in the model i.e. they have no impact in deciding the mode of payment. Vendor discount also has no impact on cash on delivery sales. Significant coefficients can be used to find out log odds ratio of the model. Now we can run the anova() function on the model to analyze the table of deviance. Results from anova are shown below:
## Making Anova table for the model
anova(model, test="Chisq");
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: COD
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 36923 49478
## Brand 9 206.695 36914 49271 < 2.2e-16 ***
## FinalTotalPrice 1 297.791 36913 48974 < 2.2e-16 ***
## HasVendorDiscount 1 22.919 36912 48951 1.69e-06 ***
## HasWebsiteDiscount 1 173.739 36911 48777 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Adding of brand, final total price, has vendor discount, has website discount &??? if billing address is same as shipping address or not??? reduces the residual variance. Hence significant variables are having impact on reducing residual variances. While no exact equivalent to the R2 of linear regression exists, the McFadden R2 index can be used to assess the model fit.
## Checking the Mcfadden R Square
library(pscl);
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model);
## llh llhNull G2 McFadden r2ML
## -2.438847e+04 -2.473904e+04 7.011434e+02 1.417079e-02 1.737585e-02
## r2CU
## 2.448226e-02
In the steps above, we briefly evaluated the fitting of the model, now we would like to see how the model is doing when predicting y on a new set of data. By setting the parameter type=‘response’'’, R will output probabilities in the form of P(y=1|X). Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0.
## Assessing the predictive ability of the model
fitted.results <- predict(model,newdata = subset(test,select=c(2,3,4,5,6)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$COD)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.630891827738216"
The 0.63 accuracy on the test set is quite a good result. However, keep in mind that this result is somewhat dependent on the manual split of the data that we made earlier. As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier. The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate
(FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5. In our case the AUC is 0.58 which shows it???s a good model.
## Plotting ROC
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p <- predict(model, newdata=subset(test,select=c(2,3,4,5,6)), type="response")
pr <- prediction(p[1:5898], test$COD)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
## Estiamating AUC
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.582904
There were 31869 unique customers in our dataset. We has around 46000 transactions. It depicts there are repeat purchases done by customers. The R output is shown below:
### Unique Customers
salesData.df <- read.csv('data/Data.csv')
salesData.df$CustomerID <- as.numeric(salesData.df$CustomerID)
unique_customers<-unique(salesData.df$CustomerID)
length(unique_customers)
## [1] 31869
We found out that Size M is the most popular. Shirt size 46 is the least popular
### Size Analysis (in descending order)
t = table(salesData.df$ProductSize);
z = as.data.frame(t);
head(z[order(-z$Freq),c(1,2)])
## Var1 Freq
## 14 M 12959
## 13 L 11087
## 15 S 9556
## 16 XL 7350
## 18 XXL 2199
## 17 XS 670
head(z[order(z$Freq),c(1,2)])
## Var1 Freq
## 11 46 1
## 7 3XL 18
## 5 36 21
## 1 28 55
## 4 34 56
## 2 30 71
Through this analysis, the best sellers and worst sellers on the store can be determined
The best sellers of the store are…
x <- aggregate ( salesData.df$OrderID ~ salesData.df$SubCategory, salesData.df, length)
x[order(x$`salesData.df$OrderID`, decreasing = TRUE),]
## salesData.df$SubCategory salesData.df$OrderID
## 55 TOPS 13014
## 12 DRESSES 8869
## 30 POLO T-SHIRTS 3793
## 33 ROUND NECK T-SHIRTS 2937
## 38 SHIRTS 2224
## 9 CASUAL SHIRTS 1882
## 63 V NECK T-SHIRTS 1707
## 17 HENLEY T-SHIRTS 1621
## 61 Tops 1591
## 18 HIGH NECK T SHIRTS 1114
## 59 TROUSERS 1048
## 54 T SHIRTS 991
## 42 SKIRTS 795
## 48 Shirts 673
## 65 WINTER JACKETS 548
## 64 WAISTCOATS 345
## 39 SHORTS 260
## 21 JEGGINGS 247
## 7 CAPS & HATS 238
## 40 SHRUGS 191
## 22 JUMPSUITS 165
## 10 CHINOS 148
## 24 KURTIS 137
## 46 SWEATERS 127
## 23 KURTAS 122
## 60 TUNICS 122
## 8 CASUAL JACKETS 96
## 47 SWEATSHIRTS 96
## 13 ETHNIC JACKETS 84
## 35 SALWAR 64
## 56 TRACK PANTS 62
## 27 MANDARIN T-SHIRTS 51
## 44 SUITS 50
## 37 SHIFT DRESS 44
## 28 NIGHTWEAR 41
## 6 CAPRIS 35
## 19 JACKETS & BLAZERS 34
## 5 CAMISOLES 32
## 41 SKATER DRESS 32
## 51 Skirts 31
## 25 LEGGINGS 30
## 49 Shorts 26
## 4 BODYCON DRESS 25
## 32 ROUND NECK SWEATERS 24
## 50 Shrugs 20
## 1 3/4THS 18
## 34 ROUND NECK VESTS 16
## 2 ASYMMETRIC DRESS 15
## 11 DENIM JACKETS 12
## 15 FORMAL TROUSERS 10
## 62 V NECK SWEATERS 8
## 36 SCARVES 7
## 3 BEACHWEAR 4
## 26 LOUNGEWEAR 3
## 29 OFF SHOULDER/TUBE/HALTER 3
## 45 SUMMER JACKETS 3
## 14 FORMAL SHIRTS 2
## 20 JEANS 2
## 31 QUILTED JACKETS 2
## 57 TRACKPANTS 2
## 16 GSTOP235BLK 1
## 43 STOCKINGS 1
## 52 Summer Jackets 1
## 53 Sweaters 1
## 58 TRACKSUITS 1
The worst sellers are…
head(x[order(x$`salesData.df$OrderID`, decreasing = FALSE),])
## salesData.df$SubCategory salesData.df$OrderID
## 16 GSTOP235BLK 1
## 43 STOCKINGS 1
## 52 Summer Jackets 1
## 53 Sweaters 1
## 58 TRACKSUITS 1
## 14 FORMAL SHIRTS 2
Harpa was the highest selling brand followed by Gritstones and Faballey. Mr. Button was the lowest selling brand along with TSHIRT Company.
setwd('~/code/DAM/project')
data<-read.csv("data/Data.csv");
x<- data.frame(data$Brand, data$HasVendorDiscount, data$HasWebsiteDiscount, data$Website_Vendor, data$
VendorDiscount, data$WebsiteDiscount);
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r");
crosstab(x, row.vars = "data.Brand");
##
## data.Brand Count Total %
## ATHENA 4140.00 9.02
## FABALLEY 6899.00 15.03
## GRITSTONES 6901.00 15.04
## HARPA 7271.00 15.84
## MEIRA 5422.00 11.81
## MISS CHASE 1678.00 3.66
## MONTEIL & MUNERO 6539.00 14.25
## MR BUTTON 848.00 1.85
## THE VANCA 4804.00 10.47
## TSHIRT COMPANY 1396.00 3.04
## Sum 45898.00 100.00
Except Gritstones every brand has vendor discount. The maximum units with vendor discount are being sold by The Vance followed by Monteil & Munero.
crosstab(x, row.vars = "data.Brand", col.vars = "data.HasVendorDiscount");
## data.HasVendorDiscount 0 1 Sum
## data.Brand
## ATHENA 1866 2274 4140
## FABALLEY 4102 2797 6899
## GRITSTONES 6901 0 6901
## HARPA 3226 4045 7271
## MEIRA 3001 2421 5422
## MISS CHASE 8 1670 1678
## MONTEIL & MUNERO 1918 4621 6539
## MR BUTTON 340 508 848
## THE VANCA 102 4702 4804
## TSHIRT COMPANY 1218 178 1396
## Sum 22682 23216 45898
The maximum units with website discount are being sold by Faballey followed by Harpa.
crosstab(x, row.vars = "data.Brand", col.vars = "data.HasWebsiteDiscount");
## data.HasWebsiteDiscount 0 1 Sum
## data.Brand
## ATHENA 2538 1602 4140
## FABALLEY 2899 4000 6899
## GRITSTONES 6154 747 6901
## HARPA 4087 3184 7271
## MEIRA 4485 937 5422
## MISS CHASE 1281 397 1678
## MONTEIL & MUNERO 3813 2726 6539
## MR BUTTON 221 627 848
## THE VANCA 3834 970 4804
## TSHIRT COMPANY 559 837 1396
## Sum 29871 16027 45898
The maximum units with website discount or vendor discount are being sold by Harpa followed by Monteil & Munero. Sales are being driven by promotion which can be clearly seen in the R output.
crosstab(x, row.vars = "data.Brand", col.vars = "data.Website_Vendor");
## data.Website_Vendor 0 1 Sum
## data.Brand
## ATHENA 710 3430 4140
## FABALLEY 1232 5667 6899
## GRITSTONES 6154 747 6901
## HARPA 930 6341 7271
## MEIRA 2452 2970 5422
## MISS CHASE 6 1672 1678
## MONTEIL & MUNERO 515 6024 6539
## MR BUTTON 61 787 848
## THE VANCA 56 4748 4804
## TSHIRT COMPANY 420 976 1396
## Sum 12536 33362 45898
brand_harpa.df <- subset(salesData.df, Brand == "HARPA")
mytable <- xtabs(~ FinalTotalPrice + HasVendorDiscount + HasWebsiteDiscount, data=brand_harpa.df)
margin.table(mytable, c(2,3))
## HasWebsiteDiscount
## HasVendorDiscount 0 1
## 0 930 2296
## 1 3157 888
brand_vanca.df <- subset(salesData.df, Brand == "THE VANCA")
mytable2 <- xtabs(~ FinalTotalPrice + HasVendorDiscount + HasWebsiteDiscount, data=brand_vanca.df)
margin.table(mytable2, c(2,3))
## HasWebsiteDiscount
## HasVendorDiscount 0 1
## 0 56 46
## 1 3778 924
brand_grit.df <- subset(salesData.df, Brand == "GRITSTONES")
mytable3 <- xtabs(~ FinalTotalPrice + HasVendorDiscount + HasWebsiteDiscount, data=brand_grit.df)
margin.table(mytable3, c(2,3))
## HasWebsiteDiscount
## HasVendorDiscount 0 1
## 0 6154 747
Similar to the results of Shipment from states, most Shipments have been billed to the state of Maharashtra followed by Karnataka
x <- aggregate ( salesData.df$OrderID ~ salesData.df$ShippingState, salesData.df, length)
x[order(x$`salesData.df$OrderID`, decreasing = TRUE),1:2]
## salesData.df$ShippingState salesData.df$OrderID
## 19 MH 8669
## 17 KA 5762
## 9 DL 5392
## 30 UP 3959
## 14 HR 3003
## 2 AP 2840
## 32 WB 2357
## 12 GJ 1905
## 28 TN 1749
## 26 RJ 1242
## 21 MP 1148
## 23 OR 1141
## 24 PB 1076
## 4 AS 861
## 16 JK 651
## 15 JH 557
## 6 CH 547
## 11 GA 523
## 31 UT 518
## 5 BR 423
## 18 KL 371
## 7 CT 360
## 13 HP 290
## 20 ML 275
## 22 MZ 121
## 29 TR 53
## 25 PY 32
## 10 DN 23
## 27 SK 22
## 1 AN 18
## 8 DD 9
## 3 AR 1
x <- aggregate ( salesData.df$OrderID ~ salesData.df$BillingState, salesData.df, length)
x[order(x$`salesData.df$OrderID`, decreasing = TRUE),1:2]
## salesData.df$BillingState salesData.df$OrderID
## 18 MH 8668
## 16 KA 5750
## 8 DL 5504
## 31 UP 3962
## 13 HR 2916
## 2 AP 2864
## 33 WB 2345
## 11 GJ 1879
## 29 TN 1728
## 27 RJ 1244
## 24 OR 1160
## 21 MP 1157
## 25 PB 1050
## 3 AS 855
## 15 JK 648
## 14 JH 548
## 5 CH 543
## 10 GA 521
## 32 UT 515
## 4 BR 442
## 17 KL 386
## 6 CT 361
## 12 HP 297
## 19 ML 273
## 22 MZ 116
## 30 TR 52
## 26 PY 36
## 9 DN 24
## 28 SK 23
## 1 AN 16
## 7 DD 9
## 20 MN 3
## 23 NL 3
Maximum shipments and billings are at Bangalore city. This information can be utilized to improve the operationl efficiency of the business.
x <- aggregate ( salesData.df$OrderID ~ salesData.df$ShippingCity, salesData.df, length)
head(x[order(x$`salesData.df$OrderID`, decreasing = TRUE),c(1)])
## [1] Bangalore New Delhi Mumbai Pune Gurgaon Hyderabad
## 791 Levels: Cuttack 110058 400005 AGRA AHMEDABAD AHMEDNAGAR ... wipro technologies, plot no. 2,3 and 4, gautam budha nagar, greater noida
x <- aggregate ( salesData.df$OrderID ~ salesData.df$BillingCity, salesData.df, length)
head(x[order(x$`salesData.df$OrderID`, decreasing = TRUE),c(1)])
## [1] Bangalore New Delhi Mumbai Pune Hyderabad Gurgaon
## 983 Levels: 190008 AHMEDABAD AHMEDNAGAR AIZAWL ANKLESHWAR ... zirakpur
There is a slightly negative correlation between the extents of vendor discounts and website discounts.
library(corrgram)
corrgram(salesData.df[,c("FinalTotalPrice","VendorDiscount","WebsiteDiscount")], lower.panel = panel.shade, upper.panel = panel.pie)
There seems to be no correlation between COD and whether the vendor has offered a discount.
corrgram(salesData.df[,c("COD","HasVendorDiscount","HasWebsiteDiscount")], lower.panel = panel.shade, upper.panel = panel.pie)
We analysed the brands Harpa, The Vanca and Gritstones to determine the extent to which website and vendor discounts have been offered for their products.
As the Brand ‘HARPA’ is the best one in terms of total number of sales, we studied the number of sales happening in COD mode in presence of vendor discount and website discount respectively. We found out that for ‘HARPA’ the maximum sales in COD mode is happening when vendor discount is provided. Also, sales in COD mode is higher when website discount is not there.
storedata.df <- read.csv('data/Data.csv')
brand_harpa.df <- subset(storedata.df, Brand == "HARPA")
mytable <- xtabs(~ COD + HasVendorDiscount + HasWebsiteDiscount, data=brand_harpa.df)
margin.table(mytable, c(1,2))
## HasVendorDiscount
## COD 0 1
## 0 1457 1666
## 1 1769 2379
margin.table(mytable3, c(2,3))
## HasWebsiteDiscount
## HasVendorDiscount 0 1
## 0 6154 747
In general, without classifying into COD or non-COD transaction, we find that maximum ???HARPA??? sales are happening in case of only vendor discount and no website discount.
For Vanca, The brand ???THE VANCA??? is the second best in terms of total number of sales, we studied the number of sales happening in COD mode in presence of vendor discount and website discount respectively. We found out that for ???THE VANCA??? the maximum sales in COD mode is happening when vendor discount is provided. Also, sales in COD mode is higher when website discount is not there.
brand_vanca.df <- subset(storedata.df, Brand == "THE VANCA")
mytable2 <- xtabs(~ COD + HasVendorDiscount + HasWebsiteDiscount, data=brand_vanca.df)
margin.table(mytable2, c(2,3))
## HasWebsiteDiscount
## HasVendorDiscount 0 1
## 0 56 46
## 1 3778 924
For Gritstones,
The brand ???GRITSTONES is ranking third in terms of total number of sales, we studied the number of sales happening in COD mode in presence of vendor discount and website discount respectively. We found out that for ???GRITSTONES???, no vendor is applied. And the maximum sales in happening when COD mode is allowed for transaction versus non-COD mode. Also, sales in COD mode is higher when website discount is not there.
brand_grit.df <- subset(storedata.df, Brand == "GRITSTONES")
mytable3 <- xtabs(~ COD + HasVendorDiscount + HasWebsiteDiscount, data=brand_grit.df)
margin.table(mytable3, c(2,3))
## HasWebsiteDiscount
## HasVendorDiscount 0 1
## 0 6154 747
In General, without considering the impact of COD mode of transaction, the ???GRITSTONES??? brand is sold maximum when there is no discounts ??? be it vendor or website discounts. However, the COD payments is facilitating the sales of the brand.
Since p-value < 0.05, we reject the null hypothesis that the average price in the presence and absence of COS is the same.
cyes <- subset( salesData.df, salesData.df$COD == 1)
cno <- subset( salesData.df, salesData.df$COD == 0)
t.test(cyes$FinalTotalPrice, cno$FinalTotalPrice)
##
## Welch Two Sample t-test
##
## data: cyes$FinalTotalPrice and cno$FinalTotalPrice
## t = 16.261, df = 42172, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 45.24489 57.64667
## sample estimates:
## mean of x mean of y
## 774.6938 723.2480
Since p-value < 0.05, we reject the null hypothesis that the difference in the average prices in Metro and non-Metro cities is zero
myes <- subset( salesData.df, salesData.df$IsMetro == 1)
mno <- subset( salesData.df, salesData.df$IsMetro == 0)
t.test(cyes$FinalTotalPrice, cno$FinalTotalPrice)
##
## Welch Two Sample t-test
##
## data: cyes$FinalTotalPrice and cno$FinalTotalPrice
## t = 16.261, df = 42172, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 45.24489 57.64667
## sample estimates:
## mean of x mean of y
## 774.6938 723.2480
From all the three ANOVA model, we can conclude that COD is making a significant difference in sales of the three brands ???HARPA???, ???THE VANCA??? and ???GRITSTONES???.
fit <- aov(FinalTotalPrice ~ COD, data = brand_harpa.df)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## COD 1 3697315 3697315 45.35 1.78e-11 ***
## Residuals 7269 592660094 81533
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit <- aov(FinalTotalPrice ~ COD, data = brand_vanca.df)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## COD 1 636134 636134 6.718 0.00957 **
## Residuals 4802 454698021 94689
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit <- aov(FinalTotalPrice ~ COD, data = brand_grit.df)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## COD 1 665038 665038 87 <2e-16 ***
## Residuals 6899 52738923 7644
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1