Zappos Models

Dear Zapponians,

In this document, I used R to build linear and nonlinear models to understand total daily revenue and net daily revenue of the site. As a student with both data analytics and accounting backgrounds, I am interested in anything that helps a company grow from a financial perspective. The dataset given contains limited information, but I approached the problem in different ways and found valuable information. Most importantly, my focus can be extended to a deeper level if more information is provided. My explanation shows below. Thank you so much for reviewing!

Best, Jiafeng (Bill) Xu

Subset Dataset

#subset rows of received orders by excluding cancelled orders (first 9292 rows)
orders = tail(fram, -9292)
head(orders)

##      InvoiceNo StockCode                     Description Quantity
## 9293    581587     22367 CHILDRENS APRON SPACEBOY DESIGN        8
## 9294    581587     22555       PLASTERS IN TIN STRONGMAN       12
## 9295    581587     22556  PLASTERS IN TIN CIRCUS PARADE        12
## 9296    581587     22613     PACK OF 20 SPACEBOY NAPKINS       12
## 9297    581587     22629             SPACEBOY LUNCH BOX        12
## 9298    581587     22631        CIRCUS PARADE LUNCH BOX        12
##          InvoiceDate      Date  Time Year Month Day Hr Min UnitPrice
## 9293 2011/12/9 12:50 2011/12/9 12:50 2011    12   9 12  50      1.95
## 9294 2011/12/9 12:50 2011/12/9 12:50 2011    12   9 12  50      1.65
## 9295 2011/12/9 12:50 2011/12/9 12:50 2011    12   9 12  50      1.65
## 9296 2011/12/9 12:50 2011/12/9 12:50 2011    12   9 12  50      0.85
## 9297 2011/12/9 12:50 2011/12/9 12:50 2011    12   9 12  50      1.95
## 9298 2011/12/9 12:50 2011/12/9 12:50 2011    12   9 12  50      1.95
##      CustomerID Country
## 9293      12680  France
## 9294      12680  France
## 9295      12680  France
## 9296      12680  France
## 9297      12680  France
## 9298      12680  France

Descriptive Analysis

#bar plot of number of orders from Jan to Nov (Dec transactions are incomplete thus excluding from the plot)
orders1 = filter(orders, Month != '12')
df = data.frame(orders1)

p1 <- ggplot(df, aes(Month)) + geom_bar(aes(fill=Country), width = 0.5) + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Histogram on Transaction Numbers By Countries", 
       subtitle="January to November") 
p1

#box plot and dot plot for Unit Prices and Quantities of each transaction
#outliers are not frequent but can be out of three standard deviations, thus in the modeling, I decide to use robust models to reduce outlier effects.
p2 <- ggplot(df, aes(Month, UnitPrice))
p2 + geom_boxplot(aes((reorder(Month, UnitPrice, median)))) + 
  labs(title="Box plot", 
       subtitle="City Mileage grouped by Class of vehicle",
       caption="Source: mpg",
       x="Class of Vehicle",
       y="City Mileage")

p3 <- ggplot(df, aes(Month, Quantity))
p3 + geom_boxplot(aes((reorder(Month, Quantity, median)))) + 
  labs(title="Box plot", 
       subtitle="City Mileage grouped by Class of vehicle",
       caption="Source: mpg",
       x="Class of Vehicle",
       y="City Mileage")

# the new variable for total revenue for each transaction
orders$totalRevenue = orders$Quantity*orders$UnitPrice

ZapRegression = read.csv("ZapRegression.csv", header = TRUE)
str(ZapRegression)

## 'data.frame':    305 obs. of  7 variables:
##  $ total_revenue  : num  3457 6200 6974 7535 7579 ...
##  $ customer_count : int  279 286 448 518 529 534 538 598 609 630 ...
##  $ net_revenue    : num  4821 3457 5384 8608 15750 ...
##  $ customer_count1: int  221 279 353 406 418 439 452 498 537 537 ...
##  $ numcancel      : int  40 49 76 18 63 43 13 5 15 22 ...
##  $ numorder       : int  1980 3918 3378 2033 1639 1897 2479 815 1562 2305 ...
##  $ cancel_ratio   : int  2 1 2 0 3 2 0 0 0 0 ...

# create the training and testing datasets for predictive analysis
# use 60% of the data --- for training 
smp_size = floor(0.6*nrow(ZapRegression))
set.seed(11)
train_ind = sample(seq_len(nrow(ZapRegression)), size = smp_size)
train = ZapRegression[train_ind, ]
test = ZapRegression[-train_ind, ]

par(mfrow = c(2,2))
plot(ZapRegression$total_revenue,ZapRegression$customer_count)
plot(ZapRegression$net_revenue,ZapRegression$customer_count1)
plot(ZapRegression$total_revenue,ZapRegression$numcancel)
plot(ZapRegression$net_revenue,ZapRegression$numcancel)

cor(ZapRegression$total_revenue,ZapRegression$customer_count)

## [1] 0.6426399

cor(ZapRegression$net_revenue,ZapRegression$customer_count1)

## [1] 0.702617

cor(ZapRegression$total_revenue,ZapRegression$numcancel)

## [1] -0.05330082

cor(ZapRegression$net_revenue,ZapRegression$numcancel)

## [1] -0.04758116

Modeling

reg1.lm=lm(total_revenue~customer_count, data = ZapRegression)
summary(reg1.lm)

## 
## Call:
## lm(formula = total_revenue ~ customer_count, data = ZapRegression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27573  -7688  -3339   2570 167886 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6100.282   2189.436   2.786  0.00567 ** 
## customer_count   16.575      1.135  14.600  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16230 on 303 degrees of freedom
## Multiple R-squared:  0.413,  Adjusted R-squared:  0.411 
## F-statistic: 213.2 on 1 and 303 DF,  p-value: < 2.2e-16

#robust model for the test data decrease outlier effects
reg1_1.lm=rlm(total_revenue~customer_count, data = train)
summary(reg1_1.lm)

## 
## Call: rlm(formula = total_revenue ~ customer_count, data = train)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24921.4  -5028.6   -908.4   5052.3 170516.7 
## 
## Coefficients:
##                Value     Std. Error t value  
## (Intercept)    3490.0311 1454.6771     2.3992
## customer_count   16.5624    0.7732    21.4208
## 
## Residual standard error: 7608 on 181 degrees of freedom

Y_pred = predict(reg1.lm, newdata = test)
summary(Y_pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10841   25957   32786   36070   43813   86871

SS.total = sum((test$total_revenue-mean(test$total_revenue))^2)
SS.residual = sum((test$total_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$total_revenue))^2)


test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)

## [1] "Test R-square 0.520714031838455"

# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)

## [1] "R-square 0.530262938813604"

reg3.lm=lm(total_revenue~customer_count + numcancel, data = ZapRegression)
summary(reg3.lm)

## 
## Call:
## lm(formula = total_revenue ~ customer_count + numcancel, data = ZapRegression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28931  -7579  -3549   2590 166136 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7754.535   2397.126   3.235  0.00135 ** 
## customer_count   16.634      1.132  14.688  < 2e-16 ***
## numcancel       -57.694     34.538  -1.670  0.09587 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16180 on 302 degrees of freedom
## Multiple R-squared:  0.4184, Adjusted R-squared:  0.4145 
## F-statistic: 108.6 on 2 and 302 DF,  p-value: < 2.2e-16

reg3_1.lm=rlm(total_revenue~customer_count + numcancel, data = train)
summary(reg3_1.lm)

## 
## Call: rlm(formula = total_revenue ~ customer_count + numcancel, data = train)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24848  -4754  -1053   4816 170139 
## 
## Coefficients:
##                Value     Std. Error t value  
## (Intercept)    3847.0683 1576.2667     2.4406
## customer_count   16.5753    0.7630    21.7242
## numcancel       -13.8774   21.2568    -0.6528
## 
## Residual standard error: 7135 on 180 degrees of freedom

Y_pred = predict(reg3_1.lm, newdata = test)
summary(Y_pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7908   23221   29920   33407   40835   83980

SS.total = sum((test$total_revenue-mean(test$total_revenue))^2)
SS.residual = sum((test$total_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$total_revenue))^2)

test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)

## [1] "Test R-square 0.507437502714105"

# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)

## [1] "R-square 0.543129986821479"

# to many zigzags and I consider polynomial model to be a better fit, see poly1.lm
ZapRegression["cc_2"] = ZapRegression$customer_count^2
poly1.lm=lm(total_revenue~customer_count+cc_2, data = ZapRegression)
summary(poly1.lm)

## 
## Call:
## lm(formula = total_revenue ~ customer_count + cc_2, data = ZapRegression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29795  -7554  -3137   3246 166391 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.386e+03  4.349e+03  -1.238  0.21654    
## customer_count  2.911e+01  4.268e+00   6.820 4.97e-11 ***
## cc_2           -2.795e-03  9.185e-04  -3.043  0.00255 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16010 on 302 degrees of freedom
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4267 
## F-statistic: 114.1 on 2 and 302 DF,  p-value: < 2.2e-16

train["cc_2"] = train$customer_count^2
test["cc_2"] = test$customer_count^2
train["c_2"] = train$customer_count1^2
test["c_2"] = test$customer_count1^2
poly1_1.lm=lm(total_revenue~customer_count1+cc_2, data = train)
summary(poly1_1.lm)

## 
## Call:
## lm(formula = total_revenue ~ customer_count1 + cc_2, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29833  -8129  -3902   2649 166552 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -5.208e+01  5.772e+03  -0.009    0.993    
## customer_count1  3.035e+01  7.205e+00   4.212 3.99e-05 ***
## cc_2            -1.379e-03  1.160e-03  -1.189    0.236    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17920 on 180 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3527 
## F-statistic: 50.59 on 2 and 180 DF,  p-value: < 2.2e-16

Y_pred = predict(poly1_1.lm, newdata = test)
summary(Y_pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8304   25992   34375   36403   44275   75618

SS.total = sum((test$total_revenue-mean(test$total_revenue))^2)
SS.residual = sum((test$total_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$total_revenue))^2)

test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)

## [1] "Test R-square 0.516175478946498"

# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)

## [1] "R-square 0.554531739429007"

# Adj R-square is higher for the test dataset indicates that outliers affect the model quality
reg2.lm=lm(net_revenue~customer_count1, data = ZapRegression)
summary(reg2.lm)

## 
## Call:
## lm(formula = net_revenue ~ customer_count1, data = ZapRegression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28957  -5666  -1424   3178  74181 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4920.7453  1428.8202   3.444 0.000654 ***
## customer_count1   16.7128     0.9724  17.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10470 on 303 degrees of freedom
## Multiple R-squared:  0.4937, Adjusted R-squared:  0.492 
## F-statistic: 295.4 on 1 and 303 DF,  p-value: < 2.2e-16

reg2_1.lm=rlm(net_revenue~customer_count1, data = train)
summary(reg2_1.lm)

## 
## Call: rlm(formula = net_revenue ~ customer_count1, data = train)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24833  -3901   -384   4402  37541 
## 
## Coefficients:
##                 Value     Std. Error t value  
## (Intercept)     1551.9794 1285.3766     1.2074
## customer_count1   18.6213    0.8979    20.7383
## 
## Residual standard error: 5968 on 181 degrees of freedom

Y_pred = predict(reg2_1.lm, newdata = test)
summary(Y_pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6747   18744   24866   27292   33129   64827

SS.total = sum((test$net_revenue-mean(test$net_revenue))^2)
SS.residual = sum((test$net_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$net_revenue))^2)

test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)

## [1] "Test R-square 0.334971941966514"

# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)

## [1] "R-square 0.644822434114578"

# 
reg4.lm=lm(net_revenue~customer_count1 + numcancel, data = ZapRegression)
summary(reg4.lm)

## 
## Call:
## lm(formula = net_revenue ~ customer_count1 + numcancel, data = ZapRegression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26363  -5431  -1519   2837  73928 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6012.6940  1560.4708   3.853 0.000142 ***
## customer_count1   16.7652     0.9698  17.288  < 2e-16 ***
## numcancel        -38.1534    22.2769  -1.713 0.087795 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10440 on 302 degrees of freedom
## Multiple R-squared:  0.4985, Adjusted R-squared:  0.4952 
## F-statistic: 150.1 on 2 and 302 DF,  p-value: < 2.2e-16

reg4_1.lm=rlm(net_revenue~customer_count1 + numcancel, data = train)

## Warning in rlm.default(x, y, weights, method = method, wt.method =
## wt.method, : 'rlm' failed to converge in 20 steps

summary(reg4_1.lm)

## 
## Call: rlm(formula = net_revenue ~ customer_count1 + numcancel, data = train)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24838.6  -4029.7   -367.9   4404.5  37542.2 
## 
## Coefficients:
##                 Value     Std. Error t value  
## (Intercept)     1520.8701 1439.3870     1.0566
## customer_count1   18.6229    0.9176    20.2948
## numcancel          1.9098   19.1575     0.0997
## 
## Residual standard error: 6130 on 180 degrees of freedom

Y_pred = predict(reg4_1.lm, newdata = test)
summary(Y_pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6810   18751   24926   27320   33132   64889

SS.total = sum((test$net_revenue-mean(test$net_revenue))^2)
SS.residual = sum((test$net_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$net_revenue))^2)

test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)

## [1] "Test R-square 0.333550626797492"

# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)

## [1] "R-square 0.645312973546001"

ZapRegression["c_2"] = ZapRegression$customer_count1^2
poly2.lm=lm(net_revenue~customer_count1+c_2, data = ZapRegression)
summary(poly2.lm)

## 
## Call:
## lm(formula = net_revenue ~ customer_count1 + c_2, data = ZapRegression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28490  -5825  -1592   3663  72069 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.097e+03  3.088e+03  -1.975   0.0492 *  
## customer_count1  3.255e+01  4.072e+00   7.994 2.80e-14 ***
## c_2             -4.682e-03  1.171e-03  -4.000 7.98e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10220 on 302 degrees of freedom
## Multiple R-squared:  0.5191, Adjusted R-squared:  0.516 
## F-statistic:   163 on 2 and 302 DF,  p-value: < 2.2e-16

poly2_1.lm=lm(net_revenue~customer_count1+c_2, data = train)
summary(poly2_1.lm)

## 
## Call:
## lm(formula = net_revenue ~ customer_count1 + c_2, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22922  -5884  -1828   3964  33688 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.212e+03  3.603e+03  -2.557 0.011386 *  
## customer_count1  3.590e+01  4.813e+00   7.460 3.54e-12 ***
## c_2             -5.104e-03  1.397e-03  -3.654 0.000338 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8951 on 180 degrees of freedom
## Multiple R-squared:  0.6248, Adjusted R-squared:  0.6206 
## F-statistic: 149.9 on 2 and 180 DF,  p-value: < 2.2e-16

Y_pred = predict(poly2_1.lm, newdata = test)
summary(Y_pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   408.2 19585.8 27739.2 28554.6 36979.3 53857.6

SS.total = sum((test$net_revenue-mean(test$net_revenue))^2)
SS.residual = sum((test$net_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$net_revenue))^2)


test.rsq = 1-SS.residual/SS.total

paste("R-square", SS.regression/SS.total)

## [1] "R-square 0.671435624454648"

poly3.lm=lm(net_revenue~customer_count1+c_2 + total_revenue, data = ZapRegression)
summary(poly3.lm)

## 
## Call:
## lm(formula = net_revenue ~ customer_count1 + c_2 + total_revenue, 
##     data = ZapRegression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29315  -5740  -1461   3424  71485 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.734e+03  3.087e+03  -2.182   0.0299 *  
## customer_count1  3.574e+01  4.338e+00   8.240 5.36e-15 ***
## c_2             -5.142e-03  1.186e-03  -4.337 1.98e-05 ***
## total_revenue   -7.489e-02  3.643e-02  -2.055   0.0407 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10160 on 301 degrees of freedom
## Multiple R-squared:  0.5258, Adjusted R-squared:  0.5211 
## F-statistic: 111.3 on 3 and 301 DF,  p-value: < 2.2e-16

poly3_1.lm=lm(net_revenue~customer_count1+c_2 + total_revenue, data = train)
summary(poly3_1.lm)

## 
## Call:
## lm(formula = net_revenue ~ customer_count1 + c_2 + total_revenue, 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23062  -6071  -1784   3733  33814 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.775e+03  3.592e+03  -2.721 0.007148 ** 
## customer_count1  3.881e+01  5.035e+00   7.707 8.56e-13 ***
## c_2             -5.515e-03  1.405e-03  -3.924 0.000124 ***
## total_revenue   -6.857e-02  3.732e-02  -1.837 0.067830 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8893 on 179 degrees of freedom
## Multiple R-squared:  0.6317, Adjusted R-squared:  0.6256 
## F-statistic: 102.4 on 3 and 179 DF,  p-value: < 2.2e-16

Y_pred = predict(poly3_1.lm, newdata = test)
summary(Y_pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   197.3 19432.7 27063.1 28594.7 36867.1 53785.8

SS.total = sum((test$net_revenue-mean(test$net_revenue))^2)
SS.residual = sum((test$net_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$net_revenue))^2)


test.rsq = 1-SS.residual/SS.total

paste("R-square", SS.regression/(SS.total))

## [1] "R-square 0.677051224656376"

Conclusion

poly2 and poly3 (last two models) both have R-square over 0.67. And since the coefficients of c_2 (squared term) are both negative and small, we expect the curve to be concave and flat. In English, when we have more customers buying staff, the growth rate of net revenue are decreasing. Obviously, it is just an estimate based on one-year sales.

Why does it matter? I built significant models to use the number of customers taking orders and cancelling orders to predict total revenues on daily basis (reg1.lm and reg3.lm). Then, using the last two polynomial regression models, we can predict net revenues, which indicates the real accounts receivable. By that, it is much more straightforward to have a picture of cash inflows from operations (with residuals, of course). From an accounting perspective, knowing such information helps the company to more efficiently plan future investments and save costs.

What else can we do? Because of the limit of the dataset, I am not able to further analyze the conversion rate (% of customers who actually take any orders among all daily site visitors). Then, based on visiting patterns in each hour, it is possible for the company to know and predict incoming visitors as long as the site is closely monitored.