Dear Zapponians,
In this document, I used R to build linear and nonlinear models to understand total daily revenue and net daily revenue of the site. As a student with both data analytics and accounting backgrounds, I am interested in anything that helps a company grow from a financial perspective. The dataset given contains limited information, but I approached the problem in different ways and found valuable information. Most importantly, my focus can be extended to a deeper level if more information is provided. My explanation shows below. Thank you so much for reviewing!
Best, Jiafeng (Bill) Xu
#subset rows of received orders by excluding cancelled orders (first 9292 rows)
orders = tail(fram, -9292)
head(orders)
## InvoiceNo StockCode Description Quantity
## 9293 581587 22367 CHILDRENS APRON SPACEBOY DESIGN 8
## 9294 581587 22555 PLASTERS IN TIN STRONGMAN 12
## 9295 581587 22556 PLASTERS IN TIN CIRCUS PARADE 12
## 9296 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12
## 9297 581587 22629 SPACEBOY LUNCH BOX 12
## 9298 581587 22631 CIRCUS PARADE LUNCH BOX 12
## InvoiceDate Date Time Year Month Day Hr Min UnitPrice
## 9293 2011/12/9 12:50 2011/12/9 12:50 2011 12 9 12 50 1.95
## 9294 2011/12/9 12:50 2011/12/9 12:50 2011 12 9 12 50 1.65
## 9295 2011/12/9 12:50 2011/12/9 12:50 2011 12 9 12 50 1.65
## 9296 2011/12/9 12:50 2011/12/9 12:50 2011 12 9 12 50 0.85
## 9297 2011/12/9 12:50 2011/12/9 12:50 2011 12 9 12 50 1.95
## 9298 2011/12/9 12:50 2011/12/9 12:50 2011 12 9 12 50 1.95
## CustomerID Country
## 9293 12680 France
## 9294 12680 France
## 9295 12680 France
## 9296 12680 France
## 9297 12680 France
## 9298 12680 France
#bar plot of number of orders from Jan to Nov (Dec transactions are incomplete thus excluding from the plot)
orders1 = filter(orders, Month != '12')
df = data.frame(orders1)
p1 <- ggplot(df, aes(Month)) + geom_bar(aes(fill=Country), width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Histogram on Transaction Numbers By Countries",
subtitle="January to November")
p1
#box plot and dot plot for Unit Prices and Quantities of each transaction
#outliers are not frequent but can be out of three standard deviations, thus in the modeling, I decide to use robust models to reduce outlier effects.
p2 <- ggplot(df, aes(Month, UnitPrice))
p2 + geom_boxplot(aes((reorder(Month, UnitPrice, median)))) +
labs(title="Box plot",
subtitle="City Mileage grouped by Class of vehicle",
caption="Source: mpg",
x="Class of Vehicle",
y="City Mileage")
p3 <- ggplot(df, aes(Month, Quantity))
p3 + geom_boxplot(aes((reorder(Month, Quantity, median)))) +
labs(title="Box plot",
subtitle="City Mileage grouped by Class of vehicle",
caption="Source: mpg",
x="Class of Vehicle",
y="City Mileage")
# the new variable for total revenue for each transaction
orders$totalRevenue = orders$Quantity*orders$UnitPrice
ZapRegression = read.csv("ZapRegression.csv", header = TRUE)
str(ZapRegression)
## 'data.frame': 305 obs. of 7 variables:
## $ total_revenue : num 3457 6200 6974 7535 7579 ...
## $ customer_count : int 279 286 448 518 529 534 538 598 609 630 ...
## $ net_revenue : num 4821 3457 5384 8608 15750 ...
## $ customer_count1: int 221 279 353 406 418 439 452 498 537 537 ...
## $ numcancel : int 40 49 76 18 63 43 13 5 15 22 ...
## $ numorder : int 1980 3918 3378 2033 1639 1897 2479 815 1562 2305 ...
## $ cancel_ratio : int 2 1 2 0 3 2 0 0 0 0 ...
# create the training and testing datasets for predictive analysis
# use 60% of the data --- for training
smp_size = floor(0.6*nrow(ZapRegression))
set.seed(11)
train_ind = sample(seq_len(nrow(ZapRegression)), size = smp_size)
train = ZapRegression[train_ind, ]
test = ZapRegression[-train_ind, ]
par(mfrow = c(2,2))
plot(ZapRegression$total_revenue,ZapRegression$customer_count)
plot(ZapRegression$net_revenue,ZapRegression$customer_count1)
plot(ZapRegression$total_revenue,ZapRegression$numcancel)
plot(ZapRegression$net_revenue,ZapRegression$numcancel)
cor(ZapRegression$total_revenue,ZapRegression$customer_count)
## [1] 0.6426399
cor(ZapRegression$net_revenue,ZapRegression$customer_count1)
## [1] 0.702617
cor(ZapRegression$total_revenue,ZapRegression$numcancel)
## [1] -0.05330082
cor(ZapRegression$net_revenue,ZapRegression$numcancel)
## [1] -0.04758116
reg1.lm=lm(total_revenue~customer_count, data = ZapRegression)
summary(reg1.lm)
##
## Call:
## lm(formula = total_revenue ~ customer_count, data = ZapRegression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27573 -7688 -3339 2570 167886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6100.282 2189.436 2.786 0.00567 **
## customer_count 16.575 1.135 14.600 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16230 on 303 degrees of freedom
## Multiple R-squared: 0.413, Adjusted R-squared: 0.411
## F-statistic: 213.2 on 1 and 303 DF, p-value: < 2.2e-16
#robust model for the test data decrease outlier effects
reg1_1.lm=rlm(total_revenue~customer_count, data = train)
summary(reg1_1.lm)
##
## Call: rlm(formula = total_revenue ~ customer_count, data = train)
## Residuals:
## Min 1Q Median 3Q Max
## -24921.4 -5028.6 -908.4 5052.3 170516.7
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 3490.0311 1454.6771 2.3992
## customer_count 16.5624 0.7732 21.4208
##
## Residual standard error: 7608 on 181 degrees of freedom
Y_pred = predict(reg1.lm, newdata = test)
summary(Y_pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10841 25957 32786 36070 43813 86871
SS.total = sum((test$total_revenue-mean(test$total_revenue))^2)
SS.residual = sum((test$total_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$total_revenue))^2)
test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)
## [1] "Test R-square 0.520714031838455"
# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)
## [1] "R-square 0.530262938813604"
reg3.lm=lm(total_revenue~customer_count + numcancel, data = ZapRegression)
summary(reg3.lm)
##
## Call:
## lm(formula = total_revenue ~ customer_count + numcancel, data = ZapRegression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28931 -7579 -3549 2590 166136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7754.535 2397.126 3.235 0.00135 **
## customer_count 16.634 1.132 14.688 < 2e-16 ***
## numcancel -57.694 34.538 -1.670 0.09587 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16180 on 302 degrees of freedom
## Multiple R-squared: 0.4184, Adjusted R-squared: 0.4145
## F-statistic: 108.6 on 2 and 302 DF, p-value: < 2.2e-16
reg3_1.lm=rlm(total_revenue~customer_count + numcancel, data = train)
summary(reg3_1.lm)
##
## Call: rlm(formula = total_revenue ~ customer_count + numcancel, data = train)
## Residuals:
## Min 1Q Median 3Q Max
## -24848 -4754 -1053 4816 170139
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 3847.0683 1576.2667 2.4406
## customer_count 16.5753 0.7630 21.7242
## numcancel -13.8774 21.2568 -0.6528
##
## Residual standard error: 7135 on 180 degrees of freedom
Y_pred = predict(reg3_1.lm, newdata = test)
summary(Y_pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7908 23221 29920 33407 40835 83980
SS.total = sum((test$total_revenue-mean(test$total_revenue))^2)
SS.residual = sum((test$total_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$total_revenue))^2)
test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)
## [1] "Test R-square 0.507437502714105"
# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)
## [1] "R-square 0.543129986821479"
# to many zigzags and I consider polynomial model to be a better fit, see poly1.lm
ZapRegression["cc_2"] = ZapRegression$customer_count^2
poly1.lm=lm(total_revenue~customer_count+cc_2, data = ZapRegression)
summary(poly1.lm)
##
## Call:
## lm(formula = total_revenue ~ customer_count + cc_2, data = ZapRegression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29795 -7554 -3137 3246 166391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.386e+03 4.349e+03 -1.238 0.21654
## customer_count 2.911e+01 4.268e+00 6.820 4.97e-11 ***
## cc_2 -2.795e-03 9.185e-04 -3.043 0.00255 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16010 on 302 degrees of freedom
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4267
## F-statistic: 114.1 on 2 and 302 DF, p-value: < 2.2e-16
train["cc_2"] = train$customer_count^2
test["cc_2"] = test$customer_count^2
train["c_2"] = train$customer_count1^2
test["c_2"] = test$customer_count1^2
poly1_1.lm=lm(total_revenue~customer_count1+cc_2, data = train)
summary(poly1_1.lm)
##
## Call:
## lm(formula = total_revenue ~ customer_count1 + cc_2, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29833 -8129 -3902 2649 166552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.208e+01 5.772e+03 -0.009 0.993
## customer_count1 3.035e+01 7.205e+00 4.212 3.99e-05 ***
## cc_2 -1.379e-03 1.160e-03 -1.189 0.236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17920 on 180 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3527
## F-statistic: 50.59 on 2 and 180 DF, p-value: < 2.2e-16
Y_pred = predict(poly1_1.lm, newdata = test)
summary(Y_pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8304 25992 34375 36403 44275 75618
SS.total = sum((test$total_revenue-mean(test$total_revenue))^2)
SS.residual = sum((test$total_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$total_revenue))^2)
test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)
## [1] "Test R-square 0.516175478946498"
# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)
## [1] "R-square 0.554531739429007"
# Adj R-square is higher for the test dataset indicates that outliers affect the model quality
reg2.lm=lm(net_revenue~customer_count1, data = ZapRegression)
summary(reg2.lm)
##
## Call:
## lm(formula = net_revenue ~ customer_count1, data = ZapRegression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28957 -5666 -1424 3178 74181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4920.7453 1428.8202 3.444 0.000654 ***
## customer_count1 16.7128 0.9724 17.188 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10470 on 303 degrees of freedom
## Multiple R-squared: 0.4937, Adjusted R-squared: 0.492
## F-statistic: 295.4 on 1 and 303 DF, p-value: < 2.2e-16
reg2_1.lm=rlm(net_revenue~customer_count1, data = train)
summary(reg2_1.lm)
##
## Call: rlm(formula = net_revenue ~ customer_count1, data = train)
## Residuals:
## Min 1Q Median 3Q Max
## -24833 -3901 -384 4402 37541
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 1551.9794 1285.3766 1.2074
## customer_count1 18.6213 0.8979 20.7383
##
## Residual standard error: 5968 on 181 degrees of freedom
Y_pred = predict(reg2_1.lm, newdata = test)
summary(Y_pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6747 18744 24866 27292 33129 64827
SS.total = sum((test$net_revenue-mean(test$net_revenue))^2)
SS.residual = sum((test$net_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$net_revenue))^2)
test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)
## [1] "Test R-square 0.334971941966514"
# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)
## [1] "R-square 0.644822434114578"
#
reg4.lm=lm(net_revenue~customer_count1 + numcancel, data = ZapRegression)
summary(reg4.lm)
##
## Call:
## lm(formula = net_revenue ~ customer_count1 + numcancel, data = ZapRegression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26363 -5431 -1519 2837 73928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6012.6940 1560.4708 3.853 0.000142 ***
## customer_count1 16.7652 0.9698 17.288 < 2e-16 ***
## numcancel -38.1534 22.2769 -1.713 0.087795 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10440 on 302 degrees of freedom
## Multiple R-squared: 0.4985, Adjusted R-squared: 0.4952
## F-statistic: 150.1 on 2 and 302 DF, p-value: < 2.2e-16
reg4_1.lm=rlm(net_revenue~customer_count1 + numcancel, data = train)
## Warning in rlm.default(x, y, weights, method = method, wt.method =
## wt.method, : 'rlm' failed to converge in 20 steps
summary(reg4_1.lm)
##
## Call: rlm(formula = net_revenue ~ customer_count1 + numcancel, data = train)
## Residuals:
## Min 1Q Median 3Q Max
## -24838.6 -4029.7 -367.9 4404.5 37542.2
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 1520.8701 1439.3870 1.0566
## customer_count1 18.6229 0.9176 20.2948
## numcancel 1.9098 19.1575 0.0997
##
## Residual standard error: 6130 on 180 degrees of freedom
Y_pred = predict(reg4_1.lm, newdata = test)
summary(Y_pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6810 18751 24926 27320 33132 64889
SS.total = sum((test$net_revenue-mean(test$net_revenue))^2)
SS.residual = sum((test$net_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$net_revenue))^2)
test.rsq = 1-SS.residual/SS.total
paste("Test R-square",test.rsq)
## [1] "Test R-square 0.333550626797492"
# fraction of variability explained by the model
paste("R-square", SS.regression/SS.total)
## [1] "R-square 0.645312973546001"
ZapRegression["c_2"] = ZapRegression$customer_count1^2
poly2.lm=lm(net_revenue~customer_count1+c_2, data = ZapRegression)
summary(poly2.lm)
##
## Call:
## lm(formula = net_revenue ~ customer_count1 + c_2, data = ZapRegression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28490 -5825 -1592 3663 72069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.097e+03 3.088e+03 -1.975 0.0492 *
## customer_count1 3.255e+01 4.072e+00 7.994 2.80e-14 ***
## c_2 -4.682e-03 1.171e-03 -4.000 7.98e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10220 on 302 degrees of freedom
## Multiple R-squared: 0.5191, Adjusted R-squared: 0.516
## F-statistic: 163 on 2 and 302 DF, p-value: < 2.2e-16
poly2_1.lm=lm(net_revenue~customer_count1+c_2, data = train)
summary(poly2_1.lm)
##
## Call:
## lm(formula = net_revenue ~ customer_count1 + c_2, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22922 -5884 -1828 3964 33688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.212e+03 3.603e+03 -2.557 0.011386 *
## customer_count1 3.590e+01 4.813e+00 7.460 3.54e-12 ***
## c_2 -5.104e-03 1.397e-03 -3.654 0.000338 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8951 on 180 degrees of freedom
## Multiple R-squared: 0.6248, Adjusted R-squared: 0.6206
## F-statistic: 149.9 on 2 and 180 DF, p-value: < 2.2e-16
Y_pred = predict(poly2_1.lm, newdata = test)
summary(Y_pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 408.2 19585.8 27739.2 28554.6 36979.3 53857.6
SS.total = sum((test$net_revenue-mean(test$net_revenue))^2)
SS.residual = sum((test$net_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$net_revenue))^2)
test.rsq = 1-SS.residual/SS.total
paste("R-square", SS.regression/SS.total)
## [1] "R-square 0.671435624454648"
poly3.lm=lm(net_revenue~customer_count1+c_2 + total_revenue, data = ZapRegression)
summary(poly3.lm)
##
## Call:
## lm(formula = net_revenue ~ customer_count1 + c_2 + total_revenue,
## data = ZapRegression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29315 -5740 -1461 3424 71485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.734e+03 3.087e+03 -2.182 0.0299 *
## customer_count1 3.574e+01 4.338e+00 8.240 5.36e-15 ***
## c_2 -5.142e-03 1.186e-03 -4.337 1.98e-05 ***
## total_revenue -7.489e-02 3.643e-02 -2.055 0.0407 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10160 on 301 degrees of freedom
## Multiple R-squared: 0.5258, Adjusted R-squared: 0.5211
## F-statistic: 111.3 on 3 and 301 DF, p-value: < 2.2e-16
poly3_1.lm=lm(net_revenue~customer_count1+c_2 + total_revenue, data = train)
summary(poly3_1.lm)
##
## Call:
## lm(formula = net_revenue ~ customer_count1 + c_2 + total_revenue,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23062 -6071 -1784 3733 33814
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.775e+03 3.592e+03 -2.721 0.007148 **
## customer_count1 3.881e+01 5.035e+00 7.707 8.56e-13 ***
## c_2 -5.515e-03 1.405e-03 -3.924 0.000124 ***
## total_revenue -6.857e-02 3.732e-02 -1.837 0.067830 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8893 on 179 degrees of freedom
## Multiple R-squared: 0.6317, Adjusted R-squared: 0.6256
## F-statistic: 102.4 on 3 and 179 DF, p-value: < 2.2e-16
Y_pred = predict(poly3_1.lm, newdata = test)
summary(Y_pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 197.3 19432.7 27063.1 28594.7 36867.1 53785.8
SS.total = sum((test$net_revenue-mean(test$net_revenue))^2)
SS.residual = sum((test$net_revenue-Y_pred)^2)
SS.regression = sum((Y_pred-mean(test$net_revenue))^2)
test.rsq = 1-SS.residual/SS.total
paste("R-square", SS.regression/(SS.total))
## [1] "R-square 0.677051224656376"
poly2 and poly3 (last two models) both have R-square over 0.67. And since the coefficients of c_2 (squared term) are both negative and small, we expect the curve to be concave and flat. In English, when we have more customers buying staff, the growth rate of net revenue are decreasing. Obviously, it is just an estimate based on one-year sales.
Why does it matter? I built significant models to use the number of customers taking orders and cancelling orders to predict total revenues on daily basis (reg1.lm and reg3.lm). Then, using the last two polynomial regression models, we can predict net revenues, which indicates the real accounts receivable. By that, it is much more straightforward to have a picture of cash inflows from operations (with residuals, of course). From an accounting perspective, knowing such information helps the company to more efficiently plan future investments and save costs.
What else can we do? Because of the limit of the dataset, I am not able to further analyze the conversion rate (% of customers who actually take any orders among all daily site visitors). Then, based on visiting patterns in each hour, it is possible for the company to know and predict incoming visitors as long as the site is closely monitored.