Alibaba Singles Day (11/11) Sales Predictions

Frank Zheng

2019-11-13 16:48:32

Introduction

This document essentially gathers historic data on previous years Alibaba Taobao sales on Singles Day, fits the data, selects the best model and yields predictions on future sales on Singles Day. The document is designed to be a live document, which means it will continuously gather future sale data, run the model selection and produces new forecast.

This project was inspired by a user who publised his/her predictions on Alibaba Single Day sales back in April 2019 but successfully predicited the sales number in November with minimal error percentage using a relatively simple third degree polynominal model. Screen shots attached.

screenshot1 screenshot2

Load packages

suppressMessages(librarian::shelf(plyr,
                                  dplyr, 
                                  gridExtra, 
                                  grid,
                                  ggplot2,
                                  readxl,
                                  magrittr,
                                  cluster,
                                  fpc,
                                  Hmisc,
                                  devtools,
                                  noncensus, 
                                  ramhiser,
                                  plotly,
                                  rowr,
                                  quiet = TRUE))

Load data

year <- seq(2009, 2018, 1)
sales <- c(0.5, 9.36, 52, 191, 350, 571, 912, 1207, 1682.69, 2135)
sales_df <- data.frame(year, sales)

# view data frame
sales_df
##    year   sales
## 1  2009    0.50
## 2  2010    9.36
## 3  2011   52.00
## 4  2012  191.00
## 5  2013  350.00
## 6  2014  571.00
## 7  2015  912.00
## 8  2016 1207.00
## 9  2017 1682.69
## 10 2018 2135.00
# overview data
plot_ly(sales_df, x= ~year, y=~sales, type = 'scatter', mode = 'lines+markers') %>%
  layout(
    title = "Alibaba Taobao Single Day Sales Over Time ",
    xaxis = list(title = 'Year'), yaxis = list(title ='Sales (in 0.1*billion RMB)'))

Model fitting

set.seed(20)
#fit first degree polynomial equation:
fit  <- lm(sales~year)
#second degree
fit2 <- lm(sales~poly(year,2,raw=TRUE))
#third degree
fit3 <- lm(sales~poly(year,3,raw=TRUE))
#fourth degree
fit4 <- lm(sales~poly(year,4,raw=TRUE))

Model selection

Multicollinearity issues

## 
## Call:
## lm(formula = sales ~ poly(year, 3, raw = TRUE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.029  -6.476  -2.476  13.599  26.217 
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.215e+08  3.763e+06   32.30 7.05e-09 ***
## poly(year, 3, raw = TRUE)1 -1.210e+05  3.737e+03  -32.36 6.96e-09 ***
## poly(year, 3, raw = TRUE)2  3.010e+01  9.281e-01   32.43 6.86e-09 ***
## poly(year, 3, raw = TRUE)3         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.33 on 7 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9992 
## F-statistic:  5614 on 2 and 7 DF,  p-value: 6.036e-12
## 
## Call:
## lm(formula = sales ~ poly(year, 4, raw = TRUE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.029  -6.476  -2.476  13.599  26.217 
## 
## Coefficients: (2 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.215e+08  3.763e+06   32.30 7.05e-09 ***
## poly(year, 4, raw = TRUE)1 -1.210e+05  3.737e+03  -32.36 6.96e-09 ***
## poly(year, 4, raw = TRUE)2  3.010e+01  9.281e-01   32.43 6.86e-09 ***
## poly(year, 4, raw = TRUE)3         NA         NA      NA       NA    
## poly(year, 4, raw = TRUE)4         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.33 on 7 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9992 
## F-statistic:  5614 on 2 and 7 DF,  p-value: 6.036e-12

Looking at the third and fourth dgree polynimial model summaries, there are NAs in their coefficient estimates. This indicates these variables are linear combinations of other predictors. When predictors are not linearly independent and parameters are undefined, this can only be solved by dropping redundant predictors until all predictors are linearly independent, which leads us to choose the best model from the linear equation or second degree polynomial equation.

Errors

## 
## Call:
## lm(formula = sales ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -258.49 -162.16  -81.56  138.79  358.05 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -476217.66   54379.26  -8.757 2.26e-05 ***
## year            236.87      27.01   8.770 2.24e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 245.3 on 8 degrees of freedom
## Multiple R-squared:  0.9058, Adjusted R-squared:  0.894 
## F-statistic: 76.92 on 1 and 8 DF,  p-value: 2.24e-05
## 
## Call:
## lm(formula = sales ~ poly(year, 2, raw = TRUE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.029  -6.476  -2.476  13.599  26.217 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.215e+08  3.763e+06   32.30 7.05e-09 ***
## poly(year, 2, raw = TRUE)1 -1.210e+05  3.737e+03  -32.36 6.96e-09 ***
## poly(year, 2, raw = TRUE)2  3.010e+01  9.281e-01   32.43 6.86e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.33 on 7 degrees of freedom
## Multiple R-squared:  0.9994, Adjusted R-squared:  0.9992 
## F-statistic:  5614 on 2 and 7 DF,  p-value: 6.036e-12

fit2 (second degree polynomial equation) produces lower residual error and yields better R^2 and adjusted R^2 than fit. Looks like fit2 is the better model to proceed.

ANOVA

## Analysis of Variance Table
## 
## Model 1: sales ~ year
## Model 2: sales ~ poly(year, 2, raw = TRUE)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1      8 481400                                  
## 2      7   3184  1    478217 1051.5 6.863e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Double check models with an Analysis of Variance. The Pr(>F) value is the probability of rejecting the null hypothesis, where one model does not fit better than the other model. We have a very significant p-value, so we can reject the null hypothesis, i.e. fit2 (second degree polynomial equation) provides a better fit than fit (linear equation)

Final equation to use: \[y=ax^3 + bx^2 + cx + d\]

##                (Intercept) poly(year, 2, raw = TRUE)1 
##               1.215345e+08              -1.209560e+05 
## poly(year, 2, raw = TRUE)2 
##               3.009508e+01

Prediction

predicted.intervals <- predict(fit2,data.frame(year=year),interval='confidence',
                               level=0.99)

df <- data.frame(year, sales, predicted = predicted.intervals[,1])

p <- plot_ly(df, x= ~year, y= ~sales, name = 'Actual', type = 'scatter', mode = 'lines+markers') %>%
  add_trace(y =~predicted, name ='Predicted', mode = 'lines+markers') %>%
  layout(
    title = "Actual Alibaba Taobao Single Day Sales vs. Predicted Over Time ",
    xaxis = list(title = 'Year'), yaxis = list(title ='Sales (in 0.1*billion RMB)'))
p
# Predicted values and confidence intervals
year.new <- seq(2009, 2025, 1)
predicted.intervals <- predicted.intervals<- predict(fit2,data.frame(year=year.new),interval='confidence',
                               level=0.99)

df <- cbind.fill(year.new, sales, predicted = predicted.intervals[,1], lowerbound = predicted.intervals[,2], upperbound = predicted.intervals[,3], fill = NA)

colnames(df) <- c("year", "actual","predicted", "lowerbound", "upperbound")
p <- plot_ly(df, x= ~year.new, y= ~predicted, name = 'Predicted', type = 'scatter', mode = 'lines+markers') %>%
  add_trace(y =~actual, name ='Actual', mode = 'lines+markers') %>%
  add_trace(y =~lowerbound, name ='Lowerbound', mode = 'lines+markers') %>%
  add_trace(y =~upperbound, name ='Upperbound', mode = 'lines+markers') %>%
  layout(
    title = "Alibaba Taobao Single Day Predictions ",
    xaxis = list(title = 'Year'), yaxis = list(title ='Sales (in 0.1*billion RMB)'))

p

Forcast results

##    year  actual   predicted lowerbound upperbound
## 1  2009    0.50    6.301091  -52.37678   64.97896
## 2  2010    9.36    2.406000  -36.99917   41.81117
## 3  2011   52.00   58.701061   26.74620   90.65592
## 4  2012  191.00  175.186273  142.19196  208.18059
## 5  2013  350.00  351.861636  316.52094  387.20233
## 6  2014  571.00  588.727151  553.38645  624.06785
## 7  2015  912.00  885.782818  852.78850  918.77713
## 8  2016 1207.00 1243.028636 1211.07377 1274.98350
## 9  2017 1682.69 1660.464606 1621.05944 1699.86977
## 10 2018 2135.00 2138.090727 2079.41286 2196.76860
## 11 2019      NA 2675.907000 2588.13022 2763.68378
## 12 2020      NA 3273.913424 3148.99896 3398.82789
## 13 2021      NA 3932.110000 3762.82125 4101.39875
## 14 2022      NA 4650.496727 4429.94714 4871.04632
## 15 2023      NA 5429.073606 5150.54302 5707.60420
## 16 2024      NA 6267.840636 5924.69533 6610.98594
## 17 2025      NA 7166.797818 6752.45249 7581.14314

Conclusion

It would be extremely difficult to determine whether the sales data from Alibaba Singles Day is the result of an artifact without any additional evidence, just because data fits the model perfectly doesn’t nesessarily mean data is not genuine.However, it is very likely that Alibaba may use a similar model to compose their Single Day Sales outlook ahead, then invest efforts into it, and eventually reach that benchmark or KPI (Key performance indicator).