This document essentially gathers historic data on previous years Alibaba Taobao sales on Singles Day, fits the data, selects the best model and yields predictions on future sales on Singles Day. The document is designed to be a live document, which means it will continuously gather future sale data, run the model selection and produces new forecast.
This project was inspired by a user who publised his/her predictions on Alibaba Single Day sales back in April 2019 but successfully predicited the sales number in November with minimal error percentage using a relatively simple third degree polynominal model. Screen shots attached.
year <- seq(2009, 2018, 1)
sales <- c(0.5, 9.36, 52, 191, 350, 571, 912, 1207, 1682.69, 2135)
sales_df <- data.frame(year, sales)
# view data frame
sales_df## year sales
## 1 2009 0.50
## 2 2010 9.36
## 3 2011 52.00
## 4 2012 191.00
## 5 2013 350.00
## 6 2014 571.00
## 7 2015 912.00
## 8 2016 1207.00
## 9 2017 1682.69
## 10 2018 2135.00
##
## Call:
## lm(formula = sales ~ poly(year, 3, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.029 -6.476 -2.476 13.599 26.217
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.215e+08 3.763e+06 32.30 7.05e-09 ***
## poly(year, 3, raw = TRUE)1 -1.210e+05 3.737e+03 -32.36 6.96e-09 ***
## poly(year, 3, raw = TRUE)2 3.010e+01 9.281e-01 32.43 6.86e-09 ***
## poly(year, 3, raw = TRUE)3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.33 on 7 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9992
## F-statistic: 5614 on 2 and 7 DF, p-value: 6.036e-12
##
## Call:
## lm(formula = sales ~ poly(year, 4, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.029 -6.476 -2.476 13.599 26.217
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.215e+08 3.763e+06 32.30 7.05e-09 ***
## poly(year, 4, raw = TRUE)1 -1.210e+05 3.737e+03 -32.36 6.96e-09 ***
## poly(year, 4, raw = TRUE)2 3.010e+01 9.281e-01 32.43 6.86e-09 ***
## poly(year, 4, raw = TRUE)3 NA NA NA NA
## poly(year, 4, raw = TRUE)4 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.33 on 7 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9992
## F-statistic: 5614 on 2 and 7 DF, p-value: 6.036e-12
Looking at the third and fourth dgree polynimial model summaries, there are NAs in their coefficient estimates. This indicates these variables are linear combinations of other predictors. When predictors are not linearly independent and parameters are undefined, this can only be solved by dropping redundant predictors until all predictors are linearly independent, which leads us to choose the best model from the linear equation or second degree polynomial equation.
##
## Call:
## lm(formula = sales ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -258.49 -162.16 -81.56 138.79 358.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -476217.66 54379.26 -8.757 2.26e-05 ***
## year 236.87 27.01 8.770 2.24e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 245.3 on 8 degrees of freedom
## Multiple R-squared: 0.9058, Adjusted R-squared: 0.894
## F-statistic: 76.92 on 1 and 8 DF, p-value: 2.24e-05
##
## Call:
## lm(formula = sales ~ poly(year, 2, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.029 -6.476 -2.476 13.599 26.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.215e+08 3.763e+06 32.30 7.05e-09 ***
## poly(year, 2, raw = TRUE)1 -1.210e+05 3.737e+03 -32.36 6.96e-09 ***
## poly(year, 2, raw = TRUE)2 3.010e+01 9.281e-01 32.43 6.86e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.33 on 7 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9992
## F-statistic: 5614 on 2 and 7 DF, p-value: 6.036e-12
fit2 (second degree polynomial equation) produces lower residual error and yields better R^2 and adjusted R^2 than fit. Looks like fit2 is the better model to proceed.
## Analysis of Variance Table
##
## Model 1: sales ~ year
## Model 2: sales ~ poly(year, 2, raw = TRUE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8 481400
## 2 7 3184 1 478217 1051.5 6.863e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Double check models with an Analysis of Variance. The Pr(>F) value is the probability of rejecting the null hypothesis, where one model does not fit better than the other model. We have a very significant p-value, so we can reject the null hypothesis, i.e. fit2 (second degree polynomial equation) provides a better fit than fit (linear equation)
Final equation to use: \[y=ax^3 + bx^2 + cx + d\]
## (Intercept) poly(year, 2, raw = TRUE)1
## 1.215345e+08 -1.209560e+05
## poly(year, 2, raw = TRUE)2
## 3.009508e+01
predicted.intervals <- predict(fit2,data.frame(year=year),interval='confidence',
level=0.99)
df <- data.frame(year, sales, predicted = predicted.intervals[,1])
p <- plot_ly(df, x= ~year, y= ~sales, name = 'Actual', type = 'scatter', mode = 'lines+markers') %>%
add_trace(y =~predicted, name ='Predicted', mode = 'lines+markers') %>%
layout(
title = "Actual Alibaba Taobao Single Day Sales vs. Predicted Over Time ",
xaxis = list(title = 'Year'), yaxis = list(title ='Sales (in 0.1*billion RMB)'))
p# Predicted values and confidence intervals
year.new <- seq(2009, 2025, 1)
predicted.intervals <- predicted.intervals<- predict(fit2,data.frame(year=year.new),interval='confidence',
level=0.99)
df <- cbind.fill(year.new, sales, predicted = predicted.intervals[,1], lowerbound = predicted.intervals[,2], upperbound = predicted.intervals[,3], fill = NA)
colnames(df) <- c("year", "actual","predicted", "lowerbound", "upperbound")
p <- plot_ly(df, x= ~year.new, y= ~predicted, name = 'Predicted', type = 'scatter', mode = 'lines+markers') %>%
add_trace(y =~actual, name ='Actual', mode = 'lines+markers') %>%
add_trace(y =~lowerbound, name ='Lowerbound', mode = 'lines+markers') %>%
add_trace(y =~upperbound, name ='Upperbound', mode = 'lines+markers') %>%
layout(
title = "Alibaba Taobao Single Day Predictions ",
xaxis = list(title = 'Year'), yaxis = list(title ='Sales (in 0.1*billion RMB)'))
p## year actual predicted lowerbound upperbound
## 1 2009 0.50 6.301091 -52.37678 64.97896
## 2 2010 9.36 2.406000 -36.99917 41.81117
## 3 2011 52.00 58.701061 26.74620 90.65592
## 4 2012 191.00 175.186273 142.19196 208.18059
## 5 2013 350.00 351.861636 316.52094 387.20233
## 6 2014 571.00 588.727151 553.38645 624.06785
## 7 2015 912.00 885.782818 852.78850 918.77713
## 8 2016 1207.00 1243.028636 1211.07377 1274.98350
## 9 2017 1682.69 1660.464606 1621.05944 1699.86977
## 10 2018 2135.00 2138.090727 2079.41286 2196.76860
## 11 2019 NA 2675.907000 2588.13022 2763.68378
## 12 2020 NA 3273.913424 3148.99896 3398.82789
## 13 2021 NA 3932.110000 3762.82125 4101.39875
## 14 2022 NA 4650.496727 4429.94714 4871.04632
## 15 2023 NA 5429.073606 5150.54302 5707.60420
## 16 2024 NA 6267.840636 5924.69533 6610.98594
## 17 2025 NA 7166.797818 6752.45249 7581.14314
It would be extremely difficult to determine whether the sales data from Alibaba Singles Day is the result of an artifact without any additional evidence, just because data fits the model perfectly doesn’t nesessarily mean data is not genuine.However, it is very likely that Alibaba may use a similar model to compose their Single Day Sales outlook ahead, then invest efforts into it, and eventually reach that benchmark or KPI (Key performance indicator).