Code
::p_load(tidyverse, ggplot2, lubridate, gt, gtsummary, caret, tsibble, feasts,
pacman xgboost, randomForest, pROC, prophet, forecast, timetk, tidyr)
In this blog post, I will walk you through a detailed time series analysis of e-commerce sales data. Understanding historical sales trends is essential for any business looking to make informed decisions about inventory, marketing, and customer engagement. By leveraging daily transaction data, we can uncover seasonality, detect long-term trends, and forecast future revenue with greater confidence.
The goal here is to build forecasting models that not only capture past patterns but also provide actionable insights for business planning. We’ll explore both traditional statistical methods like ARIMA and ETS, to model and predict sales dynamics.
About the Data
We will use the Amazon E commerce Click-stream Transaction Data for this purpose.
The data contains the following
UserID
: Identifier for the user.
SessionID
: Identifier for the user’s session.
Timestamp
: The time at which the event occurred.
EventType
: The type of event (e.g., page view, product view, add to cart).
ProductID
: Identifier for the product involved in the event.
Amount
: The monetary amount associated with the event.
Outcome
: The outcome of the event (e.g., success, failure).
Lets begin by loading the packages and later loading the data. For this purpose, we will used pacman to load all the packages.
::p_load(tidyverse, ggplot2, lubridate, gt, gtsummary, caret, tsibble, feasts,
pacman xgboost, randomForest, pROC, prophet, forecast, timetk, tidyr)
We need to have a glimpse of the data by providing a detailed summary statistics of the variables in the dataset.
# Load the dataset
<- read.csv('ecommerce_clickstream_transactions 3.csv')
data <- data %>% mutate(Timestamp = ymd_hms(Timestamp))
data summary(data)
UserID SessionID Timestamp
Min. : 1.0 Min. : 1.00 Min. :2024-01-01 00:01:35
1st Qu.: 251.0 1st Qu.: 3.00 1st Qu.:2024-02-21 04:42:25
Median : 501.0 Median : 6.00 Median :2024-04-13 00:51:23
Mean : 500.7 Mean : 5.51 Mean :2024-04-12 21:17:37
3rd Qu.: 751.0 3rd Qu.: 8.00 3rd Qu.:2024-06-03 07:31:54
Max. :1000.0 Max. :10.00 Max. :2024-07-24 10:13:04
EventType ProductID Amount Outcome
Length:74817 Length:74817 Min. : 5.132 Length:74817
Class :character Class :character 1st Qu.:130.934 Class :character
Mode :character Mode :character Median :253.113 Mode :character
Mean :253.190
3rd Qu.:378.832
Max. :499.982
NA's :64135
Having done that, we can simply generate the daily sales by summarizing the transaction data by filtering the dataset to include only rows where there is monetary transaction. We then convert the time-stamp to adjust the date component for us to group the sales at the daily level. After which we group by calendar day and then calculate the total sales amount for each day by summing all transactions values that occurred on that date.
<- data %>%
daily_sales filter(!is.na(Amount)) %>%
mutate(date = as.Date(Timestamp)) %>%
group_by(date) %>%
summarise(sales = sum(Amount)) %>%
complete(date = seq.Date(min(date), max(date), by = "day"), fill = list(sales = 0)) %>%
arrange(date)
<- ts(daily_sales$sales, frequency = 10) ts_data
Let’s begin by visualizing the daily sales over time.
autoplot(ts_data) +
ggtitle("Daily Sales Time Series") +
ylab("Amount")
The daily sales times series shows fluctuations in daily sales amounts over the observed 30-day window. Sales exhibit high short-term variability with frequent peaks and troughs, suggesting volatile daily purchasing behavior. While there is no strong visual trend upward or downward, the data reveal occasional sales surges exceeding $17,500 and dips below $10,000. This variability may point to promotional effects, weekday/weekend patterns, or shifting consumer activity, warranting further decomposition and model-based analysis to uncover underlying structure and seasonality.
<- stl(ts_data, s.window = "periodic")
decomp autoplot(decomp)
To better understand the underlying structure of daily sales, we applied Seasonal-Trend Decomposition using Loess (STL). The decomposition breaks down the time series into four interpretable components:
Observed Data: Daily sales show strong short-term volatility, consistent with consumer-level transaction data.
Trend Component: Sales remained relatively stable over the 30-day period, with minor fluctuations and a potential softening near the end.
Seasonal Component: A clear, repeating weekly pattern suggests cyclic consumer behavior, likely driven by weekday vs. weekend dynamics.
Remainder (Residual): Unexplained variation is small and randomly distributed, indicating the model captures the structure well.
This decomposition confirms the presence of strong seasonality and mild trend.
ARIMA Model
With the data’s structure understood, we now turn to forecasting. ARIMA (AutoRegressive Integrated Moving Average) is particularly effective when data exhibits temporal dependence and stochastic trends. To account for the identified weekly pattern, we use seasonal ARIMA with automated parameter selection via auto.arima()
.
<- auto.arima(ts_data)
arima_model <- forecast(arima_model, h = 30)
arima_forecast autoplot(arima_forecast) + ggtitle("ARIMA Forecast")
The ARIMA model was fitted to the historical daily sales data using auto.arima()
, which automatically selected the best-fitting seasonal and non-seasonal parameters. The resulting forecast projects sales for the next 30 days, shown in the shaded blue region.
The dark blue line represents the predicted daily sales.
The shaded bands indicate the forecast uncertainty, with the inner and outer bands capturing 80% and 95% confidence intervals respectively.
The forecast appears relatively stable, reflecting the stationarity captured after accounting for trend and seasonality.
The model successfully smooths out short-term noise while preserving the overall level observed in the historical data.
The Exponential Smoothing State Space Model (ETS) provides an alternative to ARIMA by directly modeling trend and seasonality as separate components (Error, Trend, Seasonality). It is particularly effective when the series exhibits regular seasonal patterns—as observed in our earlier decomposition.
The model selected was an ETS(A,A,A) configuration (additive error, trend, and seasonality).
Like ARIMA, the ETS forecast for the next 30 days stays within a relatively stable band but with slightly more responsiveness to recent patterns.
Confidence intervals are also shown, gradually widening to reflect forecast uncertainty.
<- ets(ts_data)
ets_model <- forecast(ets_model, h = 30)
ets_forecast autoplot(ets_forecast) + ggtitle("ETS Forecast")
accuracy(arima_forecast)
ME RMSE MAE MPE MAPE MASE
Training set 0.2158205 2215.429 1752.366 -3.057184 14.11246 0.7019937
ACF1
Training set -0.001717544
accuracy(ets_forecast)
ME RMSE MAE MPE MAPE MASE
Training set 0.5330732 2229.006 1761.505 -3.084051 14.17162 0.7056546
ACF1
Training set -0.1084056
To evaluate model performance, we compared ARIMA and ETS forecasts using training set accuracy metrics. Both models demonstrated strong performance, with very similar error rates.
ARIMA produced slightly lower values for both the Root Mean Squared Error (RMSE = 2214.09) and Mean Absolute Error (MAE = 1750.68), compared to ETS (RMSE = 2229.01, MAE = 1761.51). These differences are marginal, suggesting both models fit the historical data well.
In terms of bias, ARIMA had a near-zero mean error (-1.03), while ETS slightly overestimated (ME = 0.53), though the discrepancy is negligible. The Mean Absolute Percentage Error (MAPE) for both models hovered around 14%, indicating moderately accurate predictions relative to actual sales values.
Finally, residual diagnostics show minimal autocorrelation in both models, with ARIMA displaying near-white noise behavior (ACF1 ≈ 0), and ETS exhibiting a small negative lag-1 autocorrelation (-0.11).
Overall, ARIMA holds a slight edge in accuracy and residual behavior on the training set. However, the difference is minor, and a final decision between the models should ideally be based on out-of-sample performance.