ARIMA vs Prophet: A Comparative Analysis in Sales Forecasting Models

Abstract
Data Preprocessing
Exploratory Data Analysis
Initial Evaluation for 2022
ARIMA: Full Data Fitting and Cross-Validation
Prophet: Full Data Fitting and Cross-Validation
Final Veredict: Model Comparison
Conclusion

Abstract

Sales forecasting can be used by businesses to make informed decisions regarding inventory management, and financial projections. Company XYZ, which operates in an environment with variable demand, relies on econometric analysis, historical sales data, and management judgment to forecast future sales. With this project, I aim to enhance the existing methodology by analyzing and comparing the performance of two-time series forecasting models: the Auto-Regressive Integrated Moving Average (ARIMA) and Facebook’s Prophet algorithm. ARIMA is a traditional method in statistical forecasting to capture various patterns in time series data. Prophet, on the other hand, offers a more flexible approach and is particularly useful for handling data with strong seasonal effects and missing values.

Through a comprehensive analysis, this project will preprocess the sales data, conduct exploratory data analysis to understand underlying patterns, fit both models to historical sales data, and assess their forecasting accuracy for 2022 using a hold-out validation approach. Then, by implementing rolling window cross-validation, I will simulate real-world forecasting conditions for each product by training models on 24-month rolling windows, and forecasting one step ahead iteratively.

The models’ performance will be compared based on forecast accuracy to provide a recommendation for Company XYZ’s forecasting needs. The objective is to determine a robust model that not only enhances predictive ability with minimal additional complexity but is also compatible with the company’s existing forecasting process.

Data Preprocessing

library(tidyverse)
library(lubridate)
library(forecast)

sales_data = read_csv("Sales.csv")
start_date = as.Date("2019-01-01")
end_date = as.Date("2022-12-31")
start_year = format(start_date, "%Y")

# Check full dataset for missing values

cat("Missing values =",sum(is.na(sales_data$COUNT)))

## Missing values = 0

# Type Conversion

sales_data["DATE"] = as.Date(paste(sales_data$YEAR,sales_data$MONTH,"01",sep="-"))
#sales_data["DATE"] = ceiling_date(sales_data$DATE,"month")-days(1)

p1 = sales_data %>%
     filter(PRODUCT == "P1" & DATE >= start_date & DATE <= end_date) %>% 
     arrange(YEAR, MONTH)
p1_ts = ts(p1$COUNT, start = c(start_year,1), frequency = 12)

p2 = sales_data %>%
     filter(PRODUCT == "P2" & DATE >= start_date & DATE <= end_date) %>% 
     arrange(YEAR, MONTH)
p2_ts = ts(p2$COUNT, start = c(start_year,1), frequency=12)

p3 = sales_data %>%
     filter(PRODUCT == "P3" & DATE >= start_date & DATE <= end_date) %>% 
     arrange(YEAR, MONTH)
p3_ts = ts(p3$COUNT, start = c(start_year,1), frequency =12)

p4 = sales_data %>%
     filter(PRODUCT == "P4" & DATE >= start_date & DATE <= end_date) %>% 
     arrange(YEAR, MONTH)
p4_ts = ts(p4$COUNT, start = c(start_year,1), frequency =12)

# Outlier Identification and Treatment

outliers_p1 = length(tsoutliers(p1_ts)$index)
outliers_p2 = length(tsoutliers(p2_ts)$index)
outliers_p3 = length(tsoutliers(p3_ts)$index)
outliers_p4 = length(tsoutliers(p4_ts)$index)

df_outliers = data.frame(Product = c("Product 1", "Product 2", "Product 3", "Product 4"), Outliers = c(outliers_p1,outliers_p2,outliers_p3,outliers_p4))
colnames(df_outliers) = c("Product", "TS Outliers")
print(df_outliers)

##     Product TS Outliers
## 1 Product 1           0
## 2 Product 2           0
## 3 Product 3           0
## 4 Product 4           0

#No outliers were detected

Exploratory Data Analysis

library(gridExtra)
library(ggfortify)
library(tseries)
library(lattice)

#Sales Visualization
p1_plot = autoplot(p1_ts, ts.colour = 'blue') + labs(title="Product 1 Sales",x="",y="Sales")
p2_plot = autoplot(p2_ts, ts.colour = 'red') + labs(title="Product 2 Sales",x="",y="Sales")
p3_plot = autoplot(p3_ts, ts.colour = 'black') + labs(title="Product 3 Sales",x="",y="Sales")
p4_plot = autoplot(p4_ts, ts.colour = 'purple') + labs(title="Product 4 Sales",x="",y="Sales")

grid.arrange(p1_plot,p2_plot,p3_plot,p4_plot)

#Product Sales Series Decomposition (additive trend and seasonality)

#Product trend
p1_trend = decompose(p1_ts, type="additive")$trend
p2_trend = decompose(p2_ts, type="additive")$trend
p3_trend = decompose(p3_ts, type="additive")$trend
p4_trend = decompose(p4_ts, type="additive")$trend

#Trend Visualization
p1_trend = autoplot(p1_trend, ts.colour = 'blue') + labs(title="Product 1 Trend",x="",y="Sales")
p2_trend = autoplot(p2_trend, ts.colour = 'red') + labs(title="Product 2 Trend",x="",y="Sales")
p3_trend = autoplot(p3_trend, ts.colour = 'black') + labs(title="Product 3 Trend",x="",y="Sales")
p4_trend = autoplot(p4_trend, ts.colour = 'purple') + labs(title="Product 4 Trend",x="",y="Sales")

grid.arrange(p1_trend,p2_trend,p3_trend,p4_trend)

#Product Seasonality
p1_trend = decompose(p1_ts, type="additive")$seasonal
p2_trend = decompose(p2_ts, type="additive")$seasonal
p3_trend = decompose(p3_ts, type="additive")$seasonal
p4_trend = decompose(p4_ts, type="additive")$seasonal

#Seasonality Visualization
p1_season = autoplot(p1_trend, ts.colour = 'blue') + labs(title="Product 1 Seasonality",x="",y="Sales")
p2_season = autoplot(p2_trend, ts.colour = 'red') + labs(title="Product 2 Seasonality",x="",y="Sales")
p3_season = autoplot(p3_trend, ts.colour = 'black') + labs(title="Product 3 Seasonality",x="",y="Sales")
p4_season = autoplot(p4_trend, ts.colour = 'purple') + labs(title="Product 4 Seasonality",x="",y="Sales")

grid.arrange(p1_season,p2_season,p3_season,p4_season)

#Augmented Dickey-Fuller Tests (ADF) for Stationarity

df_adf = data.frame(c("Product 1", "Product 2", "Product 3", "Product 4"),c(adf.test(p1_ts)$statistic,adf.test(p2_ts)$statistic,adf.test(p3_ts)$statistic,adf.test(p4_ts)$statistic), c(adf.test(p1_ts)$p.value,adf.test(p2_ts)$p.value,adf.test(p3_ts)$p.value,adf.test(p4_ts)$p.value))
colnames(df_adf) = c("Product", "ADF T-Stat", "ADF P-Value")

#Product Sales Correlation Matrix
ts_data = data.frame(p1 = as.vector(p1_ts),
                      p2 = as.vector(p2_ts),
                      p3 = as.vector(p3_ts),
                      p4 = as.vector(p4_ts))

cor_matrix = cor(ts_data)

#Product Sales Scatter plot Matrix
pairs(ts_data, pch=16, cex=.5)

#Levels of sample correlation matrix
levelplot(cor_matrix)

print(cor_matrix)

##           p1        p2        p3        p4
## p1 1.0000000 0.2351253 0.3663753 0.3755730
## p2 0.2351253 1.0000000 0.3465545 0.3616718
## p3 0.3663753 0.3465545 1.0000000 0.5821849
## p4 0.3755730 0.3616718 0.5821849 1.0000000

We observe a distinct positive trend in sales for Product 1, while sales for Product 2 appear to be declining. The Augmented Dickey-Fuller test results indicate that Products 1, 2, and 3 are not stationary, with p-values exceeding the 0.05 threshold. Although Product 4 has a p-value marginally above the threshold, suggesting borderline stationarity, I will ultimately determine the most suitable forecasting model based on the Akaike Information Criterion (AIC) from the auto.arima() function. Additionally, the sales data across all products exhibit moderate correlations, with the strongest relationship evident between Products 3 and 4, as highlighted by the level plot and the sales correlation matrix. We could further explore using other products as additional regressors for future forecasting models.

Initial Evaluation for 2022

ARIMA: Fitting and 2022 Forecast

For the initial ARIMA 2022 fitting, I used the function auto.arima() to select the best model based on the Akaike Information Criterion (AIC). Then, I assessed the models by examining the plots of the residuals as well as the Ljung-Box statistics.

library(tidyverse)
library(forecast)
library(ggplot2)
library(gridExtra)

#Train
p1_ts_train = window(p1_ts, start=c(2019,1), end=c(2021,12))
p2_ts_train = window(p2_ts, start=c(2019,1), end=c(2021,12))
p3_ts_train = window(p3_ts, start=c(2019,1), end=c(2021,12))
p4_ts_train = window(p4_ts, start=c(2019,1), end=c(2021,12))

#Test
p1_ts_test = window(p1_ts, start=c(2022,1), end=c(2022,12))
p2_ts_test = window(p2_ts, start=c(2022,1), end=c(2022,12))
p3_ts_test = window(p3_ts, start=c(2022,1), end=c(2022,12))
p4_ts_test = window(p4_ts, start=c(2022,1), end=c(2022,12))

# ARIMA model fitting
fit_p1 = forecast::auto.arima(p1_ts_train, ic="aic")
fit_p2 = forecast::auto.arima(p2_ts_train, ic="aic")
fit_p3 = forecast::auto.arima(p3_ts_train, ic="aic")
fit_p4 = forecast::auto.arima(p4_ts_train, ic="aic")

# ARIMA Diagnostics
tsdiag(fit_p1)

tsdiag(fit_p2)

tsdiag(fit_p3)

tsdiag(fit_p4)

# ARIMA forecasts
forecast_p1 = forecast(fit_p1, h=12)
forecast_p2 = forecast(fit_p2, h=12)
forecast_p3 = forecast(fit_p3, h=12)
forecast_p4 = forecast(fit_p4, h=12)

# Plots
plot_p1 = autoplot(forecast_p1) + ggtitle("ARIMA Product 1 Forecast")
plot_p2 = autoplot(forecast_p2) + ggtitle("ARIMA Product 2 Forecast")
plot_p3 = autoplot(forecast_p3) + ggtitle("ARIMA Product 3 Forecast")
plot_p4 = autoplot(forecast_p4) + ggtitle("ARIMA Product 4 Forecast ")

grid.arrange(plot_p1, plot_p2, plot_p3, plot_p4, ncol=2, nrow=2)

The diagnostic checks provided by the tsdiag() function indicate that the time series models for all products are performing well. The plot of the residuals does not show any discernible pattern, suggesting randomness, which is a good sign that the models are capturing the underlying processes adequately. Furthermore, the autocorrelation function (ACF) of the residuals cuts off after the first lag, implying that there is no significant autocorrelation among the residuals. This is reinforced by the results of the Ljung-Box test, where the p-values are greater than 0.05 across all lags for each product. This lack of significant autocorrelation in the residuals suggests that the models are appropriately specified and that there is no leftover structure in the time series that the models have failed to capture.

The ARIMA model yields convincing forecasts for Products 1, 3, and 4. However, the forecasting scenario for Product 2 is notably different. The ARIMA model parameters (p, d, and q) are estimated to be zero, suggesting that the best-fit model is equivalent to a random walk without drift. This results in a ‘flat’ forecast for 2022, indicating that the model does not anticipate a discernible trend or seasonal pattern in the future sales data for Product 2. Such a forecast may imply that past sales data provide no reliable information for predicting future sales, or that the sales of Product 2 are highly unpredictable with the given data and model.

Prophet: Fitting and 2022 Forecast

For the Facebook’s Prophet model I split the data into 75/25 train-test ratios, fitting Prophet models to the training sets and forecasting for 12 subsequent months.

library(prophet)

# 75/25 train-test split ratio
p1_train_prophet = head(data_frame(ds = p1$DATE, y = p1$COUNT),-12)
p2_train_prophet = head(data_frame(ds = p2$DATE, y = p2$COUNT),-12)
p3_train_prophet = head(data_frame(ds = p3$DATE, y = p3$COUNT),-12)
p4_train_prophet = head(data_frame(ds = p4$DATE, y = p4$COUNT),-12)

# Prophet model fitting
model_prophet_p1 = prophet(p1_train_prophet)
model_prophet_p2 = prophet(p2_train_prophet)
model_prophet_p3 = prophet(p3_train_prophet)
model_prophet_p4 = prophet(p4_train_prophet)

# Prophet forecasts 
future_p1 = make_future_dataframe(model_prophet_p1, periods = 12, freq = 'month')
future_p2 = make_future_dataframe(model_prophet_p2, periods = 12, freq = 'month')
future_p3 = make_future_dataframe(model_prophet_p3, periods = 12, freq = 'month')
future_p4 = make_future_dataframe(model_prophet_p4, periods = 12, freq = 'month')

forecast_prophet_p1 = predict(model_prophet_p1, future_p1)
forecast_prophet_p2 = predict(model_prophet_p2, future_p2)
forecast_prophet_p3 = predict(model_prophet_p3, future_p3)
forecast_prophet_p4 = predict(model_prophet_p4, future_p4)

# Plots
plot_prophet_p1 = plot(model_prophet_p1, forecast_prophet_p1) + ggtitle("Prophet Product 1 Forecast")
plot_prophet_p2 = plot(model_prophet_p2, forecast_prophet_p2) + ggtitle("Prophet Product 2 Forecast")
plot_prophet_p3 = plot(model_prophet_p3, forecast_prophet_p3) + ggtitle("Prophet Product 3 Forecast")
plot_prophet_p4 = plot(model_prophet_p4, forecast_prophet_p4) + ggtitle("Prophet Product 4 Forecast")

grid.arrange(plot_prophet_p1, plot_prophet_p2, plot_prophet_p3, plot_prophet_p4, ncol=2, nrow=2)

Through the Prophet analysis, I was able to capture the declining trend of Product 2’s sales. Unlike the ARIMA model, Prophet effectively integrated the product’s seasonal fluctuations, delivering a dynamic forecast that reflects both trend and seasonality, avoiding a flat projection and providing a more actionable outlook for management.

Model Comparison for 2022

ARIMA model performance varied across products. Prophet models generally exhibited lower MSE values, suggesting superior predictive accuracy, particularly noticeable in Product 2.

library(tidyverse)

# MSE calculation for each ARIMA model
mse_p1 = mean((as.numeric(forecast_p1$mean) - p1_ts_test)^2)
mse_p2 = mean((as.numeric(forecast_p2$mean) - p2_ts_test)^2)
mse_p3 = mean((as.numeric(forecast_p3$mean) - p3_ts_test)^2)
mse_p4 = mean((as.numeric(forecast_p4$mean) - p4_ts_test)^2)

# Summary
mse_summary_arima = data.frame(
  Model = c("Model P1", "Model P2", "Model P3", "Model P4"),
  MSE_ARIMA = c(mse_p1, mse_p2, mse_p3, mse_p4)
)

# MSE calculation for each Prophet model
mse_prophet_p1 = mean((tail(forecast_prophet_p1$yhat,12)-p1_ts_test)^2)
mse_prophet_p2 = mean((tail(forecast_prophet_p2$yhat,12)-p2_ts_test)^2)
mse_prophet_p3 = mean((tail(forecast_prophet_p3$yhat,12)-p3_ts_test)^2)
mse_prophet_p4 = mean((tail(forecast_prophet_p4$yhat,12)-p4_ts_test)^2)

# Summary
mse_summary_prophet = data.frame(
  Model = c("Model P1", "Model P2", "Model P3", "Model P4"),
  MSE_prophet = c(mse_prophet_p1, mse_prophet_p2, mse_prophet_p3, mse_prophet_p4)
)

MSE = merge(mse_summary_arima, mse_summary_prophet)
print(MSE)

##      Model  MSE_ARIMA MSE_prophet
## 1 Model P1 2145.09708   2235.1186
## 2 Model P2 1960.46528   1404.1495
## 3 Model P3  210.12568    175.9310
## 4 Model P4   69.42676     32.5331

ARIMA: Full Data Fitting and Cross-Validation

I applied a rolling-window cross-validation to assess the model performance. I fitted the ARIMA model using a fixed window size of 24 months for training and then forecast one-step-ahead. This processes was iteratively applied to the four products, rolling the window forward by one each time to generate predictions and measure the forecasting error.

arimaCV = function(ts_obj, window_size, h) {
  len_ts = length(ts_obj)
  predicted = numeric(0)
  actual = numeric(0)
  for(i in seq(window_size, (len_ts - h), by=1)) {
    train = ts_obj[1:i]
    model_arima = auto.arima(train) 
    forecast = forecast(model_arima, h=h)
    predicted = c(predicted, forecast$mean[1])
    actual = c(actual, ts_obj[i+h])
  }
  result_df = data.frame(predicted = predicted, actual = actual, error = predicted - actual)
  return(result_df)
}

cv_arima_product1 = arimaCV(p1_ts,24,1)
cv_arima_product2 = arimaCV(p2_ts,24,1)
cv_arima_product3 = arimaCV(p3_ts,24,1)
cv_arima_product4 = arimaCV(p4_ts,24,1)

Prophet: Full Data Fitting and Cross-Validation

I assessed the performance of the Prophet model using the same rolling-window cross validation approach. For each product, I trained the model on a fixed window of 24 month’s of data, and then predicted the subsequent month. The window was shifted forward by one month at a time.

library(prophet)
#Prophet cross-validation function to match tsCV 
prophetCV = function(df, window_size, h, weekly.seasonality=FALSE, daily.seasonality=FALSE) {
  predicted = numeric(0)
  actual = numeric(0)
  for(i in seq(window_size, (nrow(df)-h), by=1)) {
    train = df[1:i,]
    model_prophet = prophet(train, weekly.seasonality=weekly.seasonality, daily.seasonality=daily.seasonality)
    future = make_future_dataframe(model_prophet, periods = h)
    forecast = predict(model_prophet, future)
    predicted = c(predicted, tail(forecast$yhat, n=1))
    actual = c(actual, df$y[i+h])  
  }
  df = data.frame(predicted = predicted, actual = actual, error = predicted-actual)
  return(df)
}

p1_prophet = data_frame(ds = p1$DATE, y = p1$COUNT)
p2_prophet = data_frame(ds = p2$DATE, y = p2$COUNT)
p3_prophet = data_frame(ds = p3$DATE, y = p3$COUNT)
p4_prophet = data_frame(ds = p4$DATE, y = p4$COUNT)

cv_prophet_product1 = prophetCV(p1_prophet,24,1)
cv_prophet_product2 = prophetCV(p2_prophet,24,1)
cv_prophet_product3 = prophetCV(p3_prophet,24,1)
cv_prophet_product4 = prophetCV(p4_prophet,24,1)

Final Veredict: Model Comparison

I assessed the performance of both products by comparing their corresponding MSE,RMSE,MAE,,MAPE,MDAPE and SMAPE.

custom_metrics = function(df) {
  errors = df$error
  errors = as.numeric(errors)
  
  n = length(errors)
  MSE = mean(errors^2, na.rm = TRUE)
  RMSE = sqrt(MSE)
  MAE = mean(abs(errors), na.rm = TRUE)
  
  non_zero_idx = df$actual != 0
  percentage_errors = 100 * (df$error[non_zero_idx] / df$actual[non_zero_idx])
  
  MAPE = mean(abs(percentage_errors), na.rm = TRUE)
  MDAPE = median(abs(percentage_errors), na.rm = TRUE)
  SMAPE = mean(200*abs(df$error)/(abs(df$predicted)+abs(df$actual)), na.rm=TRUE)

  results_df = data.frame(MSE, RMSE, MAE, MAPE, MDAPE, SMAPE)
  return(results_df)
}

compare_metrics = function(df_model1, df_model2) {
  metrics_model1 = custom_metrics(df_model1)
  metrics_model2 = custom_metrics(df_model2)
  
  rownames(metrics_model1) = "ARIMA"
  rownames(metrics_model2) = "Prophet"
  
  combined_metrics = rbind(metrics_model1, metrics_model2)
  
  return(combined_metrics)
}

#Product 1
print(compare_metrics(cv_arima_product1,cv_prophet_product1))

##              MSE     RMSE      MAE      MAPE    MDAPE     SMAPE
## ARIMA   3029.483 55.04074 41.66174  8.914872  7.04936  9.065784
## Prophet 5146.205 71.73705 57.23691 12.910377 11.66912 12.516321

#Product 2
print(compare_metrics(cv_arima_product2,cv_prophet_product2))

##              MSE     RMSE      MAE     MAPE    MDAPE    SMAPE
## ARIMA   1586.786 39.83448 32.11254 13.01135 10.24022 11.98177
## Prophet 4055.712 63.68447 51.43117 18.82175 17.30268 18.00102

#Product 3
print(compare_metrics(cv_arima_product3,cv_prophet_product3))

##              MSE     RMSE      MAE     MAPE    MDAPE    SMAPE
## ARIMA   187.9734 13.71034 12.09081 21.38264 19.03707 20.27012
## Prophet 599.0441 24.47538 20.94155 35.93975 32.49941 35.33141

#Product 4
print(compare_metrics(cv_arima_product4,cv_prophet_product4))

##               MSE      RMSE       MAE     MAPE    MDAPE    SMAPE
## ARIMA    79.41272  8.911382  6.923339 14.61295 12.00182 14.28629
## Prophet 174.76606 13.219911 10.710599 22.44037 16.49455 21.19286

Across all four products, the ARIMA models consistently outperformed the Prophet models in forecasting accuracy. ARIMA had lower values in all metrics. Metrics such as MSE,RMSE, and MSE, which are sensitive to large errors, are significantly lower with ARIMA. Moreover, percentage-based errors like MAPE,MDAPE, and SMAPE also favoured ARIMA, suggesting its forecasts were more accurate relative to the scale of the actual data.

Conclusion

When forecasting annual sales for 2022, Prophet outperformed ARIMA in three out of the four products. However, ARIMA consistently outperformed Prophet in the one-step ahead rolling window cross-validation. These outcomes suggest that while Prophet may be preferable for some time series data in terms of long-term predictions, ARIMA’s performance excels in data sets for short-term forecasts. Therefore, It’s important to emphasize the importance of model selection tailored to the specific characteristics of the data set. A combination of both approaches is preferable when attempting to forecast both the long-term and in the short-term. Further models should be considered such as a simple trend-based linear regression model and more advanced techniques such as Recurrent Neural Networks (RNNs) to enhance Company’s XYZ forecasts.