Introduction

Just a forecasting exercise I came across while browsing my old materials from school. By no means is this meant to be a in-depth guide on time series analysis. There will be some links at the end of this report for further readings. So ends the preamble.

Background

Let’s assume a scenario where we have some historical sales data. Our company wants to forecast the following year so leadership can better understand for future planning needs How can we approach this? Basically, a simple time series can be assumed of the following components

Overall trend (\(m_t\)) | What we want to capture
Repeating short term variation (\(s_t\)) | E.g., seasonality
The random variation (\(\epsilon_t\)) | background “noise”. Effects we cannot observe

To estimate “m” - first we must isolate it from the effects of “s”. We can do this by first “deseasonalizing” the series then using a linear regression to estimate the trend.

Simulate some historical Demand

Lets create some monthly historical data for 2021 through 2023. Notice the fluctuations month to month. We can see that there is some sort of repeating or “systematic” variation influence in the historical data. For example, the climb starting in January followed by a drop in July before climbing back up again. In addition, there looks to be an upward trend year to year.

# simulate some historical monthly demand
Historical_Demand = data.frame(
  TimePeriod = seq.Date(from = as.Date('2021-01-01'), 
                        to = as.Date('2023-12-01'), 
                        by = 'months'),
  Demand = c(2303,2778,2428,2996,3349,3449,2338,2604.50,2251.50,3246,3542,3331,
             2217,2683.50,2954.50,3235,3659,3794,2589,3087,3112,3700,4193,4129
             ,2674,3112.50,2848.50,3398,3770,3714,3155,3268,3568,4475,5248.50,4591.50),
  Index = seq(1:36)
)

Index	Time Period	Demand
1	2021-01-01	2303
2	2021-02-01	2778
3	2021-03-01	2428
4	2021-04-01	2996
5	2021-05-01	3349
6	2021-06-01	3449
7	2021-07-01	2338
8	2021-08-01	2604
9	2021-09-01	2252
10	2021-10-01	3246
11	2021-11-01	3542
12	2021-12-01	3331
13	2022-01-01	2217
14	2022-02-01	2684
15	2022-03-01	2954
16	2022-04-01	3235
17	2022-05-01	3659
18	2022-06-01	3794
19	2022-07-01	2589
20	2022-08-01	3087
21	2022-09-01	3112
22	2022-10-01	3700
23	2022-11-01	4193
24	2022-12-01	4129
25	2023-01-01	2674
26	2023-02-01	3112
27	2023-03-01	2848
28	2023-04-01	3398
29	2023-05-01	3770
30	2023-06-01	3714
31	2023-07-01	3155
32	2023-08-01	3268
33	2023-09-01	3568
34	2023-10-01	4475
35	2023-11-01	5248
36	2023-12-01	4592

ggplot(data = Historical_Demand, 
       aes(x = TimePeriod, 
           y = Demand, 
           fill = as.factor( year (TimePeriod) ) ) ) +
  
  geom_col() +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(NULL, values = c("#51d6ff", "#8D9EC6", "#A06B9A") ) +
  labs(title = 'Historical Demand 2021-2023',
       x= NULL,
       y = 'Demand') +
  custom_theme

Prep Data for model

Deseasonalize the Demand

The whole point of deseasonalizing the data is to accurately capture the real trend without that cyclical seasonal influence. We do not want that pattern of higher & lower months observed in the earlier section to impact our estimate of the trend Now, lets try to remove the seasonality from the demand series. Apply the following steps

First, calculate a seasonal index for each month (Ex: Jan 2021 / Avg 2021 )
Next, generate take the average SI by month (Ex: Jan 2021 + 2022 + 2023 / 3 = Avg SI Jan)
Lastly, divide each demand record by the corresponding average SI (Ex: Jan 2021 / Avg SI Jan)

This final column, “Deseasonalized”, is what demand would look like with the seasonality removed. The below plot illustrates the difference between the Deseasonalized demand vs actual demand. Obviously, there is still some ups & downs but notice that the Deseasonalized series is flatter

# generate calculations
Historical_Demand = Historical_Demand %>% 
  group_by(year(TimePeriod)) %>%
  # Yearly Average 
  mutate(AvgDemand = mean(Demand)) %>%
  ungroup() %>%
  # ratio of Demand to Average Demand
  mutate(Seasonal_Index = Demand / AvgDemand) %>%
  group_by(month(TimePeriod)) %>%
  # average seasonal index
  mutate(AvgSI = mean(Seasonal_Index)) %>%
  ungroup() %>%
  # Deseasonalize the Demand
  mutate(Deseasonalized = Demand / AvgSI)

Index	Time Period	Year	Month	Demand	Average Demand	Seasonal Index	Average SI	Deasonalized Demand
1	2021-01-01	2021	1	2303	2885	0.7984	0.7355	3131
2	2021-02-01	2021	2	2778	2885	0.9630	0.8779	3164
3	2021-03-01	2021	3	2428	2885	0.8417	0.8409	2887
4	2021-04-01	2021	4	2996	2885	1.0386	0.9852	3041
5	2021-05-01	2021	5	3349	2885	1.1610	1.1030	3036
6	2021-06-01	2021	6	3449	2885	1.1956	1.1232	3071
7	2021-07-01	2021	7	2338	2885	0.8105	0.8213	2847
8	2021-08-01	2021	8	2604	2885	0.9029	0.9130	2853
9	2021-09-01	2021	9	2252	2885	0.7805	0.9022	2496
10	2021-10-01	2021	10	3246	2885	1.1253	1.1596	2799
11	2021-11-01	2021	11	3542	2885	1.2279	1.3145	2694
12	2021-12-01	2021	12	3331	2885	1.1547	1.2237	2722
13	2022-01-01	2022	1	2217	3279	0.6760	0.7355	3014
14	2022-02-01	2022	2	2684	3279	0.8183	0.8779	3057
15	2022-03-01	2022	3	2954	3279	0.9009	0.8409	3514
16	2022-04-01	2022	4	3235	3279	0.9865	0.9852	3284
17	2022-05-01	2022	5	3659	3279	1.1157	1.1030	3317
18	2022-06-01	2022	6	3794	3279	1.1569	1.1232	3378
19	2022-07-01	2022	7	2589	3279	0.7895	0.8213	3152
20	2022-08-01	2022	8	3087	3279	0.9413	0.9130	3381
21	2022-09-01	2022	9	3112	3279	0.9489	0.9022	3450
22	2022-10-01	2022	10	3700	3279	1.1282	1.1596	3191
23	2022-11-01	2022	11	4193	3279	1.2786	1.3145	3190
24	2022-12-01	2022	12	4129	3279	1.2591	1.2237	3374
25	2023-01-01	2023	1	2674	3652	0.7322	0.7355	3635
26	2023-02-01	2023	2	3112	3652	0.8523	0.8779	3546
27	2023-03-01	2023	3	2848	3652	0.7800	0.8409	3388
28	2023-04-01	2023	4	3398	3652	0.9305	0.9852	3449
29	2023-05-01	2023	5	3770	3652	1.0323	1.1030	3418
30	2023-06-01	2023	6	3714	3652	1.0170	1.1232	3307
31	2023-07-01	2023	7	3155	3652	0.8639	0.8213	3841
32	2023-08-01	2023	8	3268	3652	0.8949	0.9130	3579
33	2023-09-01	2023	9	3568	3652	0.9770	0.9022	3955
34	2023-10-01	2023	10	4475	3652	1.2254	1.1596	3859
35	2023-11-01	2023	11	5248	3652	1.4372	1.3145	3993
36	2023-12-01	2023	12	4592	3652	1.2573	1.2237	3752

# overlay deseasonalized values
ggplot(data = Historical_Demand, aes(x = TimePeriod) ) +
  scale_y_continuous(labels = comma) +
  geom_col(aes(y=Demand), fill = 'deepskyblue4') +
  geom_line(aes(y=Deseasonalized), col = 'red', linewidth = 1.5) +
  geom_point(aes(y=Deseasonalized), col = 'black', size = 2.5) +
  labs(title = "Actual Demand against Deseasonalized Demand",
       subtitle = "Still some variation but notice how its flatter",
       x = NULL) +
  custom_theme

Model & Forecast

Now to use this Deseasonalized demand series to generate our forecast. We run a simple linear regression using time period to explain our Deseasonalized demand. Our model returns the following equation: Predicted Demand = 2,750.95 + 28.12 * Time Period. This will give us the Deseasonalized demand predictions for a given month. However, for our forecast we need to add back in the seasonality we removed earlier. To do this, we can refer back to the average SI by month calculated in the prior steps. Multiply the Deseasonalized demand point by its month’s corresponding Average SI.

Regression Results

# estimate trend. Index = Period 1, 2, 3..
m1 = lm(Deseasonalized ~ Index ,data = Historical_Demand)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	2750.95	73.723	37.315	0	2601.13	2900.78
Index	28.12	3.475	8.094	0	21.06	35.18

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.6583	0.6483	216.6	65.51	0	1	-243.7	493.3	498.1	1594772	34	36

Generate forecast for 2024

# pull model coefficients
modelcoef = as.vector(coef(m1))

# Create a new dataset 
demand_2024 = data.frame(
  TimePeriod = seq.Date(from = as.Date('2024-01-01'), 
                        to = as.Date('2024-12-01'), 
                        by = 'months'),
  Index = seq(from = 37, to = 48, by = 1)) %>%
  # generate demand forecast off model
  mutate(Predicted_Demand = modelcoef[1] + modelcoef[2]*Index) %>%
  # append the AvgSI calculated earlier using the historical demand
  cbind(distinct(Historical_Demand,`month(TimePeriod)`,AvgSI)) %>%
  # add seasonality back to predictions
  mutate(Seasonalized_Demand = Predicted_Demand * AvgSI)

Index	Time Period	Month	Predicted Demand	Average SI	Seasonalized Prediction
37	2024-01-01	1	3792	0.7355	2789
38	2024-02-01	2	3820	0.8779	3353
39	2024-03-01	3	3848	0.8409	3235
40	2024-04-01	4	3876	0.9852	3818
41	2024-05-01	5	3904	1.1030	4306
42	2024-06-01	6	3932	1.1232	4416
43	2024-07-01	7	3960	0.8213	3253
44	2024-08-01	8	3988	0.9130	3641
45	2024-09-01	9	4017	0.9022	3624
46	2024-10-01	10	4045	1.1596	4690
47	2024-11-01	11	4073	1.3145	5354
48	2024-12-01	12	4101	1.2237	5018

# create a plot with actual & forecast
ggplot (
  
  data = rbind( Historical_Demand %>% select(TimePeriod, Index, Demand),
                demand_2024 %>% 
                  select(TimePeriod, Index, Seasonalized_Demand) %>% 
                  rename(Demand = Seasonalized_Demand) ) %>%
    
         mutate( Flag = ifelse(Index >= 37, "Forecast", "Actual Demand") ),
  
  aes(x = TimePeriod, y = Demand, fill = Flag) ) + 
  
  geom_col() +
  scale_fill_manual(NULL, values=c('deepskyblue4','sienna1') ) + 
  scale_y_continuous(labels = comma, breaks = seq(0,5500, 1000)) +
  labs(title =  "Comparing Actual vs Forecast Demand",
       subtitle='Seems legit',
       x = NULL) +
  custom_theme

Model Validation

The forecast generated looks to align pretty well. We’re observing the same patterns we did in the historical data. But just passing the eyeball test isn’t enough. It’s important to gauge how reliable our forecast is. Obviously, no one is waiting for 2024 to end to compare to the actual numbers. What we can do now is repeat the exercise as above but only using a subset of our data, say the first two years. Then use the model returned to predict the demand for 2023 & directly compare to real values.

Create training & validation data subsets

# remove the extra columns created
Historical_Demand = Historical_Demand[,1:3]

# split data set
Training = Historical_Demand %>% filter( year(TimePeriod) < 2023)

# repeat same process on the training data
Training = Training %>% 
  group_by(year(TimePeriod)) %>%
  # Yearly Average 
  mutate(AvgDemand = mean(Demand)) %>%
  ungroup() %>%
  # ratio of Demand to Average Demand
  mutate(Seasonal_Index = Demand / AvgDemand) %>%
  group_by(month(TimePeriod)) %>%
  # average seasonal index
  mutate(AvgSI = mean(Seasonal_Index)) %>%
  ungroup() %>%
  # Deseasonalize the Demand
  mutate(Deseasonalized = Demand / AvgSI)

# rerun regression using the training data subset
m2 = lm(Deseasonalized ~ Index ,data = Training)

# Pull coefficients
modelcoef = as.vector(coef(m2))

New Model Results

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	2775.71	76.862	36.11	0e+00	2616.31	2935.11
Index	24.48	5.379	4.55	2e-04	13.32	35.63

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.4848	0.4614	182.4	20.71	2e-04	1	-158	321.9	325.5	732088	22	24

Calculate MAPE

One useful metric to calculate is Mean Absolute Percentage Error (MAPE) shown below. Its ease of explanation is why this is a common metric to calculate. In our case, the MAPE returned is 0.0699. This means our forecasted demand is approximately 7% away from the actual values, on average. A lower MAPE suggests a more reliable forecast. However, there is no hard rule on what should be considered a “good” MAPE. In general, I would consider around 10% to be an acceptable result.

\[ Mean\ Absolute\ Percentage\ Error \\ MAPE = \frac{1}{n} \sum_{t = 1}^{n} {| \frac{A_t - F_t}{A_t} | } \]

Validation = Historical_Demand %>% filter( year(TimePeriod) == 2023)

# Now on the validation data set - lets generate the forecast
Validation = Validation %>%
  mutate(Predicted_Demand = modelcoef[1] + modelcoef[2]*Index) %>%
  cbind(distinct(Training, `month(TimePeriod)`, AvgSI  ) ) %>%
  mutate(Seasonalized_Demand = Predicted_Demand * AvgSI) %>%
  # calculate the MAPE
  mutate( Error =  Demand - Seasonalized_Demand,
          ErrorOverActual = Error/Demand,
          AbsVal = abs(ErrorOverActual) )

mean(Validation$AbsVal)

## [1] 0.0699

Time Period	Actual Demand	Predicted Demand	Seasonalized Prediction	Actual - Prediction	Error Over Actual	Absolute Value
2023-01-01	2674	3388	2497	176.64	0.0661	0.0661
2023-02-01	3112	3412	3039	73.49	0.0236	0.0236
2023-03-01	2848	3437	2994	-145.83	-0.0512	0.0512
2023-04-01	3398	3461	3504	-106.42	-0.0313	0.0313
2023-05-01	3770	3486	3968	-197.79	-0.0525	0.0525
2023-06-01	3714	3510	4129	-414.74	-0.1117	0.1117
2023-07-01	3155	3535	2828	327.47	0.1038	0.1038
2023-08-01	3268	3559	3282	-13.74	-0.0042	0.0042
2023-09-01	3568	3583	3099	469.29	0.1315	0.1315
2023-10-01	4475	3608	4065	409.75	0.0916	0.0916
2023-11-01	5248	3632	4552	696.27	0.1327	0.1327
2023-12-01	4592	3657	4413	178.02	0.0388	0.0388

Closing Comments

One assumption to keep in mind when using this particular technique: equal time periods. We are making an assumption that time periods in this series are behaving in a reasonable & consistent manner, including that seasonal component that we extracted out of the demand. If say, some event alters that behavior (maybe lawsuit or supply chain issue) then those abnormalities can impact the quality of our forecast. More robust approaches would be needed to deal with cases like those. I hope short exercise provides some utility to whomever may stumble across this! Keep in mind this is one technique of many that can be employed to answer our earlier problem statement. Below are some additional resources & readings I found interesting for anyone curious.

Thanks for reading! 😄

Further Resources

Resource I Different types of forecast KPI measures
Resource II Time series modeling (like ARIMA, SES, etc.,). With code examples!
Resource III Another Good reading on Time Series
Resource IV Holt-Winters Forecasting - another popular method
Resource V Exponential smoothing technique - again, another popular method)
Resource VI Neat blog with examples

Simple Forecasting Example

Jack Cela

Last updated: 22 September 2024