Introduction

Just a forecasting exercise I came across while browsing my old materials from school. By no means is this meant to be a in-depth guide on time series analysis. There will be some links at the end of this report for further readings. So ends the preamble.

Background

Let’s assume a scenario where we have some historical sales data. Our company wants to forecast the following year so leadership can better understand for future planning needs How can we approach this? Basically, a simple time series can be assumed of the following components

  • Overall trend (\(m_t\)) | What we want to capture
  • Repeating short term variation (\(s_t\)) | E.g., seasonality
  • The random variation (\(\epsilon_t\)) | background “noise”. Effects we cannot observe

To estimate “m” - first we must isolate it from the effects of “s”. We can do this by first “deseasonalizing” the series then using a linear regression to estimate the trend.

Simulate some historical Demand

Lets create some monthly historical data for 2021 through 2023. Notice the fluctuations month to month. We can see that there is some sort of repeating or “systematic” variation influence in the historical data. For example, the climb starting in January followed by a drop in July before climbing back up again. In addition, there looks to be an upward trend year to year.

# simulate some historical monthly demand
Historical_Demand = data.frame(
  TimePeriod = seq.Date(from = as.Date('2021-01-01'), 
                        to = as.Date('2023-12-01'), 
                        by = 'months'),
  Demand = c(2303,2778,2428,2996,3349,3449,2338,2604.50,2251.50,3246,3542,3331,
             2217,2683.50,2954.50,3235,3659,3794,2589,3087,3112,3700,4193,4129
             ,2674,3112.50,2848.50,3398,3770,3714,3155,3268,3568,4475,5248.50,4591.50),
  Index = seq(1:36)
)
Index Time Period Demand
1 2021-01-01 2303
2 2021-02-01 2778
3 2021-03-01 2428
4 2021-04-01 2996
5 2021-05-01 3349
6 2021-06-01 3449
7 2021-07-01 2338
8 2021-08-01 2604
9 2021-09-01 2252
10 2021-10-01 3246
11 2021-11-01 3542
12 2021-12-01 3331
13 2022-01-01 2217
14 2022-02-01 2684
15 2022-03-01 2954
16 2022-04-01 3235
17 2022-05-01 3659
18 2022-06-01 3794
19 2022-07-01 2589
20 2022-08-01 3087
21 2022-09-01 3112
22 2022-10-01 3700
23 2022-11-01 4193
24 2022-12-01 4129
25 2023-01-01 2674
26 2023-02-01 3112
27 2023-03-01 2848
28 2023-04-01 3398
29 2023-05-01 3770
30 2023-06-01 3714
31 2023-07-01 3155
32 2023-08-01 3268
33 2023-09-01 3568
34 2023-10-01 4475
35 2023-11-01 5248
36 2023-12-01 4592


ggplot(data = Historical_Demand, 
       aes(x = TimePeriod, 
           y = Demand, 
           fill = as.factor( year (TimePeriod) ) ) ) +
  
  geom_col() +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(NULL, values = c("#51d6ff", "#8D9EC6", "#A06B9A") ) +
  labs(title = 'Historical Demand 2021-2023',
       x= NULL,
       y = 'Demand') +
  custom_theme 

Prep Data for model

Deseasonalize the Demand

The whole point of deseasonalizing the data is to accurately capture the real trend without that cyclical seasonal influence. We do not want that pattern of higher & lower months observed in the earlier section to impact our estimate of the trend Now, lets try to remove the seasonality from the demand series. Apply the following steps

  • First, calculate a seasonal index for each month (Ex: Jan 2021 / Avg 2021 )
  • Next, generate take the average SI by month (Ex: Jan 2021 + 2022 + 2023 / 3 = Avg SI Jan)
  • Lastly, divide each demand record by the corresponding average SI (Ex: Jan 2021 / Avg SI Jan)

This final column, “Deseasonalized”, is what demand would look like with the seasonality removed. The below plot illustrates the difference between the Deseasonalized demand vs actual demand. Obviously, there is still some ups & downs but notice that the Deseasonalized series is flatter

# generate calculations
Historical_Demand = Historical_Demand %>% 
  group_by(year(TimePeriod)) %>%
  # Yearly Average 
  mutate(AvgDemand = mean(Demand)) %>%
  ungroup() %>%
  # ratio of Demand to Average Demand
  mutate(Seasonal_Index = Demand / AvgDemand) %>%
  group_by(month(TimePeriod)) %>%
  # average seasonal index
  mutate(AvgSI = mean(Seasonal_Index)) %>%
  ungroup() %>%
  # Deseasonalize the Demand
  mutate(Deseasonalized = Demand / AvgSI)
Index Time Period Year Month Demand Average Demand Seasonal Index Average SI Deasonalized Demand
1 2021-01-01 2021 1 2303 2885 0.7984 0.7355 3131
2 2021-02-01 2021 2 2778 2885 0.9630 0.8779 3164
3 2021-03-01 2021 3 2428 2885 0.8417 0.8409 2887
4 2021-04-01 2021 4 2996 2885 1.0386 0.9852 3041
5 2021-05-01 2021 5 3349 2885 1.1610 1.1030 3036
6 2021-06-01 2021 6 3449 2885 1.1956 1.1232 3071
7 2021-07-01 2021 7 2338 2885 0.8105 0.8213 2847
8 2021-08-01 2021 8 2604 2885 0.9029 0.9130 2853
9 2021-09-01 2021 9 2252 2885 0.7805 0.9022 2496
10 2021-10-01 2021 10 3246 2885 1.1253 1.1596 2799
11 2021-11-01 2021 11 3542 2885 1.2279 1.3145 2694
12 2021-12-01 2021 12 3331 2885 1.1547 1.2237 2722
13 2022-01-01 2022 1 2217 3279 0.6760 0.7355 3014
14 2022-02-01 2022 2 2684 3279 0.8183 0.8779 3057
15 2022-03-01 2022 3 2954 3279 0.9009 0.8409 3514
16 2022-04-01 2022 4 3235 3279 0.9865 0.9852 3284
17 2022-05-01 2022 5 3659 3279 1.1157 1.1030 3317
18 2022-06-01 2022 6 3794 3279 1.1569 1.1232 3378
19 2022-07-01 2022 7 2589 3279 0.7895 0.8213 3152
20 2022-08-01 2022 8 3087 3279 0.9413 0.9130 3381
21 2022-09-01 2022 9 3112 3279 0.9489 0.9022 3450
22 2022-10-01 2022 10 3700 3279 1.1282 1.1596 3191
23 2022-11-01 2022 11 4193 3279 1.2786 1.3145 3190
24 2022-12-01 2022 12 4129 3279 1.2591 1.2237 3374
25 2023-01-01 2023 1 2674 3652 0.7322 0.7355 3635
26 2023-02-01 2023 2 3112 3652 0.8523 0.8779 3546
27 2023-03-01 2023 3 2848 3652 0.7800 0.8409 3388
28 2023-04-01 2023 4 3398 3652 0.9305 0.9852 3449
29 2023-05-01 2023 5 3770 3652 1.0323 1.1030 3418
30 2023-06-01 2023 6 3714 3652 1.0170 1.1232 3307
31 2023-07-01 2023 7 3155 3652 0.8639 0.8213 3841
32 2023-08-01 2023 8 3268 3652 0.8949 0.9130 3579
33 2023-09-01 2023 9 3568 3652 0.9770 0.9022 3955
34 2023-10-01 2023 10 4475 3652 1.2254 1.1596 3859
35 2023-11-01 2023 11 5248 3652 1.4372 1.3145 3993
36 2023-12-01 2023 12 4592 3652 1.2573 1.2237 3752


# overlay deseasonalized values
ggplot(data = Historical_Demand, aes(x = TimePeriod) ) +
  scale_y_continuous(labels = comma) +
  geom_col(aes(y=Demand), fill = 'deepskyblue4') +
  geom_line(aes(y=Deseasonalized), col = 'red', linewidth = 1.5) +
  geom_point(aes(y=Deseasonalized), col = 'black', size = 2.5) +
  labs(title = "Actual Demand against Deseasonalized Demand",
       subtitle = "Still some variation but notice how its flatter",
       x = NULL) +
  custom_theme

Model & Forecast

Now to use this Deseasonalized demand series to generate our forecast. We run a simple linear regression using time period to explain our Deseasonalized demand. Our model returns the following equation: Predicted Demand = 2,750.95 + 28.12 * Time Period. This will give us the Deseasonalized demand predictions for a given month. However, for our forecast we need to add back in the seasonality we removed earlier. To do this, we can refer back to the average SI by month calculated in the prior steps. Multiply the Deseasonalized demand point by its month’s corresponding Average SI.

Regression Results

# estimate trend. Index = Period 1, 2, 3..
m1 = lm(Deseasonalized ~ Index ,data = Historical_Demand)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 2750.95 73.723 37.315 0 2601.13 2900.78
Index 28.12 3.475 8.094 0 21.06 35.18
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6583 0.6483 216.6 65.51 0 1 -243.7 493.3 498.1 1594772 34 36

Generate forecast for 2024

# pull model coefficients
modelcoef = as.vector(coef(m1))

# Create a new dataset 
demand_2024 = data.frame(
  TimePeriod = seq.Date(from = as.Date('2024-01-01'), 
                        to = as.Date('2024-12-01'), 
                        by = 'months'),
  Index = seq(from = 37, to = 48, by = 1)) %>%
  # generate demand forecast off model
  mutate(Predicted_Demand = modelcoef[1] + modelcoef[2]*Index) %>%
  # append the AvgSI calculated earlier using the historical demand
  cbind(distinct(Historical_Demand,`month(TimePeriod)`,AvgSI)) %>%
  # add seasonality back to predictions
  mutate(Seasonalized_Demand = Predicted_Demand * AvgSI)
Index Time Period Month Predicted Demand Average SI Seasonalized Prediction
37 2024-01-01 1 3792 0.7355 2789
38 2024-02-01 2 3820 0.8779 3353
39 2024-03-01 3 3848 0.8409 3235
40 2024-04-01 4 3876 0.9852 3818
41 2024-05-01 5 3904 1.1030 4306
42 2024-06-01 6 3932 1.1232 4416
43 2024-07-01 7 3960 0.8213 3253
44 2024-08-01 8 3988 0.9130 3641
45 2024-09-01 9 4017 0.9022 3624
46 2024-10-01 10 4045 1.1596 4690
47 2024-11-01 11 4073 1.3145 5354
48 2024-12-01 12 4101 1.2237 5018


# create a plot with actual & forecast
ggplot (
  
  data = rbind( Historical_Demand %>% select(TimePeriod, Index, Demand),
                demand_2024 %>% 
                  select(TimePeriod, Index, Seasonalized_Demand) %>% 
                  rename(Demand = Seasonalized_Demand) ) %>%
    
         mutate( Flag = ifelse(Index >= 37, "Forecast", "Actual Demand") ),
  
  aes(x = TimePeriod, y = Demand, fill = Flag) ) + 
  
  geom_col() +
  scale_fill_manual(NULL, values=c('deepskyblue4','sienna1') ) + 
  scale_y_continuous(labels = comma, breaks = seq(0,5500, 1000)) +
  labs(title =  "Comparing Actual vs Forecast Demand",
       subtitle='Seems legit',
       x = NULL) +
  custom_theme

Model Validation

The forecast generated looks to align pretty well. We’re observing the same patterns we did in the historical data. But just passing the eyeball test isn’t enough. It’s important to gauge how reliable our forecast is. Obviously, no one is waiting for 2024 to end to compare to the actual numbers. What we can do now is repeat the exercise as above but only using a subset of our data, say the first two years. Then use the model returned to predict the demand for 2023 & directly compare to real values.

Create training & validation data subsets

# remove the extra columns created
Historical_Demand = Historical_Demand[,1:3]

# split data set
Training = Historical_Demand %>% filter( year(TimePeriod) < 2023)

# repeat same process on the training data
Training = Training %>% 
  group_by(year(TimePeriod)) %>%
  # Yearly Average 
  mutate(AvgDemand = mean(Demand)) %>%
  ungroup() %>%
  # ratio of Demand to Average Demand
  mutate(Seasonal_Index = Demand / AvgDemand) %>%
  group_by(month(TimePeriod)) %>%
  # average seasonal index
  mutate(AvgSI = mean(Seasonal_Index)) %>%
  ungroup() %>%
  # Deseasonalize the Demand
  mutate(Deseasonalized = Demand / AvgSI)

# rerun regression using the training data subset
m2 = lm(Deseasonalized ~ Index ,data = Training)

# Pull coefficients
modelcoef = as.vector(coef(m2))

New Model Results

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 2775.71 76.862 36.11 0e+00 2616.31 2935.11
Index 24.48 5.379 4.55 2e-04 13.32 35.63
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.4848 0.4614 182.4 20.71 2e-04 1 -158 321.9 325.5 732088 22 24

Calculate MAPE

One useful metric to calculate is Mean Absolute Percentage Error (MAPE) shown below. Its ease of explanation is why this is a common metric to calculate. In our case, the MAPE returned is 0.0699. This means our forecasted demand is approximately 7% away from the actual values, on average. A lower MAPE suggests a more reliable forecast. However, there is no hard rule on what should be considered a “good” MAPE. In general, I would consider around 10% to be an acceptable result.

\[ Mean\ Absolute\ Percentage\ Error \\ MAPE = \frac{1}{n} \sum_{t = 1}^{n} {| \frac{A_t - F_t}{A_t} | } \]

Validation = Historical_Demand %>% filter( year(TimePeriod) == 2023)

# Now on the validation data set - lets generate the forecast
Validation = Validation %>%
  mutate(Predicted_Demand = modelcoef[1] + modelcoef[2]*Index) %>%
  cbind(distinct(Training, `month(TimePeriod)`, AvgSI  ) ) %>%
  mutate(Seasonalized_Demand = Predicted_Demand * AvgSI) %>%
  # calculate the MAPE
  mutate( Error =  Demand - Seasonalized_Demand,
          ErrorOverActual = Error/Demand,
          AbsVal = abs(ErrorOverActual) )

mean(Validation$AbsVal)
## [1] 0.0699
Time Period Actual Demand Predicted Demand Seasonalized Prediction Actual - Prediction Error Over Actual Absolute Value
2023-01-01 2674 3388 2497 176.64 0.0661 0.0661
2023-02-01 3112 3412 3039 73.49 0.0236 0.0236
2023-03-01 2848 3437 2994 -145.83 -0.0512 0.0512
2023-04-01 3398 3461 3504 -106.42 -0.0313 0.0313
2023-05-01 3770 3486 3968 -197.79 -0.0525 0.0525
2023-06-01 3714 3510 4129 -414.74 -0.1117 0.1117
2023-07-01 3155 3535 2828 327.47 0.1038 0.1038
2023-08-01 3268 3559 3282 -13.74 -0.0042 0.0042
2023-09-01 3568 3583 3099 469.29 0.1315 0.1315
2023-10-01 4475 3608 4065 409.75 0.0916 0.0916
2023-11-01 5248 3632 4552 696.27 0.1327 0.1327
2023-12-01 4592 3657 4413 178.02 0.0388 0.0388

Closing Comments

One assumption to keep in mind when using this particular technique: equal time periods. We are making an assumption that time periods in this series are behaving in a reasonable & consistent manner, including that seasonal component that we extracted out of the demand. If say, some event alters that behavior (maybe lawsuit or supply chain issue) then those abnormalities can impact the quality of our forecast. More robust approaches would be needed to deal with cases like those. I hope short exercise provides some utility to whomever may stumble across this! Keep in mind this is one technique of many that can be employed to answer our earlier problem statement. Below are some additional resources & readings I found interesting for anyone curious.

Thanks for reading! 😄