Just a forecasting exercise I came across while browsing my old materials from school. By no means is this meant to be a in-depth guide on time series analysis. There will be some links at the end of this report for further readings. So ends the preamble.
Let’s assume a scenario where we have some historical sales data. Our company wants to forecast the following year so leadership can better understand for future planning needs How can we approach this? Basically, a simple time series can be assumed of the following components
To estimate “m” - first we must isolate it from the effects of “s”. We can do this by first “deseasonalizing” the series then using a linear regression to estimate the trend.
Lets create some monthly historical data for 2021 through 2023. Notice the fluctuations month to month. We can see that there is some sort of repeating or “systematic” variation influence in the historical data. For example, the climb starting in January followed by a drop in July before climbing back up again. In addition, there looks to be an upward trend year to year.
# simulate some historical monthly demand
Historical_Demand = data.frame(
TimePeriod = seq.Date(from = as.Date('2021-01-01'),
to = as.Date('2023-12-01'),
by = 'months'),
Demand = c(2303,2778,2428,2996,3349,3449,2338,2604.50,2251.50,3246,3542,3331,
2217,2683.50,2954.50,3235,3659,3794,2589,3087,3112,3700,4193,4129
,2674,3112.50,2848.50,3398,3770,3714,3155,3268,3568,4475,5248.50,4591.50),
Index = seq(1:36)
)
| Index | Time Period | Demand |
|---|---|---|
| 1 | 2021-01-01 | 2303 |
| 2 | 2021-02-01 | 2778 |
| 3 | 2021-03-01 | 2428 |
| 4 | 2021-04-01 | 2996 |
| 5 | 2021-05-01 | 3349 |
| 6 | 2021-06-01 | 3449 |
| 7 | 2021-07-01 | 2338 |
| 8 | 2021-08-01 | 2604 |
| 9 | 2021-09-01 | 2252 |
| 10 | 2021-10-01 | 3246 |
| 11 | 2021-11-01 | 3542 |
| 12 | 2021-12-01 | 3331 |
| 13 | 2022-01-01 | 2217 |
| 14 | 2022-02-01 | 2684 |
| 15 | 2022-03-01 | 2954 |
| 16 | 2022-04-01 | 3235 |
| 17 | 2022-05-01 | 3659 |
| 18 | 2022-06-01 | 3794 |
| 19 | 2022-07-01 | 2589 |
| 20 | 2022-08-01 | 3087 |
| 21 | 2022-09-01 | 3112 |
| 22 | 2022-10-01 | 3700 |
| 23 | 2022-11-01 | 4193 |
| 24 | 2022-12-01 | 4129 |
| 25 | 2023-01-01 | 2674 |
| 26 | 2023-02-01 | 3112 |
| 27 | 2023-03-01 | 2848 |
| 28 | 2023-04-01 | 3398 |
| 29 | 2023-05-01 | 3770 |
| 30 | 2023-06-01 | 3714 |
| 31 | 2023-07-01 | 3155 |
| 32 | 2023-08-01 | 3268 |
| 33 | 2023-09-01 | 3568 |
| 34 | 2023-10-01 | 4475 |
| 35 | 2023-11-01 | 5248 |
| 36 | 2023-12-01 | 4592 |
ggplot(data = Historical_Demand,
aes(x = TimePeriod,
y = Demand,
fill = as.factor( year (TimePeriod) ) ) ) +
geom_col() +
scale_y_continuous(labels = comma) +
scale_fill_manual(NULL, values = c("#51d6ff", "#8D9EC6", "#A06B9A") ) +
labs(title = 'Historical Demand 2021-2023',
x= NULL,
y = 'Demand') +
custom_theme
The whole point of deseasonalizing the data is to accurately capture the real trend without that cyclical seasonal influence. We do not want that pattern of higher & lower months observed in the earlier section to impact our estimate of the trend Now, lets try to remove the seasonality from the demand series. Apply the following steps
This final column, “Deseasonalized”, is what demand would look like with the seasonality removed. The below plot illustrates the difference between the Deseasonalized demand vs actual demand. Obviously, there is still some ups & downs but notice that the Deseasonalized series is flatter
# generate calculations
Historical_Demand = Historical_Demand %>%
group_by(year(TimePeriod)) %>%
# Yearly Average
mutate(AvgDemand = mean(Demand)) %>%
ungroup() %>%
# ratio of Demand to Average Demand
mutate(Seasonal_Index = Demand / AvgDemand) %>%
group_by(month(TimePeriod)) %>%
# average seasonal index
mutate(AvgSI = mean(Seasonal_Index)) %>%
ungroup() %>%
# Deseasonalize the Demand
mutate(Deseasonalized = Demand / AvgSI)
| Index | Time Period | Year | Month | Demand | Average Demand | Seasonal Index | Average SI | Deasonalized Demand |
|---|---|---|---|---|---|---|---|---|
| 1 | 2021-01-01 | 2021 | 1 | 2303 | 2885 | 0.7984 | 0.7355 | 3131 |
| 2 | 2021-02-01 | 2021 | 2 | 2778 | 2885 | 0.9630 | 0.8779 | 3164 |
| 3 | 2021-03-01 | 2021 | 3 | 2428 | 2885 | 0.8417 | 0.8409 | 2887 |
| 4 | 2021-04-01 | 2021 | 4 | 2996 | 2885 | 1.0386 | 0.9852 | 3041 |
| 5 | 2021-05-01 | 2021 | 5 | 3349 | 2885 | 1.1610 | 1.1030 | 3036 |
| 6 | 2021-06-01 | 2021 | 6 | 3449 | 2885 | 1.1956 | 1.1232 | 3071 |
| 7 | 2021-07-01 | 2021 | 7 | 2338 | 2885 | 0.8105 | 0.8213 | 2847 |
| 8 | 2021-08-01 | 2021 | 8 | 2604 | 2885 | 0.9029 | 0.9130 | 2853 |
| 9 | 2021-09-01 | 2021 | 9 | 2252 | 2885 | 0.7805 | 0.9022 | 2496 |
| 10 | 2021-10-01 | 2021 | 10 | 3246 | 2885 | 1.1253 | 1.1596 | 2799 |
| 11 | 2021-11-01 | 2021 | 11 | 3542 | 2885 | 1.2279 | 1.3145 | 2694 |
| 12 | 2021-12-01 | 2021 | 12 | 3331 | 2885 | 1.1547 | 1.2237 | 2722 |
| 13 | 2022-01-01 | 2022 | 1 | 2217 | 3279 | 0.6760 | 0.7355 | 3014 |
| 14 | 2022-02-01 | 2022 | 2 | 2684 | 3279 | 0.8183 | 0.8779 | 3057 |
| 15 | 2022-03-01 | 2022 | 3 | 2954 | 3279 | 0.9009 | 0.8409 | 3514 |
| 16 | 2022-04-01 | 2022 | 4 | 3235 | 3279 | 0.9865 | 0.9852 | 3284 |
| 17 | 2022-05-01 | 2022 | 5 | 3659 | 3279 | 1.1157 | 1.1030 | 3317 |
| 18 | 2022-06-01 | 2022 | 6 | 3794 | 3279 | 1.1569 | 1.1232 | 3378 |
| 19 | 2022-07-01 | 2022 | 7 | 2589 | 3279 | 0.7895 | 0.8213 | 3152 |
| 20 | 2022-08-01 | 2022 | 8 | 3087 | 3279 | 0.9413 | 0.9130 | 3381 |
| 21 | 2022-09-01 | 2022 | 9 | 3112 | 3279 | 0.9489 | 0.9022 | 3450 |
| 22 | 2022-10-01 | 2022 | 10 | 3700 | 3279 | 1.1282 | 1.1596 | 3191 |
| 23 | 2022-11-01 | 2022 | 11 | 4193 | 3279 | 1.2786 | 1.3145 | 3190 |
| 24 | 2022-12-01 | 2022 | 12 | 4129 | 3279 | 1.2591 | 1.2237 | 3374 |
| 25 | 2023-01-01 | 2023 | 1 | 2674 | 3652 | 0.7322 | 0.7355 | 3635 |
| 26 | 2023-02-01 | 2023 | 2 | 3112 | 3652 | 0.8523 | 0.8779 | 3546 |
| 27 | 2023-03-01 | 2023 | 3 | 2848 | 3652 | 0.7800 | 0.8409 | 3388 |
| 28 | 2023-04-01 | 2023 | 4 | 3398 | 3652 | 0.9305 | 0.9852 | 3449 |
| 29 | 2023-05-01 | 2023 | 5 | 3770 | 3652 | 1.0323 | 1.1030 | 3418 |
| 30 | 2023-06-01 | 2023 | 6 | 3714 | 3652 | 1.0170 | 1.1232 | 3307 |
| 31 | 2023-07-01 | 2023 | 7 | 3155 | 3652 | 0.8639 | 0.8213 | 3841 |
| 32 | 2023-08-01 | 2023 | 8 | 3268 | 3652 | 0.8949 | 0.9130 | 3579 |
| 33 | 2023-09-01 | 2023 | 9 | 3568 | 3652 | 0.9770 | 0.9022 | 3955 |
| 34 | 2023-10-01 | 2023 | 10 | 4475 | 3652 | 1.2254 | 1.1596 | 3859 |
| 35 | 2023-11-01 | 2023 | 11 | 5248 | 3652 | 1.4372 | 1.3145 | 3993 |
| 36 | 2023-12-01 | 2023 | 12 | 4592 | 3652 | 1.2573 | 1.2237 | 3752 |
# overlay deseasonalized values
ggplot(data = Historical_Demand, aes(x = TimePeriod) ) +
scale_y_continuous(labels = comma) +
geom_col(aes(y=Demand), fill = 'deepskyblue4') +
geom_line(aes(y=Deseasonalized), col = 'red', linewidth = 1.5) +
geom_point(aes(y=Deseasonalized), col = 'black', size = 2.5) +
labs(title = "Actual Demand against Deseasonalized Demand",
subtitle = "Still some variation but notice how its flatter",
x = NULL) +
custom_theme
Now to use this Deseasonalized demand series to generate our forecast. We run a simple linear regression using time period to explain our Deseasonalized demand. Our model returns the following equation: Predicted Demand = 2,750.95 + 28.12 * Time Period. This will give us the Deseasonalized demand predictions for a given month. However, for our forecast we need to add back in the seasonality we removed earlier. To do this, we can refer back to the average SI by month calculated in the prior steps. Multiply the Deseasonalized demand point by its month’s corresponding Average SI.
# estimate trend. Index = Period 1, 2, 3..
m1 = lm(Deseasonalized ~ Index ,data = Historical_Demand)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 2750.95 | 73.723 | 37.315 | 0 | 2601.13 | 2900.78 |
| Index | 28.12 | 3.475 | 8.094 | 0 | 21.06 | 35.18 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.6583 | 0.6483 | 216.6 | 65.51 | 0 | 1 | -243.7 | 493.3 | 498.1 | 1594772 | 34 | 36 |
# pull model coefficients
modelcoef = as.vector(coef(m1))
# Create a new dataset
demand_2024 = data.frame(
TimePeriod = seq.Date(from = as.Date('2024-01-01'),
to = as.Date('2024-12-01'),
by = 'months'),
Index = seq(from = 37, to = 48, by = 1)) %>%
# generate demand forecast off model
mutate(Predicted_Demand = modelcoef[1] + modelcoef[2]*Index) %>%
# append the AvgSI calculated earlier using the historical demand
cbind(distinct(Historical_Demand,`month(TimePeriod)`,AvgSI)) %>%
# add seasonality back to predictions
mutate(Seasonalized_Demand = Predicted_Demand * AvgSI)
| Index | Time Period | Month | Predicted Demand | Average SI | Seasonalized Prediction |
|---|---|---|---|---|---|
| 37 | 2024-01-01 | 1 | 3792 | 0.7355 | 2789 |
| 38 | 2024-02-01 | 2 | 3820 | 0.8779 | 3353 |
| 39 | 2024-03-01 | 3 | 3848 | 0.8409 | 3235 |
| 40 | 2024-04-01 | 4 | 3876 | 0.9852 | 3818 |
| 41 | 2024-05-01 | 5 | 3904 | 1.1030 | 4306 |
| 42 | 2024-06-01 | 6 | 3932 | 1.1232 | 4416 |
| 43 | 2024-07-01 | 7 | 3960 | 0.8213 | 3253 |
| 44 | 2024-08-01 | 8 | 3988 | 0.9130 | 3641 |
| 45 | 2024-09-01 | 9 | 4017 | 0.9022 | 3624 |
| 46 | 2024-10-01 | 10 | 4045 | 1.1596 | 4690 |
| 47 | 2024-11-01 | 11 | 4073 | 1.3145 | 5354 |
| 48 | 2024-12-01 | 12 | 4101 | 1.2237 | 5018 |
# create a plot with actual & forecast
ggplot (
data = rbind( Historical_Demand %>% select(TimePeriod, Index, Demand),
demand_2024 %>%
select(TimePeriod, Index, Seasonalized_Demand) %>%
rename(Demand = Seasonalized_Demand) ) %>%
mutate( Flag = ifelse(Index >= 37, "Forecast", "Actual Demand") ),
aes(x = TimePeriod, y = Demand, fill = Flag) ) +
geom_col() +
scale_fill_manual(NULL, values=c('deepskyblue4','sienna1') ) +
scale_y_continuous(labels = comma, breaks = seq(0,5500, 1000)) +
labs(title = "Comparing Actual vs Forecast Demand",
subtitle='Seems legit',
x = NULL) +
custom_theme
The forecast generated looks to align pretty well. We’re observing the same patterns we did in the historical data. But just passing the eyeball test isn’t enough. It’s important to gauge how reliable our forecast is. Obviously, no one is waiting for 2024 to end to compare to the actual numbers. What we can do now is repeat the exercise as above but only using a subset of our data, say the first two years. Then use the model returned to predict the demand for 2023 & directly compare to real values.
# remove the extra columns created
Historical_Demand = Historical_Demand[,1:3]
# split data set
Training = Historical_Demand %>% filter( year(TimePeriod) < 2023)
# repeat same process on the training data
Training = Training %>%
group_by(year(TimePeriod)) %>%
# Yearly Average
mutate(AvgDemand = mean(Demand)) %>%
ungroup() %>%
# ratio of Demand to Average Demand
mutate(Seasonal_Index = Demand / AvgDemand) %>%
group_by(month(TimePeriod)) %>%
# average seasonal index
mutate(AvgSI = mean(Seasonal_Index)) %>%
ungroup() %>%
# Deseasonalize the Demand
mutate(Deseasonalized = Demand / AvgSI)
# rerun regression using the training data subset
m2 = lm(Deseasonalized ~ Index ,data = Training)
# Pull coefficients
modelcoef = as.vector(coef(m2))
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 2775.71 | 76.862 | 36.11 | 0e+00 | 2616.31 | 2935.11 |
| Index | 24.48 | 5.379 | 4.55 | 2e-04 | 13.32 | 35.63 |
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.4848 | 0.4614 | 182.4 | 20.71 | 2e-04 | 1 | -158 | 321.9 | 325.5 | 732088 | 22 | 24 |
One useful metric to calculate is Mean Absolute Percentage Error (MAPE) shown below. Its ease of explanation is why this is a common metric to calculate. In our case, the MAPE returned is 0.0699. This means our forecasted demand is approximately 7% away from the actual values, on average. A lower MAPE suggests a more reliable forecast. However, there is no hard rule on what should be considered a “good” MAPE. In general, I would consider around 10% to be an acceptable result.
\[ Mean\ Absolute\ Percentage\ Error \\ MAPE = \frac{1}{n} \sum_{t = 1}^{n} {| \frac{A_t - F_t}{A_t} | } \]
Validation = Historical_Demand %>% filter( year(TimePeriod) == 2023)
# Now on the validation data set - lets generate the forecast
Validation = Validation %>%
mutate(Predicted_Demand = modelcoef[1] + modelcoef[2]*Index) %>%
cbind(distinct(Training, `month(TimePeriod)`, AvgSI ) ) %>%
mutate(Seasonalized_Demand = Predicted_Demand * AvgSI) %>%
# calculate the MAPE
mutate( Error = Demand - Seasonalized_Demand,
ErrorOverActual = Error/Demand,
AbsVal = abs(ErrorOverActual) )
mean(Validation$AbsVal)
## [1] 0.0699
| Time Period | Actual Demand | Predicted Demand | Seasonalized Prediction | Actual - Prediction | Error Over Actual | Absolute Value |
|---|---|---|---|---|---|---|
| 2023-01-01 | 2674 | 3388 | 2497 | 176.64 | 0.0661 | 0.0661 |
| 2023-02-01 | 3112 | 3412 | 3039 | 73.49 | 0.0236 | 0.0236 |
| 2023-03-01 | 2848 | 3437 | 2994 | -145.83 | -0.0512 | 0.0512 |
| 2023-04-01 | 3398 | 3461 | 3504 | -106.42 | -0.0313 | 0.0313 |
| 2023-05-01 | 3770 | 3486 | 3968 | -197.79 | -0.0525 | 0.0525 |
| 2023-06-01 | 3714 | 3510 | 4129 | -414.74 | -0.1117 | 0.1117 |
| 2023-07-01 | 3155 | 3535 | 2828 | 327.47 | 0.1038 | 0.1038 |
| 2023-08-01 | 3268 | 3559 | 3282 | -13.74 | -0.0042 | 0.0042 |
| 2023-09-01 | 3568 | 3583 | 3099 | 469.29 | 0.1315 | 0.1315 |
| 2023-10-01 | 4475 | 3608 | 4065 | 409.75 | 0.0916 | 0.0916 |
| 2023-11-01 | 5248 | 3632 | 4552 | 696.27 | 0.1327 | 0.1327 |
| 2023-12-01 | 4592 | 3657 | 4413 | 178.02 | 0.0388 | 0.0388 |
One assumption to keep in mind when using this particular technique: equal time periods. We are making an assumption that time periods in this series are behaving in a reasonable & consistent manner, including that seasonal component that we extracted out of the demand. If say, some event alters that behavior (maybe lawsuit or supply chain issue) then those abnormalities can impact the quality of our forecast. More robust approaches would be needed to deal with cases like those. I hope short exercise provides some utility to whomever may stumble across this! Keep in mind this is one technique of many that can be employed to answer our earlier problem statement. Below are some additional resources & readings I found interesting for anyone curious.
Thanks for reading! 😄
Resource I Different types of forecast KPI measures
Resource II Time series modeling (like ARIMA, SES, etc.,). With code examples!
Resource III Another Good reading on Time Series
Resource IV Holt-Winters Forecasting - another popular method
Resource V Exponential smoothing technique - again, another popular method)
Resource VI Neat blog with examples