Forecasting: Principles and Practices

Author

Teddy Kelly

Chapter 1 Getting Started

Forecasting is the science of predicting the future. This textbook explores the most reliable methods for producing forecasts.

1.1 What can be forecast?

The predictability of an event or a quantity depends on the following factors:

How well we understand the factors that contribute to it.
How much data is available
How similar the future is to the past
Whether the forecasts can affect the thing we are trying to forecast.

Electricity demand meets all of those criteria but predicting exchange rates is difficult.
A key step in forecasting is knowing when something can be forecast accurately, and when you can do no better than a coin flip.
Forecasts of exchange rates are self-fulfilling which is an example of the “efficient market hypothesis”
A good forecasting model captures the way in which things are changing. Usually, the way in which an environment is changing will continue into the future.
Judgmental forecasting is used when there is no data available

1.2 Forecasting, goals and planning

Forecasting

Is about predicting the future as accurately as possible, given all available info, including historical data and knowledge of any future events that might effect forecasts.

Goals

Are what you would like to occur. Goals should be linked to forecasts and plans, but does not always happen.
Forecasts should be used to determine whether a goal is realistic or not.

Planning

A response to forecasts and goals. Planning involves the appropriate actions that are required to make your forecasts match your goals.

Short-Term Forecasts

Needed for scheduling of personnel, production and transportation. Forecasts of demanded are also required as part of the scheduling process.

Medium-Term Forecasts

Needed to determine future resource requirements, in order to purchase raw materials, hire personnel, or buy machinery and equipment.

Long-Term Forecasts

Used for strategic planning. Such decisions may take account of market opportunities, environmental factors and internal resources.

An organization must develop a forecasting system that involves several approaches to predicting uncertain events.

1.3 Determining What to Forecast

It is important to ask the following questions when tackling a forecasting project:

What needs to be forecast?
What is the forecasting horizon? (one month, 6 months, 10 years?)
How frequently are forecasts required? Forecasts that are done more frequently are better done by using an automated system.
After the logistics how the forecasting procedure are figured out, it’s important to find or collect data on which the forecasts will be based on.
Types of data used fore forecasting included: sales records of a company, historical demand for a product, or the unemployment rate for a geographic region.

1.4 Forecasting Data and Methods

The appropriate forecasting methods depends on the kind of data available.

Qualitative Forecasting methods must be used when there is no data available, or if the data are not relevant to the forecasts.

Quantitative Forecasting can be applied when two conditions are satisfied:

Numerical information about the past is available
It’s reasonable to assume that some aspects of the past patterns will continue into the future.

Wide range of quantitative forecasting methods. Most quantitative prediction problems use time-series data.

Time Series Forecasting Examples

Annual profits of a firm
Unemployment data in the US from 1930-2026
Monthly rainfall
Weekly sales
Daily IBM stock
Hourly electricity demand
5-minute freeway traffic counts

The aim of forecasting time series data is to estimate how the sequences of observations will continue into the future. The simplest time series methods use only the information about the variable of interest and do not take into account other factors that might influence its behavior.

Predictor Variables and Time Series Forecasting

A model forecasting the hourly electricity demand with predictor variables might be of the form of an explanatory model:

\[ ED=f(current\_temp, streng\_of\_economy,population,time\_of\_day,day\_of\_week,error) \]

It’s called an explanatory model because it helps explain what causes the variation in electricity demand.

We could also use a time series model for forecasting since the data is time series where \(t\) is the present hour.:

\[ ED_{t+1}=f(ED_t,ED_{t-1},ED_{t-2},...,error) \]

Here, prediction of the future electricity demand is based on past values of electricity demand but not on external variables that may affect electricity demand. The error term allows for random variation and the effects of relevant variables that are not included in the model.

There is a third type called mixed models which combines the features of the two models above:

\[ ED_{t+1}=f(ED_t, current\_temp,time\_of\_day,day\_of\_week,error) \]

Why use a time-series model to forecast instead of an explanatory model?

The system may not be understood and it my be difficult to measure relationships between the variables
It’s necessary to forecast the future values of the predictors to forecast the variable of interest which may be too difficult
The goal may just be to predict what will happen and not know why it happens
Time series models may give more accurate predictions than an explanatory or mixed model

Deciding which forecasting model to use depends on the data that is available, the accuracy of the competing models given the data, and the goal of the forecasting project.

1.5 Case Studies

1.6 The Basic Steps in a Forecasting Task

Forecasting tasks usually involve five basic steps

Step 1: Problem Definition

Often the most difficult part.
Defining the problem requires an understanding of the way the forecasts will be used, who needs them, and how the forecasting function fits within the organization requiring the forecasts.

Step 2: Gathering Information

Two kinds of information required:
- Statistical Data
- The accumulated expertise of the people who collect the data and use the forecasts
Often difficult to obtain enough historical data to fit a good statistical model, so judgmental forecasting methods may also be needed.

Step 3: Preliminary (Exploratory) Analysis

Always graph the data to begin with to uncover key relationships between the variables, trends ,a nd the presence of outliers.

Step 4: Choosing and Fitting Models

All models are made up of assumptions and involve one or more parameters which must be estimated using the historical data
The best model depends on the availability of historical data, the strength of the relationships between the explanatory variables and the forecast variable, and the way in which the forecasts are to be used.

Step 5: Using and Evaluating a Forecasting Model

The performance of the model can only be accurately evaluated after the event occurs.

1.7 The Statistical Forecasting Perspective

We can think of what we are trying to forecast as a random variable because it is unknown and could take on a range of possible values.

The further ahead we forecast, the more uncertain we are. The variation associated with what we are trying to forecast decreases as the event approaches.

For forecasting, we are estimating the middle of the range of possible values the random variable could take.
A forecast is accompanied by a prediction interval giving a range of values that the random variable could take with a relatively high probability.
The average of the possible future values is called the point forecasts

\(y_t|I\) means that “the random variable \(y_t\) given what we know \(I\)”. The set of values that this random variable could take is known as the “probability distribution” of \(y_t|I\). In forecasting, its called the forecast distribution.

The “forecast” usually means the average value of the forecast distribution. \(\hat{y}_t\) is the forecast of \(y_t\), meaning the average of the possible values that \(y_t\) could take given what we know.

\(\hat{y}_{t|t-1}\) means the forecast of \(y_t\) taking into account all the previous observations of \(y\): \((y_1,...,y_{t-1})\).

Chapter 2: Time Series Graphics

2.1 `tsibble` objects

A tsibble is a specialized data structure that deals specifically with time series data.

tsibble objects have keys, index, and the variable(s) we are attempting to forecast. The index is usually the measure of time.

Index

The variable that indexes the time series, this is usually the measure of time.
Every tsibble object must have an index.

Key

The variable that determines the different unique time series in the dataset
There can be multiple key variables.
Must uniquely define each time series.

Measured Variables

Individual variables that we might wish to model.

A tsibble allows storage and manipulation of multiple time series in R. tsibbles must have an index, measured variables, and optionally some key variables which uniquely identify each series.

library(fpp3)
my_data <- tsibble(
  year = 2015:2019,
  data = c(203, 59, 1, 52, 110),
  index = year #Must have an index
)

Must include the key when creating a tsibble. In general, a row of a tsibble must have a unique combination of a time index and the keys. You cannot have repeats or duplicates.

prison <- read.csv("https://OTexts.com/fpp3/extrafiles/prison_population.csv")

# Convert into a tsibble object
prison <- prison |>
  mutate(Quarter = yearquarter(Date)) |>
  select(-Date) |>
  as_tsibble(index = Quarter,
             key = c(State, Gender, Legal, Indigenous))

Say I want to find the total Count for each quarter. Here is how I would do it:

prison |>
  summarise(TotalCount = sum(Count))

# A tsibble: 48 x 2 [1Q]
   Quarter TotalCount
     <qtr>      <int>
 1 2005 Q1      24296
 2 2005 Q2      24643
 3 2005 Q3      24511
 4 2005 Q4      24393
 5 2006 Q1      24524
 6 2006 Q2      25017
 7 2006 Q3      25428
 8 2006 Q4      25913
 9 2007 Q1      25912
10 2007 Q2      26517
# ℹ 38 more rows

PBS Data

PBS |> filter(ATC2 == "A10") |>
  select(Month, Concession, Type, Cost) |>
  summarize(TotalC = (sum(Cost) / 1e6)) -> a10

2.2 Time Plots

Plotting data allow you to observe patterns, unusual observations, changes over time, and relationships between data.

Using autoplot() from the feasts package.

a10 |> autoplot() +
  labs(y = 'Cost in millions', title = "Austarlian Antidiabetic Drug Sale")

Plot variable not specified, automatically selected `.vars = TotalC`

Using ggplot:

a10 |>
  ggplot(mapping = aes(x = Month, y = TotalC)) +
  geom_line() +
  labs(title = "Scatter Plot", y = "Cost in Millions of $")

Adding points to my plot to show when key changes happen:

a10 |> autoplot(TotalC) + geom_point()+
  labs(title = "Australian Antidiabetic Drug Costs for A10", y = "Cost in Millions of $")

Ansett Dataset

Graph the time series of flights for economy class for the Melbourne to Sydney airports.

ansett |> filter(Class == "Economy" & Airports == "MEL-SYD") |> 
  mutate(Passengers = Passengers /1000) |>
  autoplot(Passengers) + labs(title = "Ansett Airlines Economy Class",
                              subtitle = "Melbourne-Sydney",
                              y = "Thousands of Passengers")

2.3 Time Series Patterns

Terminology to describe time series graphs

Trend

A trend exists when there is a long-term increase or decrease in the data. The trend does not have to be linear.
Sometimes we refer to a trend as “changing direction”

Seasonal

A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of year of the day of the week. Or even the time of day.
Seasonality is always a fixed or known period.

Cyclic

A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency
These fluctuations are usually due to economic conditions and often related to the “business cycle”. The duration of these fluctuations is usually at least 2 years.
The length of cycles are generally longer than the length of a seasonal pattern.
The magnitude of cycles tend to be more variable than for seasonal patterns.

2.4 Seasonal Plots

Seasonal plots are similar to time plots except that the data are plotted against the individual “seasons” in which the data were observed.

A season can be a year, month, week, or year.
Then, on the x-axis of the seasonality plot, the ticks will be the next lower measure of time. For example, if the period is set equal to year, then the months of the year will appear on the x-axis.
Use the gg_season command and set period equal to the desired “season” you want.

Here is an example using the a10 tsibble created earlier:

a10 |> gg_season(TotalC, labels = 'both') +
  labs(title = "Seasonal Plot: Antibiotical Drug Sales", y = "Total Cost in millions of $")

The seasonal plot allows the underlying seasonal pattern to be seen more clearly. Helps us identify the years in which the pattern changes.

The seasonality plot confirms that January is the peak season for antibiotic drug sales and then a big drop off in February.
Also, it appears that there is a steady increase from February to March (Artificial because there are 31 days in march compared to only 28 in February) except for 2008 when the Great Recession occurred.
Since there is an upward trend in the data, each successive year is higher than the last.

Multiple Seasonal Periods

Using the vic_elec tsibble, we can specify different time periods to graph the seasonal plots.

vic_elec |> autoplot(Demand) + 
  labs(title = "Time Series Plot of Victorian Electricity Demand")

We can roughly see that electricity demand is peaking during the summer months in Australia. people are likely using the AC much more often.
Electricity demand has a local maximum during the winter months and a rough during the spring and fall months.

By day:

vic_elec |> gg_season(Demand, period = 'day') +
  labs(title = "Seasonal Plot By Day", 
       subtitle = "Victorian Electricity Demand")

Electricity demand peaks towards the end of the day at around 6pm.
The major high spikes in the blue are on very hot days during the summer.

By Week:

vic_elec |> gg_season(Demand, period = 'week') +
  labs(title = "Seasonal Plot by Week",
       subtitle = "Victorian Electricity Demand")

We can see that the electricity demand is lower on the weekends.

Doing it by hour will take too long to run:

By year:

vic_elec |> gg_season(Demand, period = 'year') +
  labs(title = "Seasonal Plot by Year",
       subtitle = "Victorian Electricity Demand")

Clearly, we can see that electricity demand peaks during the summer months in Australia.

Time-series plot for Australian Beer Production:

aus_production |> select(Quarter, Beer) |> filter(year(Quarter) >= 1992) |>
  autoplot(Beer) + geom_point() + labs(title = 'Australian Beer Production')

We can see that beer production peaks in quarter 4 which is during the summer in Australia

Seasonality Plot:

aus_production |> filter(year(Quarter) >= 1992) |> gg_season(Beer, labels = 'right') +
  labs(title = 'Australian Beer Production Seasonality')

The seasonality plot confirms what we found in the time series plot that beer production peaks during quarter 4 in the summer.

2.5 Seasonal Subseries Plots

Seasonal Subseries Plot: A subseries plot plots together each season as their own time series.

Allows us to see the underlying seasonality patterns more clearly.

Let’s look at the a10 subseries plot.

a10 |> gg_subseries(TotalC)

Warning: `gg_subseries()` was deprecated in feasts 0.4.2.
ℹ Please use `ggtime::gg_subseries()` instead.

The horizontal blue line indicates the mean value for each month. We can clearly see that the highest mean anti-diabetic drug sales occurs during the month of January.
Also, the sub-series plots show an upward trend for each month over the years.

Example: Australian Holiday Tourism

For the tourism tsibble, let’s say we want to find the total trips where the purpose is Holiday by State for each quarter. The following code accomplishes this:

tourism |> filter(Purpose == 'Holiday') |> group_by(State) |> 
  summarize(TotalTrips = sum(Trips)) -> holidays

Time series of the tsibble holidays

holidays |> autoplot(TotalTrips) + 
  labs(title = "Australian Domestic Holidays by State", y ="Overnight Trips")

Let’s look at a seasonal plot to observe the timing of the seasonal peaks using gg_season and facet_wrap to make the plots go into rows.

For facet_wrap, you indicate which variable you want to “wrap” the graphs around, so in this case it is State, then you indicate the number of rows you want and set scales equal to “free_y” since TotalTrips is the variable we are trying to measure.

holidays |> gg_season(TotalTrips, period = 'year') + 
  facet_wrap(vars(State), nrow = 3, scales = 'free_y') +
  labs(title = "Australian Domestic Holiday Travel",
       subtitle = "Seasonal Plot", y = "Overnight Trips")

We can see that most people go to the northern territories during Q3 which represents the winter months in Australia
Travel demand increases for the southern territories during the summer months in Australia.

It’s difficult to tell, so let’s look at a seasonal subseries:

holidays |> gg_subseries(TotalTrips)

For some states, the mean over night trips occurs during quarter 1, but for others, it’s quarter 2 or 3.

The takeaway from this section is that time series, seasonality plots, and seasonal subseries plots all can reveal different trends that maybe cannot be noticed in the other plots.

2.6 Scatter Plots

We can use scatter plots to study the relationships between two variables in a tsibble. For example if we have two time series plots showing Demand and Temperature during 2014 for the vic_elec dataset, we can study the relationship between electricity demand and the temperature outside.

# Electricity demand in Victoria
vic_elec |> filter(year(Time) == 2014) |> autoplot(Demand) +
  labs(title = "Half-hoourly Electricity Demand: Victoria", y = "Gigawatts")

# Temperature in Melbourne
vic_elec |> filter(year(Time) == 2014) |> autoplot(Temperature) +
  labs(title = "Half-hourly Temperature: Melbourne")

Scatter Plot of Demand and Temperature

vic_elec |> filter(year(Time) == 2014) |> 
  ggplot(mapping = aes(x = Temperature, y = Demand)) +
  geom_point() +
  labs(title = "Scatter Plot of Electricity Demand Vs Temperature",
       x = 'Temperature (degrees Celsius)',
       y = 'Electricity Demand (in Gigawatts)')

We can see that initially, when the temperature increases from cold to warm, electricity demand decreases at first.
However, once the temperature rises dramatically, electricity demand increases with it because people are likely using the air conditioning more frequently.

Correlation

The correlation between two variables can be calculated using the cor() command.
The correlation coefficient measures the strength of their linear relationship

The formula is as follows for the correlation coefficient:

\[ r=\frac{\sum(x_t-\bar{x})(y_t-\bar{y})}{\sqrt{\sum(x_t-\bar{x})^2}\sqrt{\sum(y_t-\bar{y})^2}} \]

The correlation coefficient for the above example is about 0.28, but this is misleading since the non-linear relationship is stronger than that. The correlation coefficient only measures the strength of the linear relationship.

Scatter plot Matrices

Display the time series the quarterly visitors across states and territories of Australia from the tourism tsibble.

tourism |> group_by(State) |> summarize(TotalTrips = sum(Trips)) |>
  ggplot(mapping = aes(x = Quarter, y = TotalTrips)) + geom_line() +
  facet_grid(vars(State), scales = 'free_y') + 
  labs(title = "Australian Domestic Tourism", y = 'Overnight Trips')

To see the relationships between these eight time series, we can plot each time series against the others using a scatter plot matrix.

tourism |> group_by(State) |> summarize(TotalTrips = sum(Trips)) |>
  pivot_wider(values_from = TotalTrips, names_from = State) |>
  GGally::ggpairs(columns = 2:9)

2.7 Lag Plots

Can create lag plots using the gg_plot() function and specify geom = "point" .

aus_production |> filter(year(Quarter) >= 1992) |> ggtime::gg_lag(Beer,, geom = 'point')

Registered S3 methods overwritten by 'ggtime':
  method                  from      
  +.gg_tsensemble         feasts    
  autolayer.tbl_ts        fabletools
  autoplot.dcmp_ts        fabletools
  autoplot.tbl_ts         fabletools
  grid.draw.gg_tsensemble feasts    
  print.gg_tsensemble     feasts

What do the lag plots show us?

Lag plots help us visualize the relationship between a time series and a lagged version of itself, which helps us to identify patterns of randomness, autocorrelation, and seasonality.

Each graph shows \(y_t\) plotted against \(y_{t-k}\) for different values of \(k\). The value of \(k\) corresponds to the number associated with the indicated lag plot.
For example, for the lag 2 plot, looking at the yellow points, this represents all of the quarter 4 values being plotted against quarter 2 of the same year. For the blue points which correspond to quarter 2, those points represent all of the quarter 2 values on the y-axis being plotted against quarter 4 of the previous year.
Strong positive relationships at lag 4 and lag 8 which makes sense because we are plotting each period against that same period of the previous year or two years earlier. Therefore, when there is strong seasonality, there will be a strong relationship between those specific lag plots.
On the other hand, there is a strong negative relationship for lag plots 2 and 6 because we are effectively plotting the peaks against the troughs.
Lag plots show the relationship between points that are k periods apart from each other.

2.8 Autocorrelation

Autocorrelation measures the linear relationship between lagged values of a time series. Think self correlation.

There are multiple autocorrelation coefficients, corresponding to each panel in the lag plot.
For example, \(r_1\) measures the relationship between \(y_t\) and \(y_{t-1}\), \(r_2\) measures the relationship between \(y_t\) and \(y_{t-2}\) and so on.
The value of the autocorrelation coefficient \(r_k\) is the following:

\[ r_k=\frac{\sum\limits_{t=k+1}^T(y_t-\bar{y})(y_{t-k}-\bar{y})}{\sum\limits_{t=1}^T(y_t-\bar{y})^2} \]
\(T\) is the length of the time series and the autocorrelation coefficient s make up the autocorrelation function or ACF.

aus_production |> filter(year(Quarter) >= 1992) |> ACF(Beer, lag_max = 9)

# A tsibble: 9 x 2 [1Q]
       lag     acf
  <cf_lag>   <dbl>
1       1Q -0.102 
2       2Q -0.657 
3       3Q -0.0603
4       4Q  0.869 
5       5Q -0.0892
6       6Q -0.635 
7       7Q -0.0542
8       8Q  0.832 
9       9Q -0.108

We can plot the ACF to see how the correlations change with the lag \(k\) which is known as a correlogram. A correlogram simply plots each correlation coefficient against its lag.

aus_production |> filter(year(Quarter) >= 1992) |> ACF(Beer, lag_max = 9) |> autoplot() +
  labs(title = "Australian Beer Production Correlogram")

\(r_4\) and \(r_8\) (Comparing the same period from previous years) have the two strongest positive lags which \(r_2\) and \(r_6\) (Comparing two quarters apart) have the two strongest negative lags which makes sense.
Peaks tend to be 4 quarters behind previous peaks and troughs tend to be 2 quarters behind peaks.
The dashed blue liens indicate whether the correlations are significantly different from zero.

Trend and Seasonality in ACF Plots

When the data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time will also be close in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase.
When data are seasonal, the autocorrelations will be larger for the seasonal lags than for other lags. (At the multiples of the seasonal frequency)
When the data are both trended and seasonal, there is a combination of these effects.

The a10 data is both trended and seasonal:

a10 |> gg_lag(TotalC, geom = "point")

Warning: `gg_lag()` was deprecated in feasts 0.4.2.
ℹ Please use `ggtime::gg_lag()` instead.

a10 |> ACF(TotalC, lag_max = 48) |> autoplot() + 
  labs(title = "Australian Antidiabetic Drug Sales ACF Plot")

The autocorrelation coefficients are very strong and positive which confirms the trend and seasonality in the data.
Also, the autocorrelation coefficient peaks when \(r=12, 24, 36, ...\)
The autocorrelation coefficient gradually decreases over time which makes sense because as \(k\) increases, the lag are getting increasingly further and further apart from each other, and hence their values of \(TotalC\) as well.

gafa_stock |> filter(year(Date) == 2018 & Symbol == "FB") |> autoplot()

Plot variable not specified, automatically selected `.vars = Open`

gafa_stock |> filter(year(Date) == 2018) |> group_by(Symbol) |> summarize(Close = Close) |>
  ggplot(aes(x = Date, y = Close)) + 
  geom_line() + facet_grid(vars(Symbol), scales = 'free_y') +
  labs(title = "Closing Values by Company for 2018", x = "Date", y = "Closing Value")

2.9 White Noise

Time series that show no autocorrelation are called white noise.

White noise data is uncorrelated across time with zero mean and constant variance.
We also required independence.

set.seed(30)
y <- tsibble(sample = 1:50, data = rnorm(50), index = sample)
y |> autoplot(data)

Doing the ACF plot, we get:

y |> ACF(data) |> autoplot()

We can see that most of the spikes are within the dashed blue line, meaning the the autocorrelation coefficients are very small and suggest that there is no autocorrelation (observations are not correlated with past events)
If one or more of the large spikes are outside the blue line, or if substantially more than 5% of spikes are outside these bounds, then the time series is probably not white noise.
The blue line bound is determined by the size of the series \(T\) and the following formula is used:

\[ \pm1.96/\sqrt{T} \]
For this case, \(T=50\), so the bounds are \(\pm1.96/\sqrt{50}=\pm0.28\) as seen in the graph above.
The blue lines show 95% critical values.

Real Example:

Using the aus_livestock tsibble, let’s graph the time series for pigs slaughtered in Victoria since 2014.

aus_livestock |> filter(year(Month) >= 2014 & Animal == 'Pigs' & State == 'Victoria') |>
  autoplot(Count/1e3)

ACF Plot:

aus_livestock |> filter(year(Month) >= 2014 & Animal == 'Pigs' & State == 'Victoria') |>
  ACF(Count) |> autoplot()

There appears to be one or maybe two lags with the autocorrelation coefficient outside the blue boundaries.
Maybe some slight seasonality since the biggest spike is at lag 12 which would be the same time in the previous year. We can conclude that the series is not a white noise series.