Introduction to Time Series Forecasting

2022-11-14

Forecasting

Forecasting is part of humanity since they beginning of our civilization.
Sometimes being considered a sign of divine inspiration, and sometimes being seen as a criminal activity.
Some forecasts are based on tea reading, palm reading, distribution of maggots, etc.
Some people believe that they can forecast someone’s personality based on the day that they were born (I am Leo).

Forecasting

We will focus on reliable methods for producing forecast, and not by looking at a crystal ball

-I don't practice Santeria, I ain't got no crystal ball. Well, if I had a million dollars I'd spend it all

What can we forecast?

daily electricity demand in 3 days
time of sunrise this day next year
Google stock price in 6 months
exchange rate of $US/R$ next week
total sales of iPhones in the US next month

Which one is the easiest to forecast?

daily electricity demand in 3 days
time of sunrise this day next year
Google stock price in 6 months
exchange rate of $US/R$ next week
total sales of iPhones in the US next month

Which one is the easiest to forecast?

time of sunrise this day next year
daily electricity demand in 3 days
total sales of iPhones in the US next month
exchange rate of $US/R$ next week
Google stock price in 6 months

Which is easiest to forecast?

How can we define which variable is easy to forecast and which variable is hard to forecast?
The predictability of an event depend on several factors such as:

how well we understand the factors that contribute to it;
how much data is available;
how similar the future is to the past;
whether the forecasts can affect the thing we are trying to forecast.

Example: Short-term natural gas demand

Forecasting of short-term natural gas demand can be quite accurate:
- factors: temperatures, stock level, economic conditions
- data: there is plenty of data of natural gas consumption by state, county, etc.
- For short-term forecasting (up to a few weeks), it is safe to assume that demand behavior will be similar to what has been seen in the past.
- For most residential users, the price of natural gas is not dependent on demand, and so the demand forecasts have little or no effect on consumer behavior.

Example - exchange rate of $US/R$ next week

What conditions are satisfied to forecast exchange rates?
- There is plenty of data!!
- There are several factors that impact exchange rate, and we do not have a full picture of them.
- If there are well-publicized forecasts that the exchange rate will increase, then people will immediately adjust the price they are willing to pay and so the forecasts are self-fulfilling.

Example - exchange rate of $US/R$ next week

Forecasting whether the exchange rate will rise or fall tomorrow is about as predictable as forecasting whether a tossed coin will come down as a head or a tail (50%).
Hence, forecasters need to be aware of their own limitations, and not claim more than is possible.
Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again.

Forecasting, Goasl and Planning

Forecasting is useful to inform people/firms regarding decisions about production, transportation, planning, policies, etc.
Forecasting should be an integral part of the decision-making activities
The appropriate forecasting methods depend largely on what data are available.
Qualitative forecasting - When there is no data available (these are not guesswork - The wisdom of the crowd)
Quantitative forecasting:
1. historical data is available
2. Some aspects of the past will continue into the future

Random futures

The thing we are trying to forecast is unknown, so we can think of it as a random variable.
For very short-term forecast, we might have a very good idea what is the next value ahead - More precise.
However, for longer future values, the uncertainty of the future value increases.
For forecasting, we can imagine many possible futures, each yielding a different value.
Next, we will forecast the US job opening in the health care and social assistance.

Random futures

A forecast is an estimate of the probabilities of possible futures.

Random futures

1,000 simulation

Random futures

To obtain a forecast, we are estimating the middle of the range of possible values the random variable could take (average/median) - point forecast.
Also, a forecast comes with a prediction interval - giving a range of values the random variable could take with relatively high probability.
For example, a 95% prediction interval contains a range of values which should include the actual future value with probability 95%.

Random futures

prediction interval (95%)

Random futures

Some Forecasts are just BANANAS

A tidy forecasting workflow

Example

To illustrate the process, we will fit linear trend model to national GDP data stored in global_economy.
install and load the fpp3 package in R

1-Data Preparation(tidy)

The first step in forecasting is to prepare data in the correct format.
Loading the data, identifying missing values, filtering the time series, and other pre-processing tasks.
Each data set is unique, so it is an essential step to understanding the datas’s features and should always be done before models are estimated.

1-Data Preparation(tidy)

A time series can be thought of as a list of numbers (measurement) along with the time those number were recorded (index).
In the tidyverse, we use tsibble object to save a time series in R.

head(global_economy)

## # A tsibble: 6 x 9 [1Y]
## # Key:       Country [1]
##   Country     Code   Year       GDP Growth   CPI Imports Exports Population
##   <fct>       <fct> <dbl>     <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>
## 1 Afghanistan AFG    1960    5.38e8     NA    NA    7.02    4.13    8996351
## 2 Afghanistan AFG    1961    5.49e8     NA    NA    8.10    4.45    9166764
## 3 Afghanistan AFG    1962    5.47e8     NA    NA    9.35    4.88    9345868
## 4 Afghanistan AFG    1963    7.51e8     NA    NA   16.9     9.17    9533954
## 5 Afghanistan AFG    1964    8.00e8     NA    NA   18.1     8.89    9731361
## 6 Afghanistan AFG    1965    1.01e9     NA    NA   21.4    11.3     9938414

Year - index
Country - Key
GDP, imports, Exports, Pop - Measurament

1-Data Preparation(tidy)

We create GDP per capita to be able to compare GDP across countries

gdppc <- global_economy %>%
  mutate(GDP_per_capita = GDP/Population) %>%
  select(Year, Country, GDP, Population, GDP_per_capita)
gdppc

## # A tsibble: 15,150 x 5 [1Y]
## # Key:       Country [263]
##     Year Country             GDP Population GDP_per_capita
##    <dbl> <fct>             <dbl>      <dbl>          <dbl>
##  1  1960 Afghanistan  537777811.    8996351           59.8
##  2  1961 Afghanistan  548888896.    9166764           59.9
##  3  1962 Afghanistan  546666678.    9345868           58.5
##  4  1963 Afghanistan  751111191.    9533954           78.8
##  5  1964 Afghanistan  800000044.    9731361           82.2
##  6  1965 Afghanistan 1006666638.    9938414          101. 
##  7  1966 Afghanistan 1399999967.   10152331          138. 
##  8  1967 Afghanistan 1673333418.   10372630          161. 
##  9  1968 Afghanistan 1373333367.   10604346          130. 
## 10  1969 Afghanistan 1408888922.   10854428          130. 
## # … with 15,140 more rows

2-Data Visualization

Visualization is an essential step in understanding the data (identify common patterns, and subsequently specify an appropriate model.)

gdppc %>%
  filter(Country=="Brazil") %>%
  autoplot(GDP_per_capita) +
    labs(title = "GDP per capita for Brazil", y = "$US")

3 - Define a model (specify)

There are many different time series models that can be used for forecasting.
Specifying an appropriate model for the data is essential for producing appropriate forecasts.
We can specify different models using the function model in the package fable.
This function use a formula (y ~ x) interface. The response variable(s) are specified on the left of the formula, and the structure of the model is written on the right.

3 - Define and train a model

gdpBR<-gdppc %>%
  filter(Country== "Brazil") %>%
      model(TSLM(GDP_per_capita ~ trend()))

gdpBR

## # A mable: 1 x 2
## # Key:     Country [1]
##   Country `TSLM(GDP_per_capita ~ trend())`
##   <fct>                            <model>
## 1 Brazil                            <TSLM>

In this case, we are using the TSLM model (time series linear model - linear regression)
The response variable is GDP_per_capita and it is being modeled using trend()
The resulting object is a model table or a mable.
Use report() to see the estimated model.

4 - Check model performance (evaluate)

Once a model has been fitted, it is important to check how well it has performed on the data.
There are several diagnostic tools available to check model behavior, and also accuracy measures that allow one model to be compared against another.
It is an essential step to make sure your model account for the behavior of the data !!!

5 - Forecasting

With an appropriate model specified, estimated and checked, it is time to produce the forecasts using the function forecast().
We have to specify the number of future observations to forecast. For example, forecast the next 10 observations can be generated using h = 10.

fcstBR<-gdpBR %>% forecast(h = 3)

5 - Forecasting

Each row corresponds to one forecast period.
The GDP_per_capita column contains the forecast distribution
The .mean column contains the point forecast. The point forecast is the mean of the forecast distribution.
95% hilo is the prediction interval with 95% confidence.

fcstBR%>%
  hilo(95)

## # A tsibble: 3 x 6 [1Y]
## # Key:       Country, .model [1]
##   Country .model                Year   GDP_per_capita .mean           `95%`
##   <fct>   <chr>                <dbl>           <dist> <dbl>          <hilo>
## 1 Brazil  TSLM(GDP_per_capita…  2018 N(9194, 3400207) 9194. [5580, 12808]95
## 2 Brazil  TSLM(GDP_per_capita…  2019 N(9380, 3411928) 9380. [5760, 13001]95
## 3 Brazil  TSLM(GDP_per_capita…  2020 N(9567, 3424040) 9567. [5940, 13194]95

5 - Forecasting

The forecasts can be plotted along with the historical data using autoplot()

fcstBR%>%
  autoplot(gdppc) +
  labs(y = "$US", title = "GDP per capita for Brazil")

Forecasting

We will see four simple forecasting methods. To illustrate them, we will work with the weekly positive rate of the influenza virus in the US.
Please upload the data set US_influenza (available on the class’s website)
Let’s forecast it for the next 150 weeks !!!
Make sure fpp3 and readr packages are loaded in R

US_influenza<- read_csv("US_influenza.csv")
head(US_influenza)

## # A tibble: 6 × 5
##    year  week total_tests positive_cases  rate
##   <dbl> <dbl>       <dbl>          <dbl> <dbl>
## 1  2014     1       16955           4898 0.289
## 2  2014     2       16829           4554 0.271
## 3  2014     3       16051           4133 0.257
## 4  2014     4       15130           3461 0.229
## 5  2014     5       13460           2555 0.190
## 6  2014     6       12072           2091 0.173

First step

We have to create a variable that represents the weeks of every year, and then index it!

Steps:

1. Create a new variable (date) to reflect the week of the year - `yearweek()`

2. Transform the tibble into a tsibble indexing the new date variable.

US_influenza<-US_influenza%>%
  mutate(date = yearweek(seq(as.Date("2014/1/5"), as.Date("2019/12/31"), "week")))%>%
  tsibble(index=date)
head(US_influenza)

## # A tsibble: 6 x 6 [1W]
##    year  week total_tests positive_cases  rate     date
##   <dbl> <dbl>       <dbl>          <dbl> <dbl>   <week>
## 1  2014     1       16955           4898 0.289 2014 W01
## 2  2014     2       16829           4554 0.271 2014 W02
## 3  2014     3       16051           4133 0.257 2014 W03
## 4  2014     4       15130           3461 0.229 2014 W04
## 5  2014     5       13460           2555 0.190 2014 W05
## 6  2014     6       12072           2091 0.173 2014 W06

Now you can plot the positive rate variable and observe the aspects of the data.

Plotting

ggplot(US_influenza)+
  geom_line(aes(date,rate))+
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

`MEAN(y)`: Average method

Forecast of all future values is equal to mean of historical data $\{y_1,\dots,y_T\}$.
Forecasts: $\hat{y}_{T+h|T} = \bar{y} = (y_1+\dots+y_T)/T$

`MEAN(y)`: Average method

US_influenza %>% 
  model(MEAN(rate))%>%
    forecast(h = 150) %>%
      autoplot(US_influenza)+
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

`NAIVE(y)`: Naïve method

Forecasts equal to last observed value.
Forecasts: $\hat{y}_{T+h|T} =y_T$.
Consequence of efficient market hypothesis.
Naïve forecast is optimal when data follow a random walk, these are also called random walk forecasts.

`NAIVE(y)`: Naïve method

US_influenza %>%
  model(NAIVE(rate)) %>%
  forecast(h = 150) %>%
  autoplot(US_influenza)+
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

`SNAIVE(y~lag(m))`: Seasonal naïve method

A similar method is useful for highly seasonal data.
We set each forecast to be equal to the last observed value from the same season (e.g., the same month of the previous year).
Forecasts: $\hat{y}_{T+h|T} =y_{T+h-m(k+1)}$, where $m=$ seasonal period and $k$ is the integer part of $(h-1)/m$. (i.e., the number of complete years in the forecast period prior to time $T+h$)
For example, with monthly data, the forecast for all future February values is equal to the last observed February value, etc

`SNAIVE(y~lag(m))`: Seasonal naïve method

Seasonal naïve method

US_influenza %>%
  model(SNAIVE(rate)) %>%
  forecast(h = 150) %>%
  autoplot(US_influenza) +
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

`RW(y ~ drift())`: Drift method

A variation on the naïve method is to allow the forecasts to increase or decrease over time.
The amount of change over time (called the drift) is set to be the average change seen in the historical data.
Forecasts equal to last value plus average change.

$\hat{y}_{T+h|T} = y_{T} + \frac{h}{T-1}\sum_{t=2}^T (y_t-y_{t-1}) = y_T + \frac{h}{T-1}(y_T -y_1)$

Equivalent to extrapolating a line drawn between first and last observations.

Drift method

All 4 models

Forecasting (zoom)

fc%>% dplyr::filter(.model == "Seasonal_naive")%>%
  autoplot(dplyr::filter(US_influenza, year(date) >= 2018))+
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

Actual vs Forecasted

## Series: rate 
## Model: SNAIVE 
## 
## sigma^2: 0.0047

Forecasting - agrrrrrrhhhhhh

  `He who sees the past as surprise-free is
      bound to have a future full of surprises.` - Amos Tversky

Questions ?

Forecasting

Forecasting

What can we forecast?

Which one is the easiest to forecast?

Which one is the easiest to forecast?

Which is easiest to forecast?

Example: Short-term natural gas demand

Example - exchange rate of $US/R$ next week

Example - exchange rate of $US/R$ next week

Forecasting, Goasl and Planning

Random futures

Random futures

Random futures

Random futures

Random futures

Random futures

Random futures

Random futures

Random futures

Random futures

Random futures

Some Forecasts are just BANANAS

A tidy forecasting workflow

Example

1-Data Preparation(tidy)

1-Data Preparation(tidy)

1-Data Preparation(tidy)

2-Data Visualization

3 - Define a model (specify)

3 - Define and train a model

4 - Check model performance (evaluate)

5 - Forecasting

5 - Forecasting

5 - Forecasting

Forecasting

First step

Plotting

MEAN(y): Average method

MEAN(y): Average method

NAIVE(y): Naïve method

NAIVE(y): Naïve method

SNAIVE(y~lag(m)): Seasonal naïve method

SNAIVE(y~lag(m)): Seasonal naïve method

Seasonal naïve method

RW(y ~ drift()): Drift method

Drift method

Drift method

All 4 models

Forecasting (zoom)

Actual vs Forecasted

Forecasting - agrrrrrrhhhhhh

`MEAN(y)`: Average method

`MEAN(y)`: Average method

`NAIVE(y)`: Naïve method

`NAIVE(y)`: Naïve method

`SNAIVE(y~lag(m))`: Seasonal naïve method

`SNAIVE(y~lag(m))`: Seasonal naïve method

`RW(y ~ drift())`: Drift method