2022-11-14

Forecasting

  • Forecasting is part of humanity since they beginning of our civilization.

  • Sometimes being considered a sign of divine inspiration, and sometimes being seen as a criminal activity.

  • Some forecasts are based on tea reading, palm reading, distribution of maggots, etc.

  • Some people believe that they can forecast someone’s personality based on the day that they were born (I am Leo).

Forecasting

  • We will focus on reliable methods for producing forecast, and not by looking at a crystal ball

-I don't practice Santeria, I ain't got no crystal ball. Well, if I had a million dollars I'd spend it all

What can we forecast?

  • daily electricity demand in 3 days

  • time of sunrise this day next year

  • Google stock price in 6 months

  • exchange rate of $US/R$ next week

  • total sales of iPhones in the US next month

Which one is the easiest to forecast?

  • daily electricity demand in 3 days

  • time of sunrise this day next year

  • Google stock price in 6 months

  • exchange rate of $US/R$ next week

  • total sales of iPhones in the US next month

Which one is the easiest to forecast?

  1. time of sunrise this day next year
  2. daily electricity demand in 3 days
  3. total sales of iPhones in the US next month
  4. exchange rate of $US/R$ next week
  5. Google stock price in 6 months

Which is easiest to forecast?

  • How can we define which variable is easy to forecast and which variable is hard to forecast?

  • The predictability of an event depend on several factors such as:

  1. how well we understand the factors that contribute to it;
  2. how much data is available;
  3. how similar the future is to the past;
  4. whether the forecasts can affect the thing we are trying to forecast.

Example: Short-term natural gas demand

  • Forecasting of short-term natural gas demand can be quite accurate:

    • factors: temperatures, stock level, economic conditions
    • data: there is plenty of data of natural gas consumption by state, county, etc.
    • For short-term forecasting (up to a few weeks), it is safe to assume that demand behavior will be similar to what has been seen in the past.
    • For most residential users, the price of natural gas is not dependent on demand, and so the demand forecasts have little or no effect on consumer behavior.

Example - exchange rate of $US/R$ next week

  • What conditions are satisfied to forecast exchange rates?

    • There is plenty of data!!
    • There are several factors that impact exchange rate, and we do not have a full picture of them.
    • If there are well-publicized forecasts that the exchange rate will increase, then people will immediately adjust the price they are willing to pay and so the forecasts are self-fulfilling.

Example - exchange rate of $US/R$ next week

  • Forecasting whether the exchange rate will rise or fall tomorrow is about as predictable as forecasting whether a tossed coin will come down as a head or a tail (50%).

  • Hence, forecasters need to be aware of their own limitations, and not claim more than is possible.

  • Good forecasts capture the genuine patterns and relationships which exist in the historical data, but do not replicate past events that will not occur again.

Forecasting, Goasl and Planning

  • Forecasting is useful to inform people/firms regarding decisions about production, transportation, planning, policies, etc.

  • Forecasting should be an integral part of the decision-making activities

  • The appropriate forecasting methods depend largely on what data are available.

  • Qualitative forecasting - When there is no data available (these are not guesswork - The wisdom of the crowd)

  • Quantitative forecasting:

    1. historical data is available
    2. Some aspects of the past will continue into the future

Random futures

  • The thing we are trying to forecast is unknown, so we can think of it as a random variable.

  • For very short-term forecast, we might have a very good idea what is the next value ahead - More precise.

  • However, for longer future values, the uncertainty of the future value increases.

  • For forecasting, we can imagine many possible futures, each yielding a different value.

  • Next, we will forecast the US job opening in the health care and social assistance.

Random futures

Random futures

  • A forecast is an estimate of the probabilities of possible futures.

Random futures

Random futures

Random futures

Random futures

1,000 simulation

Random futures

Random futures

  • To obtain a forecast, we are estimating the middle of the range of possible values the random variable could take (average/median) - point forecast.

  • Also, a forecast comes with a prediction interval - giving a range of values the random variable could take with relatively high probability.

  • For example, a 95% prediction interval contains a range of values which should include the actual future value with probability 95%.

Random futures

prediction interval (95%)

Random futures

Some Forecasts are just BANANAS

A tidy forecasting workflow

Example

  • To illustrate the process, we will fit linear trend model to national GDP data stored in global_economy.

  • install and load the fpp3 package in R

1-Data Preparation(tidy)

  • The first step in forecasting is to prepare data in the correct format.

  • Loading the data, identifying missing values, filtering the time series, and other pre-processing tasks.

  • Each data set is unique, so it is an essential step to understanding the datas’s features and should always be done before models are estimated.

1-Data Preparation(tidy)

  • A time series can be thought of as a list of numbers (measurement) along with the time those number were recorded (index).

  • In the tidyverse, we use tsibble object to save a time series in R.

head(global_economy)
## # A tsibble: 6 x 9 [1Y]
## # Key:       Country [1]
##   Country     Code   Year       GDP Growth   CPI Imports Exports Population
##   <fct>       <fct> <dbl>     <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>
## 1 Afghanistan AFG    1960    5.38e8     NA    NA    7.02    4.13    8996351
## 2 Afghanistan AFG    1961    5.49e8     NA    NA    8.10    4.45    9166764
## 3 Afghanistan AFG    1962    5.47e8     NA    NA    9.35    4.88    9345868
## 4 Afghanistan AFG    1963    7.51e8     NA    NA   16.9     9.17    9533954
## 5 Afghanistan AFG    1964    8.00e8     NA    NA   18.1     8.89    9731361
## 6 Afghanistan AFG    1965    1.01e9     NA    NA   21.4    11.3     9938414
  • Year - index
  • Country - Key
  • GDP, imports, Exports, Pop - Measurament

1-Data Preparation(tidy)

  • We create GDP per capita to be able to compare GDP across countries
gdppc <- global_economy %>%
  mutate(GDP_per_capita = GDP/Population) %>%
  select(Year, Country, GDP, Population, GDP_per_capita)
gdppc
## # A tsibble: 15,150 x 5 [1Y]
## # Key:       Country [263]
##     Year Country             GDP Population GDP_per_capita
##    <dbl> <fct>             <dbl>      <dbl>          <dbl>
##  1  1960 Afghanistan  537777811.    8996351           59.8
##  2  1961 Afghanistan  548888896.    9166764           59.9
##  3  1962 Afghanistan  546666678.    9345868           58.5
##  4  1963 Afghanistan  751111191.    9533954           78.8
##  5  1964 Afghanistan  800000044.    9731361           82.2
##  6  1965 Afghanistan 1006666638.    9938414          101. 
##  7  1966 Afghanistan 1399999967.   10152331          138. 
##  8  1967 Afghanistan 1673333418.   10372630          161. 
##  9  1968 Afghanistan 1373333367.   10604346          130. 
## 10  1969 Afghanistan 1408888922.   10854428          130. 
## # … with 15,140 more rows

2-Data Visualization

  • Visualization is an essential step in understanding the data (identify common patterns, and subsequently specify an appropriate model.)
gdppc %>%
  filter(Country=="Brazil") %>%
  autoplot(GDP_per_capita) +
    labs(title = "GDP per capita for Brazil", y = "$US")

3 - Define a model (specify)

  • There are many different time series models that can be used for forecasting.

  • Specifying an appropriate model for the data is essential for producing appropriate forecasts.

  • We can specify different models using the function model in the package fable.

  • This function use a formula (y ~ x) interface. The response variable(s) are specified on the left of the formula, and the structure of the model is written on the right.

3 - Define and train a model

gdpBR<-gdppc %>%
  filter(Country== "Brazil") %>%
      model(TSLM(GDP_per_capita ~ trend()))

gdpBR
## # A mable: 1 x 2
## # Key:     Country [1]
##   Country `TSLM(GDP_per_capita ~ trend())`
##   <fct>                            <model>
## 1 Brazil                            <TSLM>
  • In this case, we are using the TSLM model (time series linear model - linear regression)

  • The response variable is GDP_per_capita and it is being modeled using trend()

  • The resulting object is a model table or a mable.

  • Use report() to see the estimated model.

4 - Check model performance (evaluate)

  • Once a model has been fitted, it is important to check how well it has performed on the data.

  • There are several diagnostic tools available to check model behavior, and also accuracy measures that allow one model to be compared against another.

  • It is an essential step to make sure your model account for the behavior of the data !!!

5 - Forecasting

  • With an appropriate model specified, estimated and checked, it is time to produce the forecasts using the function forecast().

  • We have to specify the number of future observations to forecast. For example, forecast the next 10 observations can be generated using h = 10.

fcstBR<-gdpBR %>% forecast(h = 3)

5 - Forecasting

  • Each row corresponds to one forecast period.

  • The GDP_per_capita column contains the forecast distribution

  • The .mean column contains the point forecast. The point forecast is the mean of the forecast distribution.

  • 95% hilo is the prediction interval with 95% confidence.

fcstBR%>%
  hilo(95)
## # A tsibble: 3 x 6 [1Y]
## # Key:       Country, .model [1]
##   Country .model                Year   GDP_per_capita .mean           `95%`
##   <fct>   <chr>                <dbl>           <dist> <dbl>          <hilo>
## 1 Brazil  TSLM(GDP_per_capita…  2018 N(9194, 3400207) 9194. [5580, 12808]95
## 2 Brazil  TSLM(GDP_per_capita…  2019 N(9380, 3411928) 9380. [5760, 13001]95
## 3 Brazil  TSLM(GDP_per_capita…  2020 N(9567, 3424040) 9567. [5940, 13194]95

5 - Forecasting

  • The forecasts can be plotted along with the historical data using autoplot()
fcstBR%>%
  autoplot(gdppc) +
  labs(y = "$US", title = "GDP per capita for Brazil")

Forecasting

  • We will see four simple forecasting methods. To illustrate them, we will work with the weekly positive rate of the influenza virus in the US.

  • Please upload the data set US_influenza (available on the class’s website)

  • Let’s forecast it for the next 150 weeks !!!

  • Make sure fpp3 and readr packages are loaded in R

US_influenza<- read_csv("US_influenza.csv")
head(US_influenza)
## # A tibble: 6 × 5
##    year  week total_tests positive_cases  rate
##   <dbl> <dbl>       <dbl>          <dbl> <dbl>
## 1  2014     1       16955           4898 0.289
## 2  2014     2       16829           4554 0.271
## 3  2014     3       16051           4133 0.257
## 4  2014     4       15130           3461 0.229
## 5  2014     5       13460           2555 0.190
## 6  2014     6       12072           2091 0.173

First step

  • We have to create a variable that represents the weeks of every year, and then index it!

  • Steps:

    1. Create a new variable (date) to reflect the week of the year - `yearweek()`
    
    2. Transform the tibble into a tsibble indexing the new date variable.
US_influenza<-US_influenza%>%
  mutate(date = yearweek(seq(as.Date("2014/1/5"), as.Date("2019/12/31"), "week")))%>%
  tsibble(index=date)
head(US_influenza)
## # A tsibble: 6 x 6 [1W]
##    year  week total_tests positive_cases  rate     date
##   <dbl> <dbl>       <dbl>          <dbl> <dbl>   <week>
## 1  2014     1       16955           4898 0.289 2014 W01
## 2  2014     2       16829           4554 0.271 2014 W02
## 3  2014     3       16051           4133 0.257 2014 W03
## 4  2014     4       15130           3461 0.229 2014 W04
## 5  2014     5       13460           2555 0.190 2014 W05
## 6  2014     6       12072           2091 0.173 2014 W06
  • Now you can plot the positive rate variable and observe the aspects of the data.

Plotting

ggplot(US_influenza)+
  geom_line(aes(date,rate))+
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

MEAN(y): Average method

  • Forecast of all future values is equal to mean of historical data \(\{y_1,\dots,y_T\}\).
  • Forecasts: \(\hat{y}_{T+h|T} = \bar{y} = (y_1+\dots+y_T)/T\)

MEAN(y): Average method

US_influenza %>% 
  model(MEAN(rate))%>%
    forecast(h = 150) %>%
      autoplot(US_influenza)+
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

NAIVE(y): Naïve method

  • Forecasts equal to last observed value.

  • Forecasts: \(\hat{y}_{T+h|T} =y_T\).

  • Consequence of efficient market hypothesis.

  • Naïve forecast is optimal when data follow a random walk, these are also called random walk forecasts.

NAIVE(y): Naïve method

US_influenza %>%
  model(NAIVE(rate)) %>%
  forecast(h = 150) %>%
  autoplot(US_influenza)+
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

SNAIVE(y~lag(m)): Seasonal naïve method

  • A similar method is useful for highly seasonal data.
  • We set each forecast to be equal to the last observed value from the same season (e.g., the same month of the previous year).
  • Forecasts: \(\hat{y}_{T+h|T} =y_{T+h-m(k+1)}\), where \(m=\) seasonal period and \(k\) is the integer part of \((h-1)/m\). (i.e., the number of complete years in the forecast period prior to time \(T+h\))
  • For example, with monthly data, the forecast for all future February values is equal to the last observed February value, etc

SNAIVE(y~lag(m)): Seasonal naïve method

Seasonal naïve method

US_influenza %>%
  model(SNAIVE(rate)) %>%
  forecast(h = 150) %>%
  autoplot(US_influenza) +
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

RW(y ~ drift()): Drift method

  • A variation on the naïve method is to allow the forecasts to increase or decrease over time.

  • The amount of change over time (called the drift) is set to be the average change seen in the historical data.

  • Forecasts equal to last value plus average change.

\(\hat{y}_{T+h|T} = y_{T} + \frac{h}{T-1}\sum_{t=2}^T (y_t-y_{t-1}) = y_T + \frac{h}{T-1}(y_T -y_1)\)

  • Equivalent to extrapolating a line drawn between first and last observations.

Drift method

Drift method

All 4 models

Forecasting (zoom)

fc%>% dplyr::filter(.model == "Seasonal_naive")%>%
  autoplot(dplyr::filter(US_influenza, year(date) >= 2018))+
  labs(
    x = "Weeks",
    y = "Influenza Positive Rate")

Actual vs Forecasted

## Series: rate 
## Model: SNAIVE 
## 
## sigma^2: 0.0047

Forecasting - agrrrrrrhhhhhh

  `He who sees the past as surprise-free is
      bound to have a future full of surprises.` - Amos Tversky
      
      

Questions ?