Lecture 24 - Introduction to Forecasting

Penelope Pooler Eisenbies
BUA 345

2024-04-10

Housekeeping

Upcoming Dates

HW 9 is due on Monday, 4/15.
- Grace Period ends tonight (Tues. 4/16) at midnight.
HW 10 will be posted on 4/16 is due Monday, 4/22
Lecture 26 on Thu. 4/18 is Optional.
No Lecture on 4/23
Course Review on 4/25
NEW PACKAGE FOR FORECASTING: forecast
- If you are having trouble installing/loading any packages or components of R or RStudio, please come to office hour or make an appointment with me.

💥 Lecture 24 In-class Exercises - Q1 - Review 💥

Session ID: bua345s24

We have data for 2022 annual salaries of 75 Upstate NY residents ranging from $50K to $150K and we use that data to model how much someone spends on their first house.

Is it valid to apply that model to someone with an annual salary of $350K?

A. Yes, this is extrapolation and it is valid.

B. No this is extrapolation and it is invalid.

C. Yes, this is interpolation and it is valid.

D. No this is interpolation and it is invalid.

Introduction to Forecasting

Today’s Topics:

Cross-Sectional Data vs. Time Series Data
Basic Forecasting Terminology
Forecasting Trends without Seasonality in R
- Example 1 - US Population
- Example 2 - Netflix Stock Prices
HW 10 will be posted on 4/16
- The first part of HW 10 pertains to today’s leecture.

Cross-Sectional Data

Shows a Snapshot of One Time Period

Time Series Data

Shows Trend over Time

U.S. Population - Cross-Sectional Data

Population by County in 2019

U.S. Population - Time Series Data

U.S. Population 1950 - 2023

Time Series Terminology

In time series data, new observations are often correlated with prior observations

This is referred to as auto-correlation
- A variable is correlated with itself
- When data are auto-correlated, we use that information
- This process is called auto-regression
  - Using previous observations to predict into the future.

Introductory Time Series R Functions

R function: auto.arima function in forecast package
- ARIMA is an acronym:
  - AR: auto-regressive
  - I: integrated
  - MA: moving average
In ARIMA models, all three components are optimized to provide a reliable forecast.

Terminology: ARIMA model components (p, d, q)

Auto-Regressive Models (AR)

Similar to a simple linear regression model or non-linear regression model
Key difference: Regressor or predictor variable (X) is dependent variable (Y) with a specific LAG
Lag (p) is how many previous time periods the model looks back to estimate the next time period.
- If p = 1, the model estimates the next time period based on most recent one.
  - Looks back one time period
- If p = 2, the model estimates the next time period on time period BEFORE the most recent one.
  - Looks back two time periods

Example 1: U.S. Population - 1950 to Present

Forecast Questions:
- What will the U.S. Population be in 2040?
- What ARIMA model was chosen (p,d,q)?
Model Assessment Questions:
- How valid is our model?
  - Check residual plots.
- How accurate are our estimates?
  - Examine Prediction Intervals and Prediction Bands
  - Check fit statistics

Terminology: ARIMA model components (p, d, q)

Differencing (I = Integration)

Stationarity: mean and variance of data are consistent over timespan
- needed for accurate modeling
- Can be verified by examining residuals
Differencing transforms non-stationary data to stationary
Differencing order (d) determined by model:
- if d = 1: each obs. is difference from previous one (linear)
- if d = 2: each obs. is difference of difference from previous one (quadratic)

Terminology: ARIMA model components (p, d, q)

Moving Average (MA)

Moving average (q): how many terms are incorporated into each average within the data.
Algorithm calculates the average for a specific number of lagged terms
Moving Averages smooths out temporary instability in the data
- If q = 1: moving average is average of current term with the one from the previous time period.
- If q = 2:, moving average is average of the current term with the ones from two previous time periods.

U.S. Population - Interactive Plot

U.S. Population - Modeling Time Series Data

Population Trend Forecast

Create time series using population data
- Specify freq = 1 - one observation per year
- Specify start = 1950 - first year in dataset
Model data using auto.arima function
- Specify ic = aic - aic is the information criterion used to determine model.
- Specify seasonality = F - no seasonal (repeating) pattern in the data.
These commands will create and save the model:

pop_ts <- ts(uspop$popM, freq=1, start=1950)            # create time series
pop_model <- auto.arima(pop_ts, ic="aic", seasonal=F)   ## model data using auto.arima

U.S. Population - Create and Plot Forecasts

Create forecasts (until 2040)

h = 17 indicates we want to forecast 17 years
Most recent year in our data is 2023
- 2040 - 2023 = 17
Forecasts become less accurate the further into the future you specify.

pop_forecast <- forecast(pop_model, h=17) # create forecasts (until 2040)
uspop_pred_plot <- autoplot(pop_forecast) + 
  labs(y = "U.S. Population (Millions)") +
  theme_classic()

Darker purple: 80% Prediction Interval Bounds
Lighter purple: 95% Prediction Interval Bounds
Plot shows:
- Lags (p = 2), Differencing (d = 1), Moving Average (q = 1)

U.S. Population - Forecast Plot

U.S. Population - Examine Numerical Forecasts

Point Forecast is the forecasted estimate for each future time period
Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval
Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval

pop_forecast          # prints out forecast values

     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
2024       341.9140 341.7513 342.0767 341.6651 342.1628
2025       343.9590 343.4409 344.4770 343.1666 344.7513
2026       346.0985 345.1394 347.0577 344.6316 347.5654
2027       348.3135 346.8715 349.7556 346.1081 350.5189
2028       350.5901 348.6480 352.5323 347.6198 353.5605
2029       352.9173 350.4718 355.3628 349.1772 356.6574
2030       355.2861 352.3424 358.2298 350.7841 359.7881
2031       357.6891 354.2573 361.1209 352.4406 362.9375
2032       360.1202 356.2133 364.0272 354.1451 366.0953
2033       362.5746 358.2070 366.9422 355.8950 369.2542
2034       365.0480 360.2350 369.8611 357.6871 372.4090
2035       367.5372 362.2939 372.7804 359.5183 375.5560
2036       370.0393 364.3809 375.6976 361.3855 378.6930
2037       372.5520 366.4930 378.6109 363.2856 381.8183
2038       375.0734 368.6280 381.5189 365.2159 384.9309
2039       377.6021 370.7834 384.4208 367.1738 388.0304
2040       380.1367 372.9575 387.3160 369.1570 391.1165

💥 Lecture 24 In-class Exercises - Q2 💥

Session ID: bua345s24

Based on the US Population forecast output, we are 95% certain that U.S population in 2030 will be less than ______ million people?

How to input your answer:

Round to closest million (whole number)
If the answer were 123 million (e.g. 123.4233), you would enter 123.

U.S. Population - Examine Residuals and Model Fit

Top Plot: No spikes should be too large
- One obs. should be checked.
ACF: auto-correlation function.
- Ideally, most values fall within lines
Histogram: Distribution of residuals should be approx. normal
- One high outlier
Assessment: Trend is very smooth so small aberrations are exaggerated in residuals.

checkresiduals(pop_forecast) # examine residuals


    Ljung-Box test

data:  Residuals from ARIMA(2,1,1) with drift
Q* = 4.0208, df = 7, p-value = 0.7774

Model df: 3.   Total lags used: 10

U.S. Population - Examine Residuals and Model Fit

(acr <- accuracy(pop_forecast))              # examine model accuracy (fit)

                      ME     RMSE        MAE        MPE       MAPE       MASE
Training set 0.003635987 0.122532 0.08670159 0.00288164 0.03850593 0.03301367
                     ACF1
Training set 0.0004330708

Many options for comparing models
For BUA 345: We will use MAPE = Mean Absolute Percent Error
- (100 – MAPE) = Percent accuracy of model.
Despite outlier and one large ACF value, our population model is estimated to be 99.96% accurate.
This doesn’t guarantee that forecasts will be 100% accurate but it does improve our chances of accurate forecasting.

Example 2: Netflix Stock Prices

Data from Yahoo Finance

Netflix Stock

Was mostly trending upward, but had a downturn and then another recent upturn.
Data shown are daily adjusted closing value
For analysis, we will use are monthly adjusted close (1st day of trading for each month)

Netflix Stock

Forecast Questions:
- What will be the estimated stock price be in April of 2025?
- What ARIMA model was chosen (p,d,q)?
Model Assessment Questions:
- How valid is our model?
  - Check residual plots.
- How are accurate are our estimates?
  - Examine Prediction Intervals and Prediction Bands
  - Check fit statistics

Netflix Stock - Modeling Time Series Data

*Stock Trend Forecast

Creat time series using Netflix Stock data
- Specify freq = 12 - 12 observations per year
- Specify start = c(2010, 1) - first obs. in dataset is January 2010
Model data using auto.arima function
- Specify ic = aic - aic is the information criterion used to determine model.
- Specify seasonality = F - no seasonal (repeating) pattern in the data.
This chunk will create and save the model.

nflx_ts <- ts(nflx$Adjusted, freq=12, start=c(2010,1))   # create time series
nflx_model <- auto.arima(nflx_ts, ic="aic", seasonal=F)  # model data using auto.arima

Netflix Stock - Create and Plot Forecasts

Create forecasts (until April 2025)
- h = 12 indicates we want to forecast 12 months
- Most recent date in forecast data is April 1, 2024
- 12 Months until April 1, 2025
Forecasts become less accurate the further into the future you specify.

nflx_forecast <- forecast(nflx_model, h=12) # create forecasts (until April 2025)
nflx_pred_plot <- autoplot(nflx_forecast) + labs(y = "Adjusted Closing Price") +
  theme_classic()

Darker purple: 80% Prediction Interval Bounds
Lighter purple: 95% Prediction Interval Bounds
Plot shows:
- Lags (p = 2), Differencing (d = 1), Moving Average (q = 2)

Netflix Stock - Forecast Plot

Netflix Stock - Examine Numerical Forecasts

Point Forecast is the forecasted estimate for each future time period
Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval
Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval

nflx_forecast                # prints out forecast values

         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
May 2024       616.5980 577.8389 655.3570 557.3211 675.8749
Jun 2024       597.3982 540.9623 653.8340 511.0870 683.7094
Jul 2024       575.2199 501.5346 648.9053 462.5280 687.9119
Aug 2024       567.9570 477.2580 658.6559 429.2448 706.6691
Sep 2024       580.7199 475.2167 686.2230 419.3666 742.0731
Oct 2024       604.1698 487.2636 721.0759 425.3773 782.9623
Nov 2024       622.9879 497.5925 748.3833 431.2122 814.7636
Dec 2024       627.2257 494.8187 759.6327 424.7267 829.7247
Jan 2025       618.3484 478.9854 757.7114 405.2111 831.4857
Feb 2025       606.5939 459.5068 753.6811 381.6435 831.5443
Mar 2025       602.7019 447.2250 758.1789 364.9204 840.4835
Apr 2025       610.4861 446.7790 774.1931 360.1177 860.8544

💥 Lecture 24 In-class Exercises - Q3 💥

Session ID: bua345s24

Interpretation of Netflix Prediction Intervals

In January of 2025, the Netflix stock price is forecasted to be approximately $618 However the 95% prediction interval indicates it may be as low as ____.

How to input your answer:

Round to closest whole dollar.
Don’t include dollar sign.

Netflix Stock - Examine Residuals and Model Fit

Top Plot: Spikes get larger over time
ACF: auto-correlation function.
- Ideally, all or most values are with dashed lines
Histogram: Distribution of residuals should be approx. normal
Assessment: Stock prices are very volatile and this is sufficient.

checkresiduals(nflx_forecast) # examine residuals


    Ljung-Box test

data:  Residuals from ARIMA(2,1,2) with drift
Q* = 30.854, df = 20, p-value = 0.05715

Model df: 4.   Total lags used: 24

Netflix Stock - Examine Residuals and Model Fit

(acr <- accuracy(nflx_forecast))         # examine model accuracy (fit)

                     ME     RMSE      MAE       MPE     MAPE      MASE
Training set 0.02834461 29.71151 18.48178 -4.428438 12.66922 0.2157602
                    ACF1
Training set -0.02180208

Many options for comparing models
For BUA 345: We will use MAPE = Mean Absolute Percent Error
- 100 – MAPE = Percent accuracy of model.
Despite increasing volatility, our stock price model is estimated to be 87.33% accurate.
This doesn’t guarantee that forecasts will be 87% accurate but it does improve our chances of accurate forecasting.

Key Points from Today

forecast package in R simplifies forecasting**
Extrapolation OK in this case
- Report uncertainty as prediction bounds
You should know terminology and how to read and interpret output.
- You will be given data, R code, and output
- You will answer questions based on provided output.
HW 10 will cover Lectures 23-25
- HW 10 will be posted on Tue. 4/16 and due on Mon. 4/22.

To submit an Engagement Question or Comment about material from Today’s Lecture: Submit by midnight today (day of lecture). Click on Link next to the ❓ under today’s lecture.