Housekeeping

Upcoming Dates

  • HW 9 is due on Wednesday, 4/16.

    • Grace Period ends Thursday (4/17) at midnight.
  • HW 10 is now posted and is due on Monday 4/28.

  • Additional Practice Questions will be posted next week.

  • Course Review on 4/24.

Today’s plan 📋

Introduction to Forecasting

  • Cross-Sectional Data vs. Time Series Data

  • Basic Forecasting Terminology

  • Forecasting Trends without Seasonality in R

    • Example 1 - US Population

    • Example 2 - Netflix Stock Prices

  • NEW PACKAGE FOR FORECASTING: forecast

  • HW 10 is now posted and is due on 4/28

    • Part of HW 10 pertains to today’s lecture.

    • Demo videos for HW 10 will be posted this weekend.

In-class Polling (Session ID: bua345s25)

💥 Lecture 25 In-class Exercises - Q1 💥

Review Question from Linear Regression Modeling:

We have data for 2024 annual salaries of 75 Upstate NY residents ranging from $50K to $150K and we use that data to model how much someone spends on their first house.

Is it valid to apply that model to someone with an annual salary of $350K?


A. Yes, this is extrapolation and it is valid.

B. No this is extrapolation and it is invalid.

C. Yes, this is interpolation and it is valid.

D. No this is interpolation and it is invalid.

Cross-Sectional Data

Shows a Snapshot of One Time Period

Time Series Data

Shows Trend over Time

U.S. Population - Cross-Sectional Data

Population by County in 2019

U.S. Population - Time Series Data

U.S. Population 1950 - 2024

Time Series Terminology

In time series data, new observations are often correlated with prior observations

  • This is referred to as auto-correlation

    • A variable is correlated with itself

    • When data are auto-correlated, we use that information

    • This process is called auto-regression

      • Using previous observations to predict into the future.

Introductory Time Series R Functions

  • R function: auto.arima function in forecast package

    • ARIMA is an acronym:

      • AR: auto-regressive

      • I: integrated

      • MA: moving average

  • In ARIMA models, all three components are optimized to provide a reliable forecast.

Terminology: ARIMA model components (p, d, q)

Auto-Regressive Models (AR)

  • Similar to a simple linear regression model or non-linear regression model

  • Key difference: Regressor or predictor variable (X) is dependent variable (Y) with a specific LAG

  • Lag (p) is how many previous time periods the model looks back to estimate the next time period.

    • If p = 1, the model estimates the next time period based on most recent one.

      • Looks back one time period
    • If p = 2, the model estimates the next time period on time period BEFORE the most recent one.

      • Looks back two time periods

Terminology: ARIMA model components (p, d, q)

Differencing (I = Integration)

  • Stationarity: mean and variance of data are consistent over timespan

    • needed for accurate modeling

    • Can be verified by examining residuals

  • Differencing transforms non-stationary data to stationary

  • Differencing order (d) determined by model:

    • if d = 1: each obs. is difference from previous one (linear)

    • if d = 2: each obs. is difference of difference from previous one (quadratic)

Terminology: ARIMA model components (p, d, q)

Moving Average (MA)

  • Moving average (q): how many terms are incorporated into each average within the data.

  • Algorithm calculates the average for a specific number of lagged terms

  • Moving Averages smooths out temporary instability in the data

    • If q = 1: moving average is average of current term with the one from the previous time period.

    • If q = 2:, moving average is average of the current term with the ones from two previous time periods.

Example 1: U.S. Population - 1950 to Present

  • Forecast Questions:

    • What will the U.S. Population be in 2040?

    • What ARIMA model was chosen (p,d,q)?

  • Model Assessment Questions:

    • How valid is our model?

      • Check residual plots.
    • How accurate are our estimates?

      • Examine Prediction Intervals and Prediction Bands

      • Check fit statistics

U.S. Population - Interactive Plot

U.S. Population - Modeling Time Series Data

Population Trend Forecast

  • Create time series using population data

    • Specify freq = 1 - one observation per year

    • Specify start = 1950 - first year in dataset

  • Model data using auto.arima function

    • Specify ic = aic - aic is the information criterion used to determine model.

    • Specify seasonality = F - no seasonal (repeating) pattern in the data.

  • These commands will create and save the model:

pop_ts <- ts(uspop$popM, freq=1, start=1950)            # create time series
pop_model <- auto.arima(pop_ts, ic="aic", seasonal=F)   ## model data using auto.arima

U.S. Population - Create and Plot Forecasts

Create forecasts (until 2040)

  • h = 16 indicates we want to forecast 16 years

  • Most recent year in our data is 2024

    • 2040 - 2024 = 16
  • Forecasts become less accurate the further into the future you specify.

pop_forecast <- forecast(pop_model, h=16) # create forecasts (until 2040)
uspop_pred_plot <- autoplot(pop_forecast) + 
  labs(y = "U.S. Population (Millions)") +
  theme_classic()
  • Darker purple: 80% Prediction Interval Bounds
  • Lighter purple: 95% Prediction Interval Bounds
  • Plot shows:
    • Lags (p = 2), Differencing (d = 1), Moving Average (q = 1)

U.S. Population - Forecast Plot

U.S. Population - Examine Numerical Forecasts

  • Point Forecast is the forecasted estimate for each future time period
  • Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval
  • Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval
     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
2025       343.6569 343.4946 343.8193 343.4086 343.9052
2026       345.5929 345.0776 346.1082 344.8048 346.3809
2027       347.6223 346.6686 348.5760 346.1638 349.0808
2028       349.7332 348.2972 351.1691 347.5371 351.9293
2029       351.9131 349.9744 353.8517 348.9481 354.8780
2030       354.1509 351.7028 356.5991 350.4068 357.8950
2031       356.4375 353.4816 359.3933 351.9169 360.9581
2032       358.7648 355.3083 362.2214 353.4785 364.0512
2033       361.1264 357.1795 365.0734 355.0901 367.1628
2034       363.5167 359.0917 367.9418 356.7492 370.2843
2035       365.9311 361.0414 370.8209 358.4529 373.4094
2036       368.3657 363.0251 373.7064 360.1980 376.5335
2037       370.8173 365.0398 376.5947 361.9814 379.6531
2038       373.2830 367.0826 379.4835 363.8003 382.7658
2039       375.7607 369.1507 382.3707 365.6516 385.8698
2040       378.2483 371.2419 385.2548 367.5328 388.9639

💥 Lecture 25 In-class Exercises - Q2 💥

Session ID: bua345s25


Based on the US Population forecast output, we are 95% certain that U.S population in 2030 will be less than ______ million people?

How to input your answer:

  • Round to closest million (whole number)

  • If the answer were 123 million (e.g. 123.4233), you would enter 123.

U.S. Population - Examine Residuals and Model Fit

  • Top Plot: No spikes should be too large

    • One obs. should be checked.
  • ACF: auto-correlation function.

    • Ideally, most values fall within lines
  • Histogram: Distribution of residuals should be approx. normal

    • One large low outlier
  • Assessment: Trend is very smooth so small aberrations are exaggerated in residuals.


    Ljung-Box test

data:  Residuals from ARIMA(2,1,1) with drift
Q* = 3.2015, df = 7, p-value = 0.8658

Model df: 3.   Total lags used: 10

U.S. Population - Examine Residuals and Model Fit

                      ME      RMSE        MAE         MPE       MAPE       MASE
Training set 0.003495822 0.1223609 0.08692589 0.003017647 0.03838409 0.03323733
                    ACF1
Training set 0.007946428
  • Many options for comparing models

  • For BUA 345: We will use MAPE = Mean Absolute Percent Error

    • (100 – MAPE) = Percent accuracy of model.
  • Despite outlier and one relatively large ACF value, our population model is estimated to be 99.96% accurate.

  • This doesn’t guarantee that forecasts will be 100% accurate but it does improve our chances of accurate forecasting.

Example 2: Netflix Stock Prices

Data from Yahoo Finance

Netflix Stock

  • Was mostly trending upward, but had a downturn and then another recent upturn.
  • Data shown are daily adjusted closing value
  • For analysis, we will use are monthly adjusted close (1st day of trading for each month)

Netflix Stock

  • Forecast Questions:

    • What will be the estimated stock price be in April of 2026?

    • What ARIMA model was chosen (p,d,q)?

  • Model Assessment Questions:

    • How valid is our model?

      • Check residual plots.
    • How are accurate are our estimates?

      • Examine Prediction Intervals and Prediction Bands

      • Check fit statistics

Netflix Stock - Modeling Time Series Data

Stock Trend Forecast

  • Create time series using Netflix Stock data

    • Specify freq = 12 - 12 observations per year

    • Specify start = c(2010, 1) - first obs. in dataset is January 2010

  • Model data using auto.arima function

    • Specify ic = aic - aic is the information criterion used to determine model.

    • Specify seasonality = F - no seasonal (repeating) pattern in the data.

  • This code will create and save the model:

nflx_ts <- ts(nflx$Adjusted, freq=12, start=c(2010,1))   # create time series
nflx_model <- auto.arima(nflx_ts, ic="aic", seasonal=F)  # model data using auto.arima

Netflix Stock - Create and Plot Forecasts

  • Create forecasts (until April 2026)

    • h = 12 indicates we want to forecast 12 months

    • Most recent date in forecast data is April 1, 2025

    • 12 Months until April 1, 2026

  • Forecasts become less accurate the further into the future you specify.

nflx_forecast <- forecast(nflx_model, h=12) # create forecasts (until April 2025)
nflx_pred_plot <- autoplot(nflx_forecast) + labs(y = "Adjusted Closing Price") +
  theme_classic()
  • Darker purple: 80% Prediction Interval Bounds

  • Lighter purple: 95% Prediction Interval Bounds

  • Plot shows:

    • Lags (p = 0), Differencing (d = 1), Moving Average (q = 3)

Netflix Stock - Forecast Plot

Netflix Stock - Examine Numerical Forecasts

  • Point Forecast is the forecasted estimate for each future time period
  • Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval
  • Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval
         Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
May 2025       931.5097 887.9080  975.1115 864.8266  998.1929
Jun 2025       917.5952 854.4637  980.7266 821.0439 1014.1464
Jul 2025       909.0272 825.7946  992.2599 781.7339 1036.3206
Aug 2025       909.0272 805.4485 1012.6060 750.6172 1067.4372
Sep 2025       909.0272 788.4891 1029.5653 724.6801 1093.3744
Oct 2025       909.0272 773.6377 1044.4167 701.9668 1116.0876
Nov 2025       909.0272 760.2616 1057.7928 681.5099 1136.5446
Dec 2025       909.0272 747.9928 1070.0616 662.7463 1155.3081
Jan 2026       909.0272 736.5947 1081.4597 645.3145 1172.7400
Feb 2026       909.0272 725.9047 1092.1497 628.9655 1189.0889
Mar 2026       909.0272 715.8053 1102.2492 613.5197 1204.5347
Apr 2026       909.0272 706.2081 1111.8464 598.8421 1219.2124

💥 Lecture 25 In-class Exercises - Q3 💥

Session ID: bua345s25

Interpretation of Netflix Prediction Intervals


In January of 2026, the Netflix stock price is forecasted to be approximately $909. However the 95% prediction interval indicates it may be as low as ____.

How to input your answer:

  • Round to closest whole dollar.

  • Don’t include dollar sign.

Netflix Stock - Examine Residuals and Model Fit

  • Top Plot: Spikes get larger over time

  • ACF: auto-correlation function.

    • Ideally, all or most values are with dashed lines
  • Histogram: Distribution of residuals should be approx. normal

  • Assessment: Stock prices are very volatile and this is sufficient.


    Ljung-Box test

data:  Residuals from ARIMA(0,1,3)
Q* = 27.197, df = 21, p-value = 0.1644

Model df: 3.   Total lags used: 24

Netflix Stock - Examine Residuals and Model Fit

                   ME    RMSE      MAE      MPE     MAPE     MASE         ACF1
Training set 3.464136 33.6508 21.59716 1.309471 10.99867 0.213375 -0.006420417
  • Many options for comparing models

  • For BUA 345: We will use MAPE = Mean Absolute Percent Error

    • 100 – MAPE = Percent accuracy of model.
  • Despite increasing volatility, our stock price model is estimated to be 89% accurate.

  • This doesn’t guarantee that forecasts will be 89% accurate but it does improve our chances of accurate forecasting.