Housekeeping

Upcoming Dates

  • HW 9 is due on Wednesday, 4/15.

    • Grace Period ends Thursday (4/16) at 10:00 PM.
  • HW 10 is now posted and is due on Monday 4/27.

  • OPTIONAL GitHub Quarto Dashboard Workshop on Fri., 4/17, at 3:15 PM.

    • If you are interested in attending, please sign up using the Google Form.
    • Space is limited on Friday so that I can help individual students troubleshoot this process.
  • Additional Practice Questions will be posted next week.

  • Course Review on 4/23.

Today’s plan 📋

Introduction to Forecasting

  • Cross-Sectional Data vs. Time Series Data

  • Basic Forecasting Terminology

  • Forecasting Trends without Seasonality in R

    • Example 1 - US Population

    • Example 2 - Netflix Stock Prices

  • NEW PACKAGE FOR FORECASTING: forecast

    • Part of HW 10 pertains to today’s lecture.

    • Demo videos for HW 10 will be posted this weekend.

💥 Lecture 25 In-class Exercises - Q1 💥

Poll Everywhere - My User Name: penelopepoolereisenbies685

Review Question from Linear Regression Modeling:

We have 2025 data for annual salaries of 75 Upstate NY residents ranging from $50K to $150K and we use that data to model how much someone spends on their first house.

Is it valid to apply that model to someone with an annual salary of $350K?


  • Yes, this is extrapolation and it is valid.

  • No this is extrapolation and it is invalid.

  • Yes, this is interpolation and it is valid.

  • No this is interpolation and it is invalid.

Cross-Sectional Data

Shows a Snapshot of One Time Period

Time Series Data

Shows Trend over Time

U.S. Population - Cross-Sectional Data

Population by County in 2019

U.S. Population - Time Series Data

U.S. Population 1950 - 2025

Time Series Terminology

In time series data, new observations are often correlated with prior observations

  • This is referred to as auto-correlation

    • A variable is correlated with itself

    • When data are auto-correlated, we use that information

    • This process is called auto-regression

      • Using previous observations to predict into the future.

Introductory Time Series R Functions

  • R function: auto.arima function in forecast package

    • ARIMA is an acronym:

      • AR: auto-regressive

      • I: integrated

      • MA: moving average

  • In ARIMA models, all three components are optimized to provide a reliable forecast.

Terminology: ARIMA model components (p, d, q)

Auto-Regressive Models (AR)

  • Similar to a simple linear regression model or non-linear regression model

  • Key difference: Regressor or predictor variable (X) is dependent variable (Y) with a specific LAG

  • Lag (p) is how many previous time periods the model looks back to estimate the next time period.

    • If p = 1, the model estimates the next time period based on most recent one.

      • Looks back one time period
    • If p = 2, the model estimates the next time period on time period BEFORE the most recent one.

      • Looks back two time periods

Terminology: ARIMA model components (p, d, q)

Differencing (I = Integration)

  • Stationarity: mean and variance of data are consistent over timespan

    • needed for accurate modeling

    • Can be verified by examining residuals

  • Differencing transforms non-stationary data to stationary

  • Differencing order (d) determined by model:

    • if d = 1: each obs. is difference from previous one (linear)

    • if d = 2: each obs. is difference of difference from previous one (quadratic)

Terminology: ARIMA model components (p, d, q)

Moving Average (MA)

  • Moving average (q): how many terms are incorporated into each average within the data.

  • Algorithm calculates the average for a specific number of lagged terms

  • Moving Averages smooths out temporary instability in the data

    • If q = 1: moving average is average of current term with the one from the previous time period.

    • If q = 2:, moving average is average of the current term with the ones from two previous time periods.

Example 1: U.S. Population - 1950 to Present

  • Forecast Questions:

    • What will the U.S. Population be in 2041?

    • What ARIMA model was chosen (p,d,q)?

  • Model Assessment Questions:

    • How valid is our model?

      • Check residual plots.
    • How accurate are our estimates?

      • Examine Prediction Intervals and Prediction Bands

      • Check fit statistics

U.S. Population - Interactive Plot

U.S. Population - Modeling Time Series Data

Population Trend Forecast

  • Create time series using population data

    • Specify freq = 1 - one observation per year

    • Specify start = 1950 - first year in dataset

  • Model data using auto.arima function

    • Specify ic = aic - aic is the information criterion used to determine model.

    • Specify seasonality = F - no seasonal (repeating) pattern in the data.

  • These commands will create and save the model:

pop_ts <- ts(uspop$popM, freq=1, start=1950)            # create time series
pop_model <- auto.arima(pop_ts, ic="aic", seasonal=F)   ## model data using auto.arima

U.S. Population - Create and Plot Forecasts

Create forecasts (until 2041)

  • h = 16 indicates we want to forecast 16 years

  • Most recent year in our data is 2025

    • 2041 - 2025 = 16
  • Forecasts become less accurate the further into the future you specify.

pop_forecast <- forecast(pop_model, h=16) # create forecasts (until 2041)
uspop_pred_plot <- autoplot(pop_forecast) + 
  labs(y = "U.S. Population (Millions)") +
  theme_classic()
  • Darker purple: 80% Prediction Interval Bounds
  • Lighter purple: 95% Prediction Interval Bounds
  • Plot shows:
    • Lags (p = 2), Differencing (d = 1), Moving Average (q = 1)

U.S. Population - Forecast Plot

U.S. Population - Examine Numerical Forecasts

  • Point Forecast is the forecasted estimate for each future time period
  • Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval
  • Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval
     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
2026       345.4317 345.2705 345.5929 345.1851 345.6783
2027       347.3536 346.8411 347.8660 346.5699 348.1373
2028       349.3666 348.4173 350.3159 347.9148 350.8184
2029       351.4591 350.0280 352.8901 349.2705 353.6476
2030       353.6191 351.6845 355.5537 350.6604 356.5778
2031       355.8363 353.3899 358.2827 352.0949 359.5778
2032       358.1019 355.1441 361.0597 353.5783 362.6255
2033       360.4083 356.9447 363.8718 355.1112 365.7053
2034       362.7491 358.7891 366.7091 356.6927 368.8055
2035       365.1191 360.6739 369.5643 358.3207 371.9174
2036       367.5136 362.5959 372.4314 359.9926 375.0347
2037       369.9289 364.5519 375.3060 361.7055 378.1524
2038       372.3618 366.5390 378.1846 363.4566 381.2670
2039       374.8095 368.5544 381.0646 365.2431 384.3759
2040       377.2697 370.5955 383.9439 367.0623 387.4770
2041       379.7405 372.6599 386.8210 368.9117 390.5692

💥 Lecture 25 In-class Exercises - Q2 💥

Poll Everywhere - My User Name: penelopepoolereisenbies685


Based on the US Population forecast output, we are 95% certain that U.S population in 2030 will be less than ______ million people?


Round to closest million (whole number), e.g. 345 million.

U.S. Population - Examine Residuals and Model Fit

  • Top Plot: No spikes should be too large

    • One obs. should be checked.
  • ACF: auto-correlation function.

    • Ideally, most values fall within lines
  • Histogram: Distribution of residuals should be approx. normal

    • One large low outlier
  • Assessment: Trend is very smooth so small aberrations are exaggerated in residuals.


    Ljung-Box test

data:  Residuals from ARIMA(2,1,1) with drift
Q* = 3.4469, df = 7, p-value = 0.8408

Model df: 3.   Total lags used: 10

U.S. Population - Examine Residuals and Model Fit

                      ME      RMSE        MAE        MPE       MAPE       MASE
Training set 0.003349508 0.1215547 0.08631584 0.00303556 0.03801874 0.03314369
                    ACF1
Training set 0.009871574
  • Many options for comparing models

  • For BUA 345: We will use MAPE = Mean Absolute Percent Error

    • (100 – MAPE) = Percent accuracy of model.
  • Despite outlier and one relatively large ACF value, our population model is estimated to be 99.96% accurate.

  • This doesn’t guarantee that forecasts will be almost 100% accurate but it does improve our chances of accurate forecasting.

Example 2: Netflix Stock Prices

Data from Yahoo Finance

Netflix Stock

  • Was mostly trending upward, but had a downturn and then another recent upturn.
  • Data shown are daily adjusted closing value
  • For analysis, we will use are monthly adjusted close (1st day of trading for each month)

Netflix Stock

  • Forecast Questions:

    • What will be the estimated stock price be in April of 2027?

    • What ARIMA model was chosen (p,d,q)?

  • Model Assessment Questions:

    • How valid is our model?

      • Check residual plots.
    • How are accurate are our estimates?

      • Examine Prediction Intervals and Prediction Bands

      • Check fit statistics

Netflix Stock - Modeling Time Series Data

Stock Trend Forecast

  • Create time series using Netflix Stock data

    • Specify freq = 12 - 12 observations per year

    • Specify start = c(2010, 1) - first obs. in dataset is January 2010

  • Model data using auto.arima function

    • Specify ic = aic - aic is the information criterion used to determine model.

    • Specify seasonality = F - no seasonal (repeating) pattern in the data.

  • This code will create and save the model:

nflx_ts <- ts(nflx$Adjusted, freq=12, start=c(2010,1))   # create time series
nflx_model <- auto.arima(nflx_ts, ic="aic", seasonal=F)  # model data using auto.arima

Netflix Stock - Create and Plot Forecasts

  • Create forecasts (until April 2027)

    • h = 12 indicates we want to forecast 12 months

    • Most recent date in forecast data is April 1, 2026

    • 12 Months until April 1, 2027

  • Forecasts become less accurate the further into the future you specify.

nflx_forecast <- forecast(nflx_model, h=12) # create forecasts (until April 2027)
nflx_pred_plot <- autoplot(nflx_forecast) + labs(y = "Adjusted Closing Price") +
  theme_classic()
  • Darker purple: 80% Prediction Interval Bounds

  • Lighter purple: 95% Prediction Interval Bounds

  • Plot shows:

    • Lags (p = 2), Differencing (d = 1), Moving Average (q = 2)

Netflix Stock - Forecast Plot

Netflix Stock - Examine Numerical Forecasts

  • Point Forecast is the forecasted estimate for each future time period
  • Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval
  • Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval


         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
May 2026       96.28770 90.87356 101.7018 88.00749 104.5679
Jun 2026       97.88642 90.43508 105.3378 86.49058 109.2823
Jul 2026       99.36907 90.06138 108.6768 85.13418 113.6040
Aug 2026      100.03050 88.78655 111.2745 82.83436 117.2266
Sep 2026       99.90374 86.80360 113.0039 79.86881 119.9387
Oct 2026       99.59476 84.93684 114.2527 77.17741 122.0121
Nov 2026       99.72772 83.85743 115.5980 75.45620 123.9992
Dec 2026      100.48277 83.62884 117.3367 74.70690 126.2586
Jan 2027      101.54850 83.77874 119.3183 74.37199 128.7250
Feb 2027      102.44464 83.71553 121.1737 73.80094 131.0883
Mar 2027      102.90728 83.15997 122.6546 72.70638 133.1082
Apr 2027      103.04149 82.28462 123.7984 71.29660 134.7864

💥 Lecture 25 In-class Exercises - Q3 💥

Poll Everywhere - My User Name: penelopepoolereisenbies685

Interpretation of Netflix Prediction Intervals


In February of 2027, the Netflix stock price is forecasted to be approximately $102. However the 95% prediction interval indicates it may be as low as ____.

  • Round your answer to the closest whole dollar.

Netflix Stock - Examine Residuals and Model Fit

  • Top Plot: Spikes get larger over time

  • ACF: auto-correlation function.

    • Ideally, all or most values are with dashed lines
  • Histogram: Distribution of residuals should be approx. normal

  • Assessment: Stock prices are very volatile and this is sufficient.


    Ljung-Box test

data:  Residuals from ARIMA(2,1,2) with drift
Q* = 29.59, df = 20, p-value = 0.07677

Model df: 4.   Total lags used: 24

Netflix Stock - Examine Residuals and Model Fit

                       ME     RMSE      MAE       MPE     MAPE      MASE
Training set 0.0007043361 4.159505 2.555861 -6.764627 13.76645 0.2189833
                  ACF1
Training set 0.0568807
  • Many options for comparing models

  • For BUA 345: We will use MAPE = Mean Absolute Percent Error

    • 100 – MAPE = Percent accuracy of model.
  • Despite increasing volatility, our stock price model is estimated to be ____% accurate. (Next question)

  • This doesn’t guarantee that forecasts will be this accurate but it does improve our chances of accurate forecasting.

💥 Lecture 25 In-class Exercises - Q4 💥

Poll Everywhere - My User Name: penelopepoolereisenbies685


Based on the Netflix Model Mean Absolute Percent Error, (MAPE), waht is the percent accuracy of our forecast model?


Answer is reported as a percentage with two decimal places.