BUA 345 - Lecture 25

Introduction to Forecasting

Author

Penelope Pooler Eisenbies

Published

April 14, 2026

Housekeeping

Upcoming Dates

HW 9 is due on Wednesday, 4/15.
- Grace Period ends Thursday (4/16) at 10:00 PM.
HW 10 is now posted and is due on Monday 4/27.
OPTIONAL GitHub Quarto Dashboard Workshop on Fri., 4/17, at 3:15 PM.
- If you are interested in attending, please sign up using the Google Form.
- Space is limited on Friday so that I can help individual students troubleshoot this process.
Additional Practice Questions will be posted next week.
Course Review on 4/23.

Today’s plan

Introduction to Forecasting

Cross-Sectional Data vs. Time Series Data
Basic Forecasting Terminology
Forecasting Trends without Seasonality in R
- Example 1 - US Population
- Example 2 - Netflix Stock Prices
NEW PACKAGE FOR FORECASTING: forecast
- Part of HW 10 pertains to today’s lecture.
- Demo videos for HW 10 will be posted this weekend.

Lecture 25 In-class Exercises - Q1

Poll Everywhere - My User Name: penelopepoolereisenbies685

Review Question from Linear Regression Modeling:

We have 2025 data for annual salaries of 75 Upstate NY residents ranging from $50K to $150K and we use that data to model how much someone spends on their first house.

Is it valid to apply that model to someone with an annual salary of $350K?

Yes, this is extrapolation and it is valid.
No this is extrapolation and it is invalid.
Yes, this is interpolation and it is valid.
No this is interpolation and it is invalid.

Cross-Sectional Data

Shows a Snapshot of One Time Period

Time Series Data

Shows Trend over Time

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

U.S. Population - Cross-Sectional Data

Population by County in 2019

U.S. Population - Time Series Data

U.S. Population 1950 - 2025

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Time Series Terminology

In time series data, new observations are often correlated with prior observations

This is referred to as auto-correlation
- A variable is correlated with itself
- When data are auto-correlated, we use that information
- This process is called auto-regression
  - Using previous observations to predict into the future.

Introductory Time Series R Functions

R function: auto.arima function in forecast package
- ARIMA is an acronym:
  - AR: auto-regressive
  - I: integrated
  - MA: moving average
In ARIMA models, all three components are optimized to provide a reliable forecast.

Terminology: ARIMA model components (p, d, q)

Auto-Regressive Models (AR)

Similar to a simple linear regression model or non-linear regression model
Key difference: Regressor or predictor variable (X) is dependent variable (Y) with a specific LAG
Lag (p) is how many previous time periods the model looks back to estimate the next time period.
- If p = 1, the model estimates the next time period based on most recent one.
  - Looks back one time period
- If p = 2, the model estimates the next time period on time period BEFORE the most recent one.
  - Looks back two time periods

Terminology: ARIMA model components (p, d, q)

Differencing (I = Integration)

Stationarity: mean and variance of data are consistent over timespan
- needed for accurate modeling
- Can be verified by examining residuals
Differencing transforms non-stationary data to stationary
Differencing order (d) determined by model:
- if d = 1: each obs. is difference from previous one (linear)
- if d = 2: each obs. is difference of difference from previous one (quadratic)

Terminology: ARIMA model components (p, d, q)

Moving Average (MA)

Moving average (q): how many terms are incorporated into each average within the data.
Algorithm calculates the average for a specific number of lagged terms
Moving Averages smooths out temporary instability in the data
- If q = 1: moving average is average of current term with the one from the previous time period.
- If q = 2:, moving average is average of the current term with the ones from two previous time periods.

Example 1: U.S. Population - 1950 to Present

Forecast Questions:
- What will the U.S. Population be in 2041?
- What ARIMA model was chosen (p,d,q)?
Model Assessment Questions:
- How valid is our model?
  - Check residual plots.
- How accurate are our estimates?
  - Examine Prediction Intervals and Prediction Bands
  - Check fit statistics

U.S. Population - Interactive Plot

U.S. Population - Modeling Time Series Data

Population Trend Forecast

Create time series using population data
- Specify freq = 1 - one observation per year
- Specify start = 1950 - first year in dataset
Model data using auto.arima function
- Specify ic = aic - aic is the information criterion used to determine model.
- Specify seasonality = F - no seasonal (repeating) pattern in the data.
These commands will create and save the model:

Code

```{r create pop time series and model, echo=T}
pop_ts <- ts(uspop$popM, freq=1, start=1950)            # create time series
pop_model <- auto.arima(pop_ts, ic="aic", seasonal=F)   ## model data using auto.arima
```

U.S. Population - Create and Plot Forecasts

Create forecasts (until 2041)

h = 16 indicates we want to forecast 16 years
Most recent year in our data is 2025
- 2041 - 2025 = 16
Forecasts become less accurate the further into the future you specify.

Code

```{r create forecasts and plot, echo=T}
pop_forecast <- forecast(pop_model, h=16) # create forecasts (until 2041)
uspop_pred_plot <- autoplot(pop_forecast) + 
  labs(y = "U.S. Population (Millions)") +
  theme_classic()
```

Darker purple: 80% Prediction Interval Bounds
Lighter purple: 95% Prediction Interval Bounds
Plot shows:
- Lags (p = 2), Differencing (d = 1), Moving Average (q = 1)

U.S. Population - Forecast Plot

U.S. Population - Examine Numerical Forecasts

Point Forecast is the forecasted estimate for each future time period
Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval
Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval

     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
2026       345.4317 345.2705 345.5929 345.1851 345.6783
2027       347.3536 346.8411 347.8660 346.5699 348.1373
2028       349.3666 348.4173 350.3159 347.9148 350.8184
2029       351.4591 350.0280 352.8901 349.2705 353.6476
2030       353.6191 351.6845 355.5537 350.6604 356.5778
2031       355.8363 353.3899 358.2827 352.0949 359.5778
2032       358.1019 355.1441 361.0597 353.5783 362.6255
2033       360.4083 356.9447 363.8718 355.1112 365.7053
2034       362.7491 358.7891 366.7091 356.6927 368.8055
2035       365.1191 360.6739 369.5643 358.3207 371.9174
2036       367.5136 362.5959 372.4314 359.9926 375.0347
2037       369.9289 364.5519 375.3060 361.7055 378.1524
2038       372.3618 366.5390 378.1846 363.4566 381.2670
2039       374.8095 368.5544 381.0646 365.2431 384.3759
2040       377.2697 370.5955 383.9439 367.0623 387.4770
2041       379.7405 372.6599 386.8210 368.9117 390.5692

Lecture 25 In-class Exercises - Q2

Poll Everywhere - My User Name: penelopepoolereisenbies685

Based on the US Population forecast output, we are 95% certain that U.S population in 2030 will be less than ______ million people?

Round to closest million (whole number), e.g. 345 million.

U.S. Population - Examine Residuals and Model Fit

Top Plot: No spikes should be too large
- One obs. should be checked.
ACF: auto-correlation function.
- Ideally, most values fall within lines
Histogram: Distribution of residuals should be approx. normal
- One large low outlier
Assessment: Trend is very smooth so small aberrations are exaggerated in residuals.


    Ljung-Box test

data:  Residuals from ARIMA(2,1,1) with drift
Q* = 3.4469, df = 7, p-value = 0.8408

Model df: 3.   Total lags used: 10

U.S. Population - Examine Residuals and Model Fit

                      ME      RMSE        MAE        MPE       MAPE       MASE
Training set 0.003349508 0.1215547 0.08631584 0.00303556 0.03801874 0.03314369
                    ACF1
Training set 0.009871574

Many options for comparing models
For BUA 345: We will use MAPE = Mean Absolute Percent Error
- (100 – MAPE) = Percent accuracy of model.
Despite outlier and one relatively large ACF value, our population model is estimated to be 99.96% accurate.
This doesn’t guarantee that forecasts will be almost 100% accurate but it does improve our chances of accurate forecasting.

Example 2: Netflix Stock Prices

Data from Yahoo Finance

Netflix Stock

Was mostly trending upward, but had a downturn and then another recent upturn.
Data shown are daily adjusted closing value
For analysis, we will use are monthly adjusted close (1st day of trading for each month)

Netflix Stock

Forecast Questions:
- What will be the estimated stock price be in April of 2027?
- What ARIMA model was chosen (p,d,q)?
Model Assessment Questions:
- How valid is our model?
  - Check residual plots.
- How are accurate are our estimates?
  - Examine Prediction Intervals and Prediction Bands
  - Check fit statistics

Netflix Stock - Modeling Time Series Data

Stock Trend Forecast

Create time series using Netflix Stock data
- Specify freq = 12 - 12 observations per year
- Specify start = c(2010, 1) - first obs. in dataset is January 2010
Model data using auto.arima function
- Specify ic = aic - aic is the information criterion used to determine model.
- Specify seasonality = F - no seasonal (repeating) pattern in the data.
This code will create and save the model:

Code

```{r create nflx time series and model, echo=T}
nflx_ts <- ts(nflx$Adjusted, freq=12, start=c(2010,1))   # create time series
nflx_model <- auto.arima(nflx_ts, ic="aic", seasonal=F)  # model data using auto.arima
```

Netflix Stock - Create and Plot Forecasts

Create forecasts (until April 2027)
- h = 12 indicates we want to forecast 12 months
- Most recent date in forecast data is April 1, 2026
- 12 Months until April 1, 2027
Forecasts become less accurate the further into the future you specify.

Code

```{r create nflx forecasts, echo=T}
nflx_forecast <- forecast(nflx_model, h=12) # create forecasts (until April 2027)
nflx_pred_plot <- autoplot(nflx_forecast) + labs(y = "Adjusted Closing Price") +
  theme_classic()
```

Darker purple: 80% Prediction Interval Bounds
Lighter purple: 95% Prediction Interval Bounds
Plot shows:
- Lags (p = 2), Differencing (d = 1), Moving Average (q = 2)

Netflix Stock - Forecast Plot

Netflix Stock - Examine Numerical Forecasts

Point Forecast is the forecasted estimate for each future time period
Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval
Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval

         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
May 2026       96.28770 90.87356 101.7018 88.00749 104.5679
Jun 2026       97.88642 90.43508 105.3378 86.49058 109.2823
Jul 2026       99.36907 90.06138 108.6768 85.13418 113.6040
Aug 2026      100.03050 88.78655 111.2745 82.83436 117.2266
Sep 2026       99.90374 86.80360 113.0039 79.86881 119.9387
Oct 2026       99.59476 84.93684 114.2527 77.17741 122.0121
Nov 2026       99.72772 83.85743 115.5980 75.45620 123.9992
Dec 2026      100.48277 83.62884 117.3367 74.70690 126.2586
Jan 2027      101.54850 83.77874 119.3183 74.37199 128.7250
Feb 2027      102.44464 83.71553 121.1737 73.80094 131.0883
Mar 2027      102.90728 83.15997 122.6546 72.70638 133.1082
Apr 2027      103.04149 82.28462 123.7984 71.29660 134.7864

Lecture 25 In-class Exercises - Q3

Poll Everywhere - My User Name: penelopepoolereisenbies685

Interpretation of Netflix Prediction Intervals

In February of 2027, the Netflix stock price is forecasted to be approximately $102. However the 95% prediction interval indicates it may be as low as ____.

Round your answer to the closest whole dollar.

Netflix Stock - Examine Residuals and Model Fit

Top Plot: Spikes get larger over time
ACF: auto-correlation function.
- Ideally, all or most values are with dashed lines
Histogram: Distribution of residuals should be approx. normal
Assessment: Stock prices are very volatile and this is sufficient.


    Ljung-Box test

data:  Residuals from ARIMA(2,1,2) with drift
Q* = 29.59, df = 20, p-value = 0.07677

Model df: 4.   Total lags used: 24

Netflix Stock - Examine Residuals and Model Fit

                       ME     RMSE      MAE       MPE     MAPE      MASE
Training set 0.0007043361 4.159505 2.555861 -6.764627 13.76645 0.2189833
                  ACF1
Training set 0.0568807

Many options for comparing models
For BUA 345: We will use MAPE = Mean Absolute Percent Error
- 100 – MAPE = Percent accuracy of model.
Despite increasing volatility, our stock price model is estimated to be ____% accurate. (Next question)
This doesn’t guarantee that forecasts will be this accurate but it does improve our chances of accurate forecasting.

Lecture 25 In-class Exercises - Q4

Poll Everywhere - My User Name: penelopepoolereisenbies685

Based on the Netflix Model Mean Absolute Percent Error, (MAPE), waht is the percent accuracy of our forecast model?

Answer is reported as a percentage with two decimal places.

Key Points from Today

forecast package in R simplifies forecasting.
Extrapolation OK in this case
- Report uncertainty as prediction bounds
You should know terminology and how to read and interpret output.
- You will be given data, R code, and output
- You will answer questions based on provided output.
HW 10 includes material from Lectures 24-26
HW 9 is due on 4/15.

To submit an Engagement Question or Comment about material from Lecture 25: Submit it by midnight today (day of lecture).

--- title: "BUA 345 - Lecture 25" subtitle: "Introduction to Forecasting" author: "Penelope Pooler Eisenbies" date: last-modified lightbox: true toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, warning=F, message=F, include=F} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, knitr, gridExtra, forecast, tidyquant, lubridate, maps, usdata, mapproj, ggthemes, RColorBrewer, dygraphs) # verify packages # p_loaded() ``` ### Upcoming Dates - **HW 9 is due on Wednesday, 4/15**. - Grace Period ends Thursday (4/16) at 10:00 PM. - **HW 10 is now posted and is due on Monday 4/27.** - **OPTIONAL [GitHub Quarto Dashboard Workshop](https://peneloopy.github.io/bua_345_sem/#dashboard-options){target="_blank"} on Fri., 4/17, at 3:15 PM**. - If you are interested in attending, please sign up using the [Google Form](https://forms.gle/nqcSgwCc4LQrJ7PF6){target="_blank"}. - Space is limited on Friday so that I can help individual students troubleshoot this process. - Additional Practice Questions will be posted next week. - Course Review on 4/23. ## ### Today's plan **Introduction to Forecasting** - Cross-Sectional Data vs. Time Series Data - Basic Forecasting Terminology - Forecasting Trends without Seasonality in R - Example 1 - US Population - Example 2 - Netflix Stock Prices - **NEW PACKAGE FOR FORECASTING: `forecast`** - Part of HW 10 pertains to today's lecture. - Demo videos for HW 10 will be posted this weekend. ## ### Lecture 25 In-class Exercises - Q1 [***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685** **Review Question from Linear Regression Modeling:** We have 2025 data for annual salaries of 75 Upstate NY residents ranging from `$50K` to `$150K` and we use that data to model how much someone spends on their first house. **Is it valid to apply that model to someone with an annual salary of `$350K`?** ::: nonincremental - Yes, this is extrapolation and it is valid. - No this is extrapolation and it is invalid. - Yes, this is interpolation and it is valid. - No this is interpolation and it is invalid. ::: ## Cross-Sectional Data **Shows a Snapshot of One Time Period** ```{r echo = F, message = FALSE, fig.align='center'} gsb26 <- read_csv("data/gsb_26.csv", show_col_types = F, skip=25, col_select = c(1,2,6), ) |> slice(1:5) |> rename("city" = "Highest – 7.9", "2025-26" = "...2", "Most" = "...6") |> mutate(Most = substr(Most, 1,5) |> as.numeric()) gsb_long <- gsb26 |> pivot_longer(cols = `2025-26`:Most, names_to = "type", values_to = "inches") (gsb_plt <- gsb_long |> ggplot() + geom_bar(aes(x=city, y=inches, fill=type), stat="identity", position="dodge") + scale_fill_manual(values=c("blue4", "lightblue")) + theme_classic() + labs(fill="", x="City", y="Snowfall (inches)", caption="Data Source: https://goldensnowball.com/", title="City Snowfall - Current and All-time Record")+ theme(plot.title = element_text(size = 15), plot.caption = element_text(size = 10), axis.title.x = element_text(size = 15), axis.title.y = element_text(size = 15), axis.text.x = element_text(size = 8), axis.text.y = element_text(size = 15), plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))) ``` ## Time Series Data **Shows Trend over Time** ```{r syr snowfall timeseries, echo=F, message=F, fig.align='center'} snowfall <- read_csv("data/snowfall_upstateny_cities.csv", show_col_types = F) |> filter(!Season=="Season") |> separate(Season, into=c("Season_Start", "Season_End"), sep = "-") |> mutate(Season_Start = Season_Start |> as.integer(), Season_End = Season_Start + 1 |> as.integer(), Syracuse = Syracuse |> as.numeric(), Buffalo = Buffalo |> as.numeric()) |> rename("city_most" = "City With Most Snow") |> select(Season_End, Syracuse, Buffalo, city_most) |> filter(Season_End >= 1952) |> pivot_longer(cols=Syracuse:Buffalo, names_to = "City", values_to = "Snowfall") (line_plot <- snowfall |> ggplot() + geom_line(aes(x=Season_End, y=Snowfall, color=City), linewidth=1) + theme_classic() + scale_x_continuous(breaks=seq(1960, 2020, 10)) + scale_color_manual(values=c("lightblue", "blue")) + ylim(0,200) + labs(title="Syracuse Annual Snowfall", y="Snowfall (inches)", x="Year Season Ended", caption="Data Source: https://en.wikipedia.org/wiki/Golden_Snowball_Award") + theme(plot.title = element_text(size = 15), plot.caption = element_text(size = 10), axis.title.x = element_text(size = 15), axis.title.y = element_text(size = 15), axis.text.x = element_text(size = 10), axis.text.y = element_text(size = 15), plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))) ``` ## U.S. Population - Cross-Sectional Data **Population by County in 2019** ```{r echo=F, message=F, fig.align='center', fig.dim=c(15,7)} us_counties <- map_data("county") |> # county polygons rename("state" = "region", "county" = "subregion") # unique(us_counties$county[us_counties$state=="louisiana"]) # note issue Louisiana counties cnty2019_all <- county_2019 # unique(cnty2019_all$name[cnty2019_all$state=="Louisiana"]) # note issue Louisiana counties cnty2019_all <- cnty2019_all |> mutate(state = tolower(state), county = tolower(name), county = gsub(" county", "", county), county = gsub(" parish", "", county), county = gsub("\\.", "", county)) # \\ is required because . used in R coding cnty2019_all <- full_join(us_counties,cnty2019_all) |> select(long:county, pop) |> mutate(pop1k = pop/1000) # plot of logged data # transformation and breaks statement added (cnty_lpop <- cnty2019_all |> ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) + geom_polygon() + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "", title="Population by County", subtitle="Unit is 1000 People and Date are Log-transformed", caption="Data Source: R usdata package") + scale_fill_continuous(type = "viridis",trans="log", breaks=c(1,10,100,1000,10000)) + theme(legend.position = "bottom", legend.key.width = unit(1, "cm"), plot.title = element_text(size = 15), plot.subtitle = element_text(size = 10), plot.caption = element_text(size = 8), plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))) ``` ## U.S. Population - Time Series Data [**U.S. Population 1950 - 2025**](https://www.macrotrends.net/countries/USA/united-states/population) ```{r, echo=F, message=F, fig.align='center', fig.dim=c(15,7)} uspop <- read_csv("data/united-states-population-2025-04-10.csv", skip=15, show_col_types = F) |> mutate(date = mdy(date), Year = year(date), popM = Population/1000000) |> filter(Year <= 2025) (pop_plt <- uspop |> ggplot() + geom_line(aes(x=Year, y=popM), color="blue", size=1) + theme_classic() + scale_x_continuous(breaks=seq(1945, 2025, 10)) + labs(title="U.S. Population - 1950 - 2025", y="Population (millions)", caption="Data Source: https://www.macrotrends.net/countries/USA/united-states/population") + theme(plot.title = element_text(size = 15), plot.caption = element_text(size = 8), axis.title.x = element_text(size = 15), axis.title.y = element_text(size = 15), axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12), plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))) ``` ## Time Series Terminology In time series data, new observations are often correlated with prior observations - This is referred to as **auto-correlation** - A variable is correlated with itself - When data are auto-correlated, we use that information - This process is called **auto-regression** - Using previous observations to predict into the future. ## Introductory Time Series R Functions - **R function:** **`auto.arima`** function in **`forecast`** package - **ARIMA** is an acronym: - **AR:** auto-regressive - **I:** integrated - **MA:** moving average - In **ARIMA** models, all three components are optimized to provide a reliable forecast. ## ### Terminology: ARIMA model components (p, d, q) ***Auto-Regressive Models (AR)*** - Similar to a simple linear regression model or non-linear regression model - Key difference: Regressor or predictor variable (X) is **dependent** variable (Y) with a specific LAG - Lag (**p**) is how many previous time periods the model looks back to estimate the next time period. - If **p = 1**, the model estimates the next time period based on most recent one. - Looks back **one** time period - If **p = 2**, the model estimates the next time period on time period **BEFORE** the most recent one. - Looks back **two** time periods ## ### Terminology: ARIMA model components (p, d, q) ***Differencing (I = Integration)*** - **Stationarity:** mean and variance of data are consistent over timespan - needed for accurate modeling - Can be verified by examining residuals - **Differencing** transforms non-stationary data to stationary - Differencing order (**d**) determined by model: - if **d = 1:** each obs. is difference from previous one (linear) - if **d = 2:** each obs. is difference of difference from previous one (quadratic) ## ### Terminology: ARIMA model components (p, d, q) ***Moving Average (MA)*** - Moving average (**q**): how many terms are incorporated into each average within the data. - Algorithm calculates the average for a specific number of lagged terms - Moving Averages smooths out temporary instability in the data - If **q = 1:** moving average is average of current term with the one from the previous time period. - If **q = 2:**, moving average is average of the current term with the ones from two previous time periods. ## ### Example 1: U.S. Population - 1950 to Present ::::: columns ::: {.column width="40%"} ```{r echo=F, message=F, fig.dim=c(5,4)} (pop_plt <- uspop |> ggplot() + geom_line(aes(x=Year, y=popM), color="blue", size=1) + theme_classic() + scale_x_continuous(breaks=seq(1945, 2025, 10)) + labs(title="U.S. Population - 1950 - 2025", y="Population (millions)") + theme(plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))) ``` ::: ::: {.column width="60%"} - **Forecast Questions:** - What will the U.S. Population be in 2041? - What ARIMA model was chosen (p,d,q)? - **Model Assessment Questions:** - How valid is our model? - Check residual plots. - How accurate are our estimates? - Examine Prediction Intervals and Prediction Bands - Check fit statistics ::: ::::: ## U.S. Population - Interactive Plot {background-color="aliceblue"} ```{r pop dygraph, echo=F} # convert to xts (extensible time series) uspop_xts <- xts(x=uspop[,5], order.by= uspop$date) # create interactive plot dygraph(uspop_xts, main="US Population 1950 - 2025") |> dySeries("popM", label="Pop. (Mill.)", color= "darkmagenta") |> dyAxis("y", label = "", drawGrid = FALSE) |> dyAxis("x", label = "", drawGrid = FALSE) |> dyShading(from = "1950-12-31", to = "2026-12-31", color = "white") |> dyShading(from="2020-3-12", to="2021-6-14", color = "lightgrey") |> dyRangeSelector() ``` ## ### U.S. Population - Modeling Time Series Data ***Population Trend Forecast*** - Create time series using population data - Specify `freq = 1` - one observation per year - Specify `start = 1950` - first year in dataset - Model data using `auto.arima` function - Specify `ic = aic` - `aic` is the information criterion used to determine model. - Specify `seasonality = F` - no seasonal (repeating) pattern in the data. - These commands will create and save the model: ::: fragment ```{r create pop time series and model, echo=T} pop_ts <- ts(uspop$popM, freq=1, start=1950) # create time series pop_model <- auto.arima(pop_ts, ic="aic", seasonal=F) ## model data using auto.arima ``` ::: ## ### U.S. Population - Create and Plot Forecasts ***Create forecasts (until 2041)*** - `h = 16` indicates we want to forecast 16 years - Most recent year in our data is 2025 - 2041 - 2025 = 16 - **Forecasts become less accurate the further into the future you specify.** ::: fragment ```{r create forecasts and plot, echo=T} pop_forecast <- forecast(pop_model, h=16) # create forecasts (until 2041) uspop_pred_plot <- autoplot(pop_forecast) + labs(y = "U.S. Population (Millions)") + theme_classic() ``` ::: - Darker purple: 80% Prediction Interval Bounds - Lighter purple: 95% Prediction Interval Bounds - **Plot shows:** - **Lags (p = `r summary(pop_model)$arma[1]`), Differencing (d = `r summary(pop_model)$arma[6]`), Moving Average (q = `r summary(pop_model)$arma[2]`)** ## U.S. Population - Forecast Plot ```{r pop plot with pred intervals, echo=F} uspop_pred_plot ``` ## ### U.S. Population - Examine Numerical Forecasts - Point Forecast is the forecasted estimate for each future time period - Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval - Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval ::: fragment ```{r pop numerical forecasts} pop_forecast # prints out forecast values ``` ::: ## ### Lecture 25 In-class Exercises - Q2 [***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685** Based on the US Population forecast output, we are 95% certain that U.S population in 2030 will be less than `______` million people? Round to closest million (whole number), e.g. 345 million. ## ### U.S. Population - Examine Residuals and Model Fit :::::: columns ::: {.column width="40%"} - Top Plot: No spikes should be too large - One obs. should be checked. - ACF: auto-correlation function. - Ideally, most values fall within lines - Histogram: Distribution of residuals should be approx. normal - One large low outlier - Assessment: Trend is very smooth so small aberrations are exaggerated in residuals. ::: :::: {.column width="60%"} ::: fragment ```{r pop residual plots} checkresiduals(pop_forecast) # examine residuals ``` ::: :::: :::::: ## ### U.S. Population - Examine Residuals and Model Fit ::: r-fit-text ```{r pop fit statistics } (acr <- accuracy(pop_forecast)) # examine model accuracy (fit) ``` ::: - **Many options for comparing models** - For BUA 345: We will use MAPE = Mean Absolute Percent Error - **(100 – MAPE) = Percent accuracy of model.** - Despite outlier and one relatively large ACF value, our population model is estimated to be `r round(100 - acr[5],2)`% accurate. - This doesn’t guarantee that forecasts will be almost `r round(100 - acr[5])`% accurate but it does improve our chances of accurate forecasting. ## Example 2: Netflix Stock Prices {background-color="aliceblue"} **Data from [Yahoo Finance](https://finance.yahoo.com/)** ```{r import Netflix data, results='hide', echo=F} # import from yahoo finance and plot dygraph getSymbols("NFLX", from = "2010-01-01", to = "2026-04-13") ``` ```{r echo=F, fig.align='center'} # create interactive plot (nflx_dg <- dygraph(NFLX[,c(2,3,6)], main="Netflix Stock price") |> dySeries("NFLX.Adjusted", label="Adj. Close", color= "purple") |> dySeries("NFLX.Low", label="Low", color= "blue") |> dySeries("NFLX.High", label="High", color= "red") |> dyAxis("y", label = "", drawGrid = FALSE) |> dyAxis("x", label = "", drawGrid = FALSE) |> dyShading(from = "2010-01-01", to = "2026-04-14", color = "white") |> dyShading(from="2020-3-12", to="2021-6-14", color = "lightgrey") |> dyRangeSelector()) ``` ## Netflix Stock - Was mostly trending upward, but had a downturn and then another recent upturn. - Data shown are daily adjusted closing value - For analysis, we will use are monthly adjusted close (1st day of trading for each month) ```{r echo=F, message=F, fig.align='center', fig.dim=c(14,6)} # convert xts to tibble nflx <- NFLX |> fortify.zoo() |> as_tibble(.name_repair = "minimal") |> rename("Date" = "Index", "Adjusted" = "NFLX.Adjusted") |> mutate(year=year(Date), month=month(Date), day=day(Date)) |> group_by(year, month) |> filter(day == min(day, na.rm=T)) |> ungroup() |> select(Date, Adjusted) (nflx_plot <- nflx |> ggplot() + geom_line(aes(x=Date, y=Adjusted), color="red", size=1) + theme_classic() + labs(title="Netflix Monthly Adjusted Closing Price", x="Year", Y="Adjusted Closing Price", caption = "Data Source:https://finance.yahoo.com/") + scale_x_date(date_breaks = "year", date_labels = "%Y") + theme(plot.title = element_text(size = 15), plot.caption = element_text(size = 8), axis.title.x = element_text(size = 15), axis.title.y = element_text(size = 15), axis.text.x = element_text(size = 10), axis.text.y = element_text(size = 10), plot.background = element_rect(colour = "darkgrey", fill=NA, size=2))) ``` ## Netflix Stock - **Forecast Questions:** - What will be the estimated stock price be in April of 2027? - What ARIMA model was chosen (p,d,q)? - **Model Assessment Questions:** - How valid is our model? - Check residual plots. - How are accurate are our estimates? - Examine Prediction Intervals and Prediction Bands - Check fit statistics ## Netflix Stock - Modeling Time Series Data ***Stock Trend Forecast*** - Create time series using Netflix Stock data - Specify `freq = 12` - 12 observations per year - Specify `start = c(2010, 1)` - first obs. in dataset is January 2010 - Model data using `auto.arima` function - Specify `ic = aic` - `aic` is the information criterion used to determine model. - Specify `seasonality = F` - no seasonal (repeating) pattern in the data. - This code will create and save the model: :::: fragment ::: r-fit-text ```{r create nflx time series and model, echo=T} nflx_ts <- ts(nflx$Adjusted, freq=12, start=c(2010,1)) # create time series nflx_model <- auto.arima(nflx_ts, ic="aic", seasonal=F) # model data using auto.arima ``` ::: :::: ## Netflix Stock - Create and Plot Forecasts - Create forecasts (until April `r year(Sys.Date())+1`) - `h = 12` indicates we want to forecast 12 months - Most recent date in forecast data is April 1, `r year(Sys.Date())` - 12 Months until April 1, `r year(Sys.Date())+1` - **Forecasts become less accurate the further into the future you specify.** :::: fragment ::: r-fit-text ```{r create nflx forecasts, echo=T} nflx_forecast <- forecast(nflx_model, h=12) # create forecasts (until April 2027) nflx_pred_plot <- autoplot(nflx_forecast) + labs(y = "Adjusted Closing Price") + theme_classic() ``` ::: :::: - Darker purple: 80% Prediction Interval Bounds - Lighter purple: 95% Prediction Interval Bounds - **Plot shows:** - **Lags (p = `r summary(nflx_model)$arma[1]`), Differencing (d = `r summary(nflx_model)$arma[6]`), Moving Average (q = `r summary(nflx_model)$arma[2]`)** ## Netflix Stock - Forecast Plot ```{r plot nflx forecasts with pred intervals, echo=F} nflx_pred_plot ``` ## ### Netflix Stock - Examine Numerical Forecasts - Point Forecast is the forecasted estimate for each future time period - Lo 80 and Hi 80 are lower and upper bounds for the 80% prediction interval - Lo 95 and Hi 95 are lower and upper bounds for the 95% prediction interval ::: fragment ```{r nflx numerical forecasts} nflx_forecast # prints out forecast values ``` ::: ## ### Lecture 25 In-class Exercises - Q3 [***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685** **Interpretation of Netflix Prediction Intervals** In February of 2027, the Netflix stock price is forecasted to be approximately \$102. However the 95% prediction interval indicates it may be as low as `____`. - Round your answer to the closest whole dollar. ## ### Netflix Stock - Examine Residuals and Model Fit :::::: columns ::: {.column width="40%"} - Top Plot: Spikes get larger over time - ACF: auto-correlation function. - Ideally, all or most values are with dashed lines - Histogram: Distribution of residuals should be approx. normal - Assessment: Stock prices are very volatile and this is sufficient. ::: :::: {.column width="60%"} ::: fragment ```{r nflx residual plots} checkresiduals(nflx_forecast) # examine residuals ``` ::: :::: :::::: ## ### Netflix Stock - Examine Residuals and Model Fit :::: fragment ::: r-fit-text ```{r nflx fit statistics } (acr <- accuracy(nflx_forecast)) # examine model accuracy (fit) ``` ::: :::: - **Many options for comparing models** - For BUA 345: We will use MAPE = Mean Absolute Percent Error - **100 – MAPE = Percent accuracy of model.** - Despite increasing volatility, our stock price model is estimated to be ____% accurate. (Next question) - This doesn’t guarantee that forecasts will be this accurate but it does improve our chances of accurate forecasting. ## ### Lecture 25 In-class Exercises - Q4 [***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685** Based on the Netflix Model Mean Absolute Percent Error, (MAPE), waht is the percent accuracy of our forecast model? Answer is reported as a percentage with two decimal places. ## ### Key Points from Today - **`forecast`** package in R simplifies forecasting. - **Extrapolation OK in this case** - Report uncertainty as prediction bounds - **You should know terminology and how to read and interpret output.** - You will be given data, R code, and output - You will answer questions based on provided output. - **HW 10 includes material from Lectures 24-26** - **HW 9 is due on 4/15.** ::: fragment **To submit an Engagement Question or Comment about material from Lecture 25:** Submit it by midnight today (day of lecture). :::