class: center, middle, inverse, title-slide # BUA 345-Lecture 25: Forecasting - Part 1 ### Slides are a work in progress. ### 2022-04-26 --- ### Review from Regression concepts:  <p> </p> We have data for 2020 annual salaries of 75 Upstate NY residents ranging from `$50K` to `$150K`. and we use that data to model how much someone spends on their first house. <p> </p> - **Is it valid to apply that model to someone with an annual salary of `$350K`?** -- <p> </p> A. Yes, this is extrapolation and it is valid B. No this is extrapolation and it is invalid C. Yes, this is interpolation and it is valid D No this is interpolation and it is invalid --- ### Plan for Today  #### Introduction to Forecasting - **Cross-Sectional Data vs. Time Series Data** - Basic Forecasting Terminology - Forecasting Trends without Seasonality in R - Example 1 - US Population - Example 2 - Netflix Stock Prices <p> </p> #### HW Assignment 10 - **Today:** Questions 2 - 5 --- ### Cross-Sectional Data vs. Time Series Data #### Example: Golden Snowball Award - [2021-2022 Data](https://goldensnowball.com/) - [Historic Data](https://en.wikipedia.org/wiki/Golden_Snowball_Award) -- .pull-left[#### Cross-Sectional Data - Data represents a snapshot of time <!-- --> ] -- .pull-right[#### Time Series Data - Data shows trend over time <!-- --> ] --- ### Cross-Sectional Data vs. Time Series Data -- #### Example: U.S. Population Data -- .pull-left[#### Cross-Sectional Data - Data represents a snapshot of time - Data from Louisiana were missing - [R: U.S County Population in 2019](https://CRAN.R-project.org/package=usdata) <!-- --> ] -- .pull-right[#### Time Series Data - Data represents trend over time - Observations not independent - [U.S. Population 1950 - 2021](https://www.macrotrends.net/countries/USA/united-states/population) <!-- --> ] --- ### Plan for Today  #### Introduction to Forecasting - Cross-Sectional Data vs. Time Series Data - **Basic Forecasting Terminology** - Forecasting Trends without Seasonality in R - Example 1 - US Population - Example 2 - Netflix Stock Prices <p> </p> #### HW Assignment 10 - **Today:** Questions 2 - 5 --- ### Time Series Terminology - In time series data, new observations are often correlated with prior observations -- - This is referred to as **auto-correlation** - A variable is correlated with itself - When data are auto-correlated, we use that information - This process is called **auto-regression** - Using previous observations to predict into the future. -- - **R function:** **`auto.arima`** function in **`forecast`** package -- - **ARIMA** is an acronym: - **AR:** auto-regressive - **I:** integrated - **MA:** moving average - In **ARIMA** models, all three components are optimized to provide a reliable forecast. --- ### Terminology: ARIMA model components (p, d, q) #### Auto-Regressive Models (AR) - Similar to a simple linear regression model or non-linear regression model - Key difference: Regressor or predictor variable is dependent variable with a specific LAG -- - Lag (**p**) is how many previous time periods the model looks back to estimate the next time period. -- - If **p = 1**, the model estimates the next time period based on most recent one. - Looks back **one** time period - If **p = 2**, the model estimates the next time period on time period **BEFORE** the most recent one. - Looks back **two** time periods --- ### Terminology: ARIMA model components (p, d, q) .pull-right[] #### Differencing (I = Integration) - **Stationarity:** mean and variance of data are consistent over timespan - needed for accurate modeling - Can be verified by examining residuals <p> </p> <p> </p> - **Differencing** transforms non-stationary data to stationary - Differencing order (**d**) determined by model: - if **d = 1:** each obs. is difference from previous one (linear) - if **d = 2:** each obs. is difference of difference from previous one (quadratic) --- ### Terminology: ARIMA model components (p, d, q) .pull-right[] #### Moving Average (MA) - Moving average (**q**): how many terms are incorporated into each average within the data. - Algorithm calculates the average for a specific number of lagged terms - Moving Averages smooths out temporary instability in the data - If **q = 1:** moving average is average of current term with the one from the previous time period. - If **q = 2:**, moving average is average of the current term with the ones from two previous time periods. --- ### Plan for Today  #### Introduction to Forecasting - Cross-Sectional Data vs. Time Series Data - Basic Forecasting Terminology - **Forecasting Trends without Seasonality in R** - **Example 1 - US Population** - Example 2 - Netflix Stock Prices <p> </p> #### HW Assignment 10 - **Today:** Questions 2 - 5 --- ### Example 1: U.S. Population - 1950 to Present .pull-right[ <!-- --> ] .pull-left[ ] - **Forecast Questions:** - What will the U.S. Population be in 2040? - What ARIMA model was chosen (p,d,q)? <p> </p> - **Model Assessment Questions:** - How valid is our model? - Check residual plots. - How are accurate are our estimates? - Review of Confidence Intervals and Confidence Bands - Check fit statistics --- ### U.S. Population - Interactive Plot - One way to examine data effectively -- ```r # convert to xts (extensible time series) uspop_xts <- xts(x=uspop[,3], order.by= uspop$Date) # create interactive plot hchart(uspop_xts$popM, name="Pop. (Mill.)", color="darkmagenta") ```
--- ### U.S. Population - Modeling Time Series Data -- #### Population Trend Forecast - Creat time series using population data - Specify `freq = 1` - one observation per year - Specify `start = 1950` - first year in dataset -- - Model data using `auto.arima` function - Specify `ic = aic` - `aic` is the information criterion used to determine model. - Specify `seasonality = F` - no seasonal (repeating) pattern in the data. -- - This chunk will create and save the model. -- ```r # create time series for forecast pop_ts <- ts(uspop$popM, freq=1, start=1950) # model data using auto.arima function pop_model <- auto.arima(pop_ts, ic="aic", seasonal=F) ``` --- ### U.S. Population - Create and Plot Forecasts -- - **Create forecasts (until 2040)** - `h = 19` indicates we want to forecast 18 years - Most recent year in data is 2022 - 2040 - 2022 - 18 -- <p> </p> - **Forecasts become less accurate the further into the future you specify.** -- <p> </p> ```r # create forecasts (until 2040) pop_forecast <- forecast(pop_model, h=18) ``` --- ### U.S. Population - Create and Plot Forecasts - Darker purple: 80% Prediction Interval Bounds - Lighter purple: 95% Prediction Interval Bounds - Plot shows: - Lags (`p = 5`), Differencing (`d = 1`), Moving Average (`q = 1`) ```r autoplot(pop_forecast) + labs(y = "U.S. Population (Millions)") + theme_classic() ``` <!-- --> --- ### U.S. Population - Examine Numerical Forecasts - Point Forecast is the forecasted estimate for each future time period - Lo 80 and Hi 80 are the lower and upper bounds for the 80% prediction interval - Lo 95 and Hi 95 are the lower and upper bounds for the 95% prediction interval ```r # print out forecasts pop_forecast ``` ``` ## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 ## 2023 336.6683 336.6352 336.7015 336.6176 336.7190 ## 2024 338.4981 338.3517 338.6445 338.2742 338.7220 ## 2025 340.2850 339.8912 340.6788 339.6828 340.8872 ## 2026 342.0233 341.2056 342.8411 340.7727 343.2739 ## 2027 343.7084 342.2664 345.1504 341.5031 345.9138 ## 2028 345.3414 343.0744 347.6085 341.8743 348.8086 ## 2029 346.9241 343.6481 350.2002 341.9139 351.9344 ## 2030 348.4617 344.0199 352.9036 341.6685 355.2550 ## 2031 349.9571 344.2196 355.6946 341.1823 358.7318 ## 2032 351.4134 344.2714 358.5553 340.4907 362.3360 ## 2033 352.8301 344.1854 361.4749 339.6091 366.0512 ## 2034 354.2070 343.9622 364.4517 338.5389 369.8750 ## 2035 355.5411 343.5932 367.4890 337.2684 373.8138 ## 2036 356.8313 343.0692 370.5934 335.7840 377.8786 ## 2037 358.0760 342.3821 373.7699 334.0743 382.0777 ## 2038 359.2761 341.5315 377.0208 332.1381 386.4142 ## 2039 360.4325 340.5221 380.3429 329.9822 390.8828 ## 2040 361.5479 339.3658 383.7299 327.6233 395.4724 ``` --- ### Interpretation of Pop. Prediction Interval  <p> </p> Based on the US Population forecast output, we are 95% certain that U.S population in 2030 will be less than `______` million people? **How to input your answer:** - Round to closest million (whole number) - If the answer were 123 million (e.g. 123.4233), you would enter 123. --- ### U.S. Population - Examine Residuals and Model Fit .pull-left[ ```r # examine residuals checkresiduals(pop_forecast) ``` <!-- --> ``` ## ## Ljung-Box test ## ## data: Residuals from ARIMA(5,1,1) ## Q* = 7.6745, df = 4, p-value = 0.1043 ## ## Model df: 6. Total lags used: 10 ``` ] .pull-right[ #### Examnining Residuals: - Top Plot: No spikes should be too large - One obs. seems large and should be checked. - ACF: auto-correlation function. - Ideally, all or most values are with dashed lines - Histogram: Distribution of residuals should be approx. normal - We have one high outlier - Assessment: Trend is very smooth so small aberrations are exaggerated in residuals. ] --- ### U.S. Population - Examine Residuals and Model Fit ```r # examine model accuracy (fit) (acr <- accuracy(pop_forecast)) ``` ``` ## ME RMSE MAE MPE MAPE ## Training set 0.003523809 0.02460722 0.0145186 0.001910726 0.006482353 ## MASE ACF1 ## Training set 0.005939396 -0.03419504 ``` -- #### Fit Statistics for Model Comparisons - Many options for comparing models - **For BUA 345:** We will use MAPE = Mean Absolute Percent Error - **(100 – MAPE) = Percent accuracy of model.** - Despite outlier and one large ACF value, our population model is estimated to be 99.99% accurate. - This doesn’t guarantee that forecasts will be 100% accurate but it does improve our chances of accurate forecasting. --- ### Plan for Today  #### Introduction to Forecasting - Cross-Sectional Data vs. Time Series Data - Basic Forecasting Terminology - **Forecasting Trends without Seasonality in R** - Example 1 - US Population - **Example 2 - Netflix Stock Prices** <p> </p> #### HW Assignment 10 - **Today:** Questions 2 - 5 --- ### Example 2: Netflix Stock Prices  #### Interactive Plot - Data from [Yahoo Finance](https://finance.yahoo.com/) ```r # import from yahoo finance and plot hchart nflx <- getSymbols("NFLX", from = "2010-01-01", to = "2022-04-02") hchart(NFLX$NFLX.Adjusted, name="Adjusted", color="red") ```
--- ### Netflix Stock  - Was mostly trending upward, but had a downturn. - Data imported are daily adjusted close - Data we will use are monthly adjusted close (1st day of trading for each month) <!-- --> --- ### Netflix Stock  -- - **Forecast Questions:** - What will be the estimated stock price be in April of 2023? - What ARIMA model was chosen (p,d,q)? -- - **Model Assessment Questions:** - How valid is our model? - Check residual plots. - How are accurate are our estimates? - Review of Confidence Intervals and Confidence Bands - Check fit statistics --- ### Netflix Stock - Modeling Time Series Data  -- #### Stock Trend Forecast - Creat time series using Netflix Stock data - Specify `freq = 12` - 12 observations per year - Specify `start = c(2010, 1)` - first obs. in dataset is January 2010 -- - Model data using `auto.arima` function - Specify `ic = aic` - `aic` is the information criterion used to determine model. - Specify `seasonality = F` - no seasonal (repeating) pattern in the data. -- - This chunk will create and save the model. -- ```r # create time series for forecast nflx_ts <- ts(nflx$Adjusted, freq=12, start=c(2010,1)) # model data using auto.arima function nflx_model <- auto.arima(nflx_ts, ic="aic", seasonal=F) ``` --- ### Netflix Stock - Create and Plot Forecasts  -- - Create forecasts (until April 2023) - `h = 12` indicates we want to forecast 12 months - Most recent month in data is April 2023 - 12 Months until March 2023 -- <p> </p> - **Forecasts become less accurate the further into the future you specify.** -- <p> </p> ```r # create forecasts (until April 2023) nflx_forecast <- forecast(nflx_model, h=12) ``` --- ### Netflix Stock - Create and Plot Forecasts  - Darker purple: 80% Prediction Interval Bounds - Lighter purple: 95% Prediction Interval Bounds - Plot shows: - Lags (`p = 2`), Differencing (`d = 1`), Moving Average (`q = 2`) ```r autoplot(nflx_forecast) + labs(y = "Adjusted Closing Price") + theme_classic() ``` <!-- --> --- ### Netflix Stock - Examine Numerical Forecasts  - Point Forecast is the forecasted estimate for each future time period - Lo 80 and Hi 80 are the lower and upper bounds for the 80% prediction interval - Lo 95 and Hi 95 are the lower and upper bounds for the 95% prediction interval ```r # print out forecasts nflx_forecast ``` ``` ## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 ## May 2022 386.0325 354.3571 417.7079 337.5891 434.4759 ## Jun 2022 388.4639 339.8750 437.0527 314.1537 462.7740 ## Jul 2022 378.4275 315.1390 441.7161 281.6360 475.2191 ## Aug 2022 387.1893 315.5042 458.8743 277.5564 496.8221 ## Sep 2022 384.3023 302.9559 465.6487 259.8937 508.7108 ## Oct 2022 381.9712 292.3050 471.6374 244.8386 519.1038 ## Nov 2022 386.0832 289.6141 482.5524 238.5464 533.6200 ## Dec 2022 383.2967 279.5396 487.0538 224.6139 541.9794 ## Jan 2023 383.6612 273.5645 493.7578 215.2829 552.0395 ## Feb 2023 384.9666 268.9293 501.0039 207.5029 562.4303 ## Mar 2023 383.4174 261.4237 505.4110 196.8442 569.9905 ## Apr 2023 384.2102 256.8058 511.6146 189.3620 579.0584 ``` --- ### Interpretation of Netflix Prediction Intervals  <p> </p> In January of 2023, the Netflix stock price is forecasted to be approximately $384 However the 95% prediction interval indicates it may be as low as `____`. **How to input your answer:** - Round to closest whole dollar. - Don't include dollar sign. --- ### Netflix Stock - Examine Residuals and Model Fit .pull-left[ ```r # examine residuals checkresiduals(nflx_forecast) ``` <!-- --> ``` ## ## Ljung-Box test ## ## data: Residuals from ARIMA(2,1,2) ## Q* = 23.753, df = 20, p-value = 0.2533 ## ## Model df: 4. Total lags used: 24 ``` ] .pull-right[ #### Examining Residuals: - Top Plot: Spikes get larger over time - ACF: auto-correlation function. - Ideally, all or most values are with dashed lines - Histogram: Distribution of residuals should be approx. normal - Appears okay - Assessment: Stock prices are very volatile and this is sufficient. ] --- ### Netflix Stock - Examine Residuals and Model Fit ```r # examine model accuracy (fit) (acr <- accuracy(nflx_forecast)) ``` ``` ## ME RMSE MAE MPE MAPE MASE ACF1 ## Training set 2.173023 24.29538 14.95796 1.338211 11.02551 0.2500016 0.04051291 ``` -- #### Fit Statistics for Model Comparisons - Many options for comparing models - **For BUA 345:** We will use MAPE = Mean Absolute Percent Error - **(1 – MAPE) x 100% = Percent accuracy of model.** - Despite increasing volatility, our stock price model is estimated to be 88.97% accurate. - This doesn’t guarantee that forecasts will be 89% accurate but it does improve our chances of accurate forecasting. --- ### Plan for Today  #### Introduction to Forecasting - Cross-Sectional Data vs. Time Series Data - Basic Forecasting Terminology - Forecasting Trends without Seasonality in R - Example 1 - US Population - Example 2 - Netflix Stock Prices <p> </p> #### **HW Assignment 10** - **Today: Questions 2 - 5** --- ### Key Points from Today  *3 Lectures Left* -- - **R, and **`forecast`** package simplify forecasting** -- - **Extrapolation OK in this case** - Report uncertainty as prediction bounds -- - **You should know terminology and how to read and interpret output.** - You will be given data, R code, and output - You will answer questions based on provided output. -- - **HW 10 will cover Lectures 24, 25, and 26** - Due Monday, 5/2/2022 -- **To submit a question or comment about material from Lecture 25:** - Submit it by tonight, Tuesday 4/25, at midnight for credit. - Click on Link next to the question mark  under Lecture 25