Test Questions

library(fpp3)

## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──

## ✔ tibble      3.1.8     ✔ tsibble     1.1.2
## ✔ dplyr       1.0.9     ✔ tsibbledata 0.4.0
## ✔ tidyr       1.2.0     ✔ feasts      0.2.2
## ✔ lubridate   1.8.0     ✔ fable       0.3.1
## ✔ ggplot2     3.3.6

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(readxl)
library(tidyverse)

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ readr   2.1.2     ✔ stringr 1.4.0
## ✔ purrr   0.3.4     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date()        masks base::date()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ tsibble::intersect()     masks lubridate::intersect(), base::intersect()
## ✖ tsibble::interval()      masks lubridate::interval()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ tsibble::setdiff()       masks lubridate::setdiff(), base::setdiff()
## ✖ tsibble::union()         masks lubridate::union(), base::union()

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

data_eco <- readxl::read_xlsx("C:\\Users\\PythonAcct\\Downloads\\Table (1).xlsx")

This loads the data.

new_eco <- data_eco %>% 
  select(-GeoFips) %>% 
  select(-GeoName) %>% 
  select(-LineCode) %>% 
  pivot_longer(!Description, names_to = "Time", values_to="count") %>% 
  pivot_wider(names_from = Description, values_from = count) %>% 
  select(-`Current dollar statistics (millions of dollars)`) %>% 
  select(-Employment)
new_eco

## # A tibble: 23 × 5
##    Time    GDP    PI   PCE    TE
##    <chr> <dbl> <dbl> <dbl> <dbl>
##  1 1999    5.4   5.9   7.5   1.9
##  2 2000    8.6   9     8.6   3.2
##  3 2001    4.7   6.3   5.6   1.2
##  4 2002    2.1   1.3   3.6   0.1
##  5 2003    5.5   4     4.8   0.7
##  6 2004    9.6   4.9   6.3   1.7
##  7 2005    9     8.5   7.8   3.3
##  8 2006   10.8   9.8   6.7   3.8
##  9 2007    8     6.2   6.2   3.8
## 10 2008    4.5   9.9   4.5   2.6
## # … with 13 more rows
## # ℹ Use `print(n = ...)` to see more rows

The dataset has many extraneous inclusions, so first I selected those out. Pivot Longer was necessary to reframe the data and put the time information into a column. Pivot wider was then necessary to group the values under the same categories. Then, I selected out the wider categories which weren’t necessary and contained nothing but NA values.

new_eco$Time = as.numeric(as.character(new_eco$Time))

As it stood, my Time column is a character and thus not compatible with a tsibble. I found this line on google that made it work.

teco <- new_eco %>% 
  as_tsibble(index=Time)
teco

## # A tsibble: 23 x 5 [1Y]
##     Time   GDP    PI   PCE    TE
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1999   5.4   5.9   7.5   1.9
##  2  2000   8.6   9     8.6   3.2
##  3  2001   4.7   6.3   5.6   1.2
##  4  2002   2.1   1.3   3.6   0.1
##  5  2003   5.5   4     4.8   0.7
##  6  2004   9.6   4.9   6.3   1.7
##  7  2005   9     8.5   7.8   3.3
##  8  2006  10.8   9.8   6.7   3.8
##  9  2007   8     6.2   6.2   3.8
## 10  2008   4.5   9.9   4.5   2.6
## # … with 13 more rows
## # ℹ Use `print(n = ...)` to see more rows

This allowed my tibble to be coerced to a tsibble.

ggpairs(new_eco %>% select(-Time))

GDP and Personal Income have a correlation of .668, which is relatively strong and certainly positive. This means they tend to increase together. This result is statistically significant at the 99.9% confidence level.

GDP and Personal Consumption Expenditures have a correlation of .695, which is relatively strong and certainly positive. This means they tend to increase together. This result is statistically significant at the 99.9% confidence level

GDP and Total Employment have a correlation of .505, which is of medium strength and certainly positive. This means they tend to increase together. This result is statistically significant at the 95% confidence level.

Personal Income and Personal Consumption Expenditures have a correlation .519, which is of medium strength and certainly positive. This means they tend to increase together. This result is statistically significant at the 95% confidence level.

Personal Income and Total Employment have a correlation of .578, which is of medium strength and certainly positive. This means they tend to increase together. This result is statistically significant at the 99% confidence level.

Personal Consumption Expenditures and Total Employment have a correlation of .513, which is of medium strength and certainly positive. This means they tend to increase together. This result is statistically significant at the 95% confidence level.

ecom <- teco %>% model(lm=TSLM(PCE ~ GDP+PI+TE))

This line of code estimates a linear model of percent change in Personal Consumption Expenditures as a function of the percent changes in GDP, Personal Income, and Total Employment.

report(ecom)

## Series: PCE 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6825 -0.9657  0.4163  1.4112  3.9410 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 1.122693   1.153561   0.973   0.3427  
## GDP         0.487332   0.182650   2.668   0.0152 *
## PI          0.005509   0.234792   0.023   0.9815  
## TE          0.553992   0.510214   1.086   0.2912  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.061 on 19 degrees of freedom
## Multiple R-squared: 0.5182,  Adjusted R-squared: 0.4421
## F-statistic: 6.812 on 3 and 19 DF, p-value: 0.0026339

The results of that model are printed here.

For a one percent increase in GDP, Personal Consumption Expenditures are expected to increase by .487 percent. This is statistically significant at the 95% confidence level.

For a percent increase in Personal Income, Personal Consumption Expenditures are expected to increase by .005509 percent. This value is NOT significant at conventional levels.

For a percent increase in Total Employment, Personal Consumption Expenditures are expected to increase by .553992 percent. This value is NOT statistically significant at conventional levels.

The intercept indicates a basic percentage change of 1.122 as a base rate. This is not statistically significant at usual levels either.

augment(ecom) %>% 
  ggplot(aes(x = Time)) +
  geom_line(aes(y = PCE, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "% Change in Consumption",
       title = "%Change in Personal Consumption Expenditures: Fitted Vs Actual") +
  guides(colour = guide_legend(title = "Series"))

This is a plot that shows my fitted values and the actual values. They tend to be relatively close, but it definitely is not perfect.

augment(ecom) %>% 
  autoplot(.resid)

Here is a plot of the residuals given by my model. They are relatively large but seem random.

augment(ecom) %>% 
  features(.resid, ljung_box, lag=10, dof=0)

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 lm        6.95     0.730

A ljung-box test with 0 degrees of freedom indicates that my residuals cannot be distinguished from a white noise.

augment(ecom) %>% 
  features(.resid, ljung_box, lag=10, dof=19)

## Warning in pchisq(STATISTIC, lag - fitdf): NaNs produced

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 lm        6.95       NaN

In my model, it says there are 19 degrees of freedom, so I put that into my ljung-box test and it produces a NaN.

constant_scenarios <- scenarios(
  Increase = new_data(teco, 3) %>%
    mutate(GDP=1, PI=1, TE=1),
  names_to = "Scenario")

fc <- forecast(ecom, new_data = constant_scenarios)
hilo(fc)

## # A tsibble: 3 x 10 [1Y]
## # Key:       Scenario, .model [1]
##   Scenario .model  Time         PCE .mean   GDP    PI    TE
##   <chr>    <chr>  <dbl>      <dist> <dbl> <dbl> <dbl> <dbl>
## 1 Increase lm      2022 N(2.2, 5.1)  2.17     1     1     1
## 2 Increase lm      2023 N(2.2, 5.1)  2.17     1     1     1
## 3 Increase lm      2024 N(2.2, 5.1)  2.17     1     1     1
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

teco %>%
  autoplot(PCE) +
  autolayer(fc) +
  labs(title = "US consumption", y = "% change")

This creates a scenario with a one percent increase in each of my independent variables. It holds this growth for three periods. A forecast is then produced based on my regression model. The confidence levels are shown with my forecast.

For Scenario 2, I tried doing this as subsequent percent increases, but I was unable to do this. Additionally, no values for GDP, Personal Income, and Total Employment were specified, so I arbitrarily chose zero.

un_scenarios <- scenarios(
  unpercent = new_data(teco, 3) %>%
  mutate(PI=1,GDP=0,TE=0),
  names_to="Scenario")
fc1 <- forecast(ecom, new_data = un_scenarios)
hilo(fc1)

## # A tsibble: 3 x 10 [1Y]
## # Key:       Scenario, .model [1]
##   Scenario  .model  Time         PCE .mean    PI   GDP    TE
##   <chr>     <chr>  <dbl>      <dist> <dbl> <dbl> <dbl> <dbl>
## 1 unpercent lm      2022 N(1.1, 5.5)  1.13     1     0     0
## 2 unpercent lm      2023 N(1.1, 5.5)  1.13     1     0     0
## 3 unpercent lm      2024 N(1.1, 5.5)  1.13     1     0     0
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

teco %>%
  autoplot(PCE) +
  autolayer(fc1) +
  labs(title = "US consumption", y = "% change")

This is a forecast where personal income experiences 1 percent growth while the others stay at zero for three years. Because Personal Income was not significant for my model, there is almost no effect. The value is just the intercept term.

neg_scenarios <- scenarios(
  negpercent=new_data(teco,3) %>% 
  mutate(PI=-.5,GDP=0,TE=0),
  names_to="scenario")
fc2 <- forecast(ecom, new_data = neg_scenarios)
hilo(fc2)

## # A tsibble: 3 x 10 [1Y]
## # Key:       scenario, .model [1]
##   scenario   .model  Time         PCE .mean    PI   GDP    TE
##   <chr>      <chr>  <dbl>      <dist> <dbl> <dbl> <dbl> <dbl>
## 1 negpercent lm      2022 N(1.1, 5.7)  1.12  -0.5     0     0
## 2 negpercent lm      2023 N(1.1, 5.7)  1.12  -0.5     0     0
## 3 negpercent lm      2024 N(1.1, 5.7)  1.12  -0.5     0     0
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

teco %>%
  autoplot(PCE) +
  autolayer(fc2) +
  labs(title = "US consumption", y = "% change")

This is a forecast where personal income experiences -.5 percent growth while the others stay at zero for three years. Because Personal Income was not significant for my model, there is almost no effect. The value is just the intercept term.

eight_scenarios <- scenarios(
  eightpercent=new_data(teco,3) %>% 
    mutate(PI=.8,GDP=0,TE=0),
  names_to="scenario")
fc3 <- forecast(ecom, new_data = eight_scenarios)
hilo(fc3)

## # A tsibble: 3 x 10 [1Y]
## # Key:       scenario, .model [1]
##   scenario     .model  Time         PCE .mean    PI   GDP    TE
##   <chr>        <chr>  <dbl>      <dist> <dbl> <dbl> <dbl> <dbl>
## 1 eightpercent lm      2022 N(1.1, 5.5)  1.13   0.8     0     0
## 2 eightpercent lm      2023 N(1.1, 5.5)  1.13   0.8     0     0
## 3 eightpercent lm      2024 N(1.1, 5.5)  1.13   0.8     0     0
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

teco %>%
  autoplot(PCE) +
  autolayer(fc3) +
  labs(title = "US consumption", y = "% change")

This is a forecast where personal income experiences .8 percent growth while the others stay at zero for three years. Because Personal Income was not significant for my model, there is almost no effect. The value is basically just the intercept term.

all_scenarios <- scenarios(
  opercent=new_data(teco,3) %>% 
    mutate(PI=1,GDP=0,TE=0),
  npercent=new_data(teco,3) %>% 
    mutate(PI=-.5,GDP=0,TE=0),
  epercent=new_data(teco,3) %>% 
    mutate(PI=.8,GDP=0,TE=0),
  names_to = "scenario"
)

fc4 <- forecast(ecom, new_data = all_scenarios)
hilo(fc4)

## # A tsibble: 9 x 10 [1Y]
## # Key:       scenario, .model [3]
##   scenario .model  Time         PCE .mean    PI   GDP    TE
##   <chr>    <chr>  <dbl>      <dist> <dbl> <dbl> <dbl> <dbl>
## 1 opercent lm      2022 N(1.1, 5.5)  1.13   1       0     0
## 2 opercent lm      2023 N(1.1, 5.5)  1.13   1       0     0
## 3 opercent lm      2024 N(1.1, 5.5)  1.13   1       0     0
## 4 npercent lm      2022 N(1.1, 5.7)  1.12  -0.5     0     0
## 5 npercent lm      2023 N(1.1, 5.7)  1.12  -0.5     0     0
## 6 npercent lm      2024 N(1.1, 5.7)  1.12  -0.5     0     0
## 7 epercent lm      2022 N(1.1, 5.5)  1.13   0.8     0     0
## 8 epercent lm      2023 N(1.1, 5.5)  1.13   0.8     0     0
## 9 epercent lm      2024 N(1.1, 5.5)  1.13   0.8     0     0
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

teco %>%
  autoplot(PCE) +
  autolayer(fc4) +
  labs(title = "US consumption", y = "% change")

Here they are all together. PI had virtually no effect and wasn’t significant, so it basically spits out the intercept term. Even if these values were used successively, the growth would be still constant at 1.12% because that is the estimated intercept.

pd <- readxl::read_xlsx("C:\\Users\\PythonAcct\\Downloads\\M3_exam1.xlsx") %>% 
  filter(`Student Name` =="Brandon") %>% 
  pivot_longer(!`Student Name`, names_to = "Time", values_to="count") %>% 
  select(-`Student Name`) %>% 
  na.omit() %>% 
  select(count)%>% 
  mutate(Time=yearquarter(-96:-37)) %>% 
  as_tsibble(index=Time)

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `1943 Q1` -> `1943 Q1...12`
## • `1943 Q1` -> `1943 Q1...13`
## • `1944 Q1` -> `1944 Q1...17`
## • `1944 Q1` -> `1944 Q1...18`
## • `1945 Q1` -> `1945 Q1...22`
## • `1945 Q1` -> `1945 Q1...23`
## • `1946 Q1` -> `1946 Q1...27`
## • `1946 Q1` -> `1946 Q1...28`
## • `1947 Q1` -> `1947 Q1...32`
## • `1947 Q1` -> `1947 Q1...33`
## • `1948 Q1` -> `1948 Q1...37`
## • `1948 Q1` -> `1948 Q1...38`
## • `1949 Q1` -> `1949 Q1...42`
## • `1949 Q1` -> `1949 Q1...43`
## • `1950 Q1` -> `1950 Q1...47`
## • `1950 Q1` -> `1950 Q1...48`
## • `1951 Q1` -> `1951 Q1...52`
## • `1951 Q1` -> `1951 Q1...53`
## • `1952 Q1` -> `1952 Q1...57`
## • `1952 Q1` -> `1952 Q1...58`
## • `1953 Q1` -> `1953 Q1...62`
## • `1953 Q1` -> `1953 Q1...63`
## • `1954 Q1` -> `1954 Q1...67`
## • `1954 Q1` -> `1954 Q1...68`
## • `1955 Q1` -> `1955 Q1...72`
## • `1955 Q1` -> `1955 Q1...73`
## • `1956 Q1` -> `1956 Q1...77`
## • `1956 Q1` -> `1956 Q1...78`

pd

## # A tsibble: 60 x 2 [1Q]
##    count    Time
##    <dbl>   <qtr>
##  1  3600 1946 Q1
##  2  3800 1946 Q2
##  3  3000 1946 Q3
##  4  2200 1946 Q4
##  5  1600 1947 Q1
##  6  1600 1947 Q2
##  7  1400 1947 Q3
##  8  1400 1947 Q4
##  9  1800 1948 Q1
## 10  2400 1948 Q2
## # … with 50 more rows
## # ℹ Use `print(n = ...)` to see more rows

This loads the dataset. Then I filter by my name to pull out my selected data. I cleaned it a bit in excel to make it workable. I pivoted to make the rows into columns. I selected out my name because that information is unnecessary. I omitted the NA values. I got rid of extraneous columns. I had to use an extremely weird and random line of code to get my years correct. Finally, I changed it to a tsibble.

The tidying of the data is completed.

pd %>% 
  autoplot(count)+
  labs(y="Count",title="Demographic")

The next step of the process is to visualize the data. This appears to be completely random with little trend or seasonality. A transformation also does not appear to be needed.

pd %>% 
  model(STL(count~trend(window=4)+season(window=11),robust=TRUE)) %>% 
  components() %>% 
  autoplot()

This is the best I could get the decomposition to look. It appears that the decomposition is pretty unnecessary.

pd_train <- pd %>% 
  slice(1:55)
pd_test <- pd %>% 
  slice(56:60)

Here, I split the data into a test and train.

pd_mean <- pd_train %>%
  model(Mean = MEAN(count)
  )
pd_mfc <- pd_mean %>% 
  forecast(h = 4)
hilo(pd_mfc)

## # A tsibble: 4 x 6 [1Q]
## # Key:       .model [1]
##   .model    Time            count .mean                 `80%`
##   <chr>    <qtr>           <dist> <dbl>                <hilo>
## 1 Mean   1959 Q4 N(7095, 1.4e+07) 7095. [2355.387, 11833.7]80
## 2 Mean   1960 Q1 N(7095, 1.4e+07) 7095. [2355.387, 11833.7]80
## 3 Mean   1960 Q2 N(7095, 1.4e+07) 7095. [2355.387, 11833.7]80
## 4 Mean   1960 Q3 N(7095, 1.4e+07) 7095. [2355.387, 11833.7]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

pd_mfc %>%
  autoplot(pd_train) +
  autolayer(
    filter_index(pd_test),
    colour = "black"
  )

## Plot variable not specified, automatically selected `.vars = count`

gg_tsresiduals(pd_mean)

This is an awful forecast. The mean method leaves significant autocorrelation, abnormally distributed residuals, and basically leaves the regular plot as a residual.

pd_naive <- pd_train %>% 
  model(`Naive`=NAIVE(count))
pd_nfc <- pd_naive %>% 
  forecast(h=4)
hilo(pd_nfc)

## # A tsibble: 4 x 6 [1Q]
## # Key:       .model [1]
##   .model    Time            count .mean                   `80%`
##   <chr>    <qtr>           <dist> <dbl>                  <hilo>
## 1 Naive  1959 Q4 N(7600, 1071111)  7600 [6273.665,  8926.335]80
## 2 Naive  1960 Q1 N(7600, 2142222)  7600 [5724.279,  9475.721]80
## 3 Naive  1960 Q2 N(7600, 3213333)  7600 [5302.720,  9897.280]80
## 4 Naive  1960 Q3 N(7600, 4284444)  7600 [4947.329, 10252.671]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

pd_nfc %>%
  autoplot(pd_train) +
  autolayer(
    filter_index(pd_test),
    colour = "black"
  )

## Plot variable not specified, automatically selected `.vars = count`

gg_tsresiduals(pd_naive)

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

This is much better than the mean forecast. There is less autocorrelation, more of a normal distribution, and the residuals resemble a white noise more. A Naive forecast is good when the data is basically a random walk, and mine appears to be one.

pd_snaive <- pd_train %>% 
  model(`Seasonal naive`=SNAIVE(count)
  )
pd_snfc <- pd_snaive %>% 
  forecast(h=4)
hilo(pd_snfc)

## # A tsibble: 4 x 6 [1Q]
## # Key:       .model [1]
##   .model            Time            count .mean                    `80%`
##   <chr>            <qtr>           <dist> <dbl>                   <hilo>
## 1 Seasonal naive 1959 Q4 N(4600, 9148235)  4600 [ 723.8128,  8476.187]80
## 2 Seasonal naive 1960 Q1 N(6400, 9148235)  6400 [2523.8128, 10276.187]80
## 3 Seasonal naive 1960 Q2 N(7600, 9148235)  7600 [3723.8128, 11476.187]80
## 4 Seasonal naive 1960 Q3 N(7600, 9148235)  7600 [3723.8128, 11476.187]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

pd_snfc %>%
  autoplot(pd_train) +
  autolayer(
    filter_index(pd_test),
    colour = "black"
  )

## Plot variable not specified, automatically selected `.vars = count`

gg_tsresiduals(pd_snaive)

## Warning: Removed 4 row(s) containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_bin).

This is an awful forecast. The SNAIVE method leaves significant autocorrelation, abnormally distributed residuals, and basically leaves the regular plot as a residual.

pd_drift <- pd_train %>% 
  model(RW(count~drift()))
pd_dfc <- pd_drift %>% 
  forecast(h=4)
hilo(pd_dfc)

## # A tsibble: 4 x 6 [1Q]
## # Key:       .model [1]
##   .model                 Time            count .mean                   `80%`
##   <chr>                 <qtr>           <dist> <dbl>                  <hilo>
## 1 RW(count ~ drift()) 1959 Q4 N(7674, 1085730) 7674. [6338.718,  9009.430]80
## 2 RW(count ~ drift()) 1960 Q1 N(7748, 2211673) 7748. [5842.264,  9654.032]80
## 3 RW(count ~ drift()) 1960 Q2 N(7822, 3377827) 7822. [5466.876, 10177.569]80
## 4 RW(count ~ drift()) 1960 Q3 N(7896, 4584194) 7896. [5152.401, 10640.192]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

pd_dfc %>% 
  autoplot(pd_train)+
  autolayer(
    filter_index(pd_test),
    colour='black'
  )

## Plot variable not specified, automatically selected `.vars = count`

gg_tsresiduals(pd_drift)

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

This is much better than the mean and SNAIVE forecast. There is less autocorrelation, more of a normal distribution, and the residuals resemble a white noise more. A Naive forecast is good when the data is basically a random walk, and mine appears to be one.

pd_model <- pd_train %>% 
  model(TSLM(count ~ trend()+season())) %>% 
  report()

## Series: count 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4989.5 -3488.7  -185.3  3356.8  5701.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5354.68    1305.31   4.102 0.000151 ***
## trend()          49.83      31.50   1.582 0.120017    
## season()year2   450.17    1400.65   0.321 0.749245    
## season()year3   686.06    1401.72   0.489 0.626668    
## season()year4   234.79    1427.32   0.164 0.870004    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3705 on 50 degrees of freedom
## Multiple R-squared: 0.05374, Adjusted R-squared: -0.02196
## F-statistic: 0.7099 on 4 and 50 DF, p-value: 0.58903

pd_modfc <- pd_model %>%
  forecast(h=4)
hilo(pd_modfc)

## # A tsibble: 4 x 6 [1Q]
## # Key:       .model [1]
##   .model                      Time            count .mean                  `80%`
##   <chr>                      <qtr>           <dist> <dbl>                 <hilo>
## 1 TSLM(count ~ trend() + … 1959 Q4 N(8380, 1.6e+07) 8380. [3324.577, 13434.94]80
## 2 TSLM(count ~ trend() + … 1960 Q1 N(8195, 1.6e+07) 8195. [3133.171, 13256.42]80
## 3 TSLM(count ~ trend() + … 1960 Q2 N(8695, 1.6e+07) 8695. [3633.171, 13756.42]80
## 4 TSLM(count ~ trend() + … 1960 Q3 N(8981, 1.6e+07) 8981. [3918.885, 14042.13]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

pd_modfc %>% 
  autoplot(pd_train) + 
  autolayer(
    filter_index(pd_test),
    colour='black'
  )

## Plot variable not specified, automatically selected `.vars = count`

gg_tsresiduals(pd_model)

Here I estimated a model with trend and season as the independent variables. This shares the same problems with the SNAIVE and mean.

augment(pd_mean) %>%
  features(.resid, ljung_box, lag=10, dof=0)

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 Mean      179.         0

The mean does not leave white noise residuals.

augment(pd_naive) %>%
  features(.resid, ljung_box, lag=10, dof=0)

## # A tibble: 1 × 3
##   .model lb_stat   lb_pvalue
##   <chr>    <dbl>       <dbl>
## 1 Naive     50.6 0.000000207

The naive does not leave white noise residuals.

augment(pd_snaive) %>%
  features(.resid, ljung_box, lag=10, dof=0)

## # A tibble: 1 × 3
##   .model         lb_stat lb_pvalue
##   <chr>            <dbl>     <dbl>
## 1 Seasonal naive    133.         0

The snaive does not leave white noise residuals.

augment(pd_drift) %>%
  features(.resid, ljung_box, lag=10, dof=0)

## # A tibble: 1 × 3
##   .model              lb_stat   lb_pvalue
##   <chr>                 <dbl>       <dbl>
## 1 RW(count ~ drift())    50.6 0.000000207

The drift does not leave white noise residuals.

augment(pd_model) %>%
  features(.resid, ljung_box, lag=10, dof=50)

## Warning in pchisq(STATISTIC, lag - fitdf): NaNs produced

## # A tibble: 1 × 3
##   .model                           lb_stat lb_pvalue
##   <chr>                              <dbl>     <dbl>
## 1 TSLM(count ~ trend() + season())    182.       NaN

augment(pd_model) %>%
  features(.resid, ljung_box, lag=10, dof=0)

## # A tibble: 1 × 3
##   .model                           lb_stat lb_pvalue
##   <chr>                              <dbl>     <dbl>
## 1 TSLM(count ~ trend() + season())    182.         0

I run into the same issue as with my last TSLM model with the degrees of freedom, but putting zero does not leave white noise residuals. This was obvious from their plot.

pd_fit <- pd_train %>%
  model(
    Mean = MEAN(count),
    Naive = NAIVE(count),
    Seasonal_naive = SNAIVE(count),
    Drift = RW(count ~ drift()),
    tslm1=TSLM(count ~ trend()+season())
    
  )
pd_fc <- pd_fit %>%
  forecast(h = 4)
accuracy(pd_fit)

## # A tibble: 5 × 10
##   .model         .type           ME  RMSE   MAE      MPE  MAPE  MASE RMSSE  ACF1
##   <chr>          <chr>        <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mean           Training -4.13e-13 3631. 3162. -49.4     77.7 1.24  1.20  0.952
## 2 Naive          Training  7.41e+ 1 1035.  807.  -0.0431  12.7 0.318 0.342 0.539
## 3 Seasonal_naive Training  2.67e+ 2 3025. 2541.  -7.96    43.0 1     1     0.907
## 4 Drift          Training -2.23e-13 1032.  816.  -1.59    13.2 0.321 0.341 0.539
## 5 tslm1          Training -1.82e-13 3532. 3123. -43.7     71.8 1.23  1.17  0.960

accuracy(pd_fc, pd_test)

## # A tibble: 5 × 10
##   .model         .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>          <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift          Test   -535.  655.  535.  -7.61  7.61   NaN   NaN 0.208
## 2 Mean           Test    155.  334.  303.   1.98  4.14   NaN   NaN 0.193
## 3 Naive          Test   -350   458.  350   -5.01  5.01   NaN   NaN 0.193
## 4 Seasonal_naive Test    700  1643. 1300    8.92 17.6    NaN   NaN 0.227
## 5 tslm1          Test  -1312. 1434. 1312. -18.5  18.5    NaN   NaN 0.254

According to how it fits with the data, the NAIVE is definitely the best. How it predicts the test set, the mean method appears to be the best. Given that my data is a random walk and any attempt to forecast is probably misguided, the NAIVE method will be chosen, and it helps that it was the best at fitting my data.

fullnaive <- pd %>% 
  model(`Naive`=NAIVE(count))
boots <- fullnaive %>%
  forecast(h=4,bootstrap=TRUE,times=5000)
boots %>% 
  autoplot(pd)

Given that normally distributed prediction intervals are not a safe assumption, I bootstrapped 5000 times and got these prediction intervals for the Naive forecast.

pd_cv <- pd %>%
  stretch_tsibble(.init = 5, .step = 1) %>% 
  relocate(Time, .id)
pd_cv %>%
  model(`Naive`=NAIVE(count)) %>%
  forecast(h = 4) %>%
  accuracy(pd)

## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 4 observations are missing between 1961 Q1 and 1961 Q4

## # A tibble: 1 × 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Naive  Test   245. 2164. 1658.  1.33  23.6 0.683 0.740 0.780

pd_cv %>% 
  model(RW(count ~ drift())) %>% 
  forecast(h=4) %>% 
  accuracy(pd)

## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 4 observations are missing between 1961 Q1 and 1961 Q4

## # A tibble: 1 × 10
##   .model              .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>               <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RW(count ~ drift()) Test  -50.4 2366. 1862.  1.04  28.9 0.767 0.809 0.778

The only two competitors were Drift and NAIVE, and the NAIVE won out in the end through cross-validation

hilo(boots,95)

## # A tsibble: 4 x 5 [1Q]
## # Key:       .model [1]
##   .model    Time        count .mean                   `95%`
##   <chr>    <qtr>       <dist> <dbl>                  <hilo>
## 1 Naive  1961 Q1 sample[5000] 6371. [4952.542,  8752.542]95
## 2 Naive  1961 Q2 sample[5000] 6354. [3905.085,  9305.085]95
## 3 Naive  1961 Q3 sample[5000] 6351. [3457.627, 10057.627]95
## 4 Naive  1961 Q4 sample[5000] 6370. [3010.169, 10610.169]95

My official forecast is contained in the table above, with the prediction intervals.

Test Questions

Brandon Pascali

2022-10-21