library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✔ tibble 3.1.8 ✔ tsibble 1.1.2
## ✔ dplyr 1.0.9 ✔ tsibbledata 0.4.0
## ✔ tidyr 1.2.0 ✔ feasts 0.2.2
## ✔ lubridate 1.8.0 ✔ fable 0.3.1
## ✔ ggplot2 3.3.6
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(readxl)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ readr 2.1.2 ✔ stringr 1.4.0
## ✔ purrr 0.3.4 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks lubridate::intersect(), base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks lubridate::setdiff(), base::setdiff()
## ✖ tsibble::union() masks lubridate::union(), base::union()
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data_eco <- readxl::read_xlsx("C:\\Users\\PythonAcct\\Downloads\\Table (1).xlsx")
This loads the data.
new_eco <- data_eco %>%
select(-GeoFips) %>%
select(-GeoName) %>%
select(-LineCode) %>%
pivot_longer(!Description, names_to = "Time", values_to="count") %>%
pivot_wider(names_from = Description, values_from = count) %>%
select(-`Current dollar statistics (millions of dollars)`) %>%
select(-Employment)
new_eco
## # A tibble: 23 × 5
## Time GDP PI PCE TE
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1999 5.4 5.9 7.5 1.9
## 2 2000 8.6 9 8.6 3.2
## 3 2001 4.7 6.3 5.6 1.2
## 4 2002 2.1 1.3 3.6 0.1
## 5 2003 5.5 4 4.8 0.7
## 6 2004 9.6 4.9 6.3 1.7
## 7 2005 9 8.5 7.8 3.3
## 8 2006 10.8 9.8 6.7 3.8
## 9 2007 8 6.2 6.2 3.8
## 10 2008 4.5 9.9 4.5 2.6
## # … with 13 more rows
## # ℹ Use `print(n = ...)` to see more rows
The dataset has many extraneous inclusions, so first I selected those out. Pivot Longer was necessary to reframe the data and put the time information into a column. Pivot wider was then necessary to group the values under the same categories. Then, I selected out the wider categories which weren’t necessary and contained nothing but NA values.
new_eco$Time = as.numeric(as.character(new_eco$Time))
As it stood, my Time column is a character and thus not compatible with a tsibble. I found this line on google that made it work.
teco <- new_eco %>%
as_tsibble(index=Time)
teco
## # A tsibble: 23 x 5 [1Y]
## Time GDP PI PCE TE
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1999 5.4 5.9 7.5 1.9
## 2 2000 8.6 9 8.6 3.2
## 3 2001 4.7 6.3 5.6 1.2
## 4 2002 2.1 1.3 3.6 0.1
## 5 2003 5.5 4 4.8 0.7
## 6 2004 9.6 4.9 6.3 1.7
## 7 2005 9 8.5 7.8 3.3
## 8 2006 10.8 9.8 6.7 3.8
## 9 2007 8 6.2 6.2 3.8
## 10 2008 4.5 9.9 4.5 2.6
## # … with 13 more rows
## # ℹ Use `print(n = ...)` to see more rows
This allowed my tibble to be coerced to a tsibble.
ggpairs(new_eco %>% select(-Time))
GDP and Personal Income have a correlation of .668, which is relatively strong and certainly positive. This means they tend to increase together. This result is statistically significant at the 99.9% confidence level.
GDP and Personal Consumption Expenditures have a correlation of .695, which is relatively strong and certainly positive. This means they tend to increase together. This result is statistically significant at the 99.9% confidence level
GDP and Total Employment have a correlation of .505, which is of medium strength and certainly positive. This means they tend to increase together. This result is statistically significant at the 95% confidence level.
Personal Income and Personal Consumption Expenditures have a correlation .519, which is of medium strength and certainly positive. This means they tend to increase together. This result is statistically significant at the 95% confidence level.
Personal Income and Total Employment have a correlation of .578, which is of medium strength and certainly positive. This means they tend to increase together. This result is statistically significant at the 99% confidence level.
Personal Consumption Expenditures and Total Employment have a correlation of .513, which is of medium strength and certainly positive. This means they tend to increase together. This result is statistically significant at the 95% confidence level.
ecom <- teco %>% model(lm=TSLM(PCE ~ GDP+PI+TE))
This line of code estimates a linear model of percent change in Personal Consumption Expenditures as a function of the percent changes in GDP, Personal Income, and Total Employment.
report(ecom)
## Series: PCE
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6825 -0.9657 0.4163 1.4112 3.9410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.122693 1.153561 0.973 0.3427
## GDP 0.487332 0.182650 2.668 0.0152 *
## PI 0.005509 0.234792 0.023 0.9815
## TE 0.553992 0.510214 1.086 0.2912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.061 on 19 degrees of freedom
## Multiple R-squared: 0.5182, Adjusted R-squared: 0.4421
## F-statistic: 6.812 on 3 and 19 DF, p-value: 0.0026339
The results of that model are printed here.
For a one percent increase in GDP, Personal Consumption Expenditures are expected to increase by .487 percent. This is statistically significant at the 95% confidence level.
For a percent increase in Personal Income, Personal Consumption Expenditures are expected to increase by .005509 percent. This value is NOT significant at conventional levels.
For a percent increase in Total Employment, Personal Consumption Expenditures are expected to increase by .553992 percent. This value is NOT statistically significant at conventional levels.
The intercept indicates a basic percentage change of 1.122 as a base rate. This is not statistically significant at usual levels either.
augment(ecom) %>%
ggplot(aes(x = Time)) +
geom_line(aes(y = PCE, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
scale_colour_manual(
values = c(Data = "black", Fitted = "#D55E00")
) +
labs(y = "% Change in Consumption",
title = "%Change in Personal Consumption Expenditures: Fitted Vs Actual") +
guides(colour = guide_legend(title = "Series"))
This is a plot that shows my fitted values and the actual values. They tend to be relatively close, but it definitely is not perfect.
augment(ecom) %>%
autoplot(.resid)
Here is a plot of the residuals given by my model. They are relatively large but seem random.
augment(ecom) %>%
features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 lm 6.95 0.730
A ljung-box test with 0 degrees of freedom indicates that my residuals cannot be distinguished from a white noise.
augment(ecom) %>%
features(.resid, ljung_box, lag=10, dof=19)
## Warning in pchisq(STATISTIC, lag - fitdf): NaNs produced
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 lm 6.95 NaN
In my model, it says there are 19 degrees of freedom, so I put that into my ljung-box test and it produces a NaN.
constant_scenarios <- scenarios(
Increase = new_data(teco, 3) %>%
mutate(GDP=1, PI=1, TE=1),
names_to = "Scenario")
fc <- forecast(ecom, new_data = constant_scenarios)
hilo(fc)
## # A tsibble: 3 x 10 [1Y]
## # Key: Scenario, .model [1]
## Scenario .model Time PCE .mean GDP PI TE
## <chr> <chr> <dbl> <dist> <dbl> <dbl> <dbl> <dbl>
## 1 Increase lm 2022 N(2.2, 5.1) 2.17 1 1 1
## 2 Increase lm 2023 N(2.2, 5.1) 2.17 1 1 1
## 3 Increase lm 2024 N(2.2, 5.1) 2.17 1 1 1
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
teco %>%
autoplot(PCE) +
autolayer(fc) +
labs(title = "US consumption", y = "% change")
This creates a scenario with a one percent increase in each of my independent variables. It holds this growth for three periods. A forecast is then produced based on my regression model. The confidence levels are shown with my forecast.
For Scenario 2, I tried doing this as subsequent percent increases, but I was unable to do this. Additionally, no values for GDP, Personal Income, and Total Employment were specified, so I arbitrarily chose zero.
un_scenarios <- scenarios(
unpercent = new_data(teco, 3) %>%
mutate(PI=1,GDP=0,TE=0),
names_to="Scenario")
fc1 <- forecast(ecom, new_data = un_scenarios)
hilo(fc1)
## # A tsibble: 3 x 10 [1Y]
## # Key: Scenario, .model [1]
## Scenario .model Time PCE .mean PI GDP TE
## <chr> <chr> <dbl> <dist> <dbl> <dbl> <dbl> <dbl>
## 1 unpercent lm 2022 N(1.1, 5.5) 1.13 1 0 0
## 2 unpercent lm 2023 N(1.1, 5.5) 1.13 1 0 0
## 3 unpercent lm 2024 N(1.1, 5.5) 1.13 1 0 0
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
teco %>%
autoplot(PCE) +
autolayer(fc1) +
labs(title = "US consumption", y = "% change")
This is a forecast where personal income experiences 1 percent growth while the others stay at zero for three years. Because Personal Income was not significant for my model, there is almost no effect. The value is just the intercept term.
neg_scenarios <- scenarios(
negpercent=new_data(teco,3) %>%
mutate(PI=-.5,GDP=0,TE=0),
names_to="scenario")
fc2 <- forecast(ecom, new_data = neg_scenarios)
hilo(fc2)
## # A tsibble: 3 x 10 [1Y]
## # Key: scenario, .model [1]
## scenario .model Time PCE .mean PI GDP TE
## <chr> <chr> <dbl> <dist> <dbl> <dbl> <dbl> <dbl>
## 1 negpercent lm 2022 N(1.1, 5.7) 1.12 -0.5 0 0
## 2 negpercent lm 2023 N(1.1, 5.7) 1.12 -0.5 0 0
## 3 negpercent lm 2024 N(1.1, 5.7) 1.12 -0.5 0 0
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
teco %>%
autoplot(PCE) +
autolayer(fc2) +
labs(title = "US consumption", y = "% change")
This is a forecast where personal income experiences -.5 percent growth while the others stay at zero for three years. Because Personal Income was not significant for my model, there is almost no effect. The value is just the intercept term.
eight_scenarios <- scenarios(
eightpercent=new_data(teco,3) %>%
mutate(PI=.8,GDP=0,TE=0),
names_to="scenario")
fc3 <- forecast(ecom, new_data = eight_scenarios)
hilo(fc3)
## # A tsibble: 3 x 10 [1Y]
## # Key: scenario, .model [1]
## scenario .model Time PCE .mean PI GDP TE
## <chr> <chr> <dbl> <dist> <dbl> <dbl> <dbl> <dbl>
## 1 eightpercent lm 2022 N(1.1, 5.5) 1.13 0.8 0 0
## 2 eightpercent lm 2023 N(1.1, 5.5) 1.13 0.8 0 0
## 3 eightpercent lm 2024 N(1.1, 5.5) 1.13 0.8 0 0
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
teco %>%
autoplot(PCE) +
autolayer(fc3) +
labs(title = "US consumption", y = "% change")
This is a forecast where personal income experiences .8 percent growth while the others stay at zero for three years. Because Personal Income was not significant for my model, there is almost no effect. The value is basically just the intercept term.
all_scenarios <- scenarios(
opercent=new_data(teco,3) %>%
mutate(PI=1,GDP=0,TE=0),
npercent=new_data(teco,3) %>%
mutate(PI=-.5,GDP=0,TE=0),
epercent=new_data(teco,3) %>%
mutate(PI=.8,GDP=0,TE=0),
names_to = "scenario"
)
fc4 <- forecast(ecom, new_data = all_scenarios)
hilo(fc4)
## # A tsibble: 9 x 10 [1Y]
## # Key: scenario, .model [3]
## scenario .model Time PCE .mean PI GDP TE
## <chr> <chr> <dbl> <dist> <dbl> <dbl> <dbl> <dbl>
## 1 opercent lm 2022 N(1.1, 5.5) 1.13 1 0 0
## 2 opercent lm 2023 N(1.1, 5.5) 1.13 1 0 0
## 3 opercent lm 2024 N(1.1, 5.5) 1.13 1 0 0
## 4 npercent lm 2022 N(1.1, 5.7) 1.12 -0.5 0 0
## 5 npercent lm 2023 N(1.1, 5.7) 1.12 -0.5 0 0
## 6 npercent lm 2024 N(1.1, 5.7) 1.12 -0.5 0 0
## 7 epercent lm 2022 N(1.1, 5.5) 1.13 0.8 0 0
## 8 epercent lm 2023 N(1.1, 5.5) 1.13 0.8 0 0
## 9 epercent lm 2024 N(1.1, 5.5) 1.13 0.8 0 0
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
teco %>%
autoplot(PCE) +
autolayer(fc4) +
labs(title = "US consumption", y = "% change")
Here they are all together. PI had virtually no effect and wasn’t significant, so it basically spits out the intercept term. Even if these values were used successively, the growth would be still constant at 1.12% because that is the estimated intercept.
pd <- readxl::read_xlsx("C:\\Users\\PythonAcct\\Downloads\\M3_exam1.xlsx") %>%
filter(`Student Name` =="Brandon") %>%
pivot_longer(!`Student Name`, names_to = "Time", values_to="count") %>%
select(-`Student Name`) %>%
na.omit() %>%
select(count)%>%
mutate(Time=yearquarter(-96:-37)) %>%
as_tsibble(index=Time)
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `1943 Q1` -> `1943 Q1...12`
## • `1943 Q1` -> `1943 Q1...13`
## • `1944 Q1` -> `1944 Q1...17`
## • `1944 Q1` -> `1944 Q1...18`
## • `1945 Q1` -> `1945 Q1...22`
## • `1945 Q1` -> `1945 Q1...23`
## • `1946 Q1` -> `1946 Q1...27`
## • `1946 Q1` -> `1946 Q1...28`
## • `1947 Q1` -> `1947 Q1...32`
## • `1947 Q1` -> `1947 Q1...33`
## • `1948 Q1` -> `1948 Q1...37`
## • `1948 Q1` -> `1948 Q1...38`
## • `1949 Q1` -> `1949 Q1...42`
## • `1949 Q1` -> `1949 Q1...43`
## • `1950 Q1` -> `1950 Q1...47`
## • `1950 Q1` -> `1950 Q1...48`
## • `1951 Q1` -> `1951 Q1...52`
## • `1951 Q1` -> `1951 Q1...53`
## • `1952 Q1` -> `1952 Q1...57`
## • `1952 Q1` -> `1952 Q1...58`
## • `1953 Q1` -> `1953 Q1...62`
## • `1953 Q1` -> `1953 Q1...63`
## • `1954 Q1` -> `1954 Q1...67`
## • `1954 Q1` -> `1954 Q1...68`
## • `1955 Q1` -> `1955 Q1...72`
## • `1955 Q1` -> `1955 Q1...73`
## • `1956 Q1` -> `1956 Q1...77`
## • `1956 Q1` -> `1956 Q1...78`
pd
## # A tsibble: 60 x 2 [1Q]
## count Time
## <dbl> <qtr>
## 1 3600 1946 Q1
## 2 3800 1946 Q2
## 3 3000 1946 Q3
## 4 2200 1946 Q4
## 5 1600 1947 Q1
## 6 1600 1947 Q2
## 7 1400 1947 Q3
## 8 1400 1947 Q4
## 9 1800 1948 Q1
## 10 2400 1948 Q2
## # … with 50 more rows
## # ℹ Use `print(n = ...)` to see more rows
This loads the dataset. Then I filter by my name to pull out my selected data. I cleaned it a bit in excel to make it workable. I pivoted to make the rows into columns. I selected out my name because that information is unnecessary. I omitted the NA values. I got rid of extraneous columns. I had to use an extremely weird and random line of code to get my years correct. Finally, I changed it to a tsibble.
The tidying of the data is completed.
pd %>%
autoplot(count)+
labs(y="Count",title="Demographic")
The next step of the process is to visualize the data. This appears to be completely random with little trend or seasonality. A transformation also does not appear to be needed.
pd %>%
model(STL(count~trend(window=4)+season(window=11),robust=TRUE)) %>%
components() %>%
autoplot()
This is the best I could get the decomposition to look. It appears that the decomposition is pretty unnecessary.
pd_train <- pd %>%
slice(1:55)
pd_test <- pd %>%
slice(56:60)
Here, I split the data into a test and train.
pd_mean <- pd_train %>%
model(Mean = MEAN(count)
)
pd_mfc <- pd_mean %>%
forecast(h = 4)
hilo(pd_mfc)
## # A tsibble: 4 x 6 [1Q]
## # Key: .model [1]
## .model Time count .mean `80%`
## <chr> <qtr> <dist> <dbl> <hilo>
## 1 Mean 1959 Q4 N(7095, 1.4e+07) 7095. [2355.387, 11833.7]80
## 2 Mean 1960 Q1 N(7095, 1.4e+07) 7095. [2355.387, 11833.7]80
## 3 Mean 1960 Q2 N(7095, 1.4e+07) 7095. [2355.387, 11833.7]80
## 4 Mean 1960 Q3 N(7095, 1.4e+07) 7095. [2355.387, 11833.7]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
pd_mfc %>%
autoplot(pd_train) +
autolayer(
filter_index(pd_test),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = count`
gg_tsresiduals(pd_mean)
This is an awful forecast. The mean method leaves significant autocorrelation, abnormally distributed residuals, and basically leaves the regular plot as a residual.
pd_naive <- pd_train %>%
model(`Naive`=NAIVE(count))
pd_nfc <- pd_naive %>%
forecast(h=4)
hilo(pd_nfc)
## # A tsibble: 4 x 6 [1Q]
## # Key: .model [1]
## .model Time count .mean `80%`
## <chr> <qtr> <dist> <dbl> <hilo>
## 1 Naive 1959 Q4 N(7600, 1071111) 7600 [6273.665, 8926.335]80
## 2 Naive 1960 Q1 N(7600, 2142222) 7600 [5724.279, 9475.721]80
## 3 Naive 1960 Q2 N(7600, 3213333) 7600 [5302.720, 9897.280]80
## 4 Naive 1960 Q3 N(7600, 4284444) 7600 [4947.329, 10252.671]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
pd_nfc %>%
autoplot(pd_train) +
autolayer(
filter_index(pd_test),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = count`
gg_tsresiduals(pd_naive)
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
This is much better than the mean forecast. There is less autocorrelation, more of a normal distribution, and the residuals resemble a white noise more. A Naive forecast is good when the data is basically a random walk, and mine appears to be one.
pd_snaive <- pd_train %>%
model(`Seasonal naive`=SNAIVE(count)
)
pd_snfc <- pd_snaive %>%
forecast(h=4)
hilo(pd_snfc)
## # A tsibble: 4 x 6 [1Q]
## # Key: .model [1]
## .model Time count .mean `80%`
## <chr> <qtr> <dist> <dbl> <hilo>
## 1 Seasonal naive 1959 Q4 N(4600, 9148235) 4600 [ 723.8128, 8476.187]80
## 2 Seasonal naive 1960 Q1 N(6400, 9148235) 6400 [2523.8128, 10276.187]80
## 3 Seasonal naive 1960 Q2 N(7600, 9148235) 7600 [3723.8128, 11476.187]80
## 4 Seasonal naive 1960 Q3 N(7600, 9148235) 7600 [3723.8128, 11476.187]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
pd_snfc %>%
autoplot(pd_train) +
autolayer(
filter_index(pd_test),
colour = "black"
)
## Plot variable not specified, automatically selected `.vars = count`
gg_tsresiduals(pd_snaive)
## Warning: Removed 4 row(s) containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing non-finite values (stat_bin).
This is an awful forecast. The SNAIVE method leaves significant autocorrelation, abnormally distributed residuals, and basically leaves the regular plot as a residual.
pd_drift <- pd_train %>%
model(RW(count~drift()))
pd_dfc <- pd_drift %>%
forecast(h=4)
hilo(pd_dfc)
## # A tsibble: 4 x 6 [1Q]
## # Key: .model [1]
## .model Time count .mean `80%`
## <chr> <qtr> <dist> <dbl> <hilo>
## 1 RW(count ~ drift()) 1959 Q4 N(7674, 1085730) 7674. [6338.718, 9009.430]80
## 2 RW(count ~ drift()) 1960 Q1 N(7748, 2211673) 7748. [5842.264, 9654.032]80
## 3 RW(count ~ drift()) 1960 Q2 N(7822, 3377827) 7822. [5466.876, 10177.569]80
## 4 RW(count ~ drift()) 1960 Q3 N(7896, 4584194) 7896. [5152.401, 10640.192]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
pd_dfc %>%
autoplot(pd_train)+
autolayer(
filter_index(pd_test),
colour='black'
)
## Plot variable not specified, automatically selected `.vars = count`
gg_tsresiduals(pd_drift)
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
This is much better than the mean and SNAIVE forecast. There is less autocorrelation, more of a normal distribution, and the residuals resemble a white noise more. A Naive forecast is good when the data is basically a random walk, and mine appears to be one.
pd_model <- pd_train %>%
model(TSLM(count ~ trend()+season())) %>%
report()
## Series: count
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -4989.5 -3488.7 -185.3 3356.8 5701.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5354.68 1305.31 4.102 0.000151 ***
## trend() 49.83 31.50 1.582 0.120017
## season()year2 450.17 1400.65 0.321 0.749245
## season()year3 686.06 1401.72 0.489 0.626668
## season()year4 234.79 1427.32 0.164 0.870004
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3705 on 50 degrees of freedom
## Multiple R-squared: 0.05374, Adjusted R-squared: -0.02196
## F-statistic: 0.7099 on 4 and 50 DF, p-value: 0.58903
pd_modfc <- pd_model %>%
forecast(h=4)
hilo(pd_modfc)
## # A tsibble: 4 x 6 [1Q]
## # Key: .model [1]
## .model Time count .mean `80%`
## <chr> <qtr> <dist> <dbl> <hilo>
## 1 TSLM(count ~ trend() + … 1959 Q4 N(8380, 1.6e+07) 8380. [3324.577, 13434.94]80
## 2 TSLM(count ~ trend() + … 1960 Q1 N(8195, 1.6e+07) 8195. [3133.171, 13256.42]80
## 3 TSLM(count ~ trend() + … 1960 Q2 N(8695, 1.6e+07) 8695. [3633.171, 13756.42]80
## 4 TSLM(count ~ trend() + … 1960 Q3 N(8981, 1.6e+07) 8981. [3918.885, 14042.13]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
pd_modfc %>%
autoplot(pd_train) +
autolayer(
filter_index(pd_test),
colour='black'
)
## Plot variable not specified, automatically selected `.vars = count`
gg_tsresiduals(pd_model)
Here I estimated a model with trend and season as the independent variables. This shares the same problems with the SNAIVE and mean.
augment(pd_mean) %>%
features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 Mean 179. 0
The mean does not leave white noise residuals.
augment(pd_naive) %>%
features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 Naive 50.6 0.000000207
The naive does not leave white noise residuals.
augment(pd_snaive) %>%
features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 Seasonal naive 133. 0
The snaive does not leave white noise residuals.
augment(pd_drift) %>%
features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 RW(count ~ drift()) 50.6 0.000000207
The drift does not leave white noise residuals.
augment(pd_model) %>%
features(.resid, ljung_box, lag=10, dof=50)
## Warning in pchisq(STATISTIC, lag - fitdf): NaNs produced
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 TSLM(count ~ trend() + season()) 182. NaN
augment(pd_model) %>%
features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 TSLM(count ~ trend() + season()) 182. 0
I run into the same issue as with my last TSLM model with the degrees of freedom, but putting zero does not leave white noise residuals. This was obvious from their plot.
pd_fit <- pd_train %>%
model(
Mean = MEAN(count),
Naive = NAIVE(count),
Seasonal_naive = SNAIVE(count),
Drift = RW(count ~ drift()),
tslm1=TSLM(count ~ trend()+season())
)
pd_fc <- pd_fit %>%
forecast(h = 4)
accuracy(pd_fit)
## # A tibble: 5 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mean Training -4.13e-13 3631. 3162. -49.4 77.7 1.24 1.20 0.952
## 2 Naive Training 7.41e+ 1 1035. 807. -0.0431 12.7 0.318 0.342 0.539
## 3 Seasonal_naive Training 2.67e+ 2 3025. 2541. -7.96 43.0 1 1 0.907
## 4 Drift Training -2.23e-13 1032. 816. -1.59 13.2 0.321 0.341 0.539
## 5 tslm1 Training -1.82e-13 3532. 3123. -43.7 71.8 1.23 1.17 0.960
accuracy(pd_fc, pd_test)
## # A tibble: 5 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift Test -535. 655. 535. -7.61 7.61 NaN NaN 0.208
## 2 Mean Test 155. 334. 303. 1.98 4.14 NaN NaN 0.193
## 3 Naive Test -350 458. 350 -5.01 5.01 NaN NaN 0.193
## 4 Seasonal_naive Test 700 1643. 1300 8.92 17.6 NaN NaN 0.227
## 5 tslm1 Test -1312. 1434. 1312. -18.5 18.5 NaN NaN 0.254
According to how it fits with the data, the NAIVE is definitely the best. How it predicts the test set, the mean method appears to be the best. Given that my data is a random walk and any attempt to forecast is probably misguided, the NAIVE method will be chosen, and it helps that it was the best at fitting my data.
fullnaive <- pd %>%
model(`Naive`=NAIVE(count))
boots <- fullnaive %>%
forecast(h=4,bootstrap=TRUE,times=5000)
boots %>%
autoplot(pd)
Given that normally distributed prediction intervals are not a safe assumption, I bootstrapped 5000 times and got these prediction intervals for the Naive forecast.
pd_cv <- pd %>%
stretch_tsibble(.init = 5, .step = 1) %>%
relocate(Time, .id)
pd_cv %>%
model(`Naive`=NAIVE(count)) %>%
forecast(h = 4) %>%
accuracy(pd)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 4 observations are missing between 1961 Q1 and 1961 Q4
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Naive Test 245. 2164. 1658. 1.33 23.6 0.683 0.740 0.780
pd_cv %>%
model(RW(count ~ drift())) %>%
forecast(h=4) %>%
accuracy(pd)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 4 observations are missing between 1961 Q1 and 1961 Q4
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RW(count ~ drift()) Test -50.4 2366. 1862. 1.04 28.9 0.767 0.809 0.778
The only two competitors were Drift and NAIVE, and the NAIVE won out in the end through cross-validation
hilo(boots,95)
## # A tsibble: 4 x 5 [1Q]
## # Key: .model [1]
## .model Time count .mean `95%`
## <chr> <qtr> <dist> <dbl> <hilo>
## 1 Naive 1961 Q1 sample[5000] 6371. [4952.542, 8752.542]95
## 2 Naive 1961 Q2 sample[5000] 6354. [3905.085, 9305.085]95
## 3 Naive 1961 Q3 sample[5000] 6351. [3457.627, 10057.627]95
## 4 Naive 1961 Q4 sample[5000] 6370. [3010.169, 10610.169]95
My official forecast is contained in the table above, with the prediction intervals.