Load packages and data

library(dplyr)
library(ggplot2)
library(tidyr)
library(tibble)
library(tsibble)
library(ggfortify)
library(tidyverse)
library(fpp3)
library(moments)
library(zoo)
library(fable)
library(readxl)
library(seasonal)
library(caTools)
library(GGally)

Question 1

Part A

illest_table_eva_ <- read_excel("~/Downloads/illest table eva .xlsx")


fixed_table <- illest_table_eva_ %>%
  pivot_longer(!Description) %>%
  pivot_wider(names_from = Description, values_from = value) 
colnames(fixed_table)[1] <- "Year"

fixed_table$Year = as.numeric(as.character(fixed_table$Year))
fixxy <- fixed_table %>%
  as_tsibble(index = Year)
fixxy
## # A tsibble: 23 x 5 [1Y]
##     Year   GDP    PI   PCE    TE
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1999   5.4   4.5   5.5   1.1
##  2  2000   5.8   7.6   7.6   1.9
##  3  2001   3     3.7   3.4  -0.6
##  4  2002   2.6   1.2   2.9  -1.3
##  5  2003   3.6   2     4.3  -0.4
##  6  2004   5.6   4     5.2   0.7
##  7  2005   4.8   4.2   5.6   1.2
##  8  2006   5.8   6.6   5.1   1.6
##  9  2007   4.1   5.8   4.8   1.7
## 10  2008   0.7   2.6   2.5  -0.3
## # … with 13 more rows
fixxy %>%
  ggplot(aes(x = Year)) +
  geom_line(aes(y = GDP, color = "GDP")) +
  geom_line(aes(y = PI, color = "PI")) +
  geom_line(aes(y = PCE, color = "PCE")) +
  geom_line(aes(y = TE, color = "TE")) +
  labs(y = "Percentage Change", title = "GDP, Income, Consumption, and Employment") +
  scale_colour_manual(values = c(GDP = "black", PI = "green", PCE = "blue", TE = "pink")) +
  guides(colour = guide_legend(title = NULL))

After gathering my data from the website, I cleaned the data slightly using excel. I mainly changed some of the names of the variables and got rid of a bunch of excess information that I couldn’t use in R. I turned the year variable to numeric. Then converted the data in to a tsibble using Year as the index. I then plotted the data over time to get a better understanding of the data.

Part B

ggpairs(fixed_table %>% select(-Year))

Part C

GDP and Personal Income have a correlation value of 0.557. This is a strong positive correlation, therefore they will increase and decrease together. The two stars show that this data is statistically significant at the 99% level.

GDP and Personal Consumption Expenditure have a correlation value of 0.924. This is a very strong positive correlation, therefore they will increase and decrease together. The three stars indicate that this data is statistically significant at 99.9% level.

GDP and Total Employment have a correlation value of 0.809. This is a very strong positive correlation, therefore they will increase and decrease together. The three stars indicate that this data is statistically significant at the 99.9% level.

Personal Income and Personal Consumption Expenditure have a correlation value of 0.584. This is a strong positive correlation, therefore they will increase and decrease together. The two stars indicate that this data is statically significant at 99% level.

Personal Income and Total Employment have a correlation value of 0.568. This is a strong positive correlation, therefore they will increase and decrease together. The two stars indicate that this data is statistically significant at 99% level.

Personal Consumption Expenditure and Total Employment have a correlation value of 0.753. This is a very strong correlation, therefore they will increase and decrease together. The three stars indicate that this data is significantly significant at the 99.9% level.

Part D

ohno <- fixxy %>%
  model(lm = TSLM(PCE ~ GDP + PI + TE))
report(ohno)
## Series: PCE 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68710 -0.73783 -0.04094  0.59836  2.05388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.09343    0.64213  -0.145    0.886    
## GDP          0.97351    0.16482   5.907  1.1e-05 ***
## PI           0.12106    0.12435   0.974    0.343    
## TE          -0.03121    0.27974  -0.112    0.912    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.213 on 19 degrees of freedom
## Multiple R-squared: 0.8602,  Adjusted R-squared: 0.8381
## F-statistic: 38.96 on 3 and 19 DF, p-value: 2.5788e-08

I used the TSLM function to run a regression on my data. I used GDP, Income and Employment as predictors for consumption.

My Intercept coefficient from my regression found that when Income, GDP, and Employment change by 0 percentage points, Consumption changes by -0.09343 percentage points on average. I did not find this effect to be statistically significant above the 80% level.

My GDP coefficient from my regression found that as GDP increases by 1 percentage point, Consumption increases by 0.97351 percentage points on average. I found this effect to be statistically significant at the 99.9% level. This is highly significant.

My Income coefficient from my regression found that as Income increases by 1 percentage point, Consumption increases by 0.12106 percentage points on average. I did not find this effect to be statistically significant above the 80% level.

My Employment coefficient from my regression found that as Employment increases by 1 percentage point, Consumption decreases by 0.03121 percentage points on average. I did not find this effect to be statistically significant above the 80% level.

Part E

augment(ohno) %>%
  ggplot(aes(x = Year)) +
  geom_line(aes(y = PCE, color = "Data")) +
  geom_line(aes(y = .fitted, color = "Fitted")) +
  scale_color_manual(values = c(Data = "red", Fitted = "blue")) + 
  labs(y = "Consumption", title = "Fitted vs Actual") + 
  guides(color = guide_legend(title = "Series"))

Part F

daddyfit <- fixxy %>%
  model(lm = TSLM(PCE ~ PI + GDP + TE))


theman <- scenarios(Increase = new_data(fixxy, 3) %>%
    mutate(PI=1, GDP=1, TE=1), names_to = "Scenario")

damncast <- forecast(daddyfit, new_data = theman)
damncast
## # A fable: 3 x 8 [1Y]
## # Key:     Scenario, .model [1]
##   Scenario .model  Year          PCE .mean    PI   GDP    TE
##   <chr>    <chr>  <dbl>       <dist> <dbl> <dbl> <dbl> <dbl>
## 1 Increase lm      2022 N(0.97, 1.9) 0.970     1     1     1
## 2 Increase lm      2023 N(0.97, 1.9) 0.970     1     1     1
## 3 Increase lm      2024 N(0.97, 1.9) 0.970     1     1     1
fixxy %>%
  autoplot(PCE) +
  autolayer(damncast) +
  labs(title = "PCE forecast", y = "% change")

fuckstick <- scenarios(Increase = new_data(fixxy, 3) %>%
                         mutate(PI= 1, GDP = 0, TE = 0), Decrease = new_data(fixxy, 3) %>%
                         mutate(PI = -0.5, GDP = 0, TE = 0), Increase2 = new_data(fixxy, 3) %>% 
                         mutate(PI = 0.8, GDP = 0, TE = 0), names_to = "Scenario")

shitcast <- forecast(daddyfit, new_data = fuckstick)

shitcast
## # A fable: 9 x 8 [1Y]
## # Key:     Scenario, .model [3]
##   Scenario  .model  Year            PCE    .mean    PI   GDP    TE
##   <chr>     <chr>  <dbl>         <dist>    <dbl> <dbl> <dbl> <dbl>
## 1 Increase  lm      2022  N(0.028, 1.8)  0.0276    1       0     0
## 2 Increase  lm      2023  N(0.028, 1.8)  0.0276    1       0     0
## 3 Increase  lm      2024  N(0.028, 1.8)  0.0276    1       0     0
## 4 Decrease  lm      2022  N(-0.15, 1.9) -0.154    -0.5     0     0
## 5 Decrease  lm      2023  N(-0.15, 1.9) -0.154    -0.5     0     0
## 6 Decrease  lm      2024  N(-0.15, 1.9) -0.154    -0.5     0     0
## 7 Increase2 lm      2022 N(0.0034, 1.8)  0.00342   0.8     0     0
## 8 Increase2 lm      2023 N(0.0034, 1.8)  0.00342   0.8     0     0
## 9 Increase2 lm      2024 N(0.0034, 1.8)  0.00342   0.8     0     0
fixxy %>%
  autoplot(PCE, color = "black") +
  autolayer(shitcast, color = "red") + 
  labs(title = "IL consumption", y = "% change")

fixxy %>%
  autoplot(PCE, color = "black") +
  autolayer(shitcast, color = "red") + 
  autolayer(damncast, color = "blue") +
  labs(title = "IL consumption (Both together with PI)", y = "% change")

The PI’s are all plotted together. This wasn’t very significant, and the PI had very little effect across the scenarios. The only scenario where the PI changed from the the scenario where we looked at no effect on the variables is when we decreased. Even then, the PI’s overlapped making them not statistically significant.

Question 2

Cleaning and plotting the data

dat_data <- read_excel("~/Downloads/M3_exam1.xlsx")


Hdawg <- dat_data %>%
  select(-Series) %>%
  select(-N) %>%
  select(-NF) %>%
  select(-Category) %>%
  select(-'Starting Year') %>%
  select(-'Starting Quarter') %>%
  filter(`Student Name` == 'Hunter') %>%
  pivot_longer(-'Student Name') %>%
  select(-'Student Name') %>%
  select(-name) %>%
  mutate(demographic = value) %>%
  select(-value)

Hdawg <- na.omit(Hdawg)  

daddyhunt <- Hdawg %>%
  mutate(Date = yearquarter(-24:23)) %>%
  select(Date, demographic) %>%
  as_tsibble(index = Date)
daddyhunt
## # A tsibble: 48 x 2 [1Q]
##       Date demographic
##      <qtr>       <dbl>
##  1 1964 Q1        6420
##  2 1964 Q2        6160
##  3 1964 Q3        6300
##  4 1964 Q4        7580
##  5 1965 Q1        7870
##  6 1965 Q2        7700
##  7 1965 Q3        7530
##  8 1965 Q4        8640
##  9 1966 Q1        8470
## 10 1966 Q2        7610
## # … with 38 more rows
autoplot(daddyhunt)
## Plot variable not specified, automatically selected `.vars = demographic`

Test and Train

choochoo <- daddyhunt %>%
  filter(year(Date) <= 1971)
choochoo
## # A tsibble: 32 x 2 [1Q]
##       Date demographic
##      <qtr>       <dbl>
##  1 1964 Q1        6420
##  2 1964 Q2        6160
##  3 1964 Q3        6300
##  4 1964 Q4        7580
##  5 1965 Q1        7870
##  6 1965 Q2        7700
##  7 1965 Q3        7530
##  8 1965 Q4        8640
##  9 1966 Q1        8470
## 10 1966 Q2        7610
## # … with 22 more rows
testies <- daddyhunt %>%
  filter(year(Date) > 1971)
testies
## # A tsibble: 16 x 2 [1Q]
##       Date demographic
##      <qtr>       <dbl>
##  1 1972 Q1        2140
##  2 1972 Q2        2420
##  3 1972 Q3        3340
##  4 1972 Q4        2850
##  5 1973 Q1        3180
##  6 1973 Q2        3880
##  7 1973 Q3        4240
##  8 1973 Q4        4200
##  9 1974 Q1        4640
## 10 1974 Q2        4930
## 11 1974 Q3        4450
## 12 1974 Q4        2490
## 13 1975 Q1        2070
## 14 1975 Q2        2000
## 15 1975 Q3        1650
## 16 1975 Q4        1650

Transformations and Decomp

daddyhunt %>%
  features(demographic, features = guerrero)
## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1            1.67
daddyhunt %>%
  autoplot(box_cox(demographic, 1.67)) +
  labs(title = "Box-Cox Transformation", y = "Demographic trasformed")

daddyhunt %>%
  model(classical_decomposition(demographic, type = "multiplicative")) %>%
  components() %>%
  autoplot() +
  labs(title = "Multiplicative Decomp")
## Warning: Removed 2 row(s) containing missing values (geom_path).

daddyhunt %>%
  model(STL(demographic ~ season(window = 4), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomp")

The Four Models

Accuracy across the four

supafore <- choochoo %>%
  model(
    Mean = MEAN(demographic),
    Naive = NAIVE(demographic),
    Seasonal_naive = SNAIVE(demographic),
    Drift = RW(demographic ~ drift()))

crazykids <- supafore %>%
  forecast(h = 4)
accuracy(crazykids, daddyhunt)
## # A tibble: 4 × 10
##   .model         .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>          <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Drift          Test    298.   636.  528.   7.87  18.5 0.302 0.293 0.198 
## 2 Mean           Test  -2507.  2548. 2507. -98.8   98.8 1.43  1.17  0.0947
## 3 Naive          Test     -2.5  454.  408.  -2.95  15.5 0.233 0.209 0.0947
## 4 Seasonal_naive Test   -348.  1394.  972. -21.2   40.3 0.556 0.642 0.0716
crazykids %>%
  autoplot() +
  autolayer(daddyhunt)
## Plot variable not specified, automatically selected `.vars = demographic`

gg_tresiduals of all models

huntseanafr <- daddyhunt %>%
  model(SNAIVE(demographic))
huntseanafr %>%
  gg_tsresiduals()
## Warning: Removed 4 row(s) containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing non-finite values (stat_bin).

huntmeanfr <- daddyhunt %>%
  model(MEAN(demographic))
huntmeanfr %>%
  gg_tsresiduals()

huntdriftfr <- daddyhunt %>%
  model(RW(demographic ~ drift()))
huntdriftfr %>%
  gg_tsresiduals() +
  labs(title = "Drift resid")
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).

huntnaifr <- daddyhunt %>%
  model(NAIVE(demographic))
huntnaifr %>%
  gg_tsresiduals() +
  labs(title = "Naive resid")
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Ljuang-Box tests

augment(huntnaifr) %>%
  features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model             lb_stat lb_pvalue
##   <chr>                <dbl>     <dbl>
## 1 NAIVE(demographic)    28.9  0.000326
augment(huntseanafr) %>%
  features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model              lb_stat lb_pvalue
##   <chr>                 <dbl>     <dbl>
## 1 SNAIVE(demographic)    112.         0
augment(huntmeanfr) %>%
  features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model            lb_stat lb_pvalue
##   <chr>               <dbl>     <dbl>
## 1 MEAN(demographic)    101.         0
augment(huntdriftfr) %>%
  features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model                    lb_stat lb_pvalue
##   <chr>                       <dbl>     <dbl>
## 1 RW(demographic ~ drift())    28.9  0.000326

2 best models (Drift and Naive)

casting <- huntnaifr %>%
  forecast(h = "1 year")
casting
## # A fable: 4 x 4 [1Q]
## # Key:     .model [1]
##   .model                Date      demographic .mean
##   <chr>                <qtr>           <dist> <dbl>
## 1 NAIVE(demographic) 1976 Q1  N(1650, 829313)  1650
## 2 NAIVE(demographic) 1976 Q2 N(1650, 1658626)  1650
## 3 NAIVE(demographic) 1976 Q3 N(1650, 2487938)  1650
## 4 NAIVE(demographic) 1976 Q4 N(1650, 3317251)  1650
daddyhunt %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = demographic)) + 
  autolayer(casting) + 
  labs(title = "Naive Method")

casting2 <- huntdriftfr %>%
  forecast(h = "1 year")
casting2
## # A fable: 4 x 4 [1Q]
## # Key:     .model [1]
##   .model                       Date      demographic .mean
##   <chr>                       <qtr>           <dist> <dbl>
## 1 RW(demographic ~ drift()) 1976 Q1  N(1549, 836817) 1549.
## 2 RW(demographic ~ drift()) 1976 Q2 N(1447, 1709244) 1447.
## 3 RW(demographic ~ drift()) 1976 Q3 N(1346, 2617280) 1346.
## 4 RW(demographic ~ drift()) 1976 Q4 N(1244, 3560925) 1244.
daddyhunt %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = demographic)) + 
  autolayer(casting2) + 
  labs(title = "Drift Method")

TSLM model

traingang <- choochoo %>%
  model(TSLM(demographic ~ trend() + season()))
report(traingang)
## Series: demographic 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3019.33 -1268.84   -53.04  1390.54  1884.51 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8013.73     749.19  10.697 3.28e-11 ***
## trend()        -150.92      31.63  -4.771 5.63e-05 ***
## season()year2  -779.08     820.57  -0.949    0.351    
## season()year3  -876.92     822.40  -1.066    0.296    
## season()year4   340.25     825.44   0.412    0.683    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1640 on 27 degrees of freedom
## Multiple R-squared: 0.489,   Adjusted R-squared: 0.4133
## F-statistic:  6.46 on 4 and 27 DF, p-value: 0.00088007
huntdatatest <- traingang %>%
  forecast(h = 16)
accuracy(huntdatatest, daddyhunt)
## # A tibble: 1 × 10
##   .model                   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>                    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TSLM(demographic ~ tren… Test  1560. 2055. 1681.  45.6  51.1 0.961 0.947 0.464
huntdatatest %>%
  autoplot() +
  autolayer(daddyhunt)
## Plot variable not specified, automatically selected `.vars = demographic`

Linear and Exponential

linex <- choochoo %>%
  model(Linear = TSLM(demographic ~ trend()), Exponential = TSLM(log(demographic) ~ trend()))

linexfore <- linex %>%
  forecast(h = 16)
linexfore %>%
  autoplot(daddyhunt)

accuracy(linexfore, daddyhunt)
## # A tibble: 2 × 10
##   .model      .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Exponential Test   608. 1238.  937.  9.86  26.3 0.536 0.570 0.729
## 2 Linear      Test  1528. 1907. 1621. 45.5   49.8 0.927 0.879 0.728
linex %>%
  select(Linear) %>%
  gg_tsresiduals()

linex %>%
  select(Exponential) %>%
  gg_tsresiduals()

Bootstrapping

fullgang <- daddyhunt %>% 
  model(`Naive`=NAIVE(demographic))
boots <- fullgang %>%
  forecast(h=4,bootstrap=TRUE,times=5000)
boots %>% 
  autoplot(daddyhunt)

hilo(boots,95)
## # A tsibble: 4 x 5 [1Q]
## # Key:       .model [1]
##   .model    Date  demographic .mean                    `95%`
##   <chr>    <qtr>       <dist> <dbl>                   <hilo>
## 1 Naive  1976 Q1 sample[5000] 1631. [ -448.5106, 3611.489]95
## 2 Naive  1976 Q2 sample[5000] 1617. [-1327.0213, 3982.979]95
## 3 Naive  1976 Q3 sample[5000] 1607. [-1826.2819, 4434.718]95
## 4 Naive  1976 Q4 sample[5000] 1610. [-2244.0426, 4976.207]95

Cross validation

crossvally <- daddyhunt %>%
  stretch_tsibble(.init = 5, .step = 1) %>% 
  relocate(Date, .id)
crossvally %>%
  model(`Naive`=NAIVE(demographic)) %>%
  forecast(h = 4) %>%
  accuracy(daddyhunt)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 4 observations are missing between 1976 Q1 and 1976 Q4
## # A tibble: 1 × 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Naive  Test  -365. 1592. 1154. -19.2  37.0 0.714 0.797 0.724
crossvally %>% 
  model(RW(demographic ~ drift())) %>% 
  forecast(h=4) %>% 
  accuracy(daddyhunt)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 4 observations are missing between 1976 Q1 and 1976 Q4
## # A tibble: 1 × 10
##   .model                   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>                    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RW(demographic ~ drift(… Test  -250. 1739. 1281. -13.5  38.8 0.792 0.871 0.740

Out of drift and naive, Naive won through cross validation.

Best model (NAIVE)

huntnaifr <- daddyhunt %>%
  model(NAIVE(demographic))
huntnaifr %>%
  gg_tsresiduals() +
  labs(title = "Naive resid")
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).

augment(huntnaifr) %>%
  features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model             lb_stat lb_pvalue
##   <chr>                <dbl>     <dbl>
## 1 NAIVE(demographic)    28.9  0.000326
casting <- huntnaifr %>%
  forecast(h = "1 year")
casting
## # A fable: 4 x 4 [1Q]
## # Key:     .model [1]
##   .model                Date      demographic .mean
##   <chr>                <qtr>           <dist> <dbl>
## 1 NAIVE(demographic) 1976 Q1  N(1650, 829313)  1650
## 2 NAIVE(demographic) 1976 Q2 N(1650, 1658626)  1650
## 3 NAIVE(demographic) 1976 Q3 N(1650, 2487938)  1650
## 4 NAIVE(demographic) 1976 Q4 N(1650, 3317251)  1650
daddyhunt %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = demographic)) + 
  autolayer(casting) + 
  labs(title = "Naive Method")

I chose the Naive method as my best forecast. I believe that my data is a random walk and all the forecast methods struggle to properly forecast it with accuracy. I was stuck between the drift method and the naive method as both of their respective ljung-box tests were equal and better than the other two. However, if I believe that this data is a random walk, the naive method should be used. The naive method also fits my data the best and it seems the most accurate. I kept running into the same issues regarding the other three models and the TSLM model. This issue is that the other models would leave no white noise residuals. Bootstrapping worked for this model due to the residuals not being normally distributed, so I bootstrapped the PI for the naive method.