library(dplyr)
library(ggplot2)
library(tidyr)
library(tibble)
library(tsibble)
library(ggfortify)
library(tidyverse)
library(fpp3)
library(moments)
library(zoo)
library(fable)
library(readxl)
library(seasonal)
library(caTools)
library(GGally)
illest_table_eva_ <- read_excel("~/Downloads/illest table eva .xlsx")
fixed_table <- illest_table_eva_ %>%
pivot_longer(!Description) %>%
pivot_wider(names_from = Description, values_from = value)
colnames(fixed_table)[1] <- "Year"
fixed_table$Year = as.numeric(as.character(fixed_table$Year))
fixxy <- fixed_table %>%
as_tsibble(index = Year)
fixxy
## # A tsibble: 23 x 5 [1Y]
## Year GDP PI PCE TE
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1999 5.4 4.5 5.5 1.1
## 2 2000 5.8 7.6 7.6 1.9
## 3 2001 3 3.7 3.4 -0.6
## 4 2002 2.6 1.2 2.9 -1.3
## 5 2003 3.6 2 4.3 -0.4
## 6 2004 5.6 4 5.2 0.7
## 7 2005 4.8 4.2 5.6 1.2
## 8 2006 5.8 6.6 5.1 1.6
## 9 2007 4.1 5.8 4.8 1.7
## 10 2008 0.7 2.6 2.5 -0.3
## # … with 13 more rows
fixxy %>%
ggplot(aes(x = Year)) +
geom_line(aes(y = GDP, color = "GDP")) +
geom_line(aes(y = PI, color = "PI")) +
geom_line(aes(y = PCE, color = "PCE")) +
geom_line(aes(y = TE, color = "TE")) +
labs(y = "Percentage Change", title = "GDP, Income, Consumption, and Employment") +
scale_colour_manual(values = c(GDP = "black", PI = "green", PCE = "blue", TE = "pink")) +
guides(colour = guide_legend(title = NULL))
After gathering my data from the website, I cleaned the data slightly using excel. I mainly changed some of the names of the variables and got rid of a bunch of excess information that I couldn’t use in R. I turned the year variable to numeric. Then converted the data in to a tsibble using Year as the index. I then plotted the data over time to get a better understanding of the data.
ggpairs(fixed_table %>% select(-Year))
GDP and Personal Income have a correlation value of 0.557. This is a strong positive correlation, therefore they will increase and decrease together. The two stars show that this data is statistically significant at the 99% level.
GDP and Personal Consumption Expenditure have a correlation value of 0.924. This is a very strong positive correlation, therefore they will increase and decrease together. The three stars indicate that this data is statistically significant at 99.9% level.
GDP and Total Employment have a correlation value of 0.809. This is a very strong positive correlation, therefore they will increase and decrease together. The three stars indicate that this data is statistically significant at the 99.9% level.
Personal Income and Personal Consumption Expenditure have a correlation value of 0.584. This is a strong positive correlation, therefore they will increase and decrease together. The two stars indicate that this data is statically significant at 99% level.
Personal Income and Total Employment have a correlation value of 0.568. This is a strong positive correlation, therefore they will increase and decrease together. The two stars indicate that this data is statistically significant at 99% level.
Personal Consumption Expenditure and Total Employment have a correlation value of 0.753. This is a very strong correlation, therefore they will increase and decrease together. The three stars indicate that this data is significantly significant at the 99.9% level.
ohno <- fixxy %>%
model(lm = TSLM(PCE ~ GDP + PI + TE))
report(ohno)
## Series: PCE
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68710 -0.73783 -0.04094 0.59836 2.05388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.09343 0.64213 -0.145 0.886
## GDP 0.97351 0.16482 5.907 1.1e-05 ***
## PI 0.12106 0.12435 0.974 0.343
## TE -0.03121 0.27974 -0.112 0.912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.213 on 19 degrees of freedom
## Multiple R-squared: 0.8602, Adjusted R-squared: 0.8381
## F-statistic: 38.96 on 3 and 19 DF, p-value: 2.5788e-08
I used the TSLM function to run a regression on my data. I used GDP, Income and Employment as predictors for consumption.
My Intercept coefficient from my regression found that when Income, GDP, and Employment change by 0 percentage points, Consumption changes by -0.09343 percentage points on average. I did not find this effect to be statistically significant above the 80% level.
My GDP coefficient from my regression found that as GDP increases by 1 percentage point, Consumption increases by 0.97351 percentage points on average. I found this effect to be statistically significant at the 99.9% level. This is highly significant.
My Income coefficient from my regression found that as Income increases by 1 percentage point, Consumption increases by 0.12106 percentage points on average. I did not find this effect to be statistically significant above the 80% level.
My Employment coefficient from my regression found that as Employment increases by 1 percentage point, Consumption decreases by 0.03121 percentage points on average. I did not find this effect to be statistically significant above the 80% level.
augment(ohno) %>%
ggplot(aes(x = Year)) +
geom_line(aes(y = PCE, color = "Data")) +
geom_line(aes(y = .fitted, color = "Fitted")) +
scale_color_manual(values = c(Data = "red", Fitted = "blue")) +
labs(y = "Consumption", title = "Fitted vs Actual") +
guides(color = guide_legend(title = "Series"))
daddyfit <- fixxy %>%
model(lm = TSLM(PCE ~ PI + GDP + TE))
theman <- scenarios(Increase = new_data(fixxy, 3) %>%
mutate(PI=1, GDP=1, TE=1), names_to = "Scenario")
damncast <- forecast(daddyfit, new_data = theman)
damncast
## # A fable: 3 x 8 [1Y]
## # Key: Scenario, .model [1]
## Scenario .model Year PCE .mean PI GDP TE
## <chr> <chr> <dbl> <dist> <dbl> <dbl> <dbl> <dbl>
## 1 Increase lm 2022 N(0.97, 1.9) 0.970 1 1 1
## 2 Increase lm 2023 N(0.97, 1.9) 0.970 1 1 1
## 3 Increase lm 2024 N(0.97, 1.9) 0.970 1 1 1
fixxy %>%
autoplot(PCE) +
autolayer(damncast) +
labs(title = "PCE forecast", y = "% change")
fuckstick <- scenarios(Increase = new_data(fixxy, 3) %>%
mutate(PI= 1, GDP = 0, TE = 0), Decrease = new_data(fixxy, 3) %>%
mutate(PI = -0.5, GDP = 0, TE = 0), Increase2 = new_data(fixxy, 3) %>%
mutate(PI = 0.8, GDP = 0, TE = 0), names_to = "Scenario")
shitcast <- forecast(daddyfit, new_data = fuckstick)
shitcast
## # A fable: 9 x 8 [1Y]
## # Key: Scenario, .model [3]
## Scenario .model Year PCE .mean PI GDP TE
## <chr> <chr> <dbl> <dist> <dbl> <dbl> <dbl> <dbl>
## 1 Increase lm 2022 N(0.028, 1.8) 0.0276 1 0 0
## 2 Increase lm 2023 N(0.028, 1.8) 0.0276 1 0 0
## 3 Increase lm 2024 N(0.028, 1.8) 0.0276 1 0 0
## 4 Decrease lm 2022 N(-0.15, 1.9) -0.154 -0.5 0 0
## 5 Decrease lm 2023 N(-0.15, 1.9) -0.154 -0.5 0 0
## 6 Decrease lm 2024 N(-0.15, 1.9) -0.154 -0.5 0 0
## 7 Increase2 lm 2022 N(0.0034, 1.8) 0.00342 0.8 0 0
## 8 Increase2 lm 2023 N(0.0034, 1.8) 0.00342 0.8 0 0
## 9 Increase2 lm 2024 N(0.0034, 1.8) 0.00342 0.8 0 0
fixxy %>%
autoplot(PCE, color = "black") +
autolayer(shitcast, color = "red") +
labs(title = "IL consumption", y = "% change")
fixxy %>%
autoplot(PCE, color = "black") +
autolayer(shitcast, color = "red") +
autolayer(damncast, color = "blue") +
labs(title = "IL consumption (Both together with PI)", y = "% change")
The PI’s are all plotted together. This wasn’t very significant, and the PI had very little effect across the scenarios. The only scenario where the PI changed from the the scenario where we looked at no effect on the variables is when we decreased. Even then, the PI’s overlapped making them not statistically significant.
dat_data <- read_excel("~/Downloads/M3_exam1.xlsx")
Hdawg <- dat_data %>%
select(-Series) %>%
select(-N) %>%
select(-NF) %>%
select(-Category) %>%
select(-'Starting Year') %>%
select(-'Starting Quarter') %>%
filter(`Student Name` == 'Hunter') %>%
pivot_longer(-'Student Name') %>%
select(-'Student Name') %>%
select(-name) %>%
mutate(demographic = value) %>%
select(-value)
Hdawg <- na.omit(Hdawg)
daddyhunt <- Hdawg %>%
mutate(Date = yearquarter(-24:23)) %>%
select(Date, demographic) %>%
as_tsibble(index = Date)
daddyhunt
## # A tsibble: 48 x 2 [1Q]
## Date demographic
## <qtr> <dbl>
## 1 1964 Q1 6420
## 2 1964 Q2 6160
## 3 1964 Q3 6300
## 4 1964 Q4 7580
## 5 1965 Q1 7870
## 6 1965 Q2 7700
## 7 1965 Q3 7530
## 8 1965 Q4 8640
## 9 1966 Q1 8470
## 10 1966 Q2 7610
## # … with 38 more rows
autoplot(daddyhunt)
## Plot variable not specified, automatically selected `.vars = demographic`
choochoo <- daddyhunt %>%
filter(year(Date) <= 1971)
choochoo
## # A tsibble: 32 x 2 [1Q]
## Date demographic
## <qtr> <dbl>
## 1 1964 Q1 6420
## 2 1964 Q2 6160
## 3 1964 Q3 6300
## 4 1964 Q4 7580
## 5 1965 Q1 7870
## 6 1965 Q2 7700
## 7 1965 Q3 7530
## 8 1965 Q4 8640
## 9 1966 Q1 8470
## 10 1966 Q2 7610
## # … with 22 more rows
testies <- daddyhunt %>%
filter(year(Date) > 1971)
testies
## # A tsibble: 16 x 2 [1Q]
## Date demographic
## <qtr> <dbl>
## 1 1972 Q1 2140
## 2 1972 Q2 2420
## 3 1972 Q3 3340
## 4 1972 Q4 2850
## 5 1973 Q1 3180
## 6 1973 Q2 3880
## 7 1973 Q3 4240
## 8 1973 Q4 4200
## 9 1974 Q1 4640
## 10 1974 Q2 4930
## 11 1974 Q3 4450
## 12 1974 Q4 2490
## 13 1975 Q1 2070
## 14 1975 Q2 2000
## 15 1975 Q3 1650
## 16 1975 Q4 1650
daddyhunt %>%
features(demographic, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 1.67
daddyhunt %>%
autoplot(box_cox(demographic, 1.67)) +
labs(title = "Box-Cox Transformation", y = "Demographic trasformed")
daddyhunt %>%
model(classical_decomposition(demographic, type = "multiplicative")) %>%
components() %>%
autoplot() +
labs(title = "Multiplicative Decomp")
## Warning: Removed 2 row(s) containing missing values (geom_path).
daddyhunt %>%
model(STL(demographic ~ season(window = 4), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "STL Decomp")
supafore <- choochoo %>%
model(
Mean = MEAN(demographic),
Naive = NAIVE(demographic),
Seasonal_naive = SNAIVE(demographic),
Drift = RW(demographic ~ drift()))
crazykids <- supafore %>%
forecast(h = 4)
accuracy(crazykids, daddyhunt)
## # A tibble: 4 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift Test 298. 636. 528. 7.87 18.5 0.302 0.293 0.198
## 2 Mean Test -2507. 2548. 2507. -98.8 98.8 1.43 1.17 0.0947
## 3 Naive Test -2.5 454. 408. -2.95 15.5 0.233 0.209 0.0947
## 4 Seasonal_naive Test -348. 1394. 972. -21.2 40.3 0.556 0.642 0.0716
crazykids %>%
autoplot() +
autolayer(daddyhunt)
## Plot variable not specified, automatically selected `.vars = demographic`
huntseanafr <- daddyhunt %>%
model(SNAIVE(demographic))
huntseanafr %>%
gg_tsresiduals()
## Warning: Removed 4 row(s) containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing non-finite values (stat_bin).
huntmeanfr <- daddyhunt %>%
model(MEAN(demographic))
huntmeanfr %>%
gg_tsresiduals()
huntdriftfr <- daddyhunt %>%
model(RW(demographic ~ drift()))
huntdriftfr %>%
gg_tsresiduals() +
labs(title = "Drift resid")
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
huntnaifr <- daddyhunt %>%
model(NAIVE(demographic))
huntnaifr %>%
gg_tsresiduals() +
labs(title = "Naive resid")
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
augment(huntnaifr) %>%
features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 NAIVE(demographic) 28.9 0.000326
augment(huntseanafr) %>%
features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE(demographic) 112. 0
augment(huntmeanfr) %>%
features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 MEAN(demographic) 101. 0
augment(huntdriftfr) %>%
features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 RW(demographic ~ drift()) 28.9 0.000326
casting <- huntnaifr %>%
forecast(h = "1 year")
casting
## # A fable: 4 x 4 [1Q]
## # Key: .model [1]
## .model Date demographic .mean
## <chr> <qtr> <dist> <dbl>
## 1 NAIVE(demographic) 1976 Q1 N(1650, 829313) 1650
## 2 NAIVE(demographic) 1976 Q2 N(1650, 1658626) 1650
## 3 NAIVE(demographic) 1976 Q3 N(1650, 2487938) 1650
## 4 NAIVE(demographic) 1976 Q4 N(1650, 3317251) 1650
daddyhunt %>%
ggplot(aes(x = Date)) +
geom_line(aes(y = demographic)) +
autolayer(casting) +
labs(title = "Naive Method")
casting2 <- huntdriftfr %>%
forecast(h = "1 year")
casting2
## # A fable: 4 x 4 [1Q]
## # Key: .model [1]
## .model Date demographic .mean
## <chr> <qtr> <dist> <dbl>
## 1 RW(demographic ~ drift()) 1976 Q1 N(1549, 836817) 1549.
## 2 RW(demographic ~ drift()) 1976 Q2 N(1447, 1709244) 1447.
## 3 RW(demographic ~ drift()) 1976 Q3 N(1346, 2617280) 1346.
## 4 RW(demographic ~ drift()) 1976 Q4 N(1244, 3560925) 1244.
daddyhunt %>%
ggplot(aes(x = Date)) +
geom_line(aes(y = demographic)) +
autolayer(casting2) +
labs(title = "Drift Method")
traingang <- choochoo %>%
model(TSLM(demographic ~ trend() + season()))
report(traingang)
## Series: demographic
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -3019.33 -1268.84 -53.04 1390.54 1884.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8013.73 749.19 10.697 3.28e-11 ***
## trend() -150.92 31.63 -4.771 5.63e-05 ***
## season()year2 -779.08 820.57 -0.949 0.351
## season()year3 -876.92 822.40 -1.066 0.296
## season()year4 340.25 825.44 0.412 0.683
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1640 on 27 degrees of freedom
## Multiple R-squared: 0.489, Adjusted R-squared: 0.4133
## F-statistic: 6.46 on 4 and 27 DF, p-value: 0.00088007
huntdatatest <- traingang %>%
forecast(h = 16)
accuracy(huntdatatest, daddyhunt)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TSLM(demographic ~ tren… Test 1560. 2055. 1681. 45.6 51.1 0.961 0.947 0.464
huntdatatest %>%
autoplot() +
autolayer(daddyhunt)
## Plot variable not specified, automatically selected `.vars = demographic`
linex <- choochoo %>%
model(Linear = TSLM(demographic ~ trend()), Exponential = TSLM(log(demographic) ~ trend()))
linexfore <- linex %>%
forecast(h = 16)
linexfore %>%
autoplot(daddyhunt)
accuracy(linexfore, daddyhunt)
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Exponential Test 608. 1238. 937. 9.86 26.3 0.536 0.570 0.729
## 2 Linear Test 1528. 1907. 1621. 45.5 49.8 0.927 0.879 0.728
linex %>%
select(Linear) %>%
gg_tsresiduals()
linex %>%
select(Exponential) %>%
gg_tsresiduals()
fullgang <- daddyhunt %>%
model(`Naive`=NAIVE(demographic))
boots <- fullgang %>%
forecast(h=4,bootstrap=TRUE,times=5000)
boots %>%
autoplot(daddyhunt)
hilo(boots,95)
## # A tsibble: 4 x 5 [1Q]
## # Key: .model [1]
## .model Date demographic .mean `95%`
## <chr> <qtr> <dist> <dbl> <hilo>
## 1 Naive 1976 Q1 sample[5000] 1631. [ -448.5106, 3611.489]95
## 2 Naive 1976 Q2 sample[5000] 1617. [-1327.0213, 3982.979]95
## 3 Naive 1976 Q3 sample[5000] 1607. [-1826.2819, 4434.718]95
## 4 Naive 1976 Q4 sample[5000] 1610. [-2244.0426, 4976.207]95
crossvally <- daddyhunt %>%
stretch_tsibble(.init = 5, .step = 1) %>%
relocate(Date, .id)
crossvally %>%
model(`Naive`=NAIVE(demographic)) %>%
forecast(h = 4) %>%
accuracy(daddyhunt)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 4 observations are missing between 1976 Q1 and 1976 Q4
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Naive Test -365. 1592. 1154. -19.2 37.0 0.714 0.797 0.724
crossvally %>%
model(RW(demographic ~ drift())) %>%
forecast(h=4) %>%
accuracy(daddyhunt)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
## 4 observations are missing between 1976 Q1 and 1976 Q4
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 RW(demographic ~ drift(… Test -250. 1739. 1281. -13.5 38.8 0.792 0.871 0.740
Out of drift and naive, Naive won through cross validation.
huntnaifr <- daddyhunt %>%
model(NAIVE(demographic))
huntnaifr %>%
gg_tsresiduals() +
labs(title = "Naive resid")
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
augment(huntnaifr) %>%
features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 NAIVE(demographic) 28.9 0.000326
casting <- huntnaifr %>%
forecast(h = "1 year")
casting
## # A fable: 4 x 4 [1Q]
## # Key: .model [1]
## .model Date demographic .mean
## <chr> <qtr> <dist> <dbl>
## 1 NAIVE(demographic) 1976 Q1 N(1650, 829313) 1650
## 2 NAIVE(demographic) 1976 Q2 N(1650, 1658626) 1650
## 3 NAIVE(demographic) 1976 Q3 N(1650, 2487938) 1650
## 4 NAIVE(demographic) 1976 Q4 N(1650, 3317251) 1650
daddyhunt %>%
ggplot(aes(x = Date)) +
geom_line(aes(y = demographic)) +
autolayer(casting) +
labs(title = "Naive Method")
I chose the Naive method as my best forecast. I believe that my data is a random walk and all the forecast methods struggle to properly forecast it with accuracy. I was stuck between the drift method and the naive method as both of their respective ljung-box tests were equal and better than the other two. However, if I believe that this data is a random walk, the naive method should be used. The naive method also fits my data the best and it seems the most accurate. I kept running into the same issues regarding the other three models and the TSLM model. This issue is that the other models would leave no white noise residuals. Bootstrapping worked for this model due to the residuals not being normally distributed, so I bootstrapped the PI for the naive method.