Load packages and data

library(fpp3)
library(readxl)
library(GGally)

Questions

Exercise 1

ohio_data <- read_excel("C:\\Users\\Zacha\\Downloads\\Ohio table data.xlsx") %>%
  select(-State)

ohio_data %>%
  pivot_longer("Description")
## # A tibble: 4 x 25
##   `1999` `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1    4.1    4.2    1.2    3.5    3.1    5.3    5.1    3      3.1   -0.2   -2.9
## 2    3.1    5.4    2.7    1.4    3      3.9    3.1    5.1    4.3    3.5   -2.8
## 3    5.9    6.7    4.1    3.4    4.5    4.4    4.7    3.3    3.9    1.7   -1.6
## 4    1.2    1.6   -0.9   -1.3   -0.3    0.7    0.6    0.6    0.7   -1     -4  
## # ... with 14 more variables: `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
## #   `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>,
## #   `2018` <dbl>, `2019` <dbl>, `2020` <dbl>, `2021` <dbl>, name <chr>,
## #   value <chr>
## # i Use `colnames()` to see all variable names
ohio.data2 <- ohio_data %>%
  pivot_longer(!Description, names_to = "Time", values_to = "count") %>%
  pivot_wider(names_from = Description, values_from = count)
ohio.data2$Time=( as.numeric(ohio.data2$Time))


ohio.datats <- ohio.data2 %>%
  as_tsibble(index = "Time")
ohio.datats
## # A tsibble: 23 x 5 [1Y]
##     Time   GDP Income Consumption Employment
##    <dbl> <dbl>  <dbl>       <dbl>      <dbl>
##  1  1999   4.1    3.1         5.9        1.2
##  2  2000   4.2    5.4         6.7        1.6
##  3  2001   1.2    2.7         4.1       -0.9
##  4  2002   3.5    1.4         3.4       -1.3
##  5  2003   3.1    3           4.5       -0.3
##  6  2004   5.3    3.9         4.4        0.7
##  7  2005   5.1    3.1         4.7        0.6
##  8  2006   3      5.1         3.3        0.6
##  9  2007   3.1    4.3         3.9        0.7
## 10  2008  -0.2    3.5         1.7       -1  
## # ... with 13 more rows
## # i Use `print(n = ...)` to see more rows

Exercise 2

ohio.notime <- ohio.data2 %>%
  select(-Time)
ggpairs(ohio.notime)

Exercise 3

Income and GDP are positively correlated and statistically significant at the 95% level.

Consumption and GDP are positively correlated and statistically significant at the 99.9% level.

Employment and GDP are positively correlated and statistically significant at the 99.9% level.

Consumption and Income are positively correlated but not statistically significant. That is really not what I would expect to see.

Employment and Income are positively correlated and statistically significant at the 95% level.

Employment and Consumption are positively correlated and statistically significant at the 99.9% level.

Exercise 4

sad.model <- ohio.datats %>% model(TSLM(Consumption ~ Employment + Income + GDP))
  report(sad.model)
## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9349 -0.5946 -0.1337  0.6100  2.2211 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.28152    0.73214   1.750 0.096185 .  
## Employment   0.21636    0.30503   0.709 0.486751    
## Income      -0.02549    0.15318  -0.166 0.869602    
## GDP          0.71746    0.17089   4.198 0.000487 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.257 on 19 degrees of freedom
## Multiple R-squared: 0.7736,  Adjusted R-squared: 0.7379
## F-statistic: 21.65 on 3 and 19 DF, p-value: 2.3915e-06

Exercise 5

For B0: When there is no increase in employment, income, and GDP consumption, results on average in 1.28152 percentage points in consumption.

For B1: 1 percentage point change in total employment results on average in .216 percentage points in consumption.

For B2: 1 percentage point change in income results on average in -.02549 percentage points in consumption.

For B3: 1 percentage point change in GDP results on average in .717 percentage points in consumption. GDP is statistically significant at the 95% level.

Exercise 6

augment(sad.model) %>%
  ggplot(aes(x = Time)) + 
  geom_line(aes(y = Consumption, color = "Data")) + 
  geom_line(aes(y = .fitted, color = "Fitted")) + 
  scale_color_manual(values = c(Data = "red", Fitted = "blue")) +
  labs(y= "Consumption", title = "Fitted vs Actual") + 
  guides(color = guide_legend(title = "Series"))

###Exercise 7

spooky.scenario <- scenarios(
  Increase = new_data(ohio.datats, 3) %>%
    mutate(Income = 1, Employment = 1, GDP= 1))

spooky.fc <- forecast(sad.model, new_data = spooky.scenario)  

ohio.datats %>%
  autoplot(Consumption) +
  autolayer(spooky.fc) +
  labs(title = "Ohio Consumption" , y= "Percent Change")

spooky.fc
## # A fable: 3 x 8 [1Y]
## # Key:     .scenario, .model [1]
##   .scenario .model                   Time Consumption .mean Income Emplo~1   GDP
##   <chr>     <chr>                   <dbl>      <dist> <dbl>  <dbl>   <dbl> <dbl>
## 1 Increase  TSLM(Consumption ~ Emp~  2022 N(2.2, 2.1)  2.19      1       1     1
## 2 Increase  TSLM(Consumption ~ Emp~  2023 N(2.2, 2.1)  2.19      1       1     1
## 3 Increase  TSLM(Consumption ~ Emp~  2024 N(2.2, 2.1)  2.19      1       1     1
## # ... with abbreviated variable name 1: Employment
spooky.scenario2 <- scenarios(
  Increase = new_data(ohio.datats, 3) %>%
    mutate(Income=1, Employment=0, GDP=0),
  Increase2 = new_data(ohio.datats, 3) %>%
    mutate(Income=.8, Employment=0, GDP=0),
  Decrease = new_data(ohio.datats, 3) %>%
    mutate(Income= -.5, Employment=0, GDP=0),
  names_to = "Scenario")

spooky.fc2 <- forecast(sad.model, new_data = spooky.scenario2)

ohio.datats %>%
  autoplot(Consumption) +
  autolayer(spooky.fc2) +
  labs(title = "US consumption", y = "% change")

spooky.fc2
## # A fable: 9 x 8 [1Y]
## # Key:     Scenario, .model [3]
##   Scenario  .model                   Time Consumption .mean Income Emplo~1   GDP
##   <chr>     <chr>                   <dbl>      <dist> <dbl>  <dbl>   <dbl> <dbl>
## 1 Increase  TSLM(Consumption ~ Emp~  2022   N(1.3, 2)  1.26    1         0     0
## 2 Increase  TSLM(Consumption ~ Emp~  2023   N(1.3, 2)  1.26    1         0     0
## 3 Increase  TSLM(Consumption ~ Emp~  2024   N(1.3, 2)  1.26    1         0     0
## 4 Increase2 TSLM(Consumption ~ Emp~  2022   N(1.3, 2)  1.26    0.8       0     0
## 5 Increase2 TSLM(Consumption ~ Emp~  2023   N(1.3, 2)  1.26    0.8       0     0
## 6 Increase2 TSLM(Consumption ~ Emp~  2024   N(1.3, 2)  1.26    0.8       0     0
## 7 Decrease  TSLM(Consumption ~ Emp~  2022 N(1.3, 2.2)  1.29   -0.5       0     0
## 8 Decrease  TSLM(Consumption ~ Emp~  2023 N(1.3, 2.2)  1.29   -0.5       0     0
## 9 Decrease  TSLM(Consumption ~ Emp~  2024 N(1.3, 2.2)  1.29   -0.5       0     0
## # ... with abbreviated variable name 1: Employment

###Exercise 8

The forecasts are pretty similar visually. I would expect a 1 percent growth in Employment and GDP to have a stronger effect on Consumption than Income because they both have a higher correlation. The first forecast predicts that if GDP, Employment, and Income experience a 1 percent increase than the percent change in consumption will drop. My prediction intervals in the second forecast are identical for a 1 percent increase in income and a .8 percent increase in income which shows off the rather weak relationship between the two. When income decreases by .5 percentage points it had an impact on consumption albeit a small one. These don’t seem to make intuitive sense, I would expect income to have a strong effect on consumption. Very interesting relationships here.

###Question 2

zach_data <- read_excel("C:\\Users\\Zacha\\Downloads\\M3_exam1 (1).xlsx")

zach.data2 <- zach_data %>%
  select(-Series) %>%
  select(-N) %>%
  select(-NF) %>%
  select(-Category) %>%
  select(-'Starting Year') %>%
  select(-'Starting Quarter') %>%
  filter(`Student Name` == 'Zach') %>%
  pivot_longer(-'Student Name') %>%
  select(-'Student Name') %>%
  select(-name) %>%
  mutate(Micro = value) %>%
  select(-value)

clean.zach <- na.omit(zach.data2)  
clean.zach <- clean.zach %>%
  mutate(Date = yearquarter(56:99)) %>%
  select(Date,Micro) %>%
  as_tsibble(index = Date)

clean.zach
## # A tsibble: 44 x 2 [1Q]
##       Date Micro
##      <qtr> <dbl>
##  1 1984 Q1 2449.
##  2 1984 Q2 2524 
##  3 1984 Q3 2704.
##  4 1984 Q4 2892.
##  5 1985 Q1 2894.
##  6 1985 Q2 2945.
##  7 1985 Q3 3082.
##  8 1985 Q4 3187.
##  9 1986 Q1 3309.
## 10 1986 Q2 3335.
## # ... with 34 more rows
## # i Use `print(n = ...)` to see more rows
autoplot(clean.zach)
## Plot variable not specified, automatically selected `.vars = Micro`

#Doesn't look like any transformations are going to be necessary. 

#This data looks like a great candidate for the drift method. We'll start there. 

drift.zach <- clean.zach %>%
  model(RW(Micro ~ drift())) %>%
  forecast(h =4) %>%
  autoplot(clean.zach, level = NULL) +
  geom_line(data = slice(clean.zach, range(cumsum(!is.na(Micro)))),
            aes(y=Micro), linetype = "dashed", colour = "#0072B2") +
  labs(title = "Micro Forecast")
drift.zach

zach.fit <- clean.zach %>%
  model(RW(Micro ~ drift()))

gg_tsresiduals(zach.fit)
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).

#The innovation residuals do not look like they have a mean of zero to start with so a bootstrap is going to be necessary to create a prediction interval.The histogram doesn't appear to be centered around zero either and there's a clear rightward skew. The autocorrelation looks the best of the three graphs, but I'm going to use a ljung-box test for that anyway.

#ACF check with a ljung-box test. 
augment(zach.fit) %>%
  features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 x 3
##   .model              lb_stat lb_pvalue
##   <chr>                 <dbl>     <dbl>
## 1 RW(Micro ~ drift())    13.9     0.177
#A p value of .177 means that my results are not significant and are not distinguishable from white noise. This is good. 

#Let's do some accuracy tests. 

zach.fit2 <- clean.zach %>%
  model(
    Mean = MEAN(Micro),
    Naive = NAIVE(Micro),
    Seasonal_naive = SNAIVE(Micro),
    Drift = RW(Micro ~ drift())
  )

accuracy(zach.fit2)
## # A tibble: 4 x 10
##   .model         .type          ME  RMSE   MAE     MPE  MAPE  MASE RMSSE    ACF1
##   <chr>          <chr>       <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Mean           Training 4.24e-13 1786. 1587. -14.1   36.0  2.87  2.47  0.908  
## 2 Naive          Training 1.50e+ 2  318.  192.   2.84   3.59 0.348 0.439 0.00355
## 3 Seasonal_naive Training 5.49e+ 2  723.  552.  10.4   10.4  1     1     0.651  
## 4 Drift          Training 1.06e-14  280.  178.  -0.423  3.30 0.321 0.387 0.00355
#The drift method yields the lowest RMSE value. That is very crucial and beneficial to my argument that the Drift is the goat forecasting method. 


#The drift has innovation residuals that do not appear to be normally distributed so I'm going to bootstrap to create my prediction interval for a better forecast.


bootstrap <- zach.fit %>%
  generate(h=4,times=5, bootstrap=TRUE)

bootstrap2 <- zach.fit %>%
  forecast(h=4, bootstrap = TRUE, times = 5000)

autoplot(bootstrap2, clean.zach) + labs(title = "Drift Forecast with Boostrapped Residuals")

#That looks like a fantastic forecast to me. 

bootstrap2 %>%
  hilo(80)
## # A tsibble: 4 x 5 [1Q]
## # Key:       .model [1]
##   .model                 Date        Micro .mean                   `80%`
##   <chr>                 <qtr>       <dist> <dbl>                  <hilo>
## 1 RW(Micro ~ drift()) 1995 Q1 sample[5000] 9098. [8860.553,  9319.993]80
## 2 RW(Micro ~ drift()) 1995 Q2 sample[5000] 9293. [8927.506,  9698.532]80
## 3 RW(Micro ~ drift()) 1995 Q3 sample[5000] 9499. [9024.921, 10075.813]80
## 4 RW(Micro ~ drift()) 1995 Q4 sample[5000] 9691. [9137.440, 10409.900]80

###Why My Forecast is the GOAT:

I kept it pretty simple. After plotting my data it was very clear that the drift was going to produce the best forecast. A MEAN forecast would simple place the forecast in the middle of the data which would look awful. There is a clear upward trend in the data so NAIVE would be the wrong way to go; it’s clearly not random. The SNAIVE forecast would require my data to have a more distinct seasonality that it does not really possess. The drift was the clear choice, but I have to prove that. I set up a fit with all 4 models in it and I then tested the accuracy primarily looking at the RMSE value because it’s the most important in regards to forecasting. The drift of course had the lowest p value for RMSE out of the 4 models so it’s once again the clear choice. From here I just wanted to make my forecast presentable and acceptable. The last thing I needed to do was create the prediction interval around my forecast. I knew my residuals were not normally distributed when I looked at the residuals earlier, so I knew I needed to boostrap. I bootstrapped 5,000 times and I think the end result looks fantastic. A good quality forecast that far exceeds one that the other 3 forecast methods would have produced: the GOAT.