library(fpp3)
library(readxl)
library(GGally)
ohio_data <- read_excel("C:\\Users\\Zacha\\Downloads\\Ohio table data.xlsx") %>%
select(-State)
ohio_data %>%
pivot_longer("Description")
## # A tibble: 4 x 25
## `1999` `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.1 4.2 1.2 3.5 3.1 5.3 5.1 3 3.1 -0.2 -2.9
## 2 3.1 5.4 2.7 1.4 3 3.9 3.1 5.1 4.3 3.5 -2.8
## 3 5.9 6.7 4.1 3.4 4.5 4.4 4.7 3.3 3.9 1.7 -1.6
## 4 1.2 1.6 -0.9 -1.3 -0.3 0.7 0.6 0.6 0.7 -1 -4
## # ... with 14 more variables: `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
## # `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>,
## # `2018` <dbl>, `2019` <dbl>, `2020` <dbl>, `2021` <dbl>, name <chr>,
## # value <chr>
## # i Use `colnames()` to see all variable names
ohio.data2 <- ohio_data %>%
pivot_longer(!Description, names_to = "Time", values_to = "count") %>%
pivot_wider(names_from = Description, values_from = count)
ohio.data2$Time=( as.numeric(ohio.data2$Time))
ohio.datats <- ohio.data2 %>%
as_tsibble(index = "Time")
ohio.datats
## # A tsibble: 23 x 5 [1Y]
## Time GDP Income Consumption Employment
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1999 4.1 3.1 5.9 1.2
## 2 2000 4.2 5.4 6.7 1.6
## 3 2001 1.2 2.7 4.1 -0.9
## 4 2002 3.5 1.4 3.4 -1.3
## 5 2003 3.1 3 4.5 -0.3
## 6 2004 5.3 3.9 4.4 0.7
## 7 2005 5.1 3.1 4.7 0.6
## 8 2006 3 5.1 3.3 0.6
## 9 2007 3.1 4.3 3.9 0.7
## 10 2008 -0.2 3.5 1.7 -1
## # ... with 13 more rows
## # i Use `print(n = ...)` to see more rows
ohio.notime <- ohio.data2 %>%
select(-Time)
ggpairs(ohio.notime)
Income and GDP are positively correlated and statistically significant at the 95% level.
Consumption and GDP are positively correlated and statistically significant at the 99.9% level.
Employment and GDP are positively correlated and statistically significant at the 99.9% level.
Consumption and Income are positively correlated but not statistically significant. That is really not what I would expect to see.
Employment and Income are positively correlated and statistically significant at the 95% level.
Employment and Consumption are positively correlated and statistically significant at the 99.9% level.
sad.model <- ohio.datats %>% model(TSLM(Consumption ~ Employment + Income + GDP))
report(sad.model)
## Series: Consumption
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9349 -0.5946 -0.1337 0.6100 2.2211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.28152 0.73214 1.750 0.096185 .
## Employment 0.21636 0.30503 0.709 0.486751
## Income -0.02549 0.15318 -0.166 0.869602
## GDP 0.71746 0.17089 4.198 0.000487 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.257 on 19 degrees of freedom
## Multiple R-squared: 0.7736, Adjusted R-squared: 0.7379
## F-statistic: 21.65 on 3 and 19 DF, p-value: 2.3915e-06
For B0: When there is no increase in employment, income, and GDP consumption, results on average in 1.28152 percentage points in consumption.
For B1: 1 percentage point change in total employment results on average in .216 percentage points in consumption.
For B2: 1 percentage point change in income results on average in -.02549 percentage points in consumption.
For B3: 1 percentage point change in GDP results on average in .717 percentage points in consumption. GDP is statistically significant at the 95% level.
augment(sad.model) %>%
ggplot(aes(x = Time)) +
geom_line(aes(y = Consumption, color = "Data")) +
geom_line(aes(y = .fitted, color = "Fitted")) +
scale_color_manual(values = c(Data = "red", Fitted = "blue")) +
labs(y= "Consumption", title = "Fitted vs Actual") +
guides(color = guide_legend(title = "Series"))
###Exercise 7
spooky.scenario <- scenarios(
Increase = new_data(ohio.datats, 3) %>%
mutate(Income = 1, Employment = 1, GDP= 1))
spooky.fc <- forecast(sad.model, new_data = spooky.scenario)
ohio.datats %>%
autoplot(Consumption) +
autolayer(spooky.fc) +
labs(title = "Ohio Consumption" , y= "Percent Change")
spooky.fc
## # A fable: 3 x 8 [1Y]
## # Key: .scenario, .model [1]
## .scenario .model Time Consumption .mean Income Emplo~1 GDP
## <chr> <chr> <dbl> <dist> <dbl> <dbl> <dbl> <dbl>
## 1 Increase TSLM(Consumption ~ Emp~ 2022 N(2.2, 2.1) 2.19 1 1 1
## 2 Increase TSLM(Consumption ~ Emp~ 2023 N(2.2, 2.1) 2.19 1 1 1
## 3 Increase TSLM(Consumption ~ Emp~ 2024 N(2.2, 2.1) 2.19 1 1 1
## # ... with abbreviated variable name 1: Employment
spooky.scenario2 <- scenarios(
Increase = new_data(ohio.datats, 3) %>%
mutate(Income=1, Employment=0, GDP=0),
Increase2 = new_data(ohio.datats, 3) %>%
mutate(Income=.8, Employment=0, GDP=0),
Decrease = new_data(ohio.datats, 3) %>%
mutate(Income= -.5, Employment=0, GDP=0),
names_to = "Scenario")
spooky.fc2 <- forecast(sad.model, new_data = spooky.scenario2)
ohio.datats %>%
autoplot(Consumption) +
autolayer(spooky.fc2) +
labs(title = "US consumption", y = "% change")
spooky.fc2
## # A fable: 9 x 8 [1Y]
## # Key: Scenario, .model [3]
## Scenario .model Time Consumption .mean Income Emplo~1 GDP
## <chr> <chr> <dbl> <dist> <dbl> <dbl> <dbl> <dbl>
## 1 Increase TSLM(Consumption ~ Emp~ 2022 N(1.3, 2) 1.26 1 0 0
## 2 Increase TSLM(Consumption ~ Emp~ 2023 N(1.3, 2) 1.26 1 0 0
## 3 Increase TSLM(Consumption ~ Emp~ 2024 N(1.3, 2) 1.26 1 0 0
## 4 Increase2 TSLM(Consumption ~ Emp~ 2022 N(1.3, 2) 1.26 0.8 0 0
## 5 Increase2 TSLM(Consumption ~ Emp~ 2023 N(1.3, 2) 1.26 0.8 0 0
## 6 Increase2 TSLM(Consumption ~ Emp~ 2024 N(1.3, 2) 1.26 0.8 0 0
## 7 Decrease TSLM(Consumption ~ Emp~ 2022 N(1.3, 2.2) 1.29 -0.5 0 0
## 8 Decrease TSLM(Consumption ~ Emp~ 2023 N(1.3, 2.2) 1.29 -0.5 0 0
## 9 Decrease TSLM(Consumption ~ Emp~ 2024 N(1.3, 2.2) 1.29 -0.5 0 0
## # ... with abbreviated variable name 1: Employment
###Exercise 8
The forecasts are pretty similar visually. I would expect a 1 percent growth in Employment and GDP to have a stronger effect on Consumption than Income because they both have a higher correlation. The first forecast predicts that if GDP, Employment, and Income experience a 1 percent increase than the percent change in consumption will drop. My prediction intervals in the second forecast are identical for a 1 percent increase in income and a .8 percent increase in income which shows off the rather weak relationship between the two. When income decreases by .5 percentage points it had an impact on consumption albeit a small one. These don’t seem to make intuitive sense, I would expect income to have a strong effect on consumption. Very interesting relationships here.
###Question 2
zach_data <- read_excel("C:\\Users\\Zacha\\Downloads\\M3_exam1 (1).xlsx")
zach.data2 <- zach_data %>%
select(-Series) %>%
select(-N) %>%
select(-NF) %>%
select(-Category) %>%
select(-'Starting Year') %>%
select(-'Starting Quarter') %>%
filter(`Student Name` == 'Zach') %>%
pivot_longer(-'Student Name') %>%
select(-'Student Name') %>%
select(-name) %>%
mutate(Micro = value) %>%
select(-value)
clean.zach <- na.omit(zach.data2)
clean.zach <- clean.zach %>%
mutate(Date = yearquarter(56:99)) %>%
select(Date,Micro) %>%
as_tsibble(index = Date)
clean.zach
## # A tsibble: 44 x 2 [1Q]
## Date Micro
## <qtr> <dbl>
## 1 1984 Q1 2449.
## 2 1984 Q2 2524
## 3 1984 Q3 2704.
## 4 1984 Q4 2892.
## 5 1985 Q1 2894.
## 6 1985 Q2 2945.
## 7 1985 Q3 3082.
## 8 1985 Q4 3187.
## 9 1986 Q1 3309.
## 10 1986 Q2 3335.
## # ... with 34 more rows
## # i Use `print(n = ...)` to see more rows
autoplot(clean.zach)
## Plot variable not specified, automatically selected `.vars = Micro`
#Doesn't look like any transformations are going to be necessary.
#This data looks like a great candidate for the drift method. We'll start there.
drift.zach <- clean.zach %>%
model(RW(Micro ~ drift())) %>%
forecast(h =4) %>%
autoplot(clean.zach, level = NULL) +
geom_line(data = slice(clean.zach, range(cumsum(!is.na(Micro)))),
aes(y=Micro), linetype = "dashed", colour = "#0072B2") +
labs(title = "Micro Forecast")
drift.zach
zach.fit <- clean.zach %>%
model(RW(Micro ~ drift()))
gg_tsresiduals(zach.fit)
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
#The innovation residuals do not look like they have a mean of zero to start with so a bootstrap is going to be necessary to create a prediction interval.The histogram doesn't appear to be centered around zero either and there's a clear rightward skew. The autocorrelation looks the best of the three graphs, but I'm going to use a ljung-box test for that anyway.
#ACF check with a ljung-box test.
augment(zach.fit) %>%
features(.resid, ljung_box, lag=10, dof=0)
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 RW(Micro ~ drift()) 13.9 0.177
#A p value of .177 means that my results are not significant and are not distinguishable from white noise. This is good.
#Let's do some accuracy tests.
zach.fit2 <- clean.zach %>%
model(
Mean = MEAN(Micro),
Naive = NAIVE(Micro),
Seasonal_naive = SNAIVE(Micro),
Drift = RW(Micro ~ drift())
)
accuracy(zach.fit2)
## # A tibble: 4 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mean Training 4.24e-13 1786. 1587. -14.1 36.0 2.87 2.47 0.908
## 2 Naive Training 1.50e+ 2 318. 192. 2.84 3.59 0.348 0.439 0.00355
## 3 Seasonal_naive Training 5.49e+ 2 723. 552. 10.4 10.4 1 1 0.651
## 4 Drift Training 1.06e-14 280. 178. -0.423 3.30 0.321 0.387 0.00355
#The drift method yields the lowest RMSE value. That is very crucial and beneficial to my argument that the Drift is the goat forecasting method.
#The drift has innovation residuals that do not appear to be normally distributed so I'm going to bootstrap to create my prediction interval for a better forecast.
bootstrap <- zach.fit %>%
generate(h=4,times=5, bootstrap=TRUE)
bootstrap2 <- zach.fit %>%
forecast(h=4, bootstrap = TRUE, times = 5000)
autoplot(bootstrap2, clean.zach) + labs(title = "Drift Forecast with Boostrapped Residuals")
#That looks like a fantastic forecast to me.
bootstrap2 %>%
hilo(80)
## # A tsibble: 4 x 5 [1Q]
## # Key: .model [1]
## .model Date Micro .mean `80%`
## <chr> <qtr> <dist> <dbl> <hilo>
## 1 RW(Micro ~ drift()) 1995 Q1 sample[5000] 9098. [8860.553, 9319.993]80
## 2 RW(Micro ~ drift()) 1995 Q2 sample[5000] 9293. [8927.506, 9698.532]80
## 3 RW(Micro ~ drift()) 1995 Q3 sample[5000] 9499. [9024.921, 10075.813]80
## 4 RW(Micro ~ drift()) 1995 Q4 sample[5000] 9691. [9137.440, 10409.900]80
###Why My Forecast is the GOAT:
I kept it pretty simple. After plotting my data it was very clear that the drift was going to produce the best forecast. A MEAN forecast would simple place the forecast in the middle of the data which would look awful. There is a clear upward trend in the data so NAIVE would be the wrong way to go; it’s clearly not random. The SNAIVE forecast would require my data to have a more distinct seasonality that it does not really possess. The drift was the clear choice, but I have to prove that. I set up a fit with all 4 models in it and I then tested the accuracy primarily looking at the RMSE value because it’s the most important in regards to forecasting. The drift of course had the lowest p value for RMSE out of the 4 models so it’s once again the clear choice. From here I just wanted to make my forecast presentable and acceptable. The last thing I needed to do was create the prediction interval around my forecast. I knew my residuals were not normally distributed when I looked at the residuals earlier, so I knew I needed to boostrap. I bootstrapped 5,000 times and I think the end result looks fantastic. A good quality forecast that far exceeds one that the other 3 forecast methods would have produced: the GOAT.