Load packages and data

library(tidyverse)
library(fpp3)
library(GGally)

Questions

Question 1

1.

az_data <- readxl::read_excel("//Users//colinadams//Documents//GCSU//Fall 2022//Forecasting//Test 1//Test 1 EC.xlsx")

az_step1 <- az_data %>%
  select(-GeoFips) %>%
  select(-GeoName) %>%
  select(-LineCode) %>%
  pivot_longer(!Description, names_to = "Time", values_to = "count") %>%
  pivot_wider(names_from = Description, values_from = count)

az_step1$Time = as.numeric(as.character(az_step1$Time))

az_final <- az_step1 %>%
  as_tsibble(index = Time)

az_final
## # A tsibble: 23 x 5 [1Y]
##     Time   GDP Income Consumption Employment
##    <dbl> <dbl>  <dbl>       <dbl>      <dbl>
##  1  1999   8.7    6.2         7.8        3  
##  2  2000   6      8.6         8.5        3.9
##  3  2001   4.1    4.6         4.8        1  
##  4  2002   5      3.6         5          0.6
##  5  2003   7.3    6           6.9        2.5
##  6  2004   6.7    9.2         8.8        4.3
##  7  2005  10.3   10.5         9.4        5.9
##  8  2006   7.9   10.8         8.4        4.8
##  9  2007   6.3    5.7         5.7        2.7
## 10  2008   0.6    1.5         1.3       -1.8
## # … with 13 more rows
## # ℹ Use `print(n = ...)` to see more rows
az_final %>%
  ggplot(aes(x = Time)) +
  geom_line(aes(y = GDP, color = "GDP")) +
  geom_line(aes(y = Income, color = "Income")) +
  geom_line(aes(y = Consumption, color = "Consumption")) +
  geom_line(aes(y = Employment, color = "Employment")) +
  labs(y = "Percentage Change", title = "GDP, Income, Consumption, and Employment Overtime") +
  scale_colour_manual(values = c(GDP = "black", Income = "red", Consumption = "blue", Employment = "dark green")) +
  guides(colour = guide_legend(title = NULL))

After getting my data from the BEA website and downloading it as an excel file, I imported it into R. I changed the description names prior to uploading to R in order to make the data more understandable and easier to work with. I transformed current-dollar gross domestic product to be GDP, personal income to be Income, total personal consumption expenditure to be Consumption, and total employment to be Employment. I then began getting rid of unnecessary data such as the GeoFips, GeoName, and LineCode columns. much of this was data that that identical for all my observations in Arizona. I then used pivot longer and pivot wider to turn my few rows and many columns into many rows and few columns. This gave me the descriptive names at the top of my data set and the observations by time as rows under those names.

I then tuned my Time variable into a numeric statement so that it could work with the tsibble. I used as_tsibble to create my final version of the cleaned data as a tsibble. I indexed by Time because this was the variable that I was tracking the other variables off of. This allowed me to do many of the operations below.

Finally, I plotted the variables ove time to get a sense of my data. This allowed me to form a hypotheses on each variable and how they will be correlated with one another. I find these correlations in the next part. Looking at the data, I am able to see the effects of the Great Recession in 2008 on GDP, Income, Consumption, and Employment. This event caused all of these to drop steeply. There is a similar but not as strong of an impact in September 2001 and March 2020 due to 9/11 and the Covid-19 Pandemic. Interestingly, the only variable to not drop in 2020 was Income. After the pandemic, Consumption, GDP, and Employment each recovered quickly.

2.

ggpairs(az_final %>% select(-Time))

I used the ggpairs() function from the GGally package to see the correlation and statistical significance between each variable. This function also makes a scatter plot for each combination. This helps to determine correlation between each variable. I explain these correlations below.

3.

-Income and GDP are positively correlated and this is statistically significant at the 99.9% level. This is very highly correlated in that when one increases the other increases most of the time. This finding makes sense because Income is partially a function of GDP.

-Consumption and GDP are positively correlated and this is statistically significant at the 99.9% level. This is very highly correlated in that when one increases the other almost certainly increases. This finding makes sense because Consumption is partially a function of GDP.

-Employment and GDP are positively correlated and this is statistically significant at the 99.9% level. This is very highly correlated in that when one increases the other almost certainly increases. This finding is unsurprising. This is because when more people work, the overall output in an area (Arizona) increases.

-Time and GDP are negatively correlated, but I did not find this to be statistically significant. If this were significant, this would mean that when one increases, the other decreases. The correlation not being statistically significant makes sense because there is no causal relationship between Time and GDP. There could be a reason for time to be correlated positively with GDP as GDP generally increases throughout time, but I did not find this to be true.

-Consumption and Income are positively correlated and this is statistically significant at the 99.9% level. This is highly correlated in that when one increases the other increases most of the time. This relationship makes sense because one’s consumption opportunities is ultimately defined by their income.

-Employment and Income are positively correlated and this is statistically significant at the 99.9% level. This is highly correlated in that when one increases the other increases most of the time.

-Time and Income are not correlated highly, but I did not find this to be statistically significant. I believe it would make sense for Income to be positively correlated with Time as Income tends to increase over Time. I did not find any statistically significant evidence to support this correlation in Arizona.

-Employment and Consumption are positively correlated and this is statistically significant at the 99.9% level. This is highly correlated in that when one increases the other increases most of the time. This makes sense because as more people work, the overall amount of consumption should increase. This can be due to either more people being in the state or people having more money.

-Time and Consumption are negatively correlated, but I did not find this to be statistically significant. If this were significant, this would mean that when one increases, the other decreases. I believe it would make sense for Consumption to be positively correlated with Time as Consumption tends to increase over Time. I did not find any statistically significant evidence to support this correlation in Arizona.

-Time and Employment are negatively correlated, but I did not find this to be statistically significant. If this were significant, this would mean that when one increases, the other decreases. I believe it would make sense for Employment to be positively correlated with Time as total Employment tends to increase over Time. This is because population increases over Time as well. Also, employment follows business cycles which could work against the population effect I mentioned. I did not find any statistically significant evidence to support this correlation in Arizona.

4.

az_model <- az_final %>%
  model(TSLM(Consumption ~ GDP + Income + Employment))
report(az_model)
## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4318 -0.8778 -0.5747  0.6752  3.2778 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.2379     0.6671   1.856 0.079070 .  
## GDP           0.9274     0.2035   4.556 0.000216 ***
## Income       -0.1662     0.1522  -1.092 0.288426    
## Employment    0.1856     0.2929   0.634 0.533843    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.381 on 19 degrees of freedom
## Multiple R-squared: 0.8568,  Adjusted R-squared: 0.8342
## F-statistic: 37.89 on 3 and 19 DF, p-value: 3.2342e-08

I used the TSLM function in R to run a regression for Consumption in Arizona. I used GDP, Income, and Employment as predictors of Consumption in order to measure their effect. # 5.

-My Intercept coefficient from my regression, using the TSLM function, found that when Income, GDP, and Employment change by 0 percentage points, Consumption changes by 1.2379 percentage points on average. I did not find this effect to be statistically significant above the 90% level.

-My GDP coefficient from my regression, using the TSLM function, found that as GDP increases by 1 percentage point, Consumption increases by 0.9274 percentage points on average. I found this effect to be statistically significant at the 99.9% level. This is highly significant and makes sense in context. When GDP increase, Arizona produces more and has more money to spend (Consume).

-My Income coefficient from my regression, using the TSLM function, found that as Income increases by 1 percentage point, Consumption decreases by 0.1662 percentage points on average. I did not find this effect to be statistically significant. I cannot think of how this would make sense in context. This estimated coefficient is not large nor statistically significant. More data would have to be collected to estimate Income’s effect on Consumption.

-My Employment coefficient from my regression, using the TSLM function, found that as Employment increases by 1 percentage point, Consumption decreases by 0.1856 percentage points on average. I did not find this effect to be statistically significant.

6.

augment(az_model) %>%
  ggplot(aes(x = Time)) +
  geom_line(aes(y = Consumption, color = "Data")) +
  geom_line(aes(y = .fitted, color = "Fitted")) +
  labs(y = "Percentage Change", title = "Percent Change in Arizona Consumption") +
  scale_colour_manual(values = c(Data = "black", Fitted = "red")) +
  guides(colour = guide_legend(title = NULL))

I plotted my fitted values for Consumption using my TSLM regression above to see the remainder in my points. By plotting each, I can see that my fitted values are very similar to those observed in the past. This could lead to a good forecast if we wanted to forecast future Consumption.

7.

fit_az_model <- az_final %>%
  model(lm = TSLM(Consumption ~ GDP + Income + Employment))

az_scenarios <- scenarios(
  Increase = new_data(az_final, 3) %>%
  mutate(GDP = 1, Income = 1, Employment = 1),
  names_to = "Scenario")

az_forecast1 <- forecast(fit_az_model, new_data = az_scenarios)

az_scenarios2 <- scenarios(
  Increase = new_data(az_final, 3) %>%
    mutate(GDP = 0, Income = 1, Employment = 0),
  Decrease = new_data(az_final, 3) %>%
    mutate(GDP = 0, Income = -.5, Employment = 0),
  Increase1 = new_data(az_final, 3) %>%
    mutate(GDP = 0, Income = .8, Employment = 0), names_to = "Scenario")

az_forecast2 <- forecast(fit_az_model, new_data = az_scenarios2)

az_final %>%
  autoplot(Consumption) +
  autolayer(az_forecast1) +
  autolayer(az_forecast2) +
  labs(title = "Arizona Scenario 1 and 2 on Consumption Forecast", y = "Percentage Change")

I used scenario based forecasting to forecast the individual scenarios given in the problem. I was able to forecasts these individually and then plot them together. Using the given changes in GDP, Income, and Employment, I was able to forecast the value of consumption up to three years from now with prediction intervals. I found the drop in forecasted consumption growth to be interesting. I do not believe that the drop would be this significant given our scenarios.

Question 2

Preparing data and plotting

class_data <- readxl::read_excel("//Users//colinadams//Documents//GCSU//Fall 2022//Forecasting//Test 1//M3_exam1.xlsx")

colin_data <- class_data %>%
  select(-Series) %>%
  select(-N) %>%
  select(-NF) %>%
  select(-Category) %>%
  select(-'Starting Year') %>%
  select(-'Starting Quarter') %>%
  filter(`Student Name` == 'Colin') %>%
  pivot_longer(-'Student Name') %>%
  select(-'Student Name') %>%
  select(-name) %>%
  mutate(Micro = value) %>%
  select(-value)
  
colin_data <- na.omit(colin_data)  

colin_data <- colin_data %>%
  mutate(Date = yearquarter(56:99)) %>%
  select(Date, Micro) %>%
  as_tsibble(index = Date)

colin_train <- colin_data %>%
  filter(year(Date) <= 1991)
colin_test <- colin_data %>%
  filter(year(Date) > 1991)
colin_drift_prep <- colin_data %>%
  filter(year(Date) > 1987)
colin_drift_prep_train <- colin_drift_prep %>%
  filter(year(Date) < 1993)
colin_drift_prep_test <- colin_drift_prep %>%
  filter(year(Date) > 1992)

colin_data
## # A tsibble: 44 x 2 [1Q]
##       Date Micro
##      <qtr> <dbl>
##  1 1984 Q1 3143.
##  2 1984 Q2 3191.
##  3 1984 Q3 3179.
##  4 1984 Q4 3171.
##  5 1985 Q1 3124.
##  6 1985 Q2 3170 
##  7 1985 Q3 3201.
##  8 1985 Q4 3177.
##  9 1986 Q1 3170.
## 10 1986 Q2 3269.
## # … with 34 more rows
## # ℹ Use `print(n = ...)` to see more rows
colin_data %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = Micro)) +
  labs(title = "Colin's Micro Data Over Time", y = "Micro Data Index")

In preparing the data, I got rid of any unnecessary columns and rows such as: other student’s data, columns that did not have my data, and categories. I converted my values to “Micro” because I am using microeconomic data. I also added a date column showing the date that each observation of my microeconomic data occurred. My data starts in the first quarter(Q1) of 1984.

After I loaded and cleaned the data, I printed and plotted it so that I can learn the nature of it. By plotting the data, my first observation is that the data increased a lot in 1987 Q4. This almost doubled the level of my data. I see a small increasing trend before the jump, and this trend continues after it as well. The magnitude of the variation in the data does not appear to change much after the jump as well. I also see some of a random-walk in my data.

Setting up Test, Train, etc. sub-datasets

colin_train <- colin_data %>%
  filter(year(Date) <= 1991)
colin_test <- colin_data %>%
  filter(year(Date) > 1991)
colin_drift_prep <- colin_data %>%
  filter(year(Date) > 1987)
colin_drift_prep_train <- colin_drift_prep %>%
  filter(year(Date) < 1993)
colin_drift_prep_test <- colin_drift_prep %>%
  filter(year(Date) > 1992)

After preparing and observing my data, I divided the data up into training and test subsets so that I can test the accuracy of my tested forecasting methods. I also made a training and test dataset made of observations solely after the jump. I plan to use these to compute a drift forecast with Y1 (first observation) at Q1 1988. This will allow me to account for the increasing trend that continues after the jump, while preventing the jump from affecting my drift on the last observation.

Transformations to understand the data

colin_data %>%
  features(Micro, features = guerrero)
## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1            2.00
lamda <- 2.00
colin_data %>%
  autoplot(box_cox(Micro, lamda)) +
  labs(title = "Box-Cox Transformed Colin's Micro Data", x = "Date")

colin_data_add <- colin_data %>%
  model(classical_decomposition(Micro, type = "additive")) %>%
  components() %>%
  autoplot() +
  labs(title = "Classical Additive Decomposition of Colin's Micro Data")
colin_data_add
## Warning: Removed 2 row(s) containing missing values (geom_path).

colin_data_mult <- colin_data %>%
  model(classical_decomposition(Micro, type = "multiplicative")) %>%
  components() %>%
  autoplot() +
  labs(title = "Classical Multiplicative Decomposition of Colin's Micro Data")
colin_data_mult
## Warning: Removed 2 row(s) containing missing values (geom_path).

colin_data_stl <- colin_data %>%
  model(STL(Micro ~ season(window = 4), robust = TRUE)) %>%
  components() %>% 
  autoplot() +
  labs(title = "STL Decomposition: Colin's Micro Data")
colin_data_stl

After dividing the data into training and test sets, I decided to do a tranformation and decompositions to learn more about the data. I did a Box-Cox transformation to make the variance in my microeconomic data constant. This forecast did not benefit me much as the variance in the data did not change much. This helps me to understand that the variance in my data is close to constant variance, outside of the jump in 1987. This lends well to an additive, classical decomposition and STL decomposition. I decided to try out additive, multiplicative, and STL decomposition. Neither of these three decompositions showed much seasonality. This is because the scale of the seasonality when compared to the trend and remainder are negligible. This will make it unlikely that I choose a seasonal naive or any forecast with heavy seasonal weights to it. These decompositions did affirm my belief of a slight increasing trend. You can also see the effect of the jump in my data on the remainder.

Chosen Forecasting method: The Naive Forecast

colin_naive <- colin_data %>%
  model(Naive = NAIVE(Micro))
colin_naive %>%
  gg_tsresiduals()
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).

augment(colin_naive) %>%
  features(.resid, ljung_box, lag = 8, dof = 0)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 Naive    0.382      1.00
colin_naive_forecast <- colin_naive %>%
  forecast(h = "1 year")
colin_naive_forecast
## # A fable: 4 x 4 [1Q]
## # Key:     .model [1]
##   .model    Date           Micro .mean
##   <chr>    <qtr>          <dist> <dbl>
## 1 Naive  1995 Q1  N(6177, 69243) 6177.
## 2 Naive  1995 Q2 N(6177, 138486) 6177.
## 3 Naive  1995 Q3 N(6177, 207728) 6177.
## 4 Naive  1995 Q4 N(6177, 276971) 6177.
colin_data %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = Micro)) +
  autolayer(colin_naive_forecast) +
  labs(title = "Naive Forecast of Colin's Micro Data", y = "Micro Data Index")

I chose the Naive forecast as my preferred method. I chose it because the Naive method had the lowest measures of RMSE, MAE, MAPE, and MASE using the accuracy function in R, the residuals are not correlated, the mean of the residuals is 0, and the residuals resemble white-noise.

I used many forecasting methods, shown below, and checked the accuracy measures of each as a starting point for choosing my forecast. After looking at Naive, SNaive, Mean, Drift, TSLM Regression, exponential trend, and piecewise trend forecasts using my training and test data, I found that the Naive had the best accuracy measures.

Then, T forecasted looked at the residuals of the Naive method using gg_tsresiduals and the residuals are never autocorrelated, they appeared to be white noise, and the mean is roughly 0. I had two major outliers that affected my data, but other those observations, I had roughly normally distributed residuals. This was great for the Naive forecast, even though not necessary for it to be a good forecast. Also, other than my two outliers my residuals appeared to be homoskedastic. My last check for the Naive model was the Ljung-box test. When I ran this test on the Naive model for my data, I received a p-value of 1.00. This was more good news for my forecast. Because of the high p-value, above 0.05, I failed to reject the null hypothesis of the Ljung-box test. This meant my residuals were indistinguishable from white noise. I also did time series cross validation to check my model, but it was not very helpful, so I did not include it here. I believe it was not very helpful because of my small number of observations and my small forecasting window of 4 periods.

Finally, I believe the Naive forecast is the GOAT for my microeconomic data because it meets the two necessities of a good forecast, residuals being uncorrelated with a mean of 0, the residuals are also homoskedasitic and normally distributed, and the residuals are indistinguishable from white-noise. The Naive forecast also resulted in the lowest accuracy values for RMSE, MAE, MAPE, and MASE of all the methods I tested. I also believe the Naive fits my data as it accounts for the random-walk seen in my data, while there is little trend.

I included the point forecast of 6176.6, the distribution, and a plot of my forecast for the next year above.

Other forecasting methods I tried

Basic 4 Forecasts (Naive, Drift, Mean, SNaive)

colin_basic <- colin_train %>%
  model(
    Seasonal_naive = SNAIVE(Micro),
    Naive = NAIVE(Micro),
    Drift = RW(Micro ~ drift()),
    Mean = MEAN(Micro)
  )

colin_basic_forecast <- colin_basic %>%
  forecast(h = 4)
accuracy(colin_basic_forecast, colin_data)
## # A tibble: 4 × 10
##   .model         .type     ME   RMSE    MAE   MPE  MAPE  MASE  RMSSE   ACF1
##   <chr>          <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1 Drift          Test  -252.   279.   252.  -4.57  4.57 0.653 0.403   0.306
## 2 Mean           Test  1100.  1100.  1100.  19.9  19.9  2.85  1.59   -0.125
## 3 Naive          Test   -55.1   69.0   56.5 -1.00  1.03 0.146 0.0998 -0.125
## 4 Seasonal_naive Test   -69.9   86.4   69.9 -1.27  1.27 0.181 0.125  -0.213
colin_basic_forecast %>%
  autoplot() +
  autolayer(colin_data)
## Plot variable not specified, automatically selected `.vars = Micro`

colin_naive1 <- colin_train %>%
  model(Naive = NAIVE(Micro))
colin_naive1 %>%
  gg_tsresiduals()
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).

colin_naive1_forecast <- colin_naive1 %>%
  forecast(h = "1 year")
accuracy(colin_naive1_forecast, colin_data)
## # A tibble: 1 × 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE  RMSSE   ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1 Naive  Test  -55.1  69.0  56.5 -1.00  1.03 0.146 0.0998 -0.125

When I began looking at forecasting methods for my data, I used the four basic models in order to form a baseline. I used this to compare my other forecasts to the four above: Drift, Mean, Naive, and SNaive. I did not think that I would end up choosing the Naive model in the end, but that is how it ended. As shown above, the Naive has the lowest value in every accuracy measure which gave showed it to be the best of the four basic models.

Second prefered method: Drift After The Jump

colin_drift2 <- colin_drift_prep_train %>% 
  model(Drift = RW(Micro ~ drift())) 

colin_drift2_forecast <- colin_drift2 %>%
  forecast(h = "1 year")

colin_drift2 %>%
  gg_tsresiduals()
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).

augment(colin_drift2) %>%
  features(.resid, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 Drift     5.26     0.873
accuracy(colin_drift2_forecast, colin_data)
## # A tibble: 1 × 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Drift  Test   115.  137.  116.  2.00  2.03 0.336 0.211 0.139
colin_drift3 <- colin_data %>% 
  model(Drift = RW(Micro ~ drift())) 

colin_drift3_forecast <- colin_drift3 %>%
  forecast(h = "1 year")

colin_data %>%
  ggplot(aes(x = Date)) +
  geom_line(aes(y = Micro)) +
  autolayer(colin_drift3_forecast) +
  labs(title = "Naive Forecast of Colin's Micro Data", y = "Micro Data Index")

My primary motive for attempting the Drift model after the jump, was to account for a trend in my data. There is a small trend seen in all my decompositions above that could allow the drift to be the best forecast. Because of my jump, using the drift method with data from 1984 Q1 would result in a drift too large and would not make sense for my data.

To account for this issue, I created a new sub-dataset of only data after 1987 where the large jump occurred. In this I split my data up into test and train amungst the data after the jump. I retained an approximate 70/30 split for train and test in order to do this. The measures using accuracy() seemed optimistic, but unfortunately fell short of the Naive model shown above.

Because it is my second best forecasting method, I still wanted to observe the residuals and they also fall short of the Naive forecast. There is no autocorrelation in the residuals, the mean of the residuals is approximately 0, but the distribution is far from normal. The residuals are missing the outliers because it is after the jump, but this alone was not enough to make the distribution preferable. Using the Ljung-box test still resulted in residuals that were indistinguishable from white-noise.

I believe if there were a stronger trend in my data, or if the data were to begin a trend, the Drift after the jump method would be preferred to the Naive.

TSLM regression using trend and seasons

colin_data_reg_train <- colin_train %>%
  model(TSLM(Micro ~ trend() + season()))
report(colin_data_reg_train)
## Series: Micro 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -800.73 -385.19   38.83  410.46  663.99 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2619.922    227.085  11.537 6.07e-12 ***
## trend()        113.583      9.588  11.847 3.33e-12 ***
## season()year2  -61.565    248.723  -0.248    0.806    
## season()year3 -139.241    249.276  -0.559    0.581    
## season()year4  -37.310    250.197  -0.149    0.883    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 497.1 on 27 degrees of freedom
## Multiple R-squared: 0.8403,  Adjusted R-squared: 0.8167
## F-statistic: 35.52 on 4 and 27 DF, p-value: 2.1663e-10
colin_data_reg_test <- colin_data_reg_train %>%
  forecast(h = 4)
accuracy(colin_data_reg_test, colin_data)
## # A tibble: 1 × 10
##   .model                   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>                    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TSLM(Micro ~ trend() + … Test  -945.  955.  945. -17.1  17.1  2.45  1.38 0.231

I wanted to check for a possible forecast using regression and the TSLM function. Above is my first attempt to do so. I used trend and seasonality as the predictors in my regression and my microeconomic data as the forecated variable. When I checked the RMSE with this regression, I was very disappointing to see it was approximately 15x that of the Naive method. This made me want to try different regressions shown below.

Linear, Exponential, and Piecewise forecasts without

colin_data_lepw <- colin_train %>%
  model(
  linear = TSLM(Micro ~ trend()),
  exponential = TSLM(log(Micro) ~ trend()))

colin_data_lepw_forecast <- colin_data_lepw %>%
  forecast(h = 12) 
colin_data_lepw_forecast %>%
  autoplot(colin_data)

accuracy(colin_data_lepw_forecast, colin_data)
## # A tibble: 2 × 10
##   .model      .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>       <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 exponential Test  -2059. 2139. 2059. -36.0  36.0  5.33  3.09 0.768
## 2 linear      Test  -1242. 1273. 1242. -21.8  21.8  3.22  1.84 0.692
colin_data_lepw %>%
  select(linear) %>%
  gg_tsresiduals()

colin_data_lepw %>%
  select(exponential) %>%
  gg_tsresiduals()

I looked into forecasting with just a linear trend and exponential trend. All of these gave a higher RMSE than the original regression. The residuals using gg_tsresiduals look the worst of all my forecasting attempts. The residuals are highly correlated and not homoskedatic. When looking at the forecasts based on the training data, each missed the test observations by a wide amount. This is part of why they scored so terribly on the accuracy measures.