library(fpp3)
library(dplyr)
library(tidyverse)
library(tidyquant)
library(stargazer)

library(vars)
library(urca)
library(tsDyn)
library(apt) 
library(timetk)
library(readxl)
library(kableExtra)
library(grid)
library(gridExtra)

1 Abstract

The purpose of this report is to explore whether or not using Google Trends query results helps to forecast the number of unemployed persons in a specified state. The state of interest is Louisiana. We use three Google Trends terms, “Unemployment Benefits,” “Food Stamps,” and “File for Unemployment,” as well as the lagged observation of Unemployment itself, to forecast the future value of Unemployment. Because each of the Google Trends terms behave differently when included in the same mode, we will create separate models to use in a time series cross validation for each term. This report concludes that using Google Trends data is a valuable part of a forecaster’s toolbox for the observed data.

2 Acquiring Data

2.1 Selecting FRED Unemployment Data

FRED offers a data series titled “Unemployed Persons in Louisiana” (Code LASST220000000000004), with a frequency of Monthly, and units of Persons. The data is seasonally adjusted. The range January 2004 - December 2020 was downloaded in Excel format to be later appended with Google Trends Terms.

2.3 Importing and Plotting Data

We import the data from Excel as a tsibble.

file <- read_excel("C:/Users/dange/Documents/Spring 2021/Data Analysis/unempfile.xlsx", 
     col_types = c("date", "numeric", "numeric", 
         "numeric","numeric")) %>%
  mutate(month = yearmonth(Month)) %>%
  as_tsibble(index = month)
Variable Description
LASST220000000000005 Employed Persons in Louisiana, Seasonally Adjusted Monthly
file_for_unemp “File For Unemployment” Google search query index
unemp_benefits “Unemployment Benefits” Google search query index
food_stamps “Food Stamps” Google search query index

The four variables are plotted separately for preliminary observation.

p1 <- ggplot(file, aes(Month, log(UNEMP), color = "Unemployment")) +
  geom_line() + labs(y = "log of Unemployment") + theme_bw() + theme(legend.title = element_blank())
p2 <-
  ggplot(file, aes(Month, log(file_for_unemp), color = "GT: File For Unemployment")) +
  geom_line() + labs(y = "Index") + theme_bw() + theme(legend.title = element_blank())
p3 <-
  ggplot(file, aes(Month, log(unemp_benefits), color = "GT: Unemp Benefits")) +
  geom_line() + labs(y = "Index") + theme_bw() + theme(legend.title = element_blank())
p4 <-
  ggplot(file, aes(Month, log(food_stamps), color = "GT: Food Stamps")) +
  geom_line() + labs(y = "Index") + theme_bw() + theme(legend.title = element_blank())
grid.arrange(p1, p2, p3, p4, ncol = 1)

The three Google Trends terms are plotted on the same chart to casually observe covariation.

file %>%    ggplot(aes(x = month)) +
  geom_line(aes(y = file_for_unemp, color = "File for Unemployment")) +
  geom_line(aes(y = unemp_benefits, color = "Unemployment Benefits")) +
  geom_line(aes(y = food_stamps, color = "Food Stamps")) +
  labs(
    title = "Google Trends Queries",
    caption = "Source: Google Trends",
    y = "Search Popularity Index",
    x = "Month"
  )  +
  theme(legend.position = "top", legend.title = element_blank())

3 Train and Test Data

We create a training dataset from January 2004 to December 2018, and a testing dataset from January 2019 to December 2020.

train <- file %>% filter_index(. ~"2018-12-01")
test  <- file %>% filter_index("2019-01-01"~ .)

3.1 Estimation and Results

We estimate these models, and report their estimation results.

\[ log(UNEMP_{t}) = const + log(UNEMP_{t-1})\]

Equation of fit_gt \[ log(UNEMP_{t}) = const+ log(UNEMP_{t-1}) + log(foodstamps_{t-1}+1) + log(fileforunemp_{t-1}+1) + log(unempbenefits_{t-1}+1) + \epsilon_t\]

fits <- train %>%
  model(
    fit_naive = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP))),
    fit_GT     = TSLM(
      log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(food_stamps + 1) + log(file_for_unemp + 1) + log(unemp_benefits + 1)
    )
  )
tidy(fits)[, c(1, 2, 3, 4, 6)] %>%
  kable(
    format = "html",
    table.attr = "style='width:75%;' ",
    caption = "Model Estimation Results",
    digits = 3
  ) %>%
  kable_classic_2(full_width = F)
Model Estimation Results
.model term estimate std.error p.value
fit_naive (Intercept) 0.716 0.311 0.022
fit_naive lag(log(UNEMP)) 0.939 0.026 0.000
fit_GT (Intercept) 1.307 0.320 0.000
fit_GT lag(log(UNEMP)) 0.882 0.028 0.000
fit_GT log(food_stamps + 1) 0.028 0.009 0.002
fit_GT log(file_for_unemp + 1) -0.012 0.008 0.175
fit_GT log(unemp_benefits + 1) 0.023 0.008 0.008

Thus, the estimated Naive model is as follows: \[ log(UNEMP_{t}) = 0.939log(UNEMP_{t-1}) + 0.716\] And the Google Trends estimated model is as follows:

\[ log(UNEMP_{t}) = 0.882log(UNEMP_{t-1}) + 0.028log(foodstamps_{t-1}+1) -0.012 log(fileforunemp_{t-1}+1) + 0.023log(unempbenefits_{t-1}+1) + 1.307\]

3.2 Forecast Estimation and Results

3.2.1 Naive and Combined models

We compute the forecasts.

We report the MPE and MAPE of the model forecasts.

accuracy(fcs, test)[, c(1, 2, 6, 7)] %>%
  kable(
    format = "html",
    table.attr = "style='width:50%;' ",
    caption = "Model Forecast Results",
    col.names = c("model", "type", "MPE", "MAPE"),
    digits = 3,
  ) %>%
  kable_classic_2(full_width = F)
Model Forecast Results
model type MPE MAPE
fit_GT Test -1.783 8.741
fit_naive Test 0.263 8.781

We plot the forecasts.

fcs %>%
  autoplot(filter(file, year(Month) > 2016), level = NULL)

4 Time Series Cross-Validation

4.1 Estimation and Results

We run a time series cross-validation using rolling windows of 48 months.

We construct four models: A naive simple lagged model, and one model with each of the Google Trends variables. The existing statistical methodologies at our disposal do not include a singular model containing all Google Trend variables. This is a limitation of this report and further research may allow for exploration into that statistical method.

Uslide <- file %>% 
  slide_tsibble(.size=48, .step = 1)
slideacc <- Uslide %>% 
  model(
    naive = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP))),
    GT_stamps = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)  ) + log(food_stamps + 1)),
    GT_benefits = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(unemp_benefits + 1)),
    GT_file = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(file_for_unemp + 1))
    ) %>% 
  forecast(new_data = Uslide) %>% 
  accuracy(file)
slideacc[, c(1, 2, 6, 7)] %>%
  kable(
    format = "html",
    table.attr = "style='width:50%;' ",
    caption = "Model Forecast Results",
    col.names = c("Model", "Type", "MPE", "MAPE"),
    digits = 3
  ) %>%
  kable_classic_2(full_width = F)
Model Forecast Results
Model Type MPE MAPE
GT_benefits Test 0.280 2.947
GT_file Test 0.248 2.848
GT_stamps Test 0.315 2.970
naive Test 0.220 2.867

The MAPE of the Naive model has decreased to 2.867, from the MAPE of the Naive model of 8.781. From this, we can conclude that the slide of period = 48 strengthened this model.

5 Conclusion

We have observed that Google Trends variables, when both combined and modeled individually, rival a Naive forecast in Mean Absolute Percentage Error. This is especially important considering that observations of Unemployment data often are not available to economists until months after the present, while Google Trends data is available in near- real-time. When included with lagged observations of Unemployment, these specific Google Trends terms forecasted future values of Unemployment almost as well as a Naive model. In the real world, where the “lag time” between occurence and observation of unemployment data acts as the lag executed in this report, Google Trends data are a valuable part of an Economist’s toolbox when attempting to forecast unemployment.