library(fpp3)
library(dplyr)
library(tidyverse)
library(tidyquant)
library(stargazer)

library(vars)
library(urca)
library(tsDyn)
library(apt) 
library(timetk)
library(readxl)
library(kableExtra)
library(grid)
library(gridExtra)

1 Abstract

The purpose of this report is to explore whether or not using Google Trends query results helps to forecast the number of unemployed persons in a specified state. The state of interest is Louisiana. We use three Google Trends terms, “Unemployment Benefits,” “Food Stamps,” and “File for Unemployment,” as well as the lagged observation of Unemployment itself, to forecast the future value of Unemployment. Because each of the Google Trends terms behave differently when included in the same mode, we will create separate models to use in a time series cross validation for each term. This report concludes that using Google Trends data is a valuable part of a forecaster’s toolbox for the observed data.

2 Acquiring Data

2.1 Selecting FRED Unemployment Data

FRED offers a data series titled “Unemployed Persons in Louisiana” (Code LASST220000000000004), with a frequency of Monthly, and units of Persons. The data is seasonally adjusted. The range January 2004 - December 2020 was downloaded in Excel format to be later appended with Google Trends Terms.

2.2 Selecting Google Trends Terms

Understanding which Google Trends terms have the best prediction power is a subject of study in itself. A brief literature review was conducted to survey papers that use Google Trends terms to predict changes in factors related to employment. A paper titled Google Searches Predict Initial Unemployment Insurance Claims by Zheng Tian and Stephan J. Goetz. Penn State University, May 1, 2020, use two specific terms.

The two terms used by Trian et al. are “file for unemployment” and “unemployment benefits”. For this reason, those terms are downloaded from trends.google.com for the specified time period in the State of Louisiana.

A presentation on Nowcasting using Google Trends hosted on the Federal Reserve Bank of San Francisco website enumerates more specific variables. While the analysis uses the “index” of the Welfare and Unemployment group of search queries, some specific search terms are also suggested such as Jobs, Monster, Indeed, Job Search, Resume, LinkedIn, and so on. For the purposes of this analysis, Food Stamps is chosen as the third search query.

The three queries were downloaded separately from trends.google.com and merged in Microsoft Excel, along with the Unemployment data from FRED. This file is then imported into R.

2.3 Importing and Plotting Data

We import the data from Excel as a tsibble.

file <- read_excel("C:/Users/dange/Documents/Spring 2021/Data Analysis/unempfile.xlsx", 
     col_types = c("date", "numeric", "numeric", 
         "numeric","numeric")) %>%
  mutate(month = yearmonth(Month)) %>%
  as_tsibble(index = month)

Variable	Description
LASST220000000000005	Employed Persons in Louisiana, Seasonally Adjusted Monthly
file_for_unemp	“File For Unemployment” Google search query index
unemp_benefits	“Unemployment Benefits” Google search query index
food_stamps	“Food Stamps” Google search query index

The four variables are plotted separately for preliminary observation.

p1 <- ggplot(file, aes(Month, log(UNEMP), color = "Unemployment")) +
  geom_line() + labs(y = "log of Unemployment") + theme_bw() + theme(legend.title = element_blank())
p2 <-
  ggplot(file, aes(Month, log(file_for_unemp), color = "GT: File For Unemployment")) +
  geom_line() + labs(y = "Index") + theme_bw() + theme(legend.title = element_blank())
p3 <-
  ggplot(file, aes(Month, log(unemp_benefits), color = "GT: Unemp Benefits")) +
  geom_line() + labs(y = "Index") + theme_bw() + theme(legend.title = element_blank())
p4 <-
  ggplot(file, aes(Month, log(food_stamps), color = "GT: Food Stamps")) +
  geom_line() + labs(y = "Index") + theme_bw() + theme(legend.title = element_blank())
grid.arrange(p1, p2, p3, p4, ncol = 1)

The three Google Trends terms are plotted on the same chart to casually observe covariation.

file %>%    ggplot(aes(x = month)) +
  geom_line(aes(y = file_for_unemp, color = "File for Unemployment")) +
  geom_line(aes(y = unemp_benefits, color = "Unemployment Benefits")) +
  geom_line(aes(y = food_stamps, color = "Food Stamps")) +
  labs(
    title = "Google Trends Queries",
    caption = "Source: Google Trends",
    y = "Search Popularity Index",
    x = "Month"
  )  +
  theme(legend.position = "top", legend.title = element_blank())

3 Train and Test Data

We create a training dataset from January 2004 to December 2018, and a testing dataset from January 2019 to December 2020.

train <- file %>% filter_index(. ~"2018-12-01")
test  <- file %>% filter_index("2019-01-01"~ .)

3.1 Estimation and Results

We estimate these models, and report their estimation results.

\[ log(UNEMP_{t}) = const + log(UNEMP_{t-1})\]

Equation of fit_gt \[ log(UNEMP_{t}) = const+ log(UNEMP_{t-1}) + log(foodstamps_{t-1}+1) + log(fileforunemp_{t-1}+1) + log(unempbenefits_{t-1}+1) + \epsilon_t\]

fits <- train %>%
  model(
    fit_naive = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP))),
    fit_GT     = TSLM(
      log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(food_stamps + 1) + log(file_for_unemp + 1) + log(unemp_benefits + 1)
    )
  )

tidy(fits)[, c(1, 2, 3, 4, 6)] %>%
  kable(
    format = "html",
    table.attr = "style='width:75%;' ",
    caption = "Model Estimation Results",
    digits = 3
  ) %>%
  kable_classic_2(full_width = F)

Model Estimation Results
.model	term	estimate	std.error	p.value
fit_naive	(Intercept)	0.716	0.311	0.022
fit_naive	lag(log(UNEMP))	0.939	0.026	0.000
fit_GT	(Intercept)	1.307	0.320	0.000
fit_GT	lag(log(UNEMP))	0.882	0.028	0.000
fit_GT	log(food_stamps + 1)	0.028	0.009	0.002
fit_GT	log(file_for_unemp + 1)	-0.012	0.008	0.175
fit_GT	log(unemp_benefits + 1)	0.023	0.008	0.008

Thus, the estimated Naive model is as follows: \[ log(UNEMP_{t}) = 0.939log(UNEMP_{t-1}) + 0.716\] And the Google Trends estimated model is as follows:

\[ log(UNEMP_{t}) = 0.882log(UNEMP_{t-1}) + 0.028log(foodstamps_{t-1}+1) -0.012 log(fileforunemp_{t-1}+1) + 0.023log(unempbenefits_{t-1}+1) + 1.307\]

3.1.1 Splitting Google Trends Terms

Because of the conflicting coefficient signs in the above equations, we estimate using each query individually as well. We construct three separate fitted models, each consisting of the lagged Unemployment observations as well as the log of each Google Trend query. We add a constant of 1 to each Google Trend observation to allow them to be logged, as some observations hold a value of zero.

fits_split <- train %>%
  model(
    fit_GT_foodstamps = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(food_stamps + 1)),
    fit_GT_file = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(file_for_unemp + 1)),
    fit_GT_benefits = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP))+ log(unemp_benefits + 1)),
  )

tidy(fits_split)[, c(1, 2, 3, 4, 6)] %>%
  kable(
    format = "html",
    table.attr = "style='width:75%;' ",
    caption = "Model Estimation Results",
    digits = 3
  ) %>%
  kable_classic_2(full_width = F)

Model Estimation Results
.model	term	estimate	std.error	p.value
fit_GT_foodstamps	(Intercept)	1.192	0.319	0.000
fit_GT_foodstamps	lag(log(UNEMP))	0.892	0.028	0.000
fit_GT_foodstamps	log(food_stamps + 1)	0.031	0.007	0.000
fit_GT_file	(Intercept)	0.777	0.323	0.017
fit_GT_file	lag(log(UNEMP))	0.933	0.028	0.000
fit_GT_file	log(file_for_unemp + 1)	0.006	0.008	0.482
fit_GT_benefits	(Intercept)	1.116	0.315	0.001
fit_GT_benefits	lag(log(UNEMP))	0.902	0.027	0.000
fit_GT_benefits	log(unemp_benefits + 1)	0.031	0.008	0.000

According to the results above, the terms “Food Stamps” and “Unemployment Benefits” have slightly positive coefficients with a P-value of <0.001. This suggests that they may be strongly correlated with the expected value of Unemployment. We will use them to forecast the value in the next section.

3.2 Forecast Estimation and Results

3.2.1 Naive and Combined models

We compute the forecasts.

We report the MPE and MAPE of the model forecasts.

accuracy(fcs, test)[, c(1, 2, 6, 7)] %>%
  kable(
    format = "html",
    table.attr = "style='width:50%;' ",
    caption = "Model Forecast Results",
    col.names = c("model", "type", "MPE", "MAPE"),
    digits = 3,
  ) %>%
  kable_classic_2(full_width = F)

Model Forecast Results
model	type	MPE	MAPE
fit_GT	Test	-1.783	8.741
fit_naive	Test	0.263	8.781

We plot the forecasts.

fcs %>%
  autoplot(filter(file, year(Month) > 2016), level = NULL)

3.2.2 Split Google Trends Models

We report the MPE and MAPE of the model forecasts.

accuracy(fcs_split, test)[, c(1, 2, 6, 7)] %>%
  kable(
    format = "html",
    table.attr = "style='width:50%;' ",
    caption = "Model Forecast Results",
    col.names = c("model", "type", "MPE", "MAPE"),
    digits = 3
  ) %>%
  kable_classic_2(full_width = F)

Model Forecast Results
model	type	MPE	MAPE
fit_GT_benefits	Test	-2.446	9.359
fit_GT_file	Test	-0.441	8.815
fit_GT_foodstamps	Test	-1.465	9.095

We plot the forecasts.

fcs_split %>%
  autoplot(filter(file, year(Month) > 2016), level = NULL)

Breaking down the combined Google Trends model into its components of the three query terms reveals a different behavior exhibited by the three terms. Visually, the term “Food Stamps” underpredicted the spike in unemployment in 2020, while “Unemployment Benefits” seems to have overestimated Unemployment after the spike.

Because of the different behaviors of the query terms, we will then construct separate models in the time series cross-validation section.

4 Time Series Cross-Validation

4.1 Estimation and Results

We run a time series cross-validation using rolling windows of 48 months.

We construct four models: A naive simple lagged model, and one model with each of the Google Trends variables. The existing statistical methodologies at our disposal do not include a singular model containing all Google Trend variables. This is a limitation of this report and further research may allow for exploration into that statistical method.

Uslide <- file %>% 
  slide_tsibble(.size=48, .step = 1)

slideacc <- Uslide %>% 
  model(
    naive = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP))),
    GT_stamps = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)  ) + log(food_stamps + 1)),
    GT_benefits = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(unemp_benefits + 1)),
    GT_file = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(file_for_unemp + 1))
    ) %>% 
  forecast(new_data = Uslide) %>% 
  accuracy(file)

slideacc[, c(1, 2, 6, 7)] %>%
  kable(
    format = "html",
    table.attr = "style='width:50%;' ",
    caption = "Model Forecast Results",
    col.names = c("Model", "Type", "MPE", "MAPE"),
    digits = 3
  ) %>%
  kable_classic_2(full_width = F)

Model Forecast Results
Model	Type	MPE	MAPE
GT_benefits	Test	0.280	2.947
GT_file	Test	0.248	2.848
GT_stamps	Test	0.315	2.970
naive	Test	0.220	2.867

The MAPE of the Naive model has decreased to 2.867, from the MAPE of the Naive model of 8.781. From this, we can conclude that the slide of period = 48 strengthened this model.

5 Conclusion

We have observed that Google Trends variables, when both combined and modeled individually, rival a Naive forecast in Mean Absolute Percentage Error. This is especially important considering that observations of Unemployment data often are not available to economists until months after the present, while Google Trends data is available in near- real-time. When included with lagged observations of Unemployment, these specific Google Trends terms forecasted future values of Unemployment almost as well as a Naive model. In the real world, where the “lag time” between occurence and observation of unemployment data acts as the lag executed in this report, Google Trends data are a valuable part of an Economist’s toolbox when attempting to forecast unemployment.

Google Trends and Unemployment

Patrick Martin

May 15, 2021