library(fpp3)
library(dplyr)
library(tidyverse)
library(tidyquant)
library(stargazer)
library(vars)
library(urca)
library(tsDyn)
library(apt)
library(timetk)
library(readxl)
library(kableExtra)
library(grid)
library(gridExtra)
The purpose of this report is to explore whether or not using Google Trends query results helps to forecast the number of unemployed persons in a specified state. The state of interest is Louisiana. We use three Google Trends terms, “Unemployment Benefits,” “Food Stamps,” and “File for Unemployment,” as well as the lagged observation of Unemployment itself, to forecast the future value of Unemployment. Because each of the Google Trends terms behave differently when included in the same mode, we will create separate models to use in a time series cross validation for each term. This report concludes that using Google Trends data is a valuable part of a forecaster’s toolbox for the observed data.
FRED offers a data series titled “Unemployed Persons in Louisiana” (Code LASST220000000000004), with a frequency of Monthly, and units of Persons. The data is seasonally adjusted. The range January 2004 - December 2020 was downloaded in Excel format to be later appended with Google Trends Terms.
Understanding which Google Trends terms have the best prediction power is a subject of study in itself. A brief literature review was conducted to survey papers that use Google Trends terms to predict changes in factors related to employment. A paper titled Google Searches Predict Initial Unemployment Insurance Claims by Zheng Tian and Stephan J. Goetz. Penn State University, May 1, 2020, use two specific terms.
The two terms used by Trian et al. are “file for unemployment” and “unemployment benefits”. For this reason, those terms are downloaded from trends.google.com for the specified time period in the State of Louisiana.
A presentation on Nowcasting using Google Trends hosted on the Federal Reserve Bank of San Francisco website enumerates more specific variables. While the analysis uses the “index” of the Welfare and Unemployment group of search queries, some specific search terms are also suggested such as Jobs, Monster, Indeed, Job Search, Resume, LinkedIn, and so on. For the purposes of this analysis, Food Stamps is chosen as the third search query.
The three queries were downloaded separately from trends.google.com and merged in Microsoft Excel, along with the Unemployment data from FRED. This file is then imported into R.
We import the data from Excel as a tsibble.
file <- read_excel("C:/Users/dange/Documents/Spring 2021/Data Analysis/unempfile.xlsx",
col_types = c("date", "numeric", "numeric",
"numeric","numeric")) %>%
mutate(month = yearmonth(Month)) %>%
as_tsibble(index = month)
| Variable | Description |
|---|---|
| LASST220000000000005 | Employed Persons in Louisiana, Seasonally Adjusted Monthly |
| file_for_unemp | “File For Unemployment” Google search query index |
| unemp_benefits | “Unemployment Benefits” Google search query index |
| food_stamps | “Food Stamps” Google search query index |
The four variables are plotted separately for preliminary observation.
p1 <- ggplot(file, aes(Month, log(UNEMP), color = "Unemployment")) +
geom_line() + labs(y = "log of Unemployment") + theme_bw() + theme(legend.title = element_blank())
p2 <-
ggplot(file, aes(Month, log(file_for_unemp), color = "GT: File For Unemployment")) +
geom_line() + labs(y = "Index") + theme_bw() + theme(legend.title = element_blank())
p3 <-
ggplot(file, aes(Month, log(unemp_benefits), color = "GT: Unemp Benefits")) +
geom_line() + labs(y = "Index") + theme_bw() + theme(legend.title = element_blank())
p4 <-
ggplot(file, aes(Month, log(food_stamps), color = "GT: Food Stamps")) +
geom_line() + labs(y = "Index") + theme_bw() + theme(legend.title = element_blank())
grid.arrange(p1, p2, p3, p4, ncol = 1)
The three Google Trends terms are plotted on the same chart to casually observe covariation.
file %>% ggplot(aes(x = month)) +
geom_line(aes(y = file_for_unemp, color = "File for Unemployment")) +
geom_line(aes(y = unemp_benefits, color = "Unemployment Benefits")) +
geom_line(aes(y = food_stamps, color = "Food Stamps")) +
labs(
title = "Google Trends Queries",
caption = "Source: Google Trends",
y = "Search Popularity Index",
x = "Month"
) +
theme(legend.position = "top", legend.title = element_blank())
We create a training dataset from January 2004 to December 2018, and a testing dataset from January 2019 to December 2020.
train <- file %>% filter_index(. ~"2018-12-01")
test <- file %>% filter_index("2019-01-01"~ .)
We estimate these models, and report their estimation results.
\[ log(UNEMP_{t}) = const + log(UNEMP_{t-1})\]
Equation of fit_gt \[ log(UNEMP_{t}) = const+ log(UNEMP_{t-1}) + log(foodstamps_{t-1}+1) + log(fileforunemp_{t-1}+1) + log(unempbenefits_{t-1}+1) + \epsilon_t\]
fits <- train %>%
model(
fit_naive = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP))),
fit_GT = TSLM(
log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(food_stamps + 1) + log(file_for_unemp + 1) + log(unemp_benefits + 1)
)
)
tidy(fits)[, c(1, 2, 3, 4, 6)] %>%
kable(
format = "html",
table.attr = "style='width:75%;' ",
caption = "Model Estimation Results",
digits = 3
) %>%
kable_classic_2(full_width = F)
| .model | term | estimate | std.error | p.value |
|---|---|---|---|---|
| fit_naive | (Intercept) | 0.716 | 0.311 | 0.022 |
| fit_naive | lag(log(UNEMP)) | 0.939 | 0.026 | 0.000 |
| fit_GT | (Intercept) | 1.307 | 0.320 | 0.000 |
| fit_GT | lag(log(UNEMP)) | 0.882 | 0.028 | 0.000 |
| fit_GT | log(food_stamps + 1) | 0.028 | 0.009 | 0.002 |
| fit_GT | log(file_for_unemp + 1) | -0.012 | 0.008 | 0.175 |
| fit_GT | log(unemp_benefits + 1) | 0.023 | 0.008 | 0.008 |
Thus, the estimated Naive model is as follows: \[ log(UNEMP_{t}) = 0.939log(UNEMP_{t-1}) + 0.716\] And the Google Trends estimated model is as follows:
\[ log(UNEMP_{t}) = 0.882log(UNEMP_{t-1}) + 0.028log(foodstamps_{t-1}+1) -0.012 log(fileforunemp_{t-1}+1) + 0.023log(unempbenefits_{t-1}+1) + 1.307\]
Because of the conflicting coefficient signs in the above equations, we estimate using each query individually as well. We construct three separate fitted models, each consisting of the lagged Unemployment observations as well as the log of each Google Trend query. We add a constant of 1 to each Google Trend observation to allow them to be logged, as some observations hold a value of zero.
fits_split <- train %>%
model(
fit_GT_foodstamps = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(food_stamps + 1)),
fit_GT_file = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(file_for_unemp + 1)),
fit_GT_benefits = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP))+ log(unemp_benefits + 1)),
)
tidy(fits_split)[, c(1, 2, 3, 4, 6)] %>%
kable(
format = "html",
table.attr = "style='width:75%;' ",
caption = "Model Estimation Results",
digits = 3
) %>%
kable_classic_2(full_width = F)
| .model | term | estimate | std.error | p.value |
|---|---|---|---|---|
| fit_GT_foodstamps | (Intercept) | 1.192 | 0.319 | 0.000 |
| fit_GT_foodstamps | lag(log(UNEMP)) | 0.892 | 0.028 | 0.000 |
| fit_GT_foodstamps | log(food_stamps + 1) | 0.031 | 0.007 | 0.000 |
| fit_GT_file | (Intercept) | 0.777 | 0.323 | 0.017 |
| fit_GT_file | lag(log(UNEMP)) | 0.933 | 0.028 | 0.000 |
| fit_GT_file | log(file_for_unemp + 1) | 0.006 | 0.008 | 0.482 |
| fit_GT_benefits | (Intercept) | 1.116 | 0.315 | 0.001 |
| fit_GT_benefits | lag(log(UNEMP)) | 0.902 | 0.027 | 0.000 |
| fit_GT_benefits | log(unemp_benefits + 1) | 0.031 | 0.008 | 0.000 |
According to the results above, the terms “Food Stamps” and “Unemployment Benefits” have slightly positive coefficients with a P-value of <0.001. This suggests that they may be strongly correlated with the expected value of Unemployment. We will use them to forecast the value in the next section.
We compute the forecasts.
We report the MPE and MAPE of the model forecasts.
accuracy(fcs, test)[, c(1, 2, 6, 7)] %>%
kable(
format = "html",
table.attr = "style='width:50%;' ",
caption = "Model Forecast Results",
col.names = c("model", "type", "MPE", "MAPE"),
digits = 3,
) %>%
kable_classic_2(full_width = F)
| model | type | MPE | MAPE |
|---|---|---|---|
| fit_GT | Test | -1.783 | 8.741 |
| fit_naive | Test | 0.263 | 8.781 |
We plot the forecasts.
fcs %>%
autoplot(filter(file, year(Month) > 2016), level = NULL)
We report the MPE and MAPE of the model forecasts.
accuracy(fcs_split, test)[, c(1, 2, 6, 7)] %>%
kable(
format = "html",
table.attr = "style='width:50%;' ",
caption = "Model Forecast Results",
col.names = c("model", "type", "MPE", "MAPE"),
digits = 3
) %>%
kable_classic_2(full_width = F)
| model | type | MPE | MAPE |
|---|---|---|---|
| fit_GT_benefits | Test | -2.446 | 9.359 |
| fit_GT_file | Test | -0.441 | 8.815 |
| fit_GT_foodstamps | Test | -1.465 | 9.095 |
We plot the forecasts.
fcs_split %>%
autoplot(filter(file, year(Month) > 2016), level = NULL)
Breaking down the combined Google Trends model into its components of the three query terms reveals a different behavior exhibited by the three terms. Visually, the term “Food Stamps” underpredicted the spike in unemployment in 2020, while “Unemployment Benefits” seems to have overestimated Unemployment after the spike.
Because of the different behaviors of the query terms, we will then construct separate models in the time series cross-validation section.
We run a time series cross-validation using rolling windows of 48 months.
We construct four models: A naive simple lagged model, and one model with each of the Google Trends variables. The existing statistical methodologies at our disposal do not include a singular model containing all Google Trend variables. This is a limitation of this report and further research may allow for exploration into that statistical method.
Uslide <- file %>%
slide_tsibble(.size=48, .step = 1)
slideacc <- Uslide %>%
model(
naive = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP))),
GT_stamps = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP) ) + log(food_stamps + 1)),
GT_benefits = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(unemp_benefits + 1)),
GT_file = TSLM(log(UNEMP) ~ 1 + lag(log(UNEMP)) + log(file_for_unemp + 1))
) %>%
forecast(new_data = Uslide) %>%
accuracy(file)
slideacc[, c(1, 2, 6, 7)] %>%
kable(
format = "html",
table.attr = "style='width:50%;' ",
caption = "Model Forecast Results",
col.names = c("Model", "Type", "MPE", "MAPE"),
digits = 3
) %>%
kable_classic_2(full_width = F)
| Model | Type | MPE | MAPE |
|---|---|---|---|
| GT_benefits | Test | 0.280 | 2.947 |
| GT_file | Test | 0.248 | 2.848 |
| GT_stamps | Test | 0.315 | 2.970 |
| naive | Test | 0.220 | 2.867 |
The MAPE of the Naive model has decreased to 2.867, from the MAPE of the Naive model of 8.781. From this, we can conclude that the slide of period = 48 strengthened this model.
We have observed that Google Trends variables, when both combined and modeled individually, rival a Naive forecast in Mean Absolute Percentage Error. This is especially important considering that observations of Unemployment data often are not available to economists until months after the present, while Google Trends data is available in near- real-time. When included with lagged observations of Unemployment, these specific Google Trends terms forecasted future values of Unemployment almost as well as a Naive model. In the real world, where the “lag time” between occurence and observation of unemployment data acts as the lag executed in this report, Google Trends data are a valuable part of an Economist’s toolbox when attempting to forecast unemployment.