Google Sentiments and Business Cycles

1.Data Cleaning and Wrangling

Download data from Google Trend

We made a list of six words: ‘recession’, ‘depression’, ‘slowdown’, ‘downturn’, ‘inflation’, and ‘uncertainty’ and downloaded the frequency of their search queries on Google, measured by Google Trend.

More instructions on how to bulk download data from Google Trend can be found here: https://www.rubenvezzoli.online/bulk-download-data-google-trends-r/

kwlist <- readLines("list.csv") #Make a list of keywords you want to download on a csv file
googleTrendsData <- function (keywords) { 
  country <- c('US') #set the region
  time <- ("2004-04-01 2021-04-01") #set the time
  channel <- 'web' 
  
  trends <- gtrends(keywords, 
                    gprop = channel,
                    geo = country,
                    time = time ) 
  
  results <- trends$interest_over_time 
} 
output <- map_dfr(.x = kwlist,
                  .f = googleTrendsData ) # googleTrendsData function is executed over the kwlist
write.csv(output, "Download.csv")

Preparing the data

data <- read_csv('Download.csv')
data <- data %>% select(c('date','keyword','hits')) 
data <- cast(data,date~keyword) #Change the dataframe from long form to wide form
write.csv(data, "GGTrend.csv")

#load the FRED data and merge into one
gdp <- read_csv('GDP.csv')
consumption <- read_csv('Consumption.csv')
investment <- read_csv('Investment.csv')

#merge multiple datasets
dataset_list = list(gdp, consumption, investment)
Fred_data <- Reduce(
  function(x, y, ...) merge(x, y, all = TRUE, ...),
  dataset_list
)
names(Fred_data)[1] <- 'date'
write.csv(Fred_data, "Fred_data.csv")


#merge data from GGTrend and data from FRED
data_full <- merge(Fred_data, data, by = 'date', all.x = T)
data_full <- data_full %>% dplyr::rename('GDP' = 'GDP_PCH', 'Consumption' = 'PCE_PCH', 'Investment' = 'GPDIC1_PCH')

#clean and add data of S&P500
sp500 <- read_csv('sp500.csv')
sp500 <- sp500 %>% mutate(log = 100*(log(sp500$Value)))
sp500['lag_log'] = c(sp500$log[-1],NA)
sp500 <- sp500 %>% mutate(SP_growth = (lag_log-log))

sp500 <- sp500 %>% filter(grepl('Jan|Apr|Jul|Oct', Date))
sp500 <- sp500[(dim(sp500)-67):(dim(sp500)),]
data_full <- cbind(data_full, sp500)
drop <- c('Date','log','lag_log','Value')
data_full <- data_full[,(!names(data_full) %in% drop)]
data_full <- data_full %>% dplyr::rename('SP500' = 'SP_growth')

formattable(head(data_full,15))

date	GDP	Consumption	Investment	depression	downturn	inflation	recession	slowdown	uncertainty	SP500
2004-04-01	1.58402	1.31129	3.99449	100	0	88	8	61	76	-2.7104839
2004-07-01	1.60503	1.61523	1.55136	64	9	63	5	33	77	-1.5491471
2004-10-01	1.78070	1.95054	2.08644	89	16	83	7	47	74	4.4432319
2005-01-01	1.90787	1.15601	2.81132	77	7	75	4	26	31	1.5287992
2005-04-01	1.16512	1.74196	-1.33062	89	7	83	6	27	71	1.1866789
2005-07-01	1.80365	2.05041	1.32168	60	6	62	3	39	56	0.1716739
2005-10-01	1.44141	1.08107	3.25078	84	15	88	6	23	69	3.7379835
2006-01-01	2.03728	1.62961	1.48467	82	4	65	5	40	59	-0.1565313
2006-04-01	1.07229	1.31527	-0.64480	81	8	82	4	71	49	-0.9412923
2006-07-01	0.85574	1.33330	-0.38581	52	4	60	4	64	52	2.1198878
2006-10-01	1.22415	0.79753	-1.94005	67	0	68	6	35	69	1.8314465
2007-01-01	1.22062	1.50873	-0.68803	63	15	64	3	24	31	1.4360651
2007-04-01	1.22316	0.99867	1.10295	73	9	77	6	27	53	3.1870386
2007-07-01	1.06130	1.13049	-1.00936	49	21	61	4	28	42	-4.4439806
2007-10-01	1.00790	1.27062	-1.23041	67	14	77	10	33	55	-5.0825097

Note: all macroeconomic variables and the S&P 500 are in growth rates

2.Data Visualization

We first plot the time series of the keywords. Generally, the more ‘specfic’ the word (‘recession’, ‘downturn’), the more accurately it reflects the true economic events. We can see search queries spiked up during periods of economic shocks for these words. In contrast, more general words (‘uncertainty’), or words which have other connotations (‘depression’) seem to have some features of seasonality.

data <- data_full %>% drop_na()
data$Consumption <- as.numeric(data$Consumption)
data_fullyear <- data[-c(1,2,3),] 

raw_data <- read_csv('Download.csv')


#Plot graphs of all keywords
raw_data %>% ggplot(aes(x = date, y = hits, color = keyword)) + geom_line() + 
  facet_wrap(~keyword) + labs(x = NULL, y = NULL, title = 'Google Trend Queries by Keywords') +
  theme(plot.title = element_text(hjust = 0.5))

The sign of seasonality is seen most clearly in the keyword ‘Uncertainty’ so we further examine it. In particular, ‘Uncertainty’ is searched more frequently in Q2 and Q4 than in Q1 and Q3.

#Plot the Seasonality of certain Keywords
ts_uncertainty <- ts(data_fullyear$uncertainty, start = 2005, frequency = 4) #convert to time-series
ggseasonplot(window(ts_uncertainty,start = 2005)) + ggtitle('Seasonality in Google Queries of Uncertainty')

We constructed our own indicator of negative sentiments by summing all the six keywords, and then normalized the values.

We include both specific and general keywords in our indicator because while the specific words indicate the negative sentiments caused by particular economic shocks, the more general words seem to also include the influence of some kinds of ‘recurrent’, ‘personal’ sentiments (based on their seasonality) that are orthogonal to the shocks. Thus, the combination of the two should be able to represent the aggregate negative sentiments.

When plotting the results, it can be seen that the indicator quite correctly reflects the true events in the real world, while still being able to capture the seasonality at normal times.

#Create an Indicator of Sentiments based on all the keywords
data_fullyear <- data_fullyear %>% mutate(Indicator = depression + downturn + inflation + recession + slowdown + uncertainty)
mean_ind <- mean(data_fullyear$Indicator)
sd_ind <-sd(data_fullyear$Indicator)
data_fullyear <- data_fullyear %>% mutate(standard_Ind = (Indicator - mean_ind)/sd_ind) #normalize the indicator



#Plot the indicator
data_fullyear[["date"]] <- as.Date(data_fullyear[["date"]]) 
data_fullyear %>% ggplot(aes(x = date)) + geom_line(aes(y = standard_Ind), color = 'darkblue') + 
  labs(x = NULL, y = NULL, title = 'Normalized Indicator of Sentiments') + 
  theme(panel.background = element_rect(fill = "#DDEFF5"),panel.grid.major = element_blank(),plot.title = element_text(hjust = 0.5)) + 
  geom_vline(xintercept=as.numeric(data$date[c(19, 64)]), linetype=4, colour="black") + 
  annotate('text', x = as.Date("2006-10-01"), y = 3.2, label = 'Financial Crisis', fontface = 'bold') + 
  annotate('text', x = as.Date("2018-10-01"), y = 2, label = 'Covid-19', fontface = 'bold')

3.Data Analysis and Forecasting

Correlation

We first explore the correlation of our variables in the dataset. Generally, the growth rates of GDP, Consumption, Investment are negatively correlated with the frequency of keywords and the sentiment indicator. Also, the ‘more specific’ the word is (recession, downturn), the stronger the negative correlation. This aligns with our intuition since higher the sentiment indicator should imply less autonomous consumption and business confidence, which in turn depresses consumption, investment, and ultimately GDP. However, the growth rate of the S&P 500 is only weakly negatively correlated with the keywords and indicators. This might be because of the fact that the data here is quarterly, whereas the volatility of stock prices is highly sensitive even at the daily level. Therefore, quarterly sentiments cannot reflect sentiments in stock prices, and can only mirror more rigid variables like macroeconomic variables.

#explore correlation between the variables
data_corr <- data_fullyear %>% select(-c('date'))
mydata.cor = cor(data_corr)
corrplot(mydata.cor)

To get the dynamic sense of the correlation, we plot the time series of the three macroeconomic variables against the indicators. Generally, if one of the two is positive, the other will be negative, and this relationship is more marked during shocks like the Financial Crisis or the 2020 pandemic.

ggplot(data_fullyear) + geom_bar(aes(x = date, y = standard_Ind), stat="identity", fill = 'blue') + 
  geom_line(aes(x = date, y = Consumption), color = 'darkblue') + 
  theme(panel.background = element_rect(fill = "#DDEFF5"), plot.title = element_text(hjust = 0.5)) + labs(y = NULL, x = NULL) + 
  ggtitle('Sentiment Indicator (bar) vs Consumption Growth Rate (line)')

ggplot(data_fullyear) + geom_bar(aes(x = date, y = standard_Ind), stat="identity", fill = '#F3A536') + 
  geom_line(aes(x = date, y = GDP), color = '#DAAD36') + 
  theme(panel.background = element_rect(fill = "#FCF3E8"), plot.title = element_text(hjust = 0.5)) + labs(y = NULL, x = NULL) + 
  ggtitle('Sentiment Indicator (bar) vs GDP Growth Rate (line)')

ggplot(data_fullyear) + geom_bar(aes(x = date, y = standard_Ind), stat="identity", fill = 'darkgreen') + 
  geom_line(aes(x = date, y = Investment), color = 'green') + 
  theme(panel.background = element_rect(fill = "#DFF8E0"), plot.title = element_text(hjust = 0.5)) + labs(y = NULL, x = NULL) + 
  ggtitle('Sentiment Indicator (bar) vs Investment Growth Rate (line)')

Forecasting 2018-2019 values

Formula of the ARIMA(\(\phi\),\(d\),\(\theta\)) model: \[ y'_t = c + (\phi_1y'_{t-1} +...+\phi_ny'_{t-n}) + (\theta_1\epsilon_{t-1} +...+\theta_n\epsilon_{t-n}+\epsilon_t)\] where \(y'_t\) is the differenced time series, \(\phi_i\) are the coefficients of the lagged values, and \(\theta_i\) are the coefficients of the lagged errors.

We will use this model to forecast the growth rate of Consumption first. We take the data from before 2018 to train the model and use the data afterwards to test our forecasting model.

The ARIMA model relies on past values to forecast, so we do not expect it to be accurate during periods of extreme exogenous shocks. Therefore, we truncate our forecasting time period into two: 2018-2019 and 2020 so that we will have a closer look at how the model performs under the relatively stable period (2018-2019) and unstable period (2020).

#Plot consumption in the original data
Consumption <- data_fullyear$Consumption
Consump_ts <- ts(Consumption, start = 2005, frequency = 4)
consump_actual <- autoplot(Consump_ts) + labs(x = NULL, y = NULL) + ggtitle('Actual Consumption growth rate')

#Divide the data into parts to fit into the model
data_train <- data_fullyear[1:(dim(data_fullyear)-12),] #data before 2018
data_pre2020 <- data_fullyear[1:(dim(data_fullyear)-4),] #data before 2020
Consump_pre2020 <- ts(data_pre2020$Consumption, start = 2005, frequency = 4) #consumption before 2020
gr1 <- autoplot(Consump_pre2020) + labs(x = NULL, y = NULL) + ggtitle('Actual Consumption growth rate 2018-2019 period')


Consump_train <- ts(data_train$Consumption, start = 2005, frequency = 4) #convert to timeseries
con_arima <- auto.arima(Consump_train, D = 0, max.Q = 0, max.P = 0) #Arima model - used consumption data before 2018
con_arima

## Series: Consump_train 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1    mean
##       0.4505  0.9418
## s.e.  0.1235  0.1494
## 
## sigma^2 estimated as 0.3748:  log likelihood=-47.37
## AIC=100.73   AICc=101.23   BIC=106.59

The auto.arima algorithm chooses the most suitable combination of parameters that results in the lowest values of AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion). In this case, the algorithm returns the ARIMA(1,0,0) model, which is equivalent to an autoregressive AR(1) model. The model is as follow: \[ y_t = 0.9418 + 0.4505y_{t-1} + \epsilon_t \]

We can then use this fitted model to predict the consumption growth rates of 2018 and 2019

forecast(con_arima, level = 95) #forecast 2018 and 2019 values

##         Point Forecast       Lo 95    Hi 95
## 2018 Q1      1.2760789  0.07610644 2.476051
## 2018 Q2      1.0924005 -0.22374002 2.408541
## 2018 Q3      1.0096454 -0.32884572 2.348136
## 2018 Q4      0.9723605 -0.37062210 2.315343
## 2019 Q1      0.9555620 -0.38833047 2.299455
## 2019 Q2      0.9479936 -0.39608355 2.292071
## 2019 Q3      0.9445836 -0.39953094 2.288698
## 2019 Q4      0.9430473 -0.40107487 2.287170

gr2 <- con_arima %>% forecast(h = 10) %>% autoplot() + labs(x = NULL, y = NULL)

To incorporate the effects of sentiments, we fit an ARIMAX(\(\phi\),\(d\),\(\theta\)) model, with the standardized Indicator of Negative Sentiment as the exogenous independent variable. The model is as follow: \[ y'_t = c + (\phi_1y'_{t-1} +...+\phi_ny'_{t-n}) + (\theta_1\epsilon_{t-1} +...+\theta_n\epsilon_{t-n}+\epsilon_t) + \beta x'_t \] with \(\beta\) being the coefficient for the standardized Indicator of Negative Sentiment, and the other coefficients being similar to those in the previous Arima model.

data1819 <- data_fullyear[(dim(data_fullyear)-11):(dim(data_fullyear)-4),] # data from 2018 to 2019

ind_forecast <- ts(data1819$standard_Ind, start = 2017, frequency = 4) #Indicators in 2018-2019 used to forecats
ind_train <- ts(data_train$standard_Ind, start = 2005, frequency = 4) #Indicators in 2005-2017 to fit the model


con_arimax<- auto.arima(Consump_train, xreg = ind_train) #fit the arimax model 2005-2017
con_arimax

## Series: Consump_train 
## Regression with ARIMA(0,0,1) errors 
## 
## Coefficients:
##          ma1  intercept     xreg
##       0.3509     0.9290  -0.2729
## s.e.  0.1452     0.0986   0.0798
## 
## sigma^2 estimated as 0.2966:  log likelihood=-40.7
## AIC=89.41   AICc=90.26   BIC=97.21

forecast(con_arimax, xreg = ind_forecast, level = 95) #predict 2018 and 2019 values

##         Point Forecast       Lo 95    Hi 95
## 2018 Q1      1.2909405  0.22357817 2.358303
## 2018 Q2      0.8933795 -0.23780497 2.024564
## 2018 Q3      1.2433109  0.11212644 2.374495
## 2018 Q4      0.7889223 -0.34226211 1.920107
## 2019 Q1      0.9560538 -0.17513069 2.087238
## 2019 Q2      0.7523623 -0.37882211 1.883547
## 2019 Q3      1.1858595  0.05467502 2.317044
## 2019 Q4      0.7889223 -0.34226211 1.920107

gr3<- con_arimax %>% forecast(xreg = ind_forecast, h = 6) %>% autoplot() + labs(x = NULL, y = NULL)
grid.arrange(gr1, gr2, gr3, ncol =1) #compare actual values, ARIMA values, ARIMAX values

The graph of the ARIMAX model reflects quite accurately the actual movements of the actual growth rate, and the forecasted values are not too far-off from the actual ones.

Forecasting 2020 values

Here, we fit similar models to predict values for the year 2020, whose values are substantially affected by the pandemic.

The ARIMA model is based on past values, so it is plausible that it cannot forecast the extent of the exogenous shock that the pandemic induces.

However, by comparing the results of the ARIMA and the ARIMAX model, we can see how sentiments alter the course that the ARIMA model predicts.

#predict values in Covid 2020
con_precovid <- ts(data_pre2020$Consumption, start = 2005, frequency = 4)
arima_covid <- auto.arima(con_precovid, D = 0, max.Q = 0, max.P = 0) #Arima model with no sentiments
arima_covid

## Series: con_precovid 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1    mean
##       0.4350  0.9414
## s.e.  0.1146  0.1290
## 
## sigma^2 estimated as 0.3383:  log likelihood=-51.71
## AIC=109.41   AICc=109.84   BIC=115.7

forecast(arima_covid, level = 95)

##         Point Forecast      Lo 95    Hi 95
## 2020 Q1      0.8700877 -0.2698458 2.010021
## 2020 Q2      0.9103714 -0.3327623 2.153505
## 2020 Q3      0.9278964 -0.3338194 2.189612
## 2020 Q4      0.9355204 -0.3296814 2.200722
## 2021 Q1      0.9388372 -0.3270233 2.204698
## 2021 Q2      0.9402801 -0.3257050 2.206265
## 2021 Q3      0.9409079 -0.3251009 2.206917
## 2021 Q4      0.9411810 -0.3248323 2.207194

gr_arima_covid <- arima_covid %>% forecast(h = 4) %>% autoplot() + labs(x = NULL, y = NULL)

data_covid <- data_fullyear[(dim(data_fullyear)-3):(dim(data_fullyear)),]
ind_precovid <- ts(data_pre2020$standard_Ind, start = 2005, frequency = 4)
ind_covid <- data_covid$standard_Ind


arimax_covid<- auto.arima(con_precovid, xreg = ind_precovid) #Arimax model with sentiments
arimax_covid

## Series: con_precovid 
## Regression with ARIMA(1,0,0) errors 
## 
## Coefficients:
##          ar1  intercept     xreg
##       0.3038     0.9363  -0.2637
## s.e.  0.1378     0.0950   0.0762
## 
## sigma^2 estimated as 0.2804:  log likelihood=-45.5
## AIC=99   AICc=99.73   BIC=107.38

forecast(arimax_covid, xreg = ind_covid, h = 4)

##         Point Forecast      Lo 80    Hi 80       Lo 95    Hi 95
## 2020 Q1      1.0512470  0.3726000 1.729894  0.01334603 2.089148
## 2020 Q2      0.3746889 -0.3345846 1.083962 -0.71005129 1.459429
## 2020 Q3      1.1586669  0.4466331 1.870701  0.06970523 2.247629
## 2020 Q4      0.8713755  0.1590875 1.583663 -0.21797500 1.960726

gr_arimax_covid <- arimax_covid %>% forecast(xreg = ind_covid, h = 4) %>% autoplot() + labs(x = NULL, y = NULL)
grid.arrange(consump_actual, gr_arima_covid, gr_arimax_covid, ncol =1)

The model generated by auto.arima() is as follow: \[ y_t = 0.9356 + 0.3191y_{t-1} - 0.239x_t + \epsilon_t \]

Although results from the ARIMAX model are not as extreme as those in reality, it has been able to predict the trajectory of the growth rates in 2020, which is a relatively large drop in Q2 followed by a recovery in Q3.

Forecasting GDP Growth Rate

Now, we will carry out similar steps for the other three dependent variables. Overally, the forecasting patterns of GDP and Investment are similar to those of Consumption.

data_after18 <- data_fullyear[(dim(data_fullyear)-11):(dim(data_fullyear)),] #data from 2018 to 2020 (inclusive)
#Plot gdp in the original data
Gdp <- data_fullyear$GDP
Gdp_ts <- ts(Gdp, start = 2005, frequency = 4)
Gdp_actual <- autoplot(Gdp_ts) + labs(x = NULL, y = NULL) + ggtitle('Actual GDP growth rate')

#ArimaGDP
GDP_train <- ts(data_train$GDP, start = 2005, frequency = 4)
arima_gdp <- auto.arima(GDP_train, D = 0, max.Q = 0, max.P = 0)
gr_gdp_arima <- arima_gdp %>% forecast(h = 14) %>% autoplot() 

#indicators used to forecast from 2018 to 2020 (inclusive)
ind_forecast_full <- ts(data_after18$standard_Ind, start = 2018, frequency = 4)

gdp_arimax <- auto.arima(GDP_train, xreg = ind_train) #fit the arimax model 2005-2017

Arima(1,0,0) model: \[ y_t = 0.9278 + 0.4532y_{t-1} + \epsilon_t \]

Arimax(2,0,1) model: \[ y_t = 0.9084 - 0.6077y_{t-1} + 0.3695y_{t-2} - 0.4614x_t + \epsilon_t + 0.8705\epsilon_{t-1} \]

Forecast results and graphs for GDP growth rate:

forecast(gdp_arimax, xreg = ind_forecast_full, level = 95) #predict 2018 and 2019 values

##         Point Forecast       Lo 95     Hi 95
## 2018 Q1      0.7591160 -0.35387977 1.8721117
## 2018 Q2      0.9930677 -0.11992802 2.1060634
## 2018 Q3      1.0750995 -0.03789627 2.1880952
## 2018 Q4      0.8118905 -0.30110526 1.9248862
## 2019 Q1      0.7229848 -0.41273630 1.8587059
## 2019 Q2      0.6809079 -0.45481320 1.8166290
## 2019 Q3      0.9886494 -0.14707170 2.1243706
## 2019 Q4      0.7994163 -0.33630485 1.9351374
## 2020 Q1      0.9545102 -0.20053505 2.1095555
## 2020 Q2     -0.1910869 -1.34613222 0.9639584
## 2020 Q3      0.9669049 -0.18814041 2.1219502
## 2020 Q4      0.9447443 -0.21030099 2.0997896

gr_gdp_arimax<- gdp_arimax %>% forecast(xreg = ind_forecast_full, h = 6) %>% autoplot()
grid.arrange(Gdp_actual, gr_gdp_arima, gr_gdp_arimax, ncol =1)

Forecasting Investment Growth Rate

Investment <- data_fullyear$Investment
Investment <- ts(Investment, start = 2005, frequency = 4)
Investment_actual <- autoplot(Investment) + labs(x = NULL, y = NULL) + ggtitle('Actual Investment Growth Rate')

Investment_train <- ts(data_train$Investment, start = 2005, frequency = 4)
arima_inv <- auto.arima(Investment_train, D = 0, max.Q = 0, max.P = 0)
gr_inv_arima <- arima_inv %>% forecast(h = 14) %>% autoplot()+ labs(x = NULL, y = NULL)

inv_arimax <- auto.arima(Investment_train, xreg = ind_train) #fit the arimax model 2005-2017
forecast(inv_arimax, xreg = ind_forecast_full, level = 95) #predict 2018 and 2019 values

##         Point Forecast     Lo 95    Hi 95
## 2018 Q1    -0.02722691 -5.373566 5.319112
## 2018 Q2     1.12641697 -4.408037 6.660871
## 2018 Q3     1.06144495 -4.807016 6.929906
## 2018 Q4    -0.19527036 -6.066552 5.676011
## 2019 Q1    -0.82729142 -6.767152 5.112569
## 2019 Q2    -0.43211195 -6.466886 5.602662
## 2019 Q3     0.82048326 -5.214436 6.855403
## 2019 Q4    -0.07182856 -6.130509 5.986852
## 2020 Q1     0.16019463 -5.973548 6.293937
## 2020 Q2    -2.82579249 -8.962511 3.310926
## 2020 Q3     0.78633705 -5.374971 6.947645
## 2020 Q4     0.15778522 -6.012873 6.328444

gr_inv_arimax<- inv_arimax %>% forecast(xreg = ind_forecast_full, h = 6) %>% autoplot()+ labs(x = NULL, y = NULL)
grid.arrange(Investment_actual, gr_inv_arima, gr_inv_arimax, ncol =1)

ARIMA(1,0,0) model: \[ y_t = 0.4885y_{t-1} + \epsilon_t \] The auto.arima() function fits a seasonal ARIMAX(2,0,1)(1,0,0)[4] model in this case, which has the form: \[ (1 - \phi_1B)(1-\phi_2B^2)(1-\Phi_1B^4)y_t = (1+\theta_1B)\epsilon_t \] where \(B^n\) is the backshift operator of \(n\)-period.

After expanding, rearranging, and plugging in coefficients from the output, we have: \[ z_t = 0.4920z_{t-1} - 0.4505z_{t-2} - 0.2216z_{t-3} + \epsilon_t + 0.7440\epsilon_{t-1}\] where \(z_t = y_t + 0.3565y_{t-4}\)

Forecasting S&P 500 Growth Rate:

Sp500 <- data_fullyear$SP500
Sp500 <- ts(Sp500, start = 2005, frequency = 4)
Sp500_actual <- autoplot(Sp500) + ggtitle('Actual S&P500') + labs(x = NULL, y = NULL)


Sp_train <- ts(data_train$SP500, start = 2005, frequency = 4)
arima_sp <- auto.arima(Sp_train, D = 0, max.Q = 0, max.P = 0)
gr_sp_arima <- arima_sp %>% forecast(h = 6) %>% autoplot() + labs(x = NULL, y = NULL) 


ind_forecast_down <- ts(data_after18$slowdown, start = 2018, frequency = 4)
ind_train_down <- ts(data_train$slowdown, start = 2005, frequency = 4)


sp_arimax <- auto.arima(Sp_train, xreg = ind_train_down) 
forecast(sp_arimax, xreg = ind_forecast_down, level = 95)

##         Point Forecast     Lo 95    Hi 95
## 2018 Q1      0.6461015 -6.322179 7.614382
## 2018 Q2      1.1518438 -5.816437 8.120124
## 2018 Q3      0.9350971 -6.033183 7.903378
## 2018 Q4      1.2240927 -5.744188 8.192373
## 2019 Q1      0.6461015 -6.322179 7.614382
## 2019 Q2      0.5016037 -6.466677 7.469884
## 2019 Q3      0.8628482 -6.105432 7.831129
## 2019 Q4      0.7183504 -6.249930 7.686631
## 2020 Q1      1.2240927 -5.744188 8.192373
## 2020 Q2      0.8628482 -6.105432 7.831129
## 2020 Q3      1.2963416 -5.671939 8.264622
## 2020 Q4      1.1518438 -5.816437 8.120124

gr_sp_arimax<- sp_arimax %>% forecast(xreg = ind_forecast_down, h = 6) %>% autoplot() + 
  ggtitle('Forcasts with Arimax(1,1,0)') + labs(x = NULL, y = NULL)
grid.arrange(Sp500_actual, gr_sp_arima, gr_sp_arimax, ncol =1)

Arima(2,0,2) model: \[ y_t = 0.7444y_{t-1} - 0.9133y_{t-2} + \epsilon_t - 0.8314\epsilon_{t-1} + 0.7589\epsilon_{t-1} \] Arimax(2,0,0) model: \[ y_t = 1.9411 + 0.1672y_{t-1} - 0.3891y_{t-2} -0.0470x + \epsilon_t \] It can be seen that these models fail to capture both the patterns and the extent of the index movements. This can be attributed to the very weak correlation between the keywords and the index. To improve for the above models, there should be more control variables to more accurately forecast the values. The variables can account for other factors apart from sentiments, and thus can help improve the accuracy in forecasting the extent of the movements .