Here I hypothesize that trends in the fluctuation in the stock market can be predicted by the relative changes in amount terms related to the stock index is searched on search engines. For this google trends is useful since it gathers the total number of searches relative to the total search volume in google. For this I used the googletrend library to load the data for a particular topic.
I will test this hypothesis on the Apple stock. Apple is very popular and often talked about, if there is no correlation then we cannot reject the null hypothesis: the stock price is independent of relative search volume in google.
In this week I will use the google trend output as a predicter in a time series model. The time series model I will use is an autoregressive intergrated moving average (ARIMA) model, this model will take \(x\) number of days of time series data and use it to forecast a given number of days ahead.
The google trend output for keyword ‘Apple’ in the United States
We can see that there is a clear trend in the data. Notably there appears to be some time-dependence, with spikes roughly twice per year.
We can also plot the adjusted stock price as a function of time.
[1] "AAPL"
We can also plot the log daily return, given as \(r = \log(P_j) - \log(P_{j-1})\), where \(P_{i}\) is the price on a given day, \(i\).
And we can plot the autocorrelation function to see if there is any correlation between the daily rates.
In general there does not seem to much correlation between daily rates, as there are not any spikes above the 0.05 significance line. This indicates that the daily rates are akin to random fluctuations. To prove this we can plot the autocorrelation for uniform random numbers, which should no correlation
We can see that there are higher significant spikes on the random numbers compared to the daily returns. This tells us that it should be quite difficult to forecast the stock market return from looking at previous returns. After 2 days there is a negative spike around 0.05 and a positive spike above 0.05 after 4 days telling us that there is a slightly significant negative correlation after 2 days, a a significant correlation after 4 days in stock market return.
The trend is by no means clear though, which may lend credibility to the idea of looking at google trends as an indicator for stock market returns.
We will attempt to predict the stock market price using the ARIMA model, in which the stock market return is a weighted linear sum of the \(n\) last daily stock returns. The ARIMA model has \(p\) auro regressive terms, \(d\) differencing operations, and \(q\) moving average terms. To select the best number of parameters we will run through all combinations in parameter space and pick the best one.
We combine the ARIMA model with the generalized autoregressive conditional heteroscedastic (GARCH) model which models volatility, and looks at how it changes over time. Typically the volatility varies less over time compared to the daily return, so we use larger windows to predict the volatility, such as 100 days.
We run the model, and for positive outcomes, the model predicts the return will be positive, a \(1\) is outputted, which represents a long position, or buy. Else if the stock return is negative, a \(-1\) is outputted, representing a short postion or sell.
Finally the equity curve is produced, which displays the relative change in value of the asset over time. If the equity curve remained at zero, there would be no change, if the equity curve climbed to 1, the value of the asset doubled.
We can see that the ARIMA-GARCH models do not perform very well, this may be because there is not much autocorrelation in the daily returns of apple stock. When there is serial correlation this model typically performs very well. Moreover including the google trend into the model outperformed the buy and hold strategy, but only slightly.
Moreover, including the google trend data made the model run much faster. Timing the model, without the google trend data the model took 4838 seconds, which is roughly 1 hour 20 minutes, whereas including the google trend only took 23 minutes. This may be because the threshold for determining whether to buy or sell is acheived faster with the google trend information, compared to without.
We have shown that the including google trend data can influence the performance of a ARIMA-GARCH forecasting model used to predict the daily return on apple stock. This information can not generate larger equity and has a faster run time than the equivalent ARIMA-GARCH model without including google trend data.
Now that we have an ARIMA-GARCH model that works well with the google trend data, and gives us reliable returns. The goal of this week is to:
The ARIMA-GARCH model predicts the expected return of the next day given a certain number of previous days. To remind ourselves, if the expected return is positive the price of the asset is expected to increase, similarly, if the expected return is negative the price will decrease.
In the model developed last week, if the expected was positive this executed a buy/long condition, and a sell/short condition if negative, there were no hold conditions. This week I will add a threshold on the condition such that only if the expected return is greater than a given value will we buy, and only if it is below a certain value will we sell. This adds can remove the element of uncertainty from the model, but may miss out on some profits. Overall it should add a degree of conservatism to the model.
If we plot the marginal change in stock price we can see how the two versions of the model compare. One there is no threshold, equivalent to the model developed last week, and the other there is a threshold of 0.1% daily return. We again compare to a buy and hold strategy.
Using a daily expected return of 0.1% threshold the performance of both forms of the ARIMA-GARCH models differ. Both do worse at the beginning of May, however after the less conservative model outperforms the conservative model, where the expected returns predicted by the model are less than 0.1%, so a hold action is executed, this misses out on some profits gained by the less conservative model.
Overall the less conservative model outperformed the model with 0.1% threshold on the expected return, which can be configured dependent on the risk-tolerance of the user.
Emailing the results to me turned out to be a little harder than I anticipated. The process involved settig up a Gmail API, which when connected to the R-package gmailr can send emails, and the R script to read API client ID and secret. The complete setup process can be found here.
complete_email <- mime(
To = "moocarme@gmail.com",
From = "moocarme@gmail.com",
Subject = paste(toString(Sys.Date()),
"Stock prediction finished"),
body = paste0('The trading prediction has finished for ',toString(Sys.Date()),
'. \n', 'The predridction for AAPL was', toString(ind[1]),
'. \n', 'You should ', toString(ifelse(ind[1]>0, 'buy', 'sell')),
'. \n',
'A more conservative model at a daily buy/sell rate of '
, toString(bsh),' percent would suggest you '
, ifelse(ind[1]>bsh, 'buy.',
ifelse(ind[1]<(-bsh), 'sell.', 'hold.'))))
send_message(complete_email)
A screenshot of the email output from the above code is shown below. The dates, daily returns and actions are generalized so the same email body template can be used every day.
Email screenshot
Next, the script is scheduled, so that the file will run at the end of each day. Some modifications of the script were necessary, such as generalizing the dates, so that stock prices and google trend prices will be the most recent.
The script was scheduled on a linux machine using crontab. The follwing crontab command was used to implement the script.
0 19 * * 1-5 /usr/bin/R --vanilla --quiet < /home/Documents/DataScienceBootcamp/Project/project_23Jun16.R
The first two entries represent the minute and the hour, in 24 hour format, so this is scheduled to run at 7pm. The third and fourth are the day and month for the script to run, and asterisk represents all instances, the fifth entry represents the day of the week, ‘1-5’ represents Monday through Friday. The last part of the command tells crontab to run the file in the file-location by the program R.
The program is set to run at 7pm when the trading markets have closed.
This week has been focused on adjustments to the model, and building a complete data prediction product that needs little supervision. As it stands the model runs, has good predictive results (makes money), and the only action the user needs to take is provided in the email, whether to buy, sell or hold the asset which is dependent on their risk tolerance. Only two risk tolerances are provided here, but this could easily be expanded.
My next goal is to automate the process further by putting the script onto a raspberry pi computer, a small cheap computer, with 1Gb of RAM, but only runs on 5W, so it won’t cost a lot to keep running. As long as the script takes under 24 hours to run it should be no problem. This could be an issue with many stocks and many predictive features, which will be another problem for the future.
This process could also be setup with some trading software such as e-trade, or scott-trade, such that the process is completely automated.