Stock market prices flutuate in a way that is notoriously difficult to predict. Here I will attempt to use data accululated through the internet to attempt to predict when there will be large flutuations in stock market prices that will trigger a buy or sell action.
My hypothesis is that many fluctuations in the stock market prices may be predicted by analysing news reports, twitter feeds, etc. For example the stock prices of Apple climb at the release of a new product, around the time of the release there is likely to be many different news reports and articles posted amongst the web, as well as many consumers tweeting their excitement for the upcoming product.
By incorporating this information perhaps I can better predict large flutuations in stock market prices and generate higher returns on investments.
First I will begin by loading in the data and tidying it up. My goal is to have each column a separate variable, and each row a separate observation. I will load in the data using the quantmod package. Data tidying will be done using dplyr and tidyr. Additional help will come from the zoo, xts, and TTR libraries since the historical stock data comes in xts form due to its ease with working with time-series data.
library(zoo)
library(TTR)
library(xts)
library(quantmod)
library(dplyr)
library(tidyr)
Here I add in the stock symbols that I want, the function getSymbols generates an xts file for each stock symbol with the same name.
# Choose stock company
Nasdaq100_Symbols <- c('AAPL', 'AMZN')
stocks <- getSymbols(Nasdaq100_Symbols)
Since the stock data is separate for each stock I will combine them in a matrix named df, and also combine the column names in a separate list as this will be useful in understanding the data
cols = list()
df <- matrix(data = NA, ncol = 6*length(stocks), nrow = dim(get(stocks[1]))[1])
for(i in 1:length(stocks)){
in1 <- 6*(i-1)+1
in2 <- 6*i
df[,in1:in2] <- get(stocks[i])
cols <- c(cols, colnames(get(stocks[i])))
}
colnames(df) <- cols
Next row names are added and a separate column for the date is added, labelled, and the matrix is converted to a data frame
rownames(df)<- as.character.Date(index(get(stocks[1])))
df.dates <- as.data.frame(cbind(as.character.Date(index(get(stocks[1]))), df))
colnames(df.dates)[1] <- 'Date'
head(df.dates)
## Date AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume
## 2007-01-03 2007-01-03 86.289999 86.579999 81.899999 83.800002 309579900
## 2007-01-04 2007-01-04 84.050001 85.949998 83.820003 85.659998 211815100
## 2007-01-05 2007-01-05 85.77 86.199997 84.400002 85.049997 208685400
## 2007-01-08 2007-01-08 85.959998 86.529998 85.280003 85.47 199276700
## 2007-01-09 2007-01-09 86.450003 92.979999 85.15 92.570003 837324600
## 2007-01-10 2007-01-10 94.749999 97.800002 93.450003 96.999997 738220000
## AAPL.Adjusted AMZN.Open AMZN.High AMZN.Low AMZN.Close
## 2007-01-03 11.01952 38.68 39.060001 38.049999 38.700001
## 2007-01-04 11.264106 38.59 39.139999 38.259998 38.900002
## 2007-01-05 11.183892 38.720001 38.790001 37.599998 38.369999
## 2007-01-08 11.239121 38.220001 38.310001 37.169998 37.5
## 2007-01-09 12.172756 37.599998 38.060001 37.34 37.779999
## 2007-01-10 12.755291 37.490002 37.700001 37.07 37.150002
## AMZN.Volume AMZN.Adjusted
## 2007-01-03 12405100 38.700001
## 2007-01-04 6318400 38.900002
## 2007-01-05 6619700 38.369999
## 2007-01-08 6783000 37.5
## 2007-01-09 5703000 37.779999
## 2007-01-10 6527500 37.150002
Now we will tidy the data using dplyr and tidyr to only include the data we are interested in. Here we will just take the adjusted values since they take into account stock splits, dividends etc which may skew prices.
tidy_stocks <- as.data.frame(df.dates) %>%
select(Date, contains('Adjusted')) %>%
gather(Index, Price, -Date)
head(tidy_stocks)
## Date Index Price
## 1 2007-01-03 AAPL.Adjusted 11.01952
## 2 2007-01-04 AAPL.Adjusted 11.264106
## 3 2007-01-05 AAPL.Adjusted 11.183892
## 4 2007-01-08 AAPL.Adjusted 11.239121
## 5 2007-01-09 AAPL.Adjusted 12.172756
## 6 2007-01-10 AAPL.Adjusted 12.755291
Calculating the returns on prices will be the next thing we will do. The simple daily return, \(R_s\), is calculated with the relation \(R_s = \frac{P_j-P_{j-1}}{P_{j-1}}\), where \(P_{j}\) is the price at day \(j\), and \(P_{j-1}\) is the price the previous day. The log return, \(R_{cc}\), is another useful relation that assumes the difference in price is calculated via continuously compounding, \(R_{cc}\) is calculated using the relation \(R_{cc} = \log(P_{j})-\log(P_{j-1})\).
We can add this to the data frame using the mutate function.
stock_returns <- tidy_stocks %>%
mutate(Simple.return = ifelse(lag(Index)==Index,(as.numeric(Price)-as.numeric(lag(Price)))/(as.numeric(lag(Price))),0)) %>%
mutate(Log.return = ifelse(lag(Index)==Index,(log(as.numeric(Price))-log(as.numeric(lag(Price)))),0))
head(stock_returns)
## Date Index Price Simple.return Log.return
## 1 2007-01-03 AAPL.Adjusted 11.01952 NA NA
## 2 2007-01-04 AAPL.Adjusted 11.264106 0.022195704 0.021952964
## 3 2007-01-05 AAPL.Adjusted 11.183892 -0.007121204 -0.007146681
## 4 2007-01-08 AAPL.Adjusted 11.239121 0.004938263 0.004926110
## 5 2007-01-09 AAPL.Adjusted 12.172756 0.083070108 0.079799701
## 6 2007-01-10 AAPL.Adjusted 12.755291 0.047855638 0.046745826
For exploratory data analyses it is useful to plot various things to get a feel for the data. Here we will just look at apple.
Using ggplot we can plot the price as a function of time, as well all the daily log return, and a histogram of the daily log return.
library(ggplot2)
aapl_returns <- stock_returns %>% filter(grepl('AAPL',Index)) %>% select(-Index)
head(aapl_returns)
## Date Price Simple.return Log.return
## 1 2007-01-03 11.01952 NA NA
## 2 2007-01-04 11.264106 0.022195704 0.021952964
## 3 2007-01-05 11.183892 -0.007121204 -0.007146681
## 4 2007-01-08 11.239121 0.004938263 0.004926110
## 5 2007-01-09 12.172756 0.083070108 0.079799701
## 6 2007-01-10 12.755291 0.047855638 0.046745826
plot1 <- ggplot(data = aapl_returns, aes(x = Date, y = as.numeric(Price), group = 1))+geom_line()+ geom_point(color="blue")
plot1 + ggtitle('Price over time')
plot2 <- ggplot(data = aapl_returns, aes(x = Date, y = Log.return, group = 1)) + geom_line()
plot2 + ggtitle('Log return over time')
plot3 <- ggplot(data = aapl_returns, aes(Log.return)) + geom_histogram(bins = 70)
plot3 + ggtitle('Histogram of the Log return')
We can see that stock of Apple has increased steadily over the course of the dataset, since the beginning of 2007. Also from the histogram of the daily log return we can see that the center of this distribution is almost around zero, which agrees with intuition, that on average if you invested in Apple for one day there would likely not be much return.
meanDist.daily <- mean(na.omit(aapl_returns$Log.return))
meanDist.yearly <- meanDist.daily * 252
meanDist.daily
## [1] 0.0009244331
meanDist.yearly
## [1] 0.2329572
On average over the course of one day you are likely to increase the value of your stock by 0.0924% and over the year it would be 23.2957% since there are 252 trading days in a year.