The financial markets contain a plethora of statistical patterns. The behavior of those patterns is similar with the behavior of the natural phenomena patterns. That means that both are affected by unknown and unstable variables. Which leads to high unpredictability and volatility. That makes almost impossible to forecast future behavior.
As Burton Malkiel, who argues in his 1973 book, “A Random Walk Down Wall Street,”
Nevertheles, one forecasting methodology is: To use the past performance of markets as a predictor for the future. That can be achieved by observing the changes of small seasonal intervals, when the time series is stationary.
The purpose of this project is to analyze 3 different algorithms for the financial forecasting of Daimler share. Disclaimer I am working on Advanced Analytics of Daimler AG. This analysis is for educational purposes and not for financial advising.
We will use Daimler historical share market datasets (from 2010). For forecasting future values. On those nature of forecasting we assume that some patterns of our sets have carriage on future short linear interims. The same approach is applied on the weather forecasting.
We will apply mathematical technical indicators in our datasets on the below domains:
-Support & resistance
-Trend
-Momentum
-Volume
-Volatility
Some of them are the:
-Moving average convergence/divergence
-Relative strength index
-Stochastic oscillator
-Ease of movement
-Larry Williams oscilator. Etc.
The 3 algorithms that will we compare to predict the Daimler financial share behavior are:
-(LASSO) Least Absolute Shrinkage and Selection Operator. This method is based on a linear regression model is proposed as a novel method to predict financial market behavior
-Deep Learning (Long Short Term Memory Neural Network of linear
stack densely connected layers.
-And eXtreme Gradient Boosting
Apreciation to the colleagues from:
Cornell University (arXiv:1512.04916v3) [q-fin.CP] (Ruoxua, 2016)
Cornell University Social and Information Networks (cs.SI); Computational
Finance (q-fin.CP) (Jichang Zhao,2019)
This article is intended for academic and educational purposes and is not an investment recommendation. The information that we provide or should not be a substitute for advice from an investment professional. The models discussed in this paper do not reflect the investment performance. A decision to invest in any product or strategy should not be based on the information or conclusions contained herein. This is neither an offer to sell / buy nor a solicitation for an offer to buy interests in securities.
Load required libraries
library(dplyr)
library(ggthemes)
options(kableExtra.latex.load_packages = FALSE)
require(kableExtra)
library(dynlm)
library(ggplot2)
library(tidyr)
library(tseries)
library(TSA)
library(forecast)
library(gridExtra)
library(corrplot)
library(data.table)
library(quantmod)
library(plotly)
library(GGally)
library(data.table)
library(kableExtra)
library(plotly)
library(caret)
library(xgboost)
library(dplyr)
library(lubridate)
library(DT)
library(xts)
library(e1071)
library(doParallel)
library(keras)
library(tensorflow)
library(knitr)
load("daimler_stock_env.RData")
Download the Daimler share prices, since 2010
We will use as data sets the Daimler AG Symbol (DDAIF)
Display of the 6 first entries of our dataset
DDAIF.Open DDAIF.High DDAIF.Low
FALSE FALSE FALSE
DDAIF.Close DDAIF.Volume DDAIF.Adjusted
FALSE FALSE FALSE
Avg_volume_10 Avg_volume_20 Volume_perc_avg_60
TRUE TRUE TRUE
Range perc_change_closing change_from_yest
FALSE TRUE TRUE
moving_avg_10 moving_avg_20 moving_avg_60
TRUE TRUE TRUE
perc_moving_avg_10 perc_moving_avg_20 perc_moving_avg_60
TRUE TRUE TRUE
cash_tradet avg_cash_trated_10 avg_cash_trated_20
FALSE TRUE TRUE
avg_cash_trated_60 Avg_Dollar_volume_pct_10 Avg_Dollar_volume_pct_20
TRUE TRUE TRUE
Avg_Dollar_volume_pct_60 nightgap night_gap_perc
TRUE TRUE TRUE
perc_range_previous perc_range_atpr perc_range_williams
FALSE FALSE FALSE
one_month_range_perc EMA10 EMA20
TRUE TRUE TRUE
EMA60 WMA10 EVWMA10
TRUE TRUE TRUE
ZLEMA10 VWAP10 HMA10
TRUE TRUE TRUE
ALMA10
TRUE
DDAIF.Open DDAIF.High DDAIF.Low DDAIF.Close DDAIF.Volume
2010-05-03 50.90 51.62 50.68 51.26 753100
DDAIF.Adjusted Avg_volume_10 Avg_volume_20 Volume_perc_avg_60
2010-05-03 37.62413 NA NA NA
Range perc_change_closing change_from_yest moving_avg_10
2010-05-03 0.939999 NA NA NA
moving_avg_20 moving_avg_60 perc_moving_avg_10
2010-05-03 NA NA NA
perc_moving_avg_20 perc_moving_avg_60 cash_tradet
2010-05-03 NA NA 38603904
avg_cash_trated_10 avg_cash_trated_20 avg_cash_trated_60
2010-05-03 NA NA NA
Avg_Dollar_volume_pct_10 Avg_Dollar_volume_pct_20
2010-05-03 NA NA
Avg_Dollar_volume_pct_60 nightgap night_gap_perc
2010-05-03 NA NA NA
perc_range_previous perc_range_atpr perc_range_williams
2010-05-03 38.297488 1.833787 38.29802
one_month_range_perc EMA10 EMA20 EMA60 WMA10 EVWMA10 ZLEMA10
2010-05-03 NA NA NA NA NA NA NA
VWAP10 HMA10 ALMA10
2010-05-03 NA NA NA
[ reached getOption("max.print") -- omitted 5 rows ]
An 'xts' object on 2010-05-03/2019-04-30 containing:
Data: num [1:2264, 1:40] 50.9 48.9 47.3 47.1 46.2 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:40] "DDAIF.Open" "DDAIF.High" "DDAIF.Low" "DDAIF.Close" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
List of 2
$ src : chr "yahoo"
$ updated: POSIXct[1:1], format: "2019-05-15 18:38:05"
Calculate log returns
Chart series technical analysis graph
Bollinger Bands, and Moving Average Convergence Divergence graph
Interactive graph with opening Prices since 2010 pro semester
require(ggthemes)
require(ggplot2)
require(plotly)
p1 <- ggplot(DDAIF, aes(x = index(DDAIF), y = DDAIF[, 1])) + geom_line(color = "yellow") +
ggtitle("Daimler stock Opening Prices") + xlab("Date") + ylab("Opening Prices") +
scale_x_date(date_labels = "%b %y", date_breaks = "6 months") + theme_solarized(light = FALSE) +
theme(axis.text = element_text(size = 10, angle = 90), plot.title = element_text(size = 11,
color = "yellow", hjust = 0.5))
ggplotly(p1)
Interactive graph with the Adjusted Closing Prices
p2 <- ggplot(DDAIF, aes(x = index(DDAIF), y = DDAIF[, 6])) + geom_line(color = "yellow") +
ggtitle("Daimler Adjusted Closing Prices") + xlab("Date") + ylab("Adjusted Closing Prices") +
scale_x_date(date_labels = "%b %y", date_breaks = "6 months") + theme_solarized(light = FALSE) +
theme(axis.text = element_text(size = 10, angle = 90), plot.title = element_text(size = 11,
color = "yellow", hjust = 0.5))
ggplotly(p2)
Plot of an An open-high-low-close chart
require(gridExtra)
options(repr.plot.width = 10, repr.plot.height = 10)
popen = ggplot(DDAIF, aes(DDAIF.Open)) + geom_histogram(bins = 50, aes(y = ..density..),
col = "yellow", fill = "yellow", alpha = 0.2) + geom_density()
phigh = ggplot(DDAIF, aes(DDAIF.High)) + geom_histogram(bins = 50, aes(y = ..density..),
col = "blue", fill = "red", alpha = 0.2) + geom_density()
plow = ggplot(DDAIF, aes(DDAIF.Low)) + geom_histogram(bins = 50, aes(y = ..density..),
col = "red", fill = "red", alpha = 0.2) + geom_density()
pclose = ggplot(DDAIF, aes(DDAIF.Close)) + geom_histogram(bins = 50, aes(y = ..density..),
col = "black", fill = "red", alpha = 0.2) + geom_density()
grid.arrange(popen, phigh, plow, pclose, nrow = 2, ncol = 2)
Add new indicators to our sets to increase prediction. Such us: Weighted Moving Average
Double Exponential Moving Average is a measure of a security’s trending average The EVWMA uses the volume to declare the period of the MA. Zero Lag Exponential Moving Average (ZLEMA) As is the case with the double Volume weighted average price (VWAP) and moving volume weighted average price The Hull Moving Average (HMA), developed by Alan Hull, is an extremely fast The ALMA moving average uses curve of the Normal (Gauss) distribution which
# We create a response variable. To predicting future days price, we apply
# lag function in the price change
# We Calculate the various moving averages (MA) of a series for volume. For
# the past 10, 20 , 60 days
require(TTR)
DDAIF$Avg_volume_10 <- SMA(DDAIF$DDAIF.Volume, n = 10)
DDAIF$Avg_volume_20 <- SMA(DDAIF$DDAIF.Volume, n = 20)
DDAIF$Avg_volume_60 <- SMA(DDAIF$DDAIF.Volume, n = 60)
# We calculate the % of the average volume of the above days
DDAIF$Volume_perc_avg_10 <- (DDAIF$DDAIF.Volume/DDAIF$Avg_vol_10) * 100
DDAIF$Volume_perc_avg_20 <- (DDAIF$DDAIF.Volume/DDAIF$Avg_vol_20) * 100
DDAIF$Volume_perc_avg_60 <- (DDAIF$DDAIF.Volume/DDAIF$Avg_vol_60) * 100
# We calculate the range between high and low
DDAIF$Range <- DDAIF$DDAIF.High - DDAIF$DDAIF.Low
# % change of closing price.
DDAIF$perc_change_closing <- (DDAIF$DDAIF.Close - lag(DDAIF$DDAIF.Close))/lag(DDAIF$DDAIF.Close) *
100
# Range between prior days closing price and todays closing price
DDAIF$change_from_yest <- DDAIF$DDAIF.Close - lag(DDAIF$DDAIF.Close)
# We Calculate again the various moving averages (MA) for range now . For
# the past 10, 20 , 60 days
DDAIF$moving_avg_10 <- SMA(DDAIF$Range, n = 10)
DDAIF$moving_avg_20 <- SMA(DDAIF$Range, n = 20)
DDAIF$moving_avg_60 <- SMA(DDAIF$Range, n = 60)
# We calculate the % of the average range of the above days
DDAIF$perc_moving_avg_10 <- (DDAIF$Range/DDAIF$moving_avg_10) * 100
DDAIF$perc_moving_avg_20 <- (DDAIF$Range/DDAIF$moving_avg_20) * 100
DDAIF$perc_moving_avg_60 <- (DDAIF$Range/DDAIF$moving_avg_60) * 100
# The tot amount of money traded multiplied by the volume (in dollars)
DDAIF$cash_tradet <- DDAIF$DDAIF.Close * DDAIF$DDAIF.Volume
# The average volume of cash trated for the same periods as above
DDAIF$avg_cash_trated_10 <- SMA(DDAIF$cash_tradet, n = 10)
DDAIF$avg_cash_trated_20 <- SMA(DDAIF$cash_tradet, n = 20)
DDAIF$avg_cash_trated_60 <- SMA(DDAIF$cash_tradet, n = 60)
# The % of the avgo volume today.
DDAIF$Avg_Dollar_volume_pct_10 <- (DDAIF$cash_tradet/DDAIF$avg_cash_trated_10) *
100
DDAIF$Avg_Dollar_volume_pct_20 <- (DDAIF$cash_tradet/DDAIF$avg_cash_trated_20) *
100
DDAIF$Avg_Dollar_volume_pct_60 <- (DDAIF$cash_tradet/DDAIF$avg_cash_trated_60) *
100
# Todays open vs Yesterdays Close.
require(data.table)
require(dplyr)
DDAIF$nightgap <- DDAIF$DDAIF.Open - lag(DDAIF$DDAIF.Close)
# The Gap % win or loss from yesterday closing prices
DDAIF$night_gap_perc <- (DDAIF$DDAIF.Open - lag(DDAIF$DDAIF.Close))/lag(DDAIF$DDAIF.Close) *
100
DDAIF$perc_range_previous = abs((DDAIF$DDAIF.Close - DDAIF$DDAIF.Open)/(DDAIF$DDAIF.High -
DDAIF$DDAIF.Low) * 100)
DDAIF$perc_range_atpr = (DDAIF$Range/DDAIF$DDAIF.Close) * 100
DDAIF$perc_range_williams = (DDAIF$DDAIF.High - DDAIF$DDAIF.Close)/(DDAIF$DDAIF.High -
DDAIF$DDAIF.Low) * 100
# Compute range for 1 Month
require(zoo)
one_month_range_perc <- rollapply(DDAIF$DDAIF.High, 20, max) - rollapply(DDAIF$DDAIF.Low,
20, max)
DDAIF$one_month_range_perc = (DDAIF$DDAIF.Close - DDAIF$DDAIF.Low)/one_month_range_perc *
100
gc() #clean RAM
# Moving averages smooth the price data to form a trend following indicator.
# They
require(TTR)
DDAIF$EMA10 <- EMA(DDAIF$DDAIF.Low, n = 10)
DDAIF$EMA20 <- EMA(DDAIF$DDAIF.Low, n = 20)
# Weighted Moving Average
DDAIF$EMA60 <- EMA(DDAIF$DDAIF.Low, n = 60)
# Double Exponential Moving Average is a measure of a security's trending
# average
DDAIF$WMA10 <- WMA(DDAIF$DDAIF.Low, n = 10)
# The EVWMA uses the volume to declare the period of the MA.
DDAIF$EVWMA10 <- EVWMA(DDAIF$DDAIF.Low, DDAIF$DDAIF.Volume)
# Zero Lag Exponential Moving Average (ZLEMA) As is the case with the double
DDAIF$ZLEMA10 <- ZLEMA(DDAIF$DDAIF.Low, n = 10)
# Volume weighted average price (VWAP) and moving volume weighted average
# price
DDAIF$VWAP10 <- VWAP(DDAIF$DDAIF.Low, DDAIF$DDAIF.Volume)
# The Hull Moving Average (HMA), developed by Alan Hull, is an extremely
# fast
DDAIF$HMA10 <- HMA(DDAIF$DDAIF.Low, n = 20)
# The ALMA moving average uses curve of the Normal (Gauss) distribution
# which
DDAIF$ALMA10 <- ALMA(DDAIF$DDAIF.Low, n = 9, offset = 0.85, sigma = 6)
# DDAIF <- DDAIF[complete.cases(DDAIF), ]
write.csv(DDAIF, file = "DDAIF_with_TI.csv", row.names = F)
Augmented Dickey-Fuller test AND correlation test
# Augmented Dickey-Fuller test AND correlation test
require(tseries)
# From package tseries we use the () adf.test- Computes the Augmented
# Dickey-Fuller test for the null that x has a unit root.
adf.test(DDAIF$DDAIF.Adjusted)
# The result shows that we need to move stationarity
require(forecast)
require(xts)
require(e1071)
require(doParallel)
require(dynlm)
require(caret)
require(dynlm)
DDAIF_lm <- na.omit(DDAIF) #we handle missing values
set.seed(123) #algorithm for reproducability
X <- DDAIF_lm[, -6]
y <- DDAIF_lm[, 6]
# We scale the variables in order to run the models
X.scaled <- scale(X)
gc() #clean RAM
# We merge them back
DDAIF_lm <- cbind(X.scaled, y)
# create index
numerical_Vars <- which(sapply(DDAIF_lm, is.numeric))
# save the vector
numerical_VarNames <- names(numerical_Vars)
cat("They exist", length(numerical_Vars), "numerical variables.\n")
sum_numVar <- DDAIF_lm[, numerical_Vars]
We calculate the correlations of all numerical variables
We display the correlation of our features, in order to choose the ones without
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 13347310 712.9 24364361 1301.2 16827610 898.7
Vcells 33860826 258.4 58208892 444.1 40300111 307.5
We remove the highly correlated variables to avoid overfitting of models
# We remove the highly correlated variables to avoid overfitting of models
del <- cor(DDAIF_lm)
del[upper.tri(del)] <- 0
diag(del) <- 0
DDAIF_lm <- DDAIF_lm[, !apply(del, 2, function(x) any(x > 0.9))]
#------------------------------------------
# We create our Train and Test Datasets
#------------------------------------------
# For next day forecast n = days_forecast + 1. If you want more days change
# the
#+1
days_to_forecast = 7
n = days_to_forecast + 1
X_train = DDAIF_lm[1:(nrow(DDAIF_lm) - (n - 1)), -17]
# Our dependent var: Is the price adj
y_train = DDAIF_lm[n:nrow(DDAIF_lm), 17]
X_test = DDAIF_lm[((nrow(DDAIF_lm) - (n - 2)):nrow(DDAIF_lm)), -17]
ourdate <- time(DDAIF2)
require(quantmod)
# We create the validation test of the real prices of the next 7 days Adapt
# dates according to your n days of forecast
DDAIF2 = getSymbols("DDAIF", from = "2019-04-22", to = "2019-05-01", auto.assign = FALSE)
y_test <- as.numeric(DDAIF2$DDAIF.Adjusted)
train <- cbind(X_train, y_train)
# check the number of features
dim(X_train)
dim(X_test)
KERAS DEEP LEARNING : backend TensorFlow We will apply deep learning networks of linear stack densely connected layers If you already have installed Keras and tensorflow then skip the below commands devtools::install_github(“rstudio/keras”) devtools::install_github(“rstudio/tensorflow”) install_tensorflow() require(keras) require(tensorflow)
We will apply deep learning networks of linear stack densely connected layers
keras_model <- keras_model_sequential()
keras_model %>%
#We ddd a densely-connected NN layer to an output
#ReLU (Rectified Linear Unit) Activation Function
layer_dense(units = 60, activation = 'relu', input_shape = ker) %>%
layer_dropout(rate = 0.2) %>% #We apply dropout to prevent overfitting
layer_dense(units = 50, activation = 'relu') %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 1, activation = 'linear')
We train the NN model
keras_history <- keras_model %>% fit(X_train, y_train, epochs = 200, batch_size = 28,
validation_split = 0.1, callbacks = callback_tensorboard("logs/run_a"))
Plot of Keras Model History
Interactive plot of Keras Model Predictions vs Actuals
#------------------------------------------
# Step 3: We plot the predictions
#------------------------------------------
require(ggplot2)
require(ggthemes)
require(plotly)
real_VS_pred <- data.frame(keras_pred, y_test)
colnames(real_VS_pred) <- c("KERAS PRED", "REAL PRICES")
p4 <- ggplot(real_VS_pred, aes(date)) + geom_line(aes(y = keras_pred, colour = "keras_pred")) +
geom_line(aes(y = y_test, colour = "real_prices")) + geom_point(aes(y = keras_pred,
colour = "keras_pred"), size = 2) + geom_point(aes(y = y_test, colour = "real_prices"),
size = 2) + labs(title = "Keras (Predicted vs Actual)", x = "Date", y = "Daimler Share Price in $") +
theme_solarized(light = FALSE)
ggplotly(p4)
# We display Pred vs Actual of all models
require(kableExtra)
kable(real_VS_pred) %>% kable_styling(bootstrap_options = "bordered", full_width = F,
position = "center") %>% column_spec(1, bold = T, color = "red")
KERAS PRED | REAL PRICES |
---|---|
61.47425 | 66.77 |
62.02896 | 66.31 |
60.32361 | 65.07 |
62.72711 | 64.31 |
61.83586 | 65.03 |
59.63830 | 65.13 |
60.34718 | 65.74 |
With caret package we will apply cross validation in order to find the optimal
hyperparameters
require(caret)
train$DDAIF.Adjusted <- as.numeric(train$DDAIF.Adjusted)
set.seed(123) #algorithm for reproducability
trainControl <- trainControl(method = "cv", number = 5)
lassoGrid <- expand.grid(alpha = 1, lambda = seq(0.001, 0.1, by = 5e-04))
lassomod <- train(DDAIF.Adjusted ~ ., data = na.omit(train), method = "glmnet",
trControl = trainControl, tuneGrid = lassoGrid)
# we display the optimal alpha and lambda penalties
lassomod$bestTune
# We display the root mean squared error
min(lassomod$results$RMSE)
# From caret package we use () varImp. Is a generic method for calculating
# variable importance for objects produced by train and method specific
# methods
lasso_VarImp <- varImp(lassomod, scale = F)
lasso_Importance <- lasso_VarImp$importance
vars_Selected <- length(which(lasso_Importance$Overall != 0))
vars_NotSelected <- length(which(lasso_Importance$Overall == 0))
Display of the Laso Regression vars penalty
The Lasso regression used 11 variables, and did not used 5 variables.
Interactive plot of Keras and Lasso regression:Predicted vs Actual Prices
KERAS PRED | LASSO PRED | REAL PRICES | |
---|---|---|---|
2019-04-22 | 61.47425 | 63.98916 | 66.77 |
2019-04-23 | 62.02896 | 55.00517 | 66.31 |
2019-04-24 | 60.32361 | 63.50995 | 65.07 |
2019-04-25 | 62.72711 | 56.25471 | 64.31 |
2019-04-26 | 61.83586 | 63.31208 | 65.03 |
2019-04-29 | 59.63830 | 56.38149 | 65.13 |
2019-04-30 | 60.34718 | 68.96010 | 65.74 |
We define the parameters that caret will use in the finding of hyperparameters
library(xgboost)
xgb_grid = expand.grid(nrounds = 1000, eta = c(0.1, 0.05, 0.01), max_depth = c(2,
3, 4, 5, 6), gamma = 0, colsample_bytree = 1, min_child_weight = c(1, 2,
3, 4, 5), subsample = 1)
# With the 5 fold cross validation, caret package can find the optimal
# hyperparameters for our model (takes a lot of time...)
xgb_hyperparam <- train(DDAIF.Adjusted ~ ., data = na.omit(train), method = "xgbTree",
trControl = trainControl, tuneGrid = xgb_grid)
KERAS PRED | LASSO PRED | XGB PRED | REAL PRICES | |
---|---|---|---|---|
2019-04-22 | 61.47425 | 63.98916 | 61.09808 | 66.77 |
2019-04-23 | 62.02896 | 55.00517 | 62.61695 | 66.31 |
2019-04-24 | 60.32361 | 63.50995 | 59.31833 | 65.07 |
2019-04-25 | 62.72711 | 56.25471 | 62.09966 | 64.31 |
2019-04-26 | 61.83586 | 63.31208 | 56.91272 | 65.03 |
2019-04-29 | 59.63830 | 56.38149 | 61.77945 | 65.13 |
2019-04-30 | 60.34718 | 68.96010 | 56.54294 | 65.74 |
Usually the share price daily fluctuation is between 1 - 2 percent in ordinary time periods. Unfortunately the above models presented high daily fluctuation. Regardless from our application of the mathematical technical indicators. Into our datasets before the training of the models.
Unfortunately currently our models are not adequate to forecast time series of markets successfully.
On another paper, i have also created some models to analyze the correlation of social media sentiment and Daimler share price. There my models forecasting was significantly more successful. I would propose to create a model that would analyze and combine the results of:
-Social media sentiment
-Economic News
-Market Time Series Analysis
Thank you for reading my analysis
KR
Niko
(https://www.linkedin.com/in/niko-papacosmas-mba-pmp-mcse-695a2695/)
REFERENCES
[1] Ruoxuan Xiong, Eric P. Nichols, Yuan Shen. Deep Learning Stock Volatility with Google Domestic Trends. Cornell University (arXiv:1512.04916v3) [q-fin.CP] (2016)
[2] Junran Wu, Ke Xu, Jichang Zhao.Online reviews can predict long-term returns of individual stocks . Cornell University (2019)