Abstract

There are a few ways of predicting stock prices. From blind guessing to machine learning, many techniques have been tested to try to achieve decent results. Capturing the complexity of the market in a model is a daunting task but it can be broken down in a number of approaches.

Since predicting the exact percent increase of a stock the next day is difficult, we can instead try instead to predict the direction of the movement. Alternatively, we can look at the trends and seasonality in the data and try to predict with some level of confidence where the price would like in an arbitrary number of days.

The logistic models which try to predict the direction of the next day’s movement yielded poor results. Hardly any of the predictor variable were significant, and the classification accuracies were around 52% which is hardly better than guessing. Panel Regression as well as the autoregressive models using ARIMA attempt to predict closing prices of stocks using one stock at a time. Panel Regression attempts to build one model for any number of stocks across the same time periods. Doing this also yielded poor and sporadic results when predicting per stock. On the other hand, the auto-regressive models that predict subsequent values of a stock time series were of greater use. This required finding a suitable combination of parameters to fit the data closely.

Key Words

Time series
Logistic regression
Auto-regressive models
Panel regression

Introduction

Finding reliable ways to predict stock prices is a difficult exercise and the methods used often yield unsatisfactory results. A lot of stock market data is readily available and it is important to find significant predictors variables amongst all the options. If we believe that the price of a stock, or the direction of its movement can be predicted using recent information then we can look at variables associated with recent movements with indicators such as RSI or moving averages as well as intraday variations in price via the OHLC prices.

If on other the hand we believe that trends and long-term movements are a better way to predict prices then time series analysis can be used. The analysis consists of comparing a series of prices such as the close price of a stock with a lagged or shifted version of itself in order to discern any temporal correlations. This technique requires certain assumptions to be met that require transformations of the data. Through different modeling approaches and parameter tuning, we can prediction what future prices will be to a certain degree of confidence.

Methodology

The dataset is composed of a variety of stocks. For the purpose of brevity, the data exploration will be limited to a single stock, Apple (AAPL). The dataset contains typical pricing variables and it is also augmented with engineered features which are believed to have predictive power. The engineered features used are technical indicators such as the RSI (Relative Strength Indicator), the MACD (Moving Average Convergence Divergence), or simple moving averages.

The logistic models require the creation of a binary target variable. It is calculated by comparing the price of the stock with the price on the following day. If the difference is positive, the target is evaluated as 1, and 0 otherwise. The logistic models are built using Generalized Linear Models suited for the exponential family of distributions (in this case binomial) with the help of the glm function.

The autoregressive models only make use of the price time series and fits a model that explains a given time series based on its previous values and how correlated it is to its lags (previous values). Auto Regressive Integrated Moving Average models (ARIMA) are 3 parameter models that can be used to forecast future values. The parameter p represents the AR (auto regressive) term, d is the MA (moving average) term and I (integration) is the order of differentiation required to make the series stationary.

The Panel regression models also use the engineered features such as RSI, MACD, closing price moving averages, and trade volume moving averages. Since our entire dataset consists of several different stocks over given time periods, the individuality of each company is taken into consideration using a fixed effects panel model. The closing price is then predicted factoring in the individual specific company effects.

Experimentation and Results

Data Description

The variables description are as follows:

Variable	Description
date	Trading Date
open	Price of the stock at market open
high	Highest price reached in the trade day
low	Lowest price reached in the trade day
close	Price of the stock at market close
volume	Number of shares traded
unadjustedVolume	Volume for stocks, unadjusted by stock splits
change	Change in closing price from prior trade day close
changePercent	Percentage change in closing price from prior trade day close
vwap	Volume weighted average price (VWAP) is the ratio of the value traded to total volume traded
label	Trading Date
changeOverTime	Percent change of each interval relative to first value. Useful for comparing multiple stocks.
ticker	Abbreviation used to uniquely identify publicly traded shares

dataset <- read_csv('https://raw.githubusercontent.com/salma71/Data_621/master/Project_Proposal/stocks_combined.csv')
tickers <- read_csv('https://raw.githubusercontent.com/salma71/Data_621/master/Project_Proposal/tickers.csv')

The tickers available in this dataset are listed below. We will be focusing on the AAPL stock which is described below.

tickers

skim(dataset)

Data summary

Name	dataset
Number of rows	36850
Number of columns	13
_______________________
Column type frequency:
character	3
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
date	1	8	10	1258
ticker	1	1	4	30
label	1	5	9	1258

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
open	1	94.52	52.68	18.17	56.83	84.09	118.30	417.14	▇▅▁▁▁
high	1	95.25	53.14	18.41	57.40	84.67	119.14	421.84	▇▅▁▁▁
low	1	93.76	52.19	18.11	56.19	83.54	117.39	417.11	▇▅▁▁▁
close	1	94.53	52.68	18.18	56.77	84.09	118.32	421.55	▇▅▁▁▁
volume	1	11196571.67	12461717.47	305358.00	3977137.50	7045060.00	13777850.50	618237630.00	▇▁▁▁▁
unadjustedVolume	1	10967241.59	12152154.22	305358.00	3817578.50	6895243.00	13581565.00	618237630.00	▇▁▁▁▁
change	1	0.05	1.50	-23.45	-0.43	0.04	0.56	65.29	▁▇▁▁▁
changePercent	1	0.05	1.30	-14.34	-0.56	0.05	0.70	16.67	▁▁▇▁▁
vwap	1	94.52	52.68	13.79	56.79	84.11	118.24	420.32	▇▅▁▁▁
changeOverTime	1	0.38	0.50	-0.41	0.06	0.24	0.54	3.19	▇▅▁▁▁

head(dataset)

Data Processing

The only necessary processing of the data is formatting the date column appropriately.

dataset$date <- as.Date(dataset$date, format="%m/%d/%Y")

Feature Engineering

The dataset is augmented with technicals indicators from the ta-lib package and the target variable for the logistic model is created. Observations are required to be dropped where NA values are introduced due to the shifting required by the window functions. We are interested in rates of change instead of value in order to be able to any stocks, which will not have the same price.

TARGET: the binaby target variable for the logistic model
MACD: the MACD signal (only) using the typical 12 and 26 window sizes
m5: momentum indicator over the last 5 days (moving average of one trading week)
m20: momentum indicator over the last 20 days (moving average of one trading month)
vol1: rate of change of the volume over the last day
vol5: rate of change of the volume over the last five dasy (one trading week)

AAPL <- dataset %>% filter(ticker=='AAPL') 

# Augment data frame
AAPL_full <- AAPL %>% mutate(macd=MACD(Cl(AAPL), nFast=12, nSlow=26, nSig=9, maType=SMA)[,1], 
                        m5=momentum(AAPL$close, n=5), 
                        m20=momentum(AAPL$close, n=20), 
                        rsi=RSI(AAPL$close, n=14),
                        vol1=ROC(AAPL$volume, n=1),
                        vol5=ROC(AAPL$volume, n=5)) %>%
                        drop_na() %>%
                      select(-c(date,ticker,label, changeOverTime))

# Create target variable
AAPL_model1_data <- AAPL_full%>% 
                  mutate(TARGET=if_else(shift(close, n=1, fill=NA, type="lead") > close, 1, 0)) %>%
                  #select(-c(date,ticker,label,open,high,low,close,volume,unadjustedVolume,vwap,change,changeOverTime)) %>%
                  drop_na()
                  
AAPL_model1_data$TARGET <- factor(AAPL_model1_data$TARGET)

head(AAPL_model1_data,10)

Data Exploration

The stock price for AAPL over the years are risen steadily. There are two important periods of decline but the overall trends remains positive. These periods of decline should provide balance to the dataset.

AAPL %>% ggplot(aes(x=date,y=close)) + geom_line() + ggtitle('AAPL Close Price')

The distributions of the predictor variables are mostly centered. The variables related to volume are skewed to the right tail. The variables close, high and low

AAPL_full %>% 
  keep(is.numeric) %>%
  gather() %>% 
  ggplot(aes(x= value)) + 
  geom_histogram(fill='pink') + 
  facet_wrap(~key, scales = 'free')

The correlation plot reveals that some variable are nearly perfectly correlated. This makes sense given that some of these variables generally move together or are derived from another.

library(corrplot)
corr_dataframe <- AAPL_model1_data %>% mutate_if(is.factor, as.numeric)
corr.d <- cor(corr_dataframe)
corr.d[ lower.tri( corr.d, diag = TRUE ) ] <- NA
corrplot( corr.d, type = "upper", diag = FALSE )

The target variable we created has a few more observations where the movement was positive. This seems reasonable.

AAPL_model1_data %>% ggplot(aes(x=TARGET)) + geom_histogram(stat="count")

Modeling

Model 1 (Logistic)

# Initialize a df that will store the metrics of models
models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
                 aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
                precision.deviance=numeric(), stringsAsFactors=FALSE)

M1A Full Model

The first logistic model yields unsatisfactory results. Only the close variable is significant. It is revealed that a lot of the variables are multicollinear. This makes sense given that a number of variables are related to the price are generally move together (close, open, low, high), and some are derived from other variables. The model should be reduced by removing the least significant predictors until the model is significant.

model.full <- build_model('model.full', "TARGET ~ .", data = AAPL_model1_data)

## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.526  -1.214   1.016   1.136   1.516  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)       2.671e-01  6.027e-01   0.443   0.6577  
## open              8.583e-03  8.433e-02   0.102   0.9189  
## high             -1.930e-03  1.517e-01  -0.013   0.9899  
## low              -4.638e-02  1.281e-01  -0.362   0.7172  
## close            -2.640e-01  1.042e-01  -2.533   0.0113 *
## volume            2.347e-09  5.330e-09   0.440   0.6597  
## unadjustedVolume -3.069e-09  5.218e-09  -0.588   0.5563  
## change            7.907e-02  1.042e-01   0.759   0.4479  
## changePercent    -5.918e-02  1.341e-01  -0.441   0.6590  
## vwap              3.025e-01  2.193e-01   1.379   0.1678  
## macd             -2.321e-02  6.687e-02  -0.347   0.7286  
## m5                1.059e-02  2.147e-02   0.493   0.6218  
## m20               1.028e-02  1.728e-02   0.595   0.5520  
## rsi              -8.145e-04  9.055e-03  -0.090   0.9283  
## vol1             -3.231e-02  1.985e-01  -0.163   0.8707  
## vol5             -4.458e-03  1.609e-01  -0.028   0.9779  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1705.0  on 1231  degrees of freedom
## Residual deviance: 1694.3  on 1216  degrees of freedom
## AIC: 1726.3
## 
## Number of Fisher Scoring iterations: 4

models.df <- rbind(models.df,model.full$df_info)
#summary(model.full)

Variance Inflation Factors

##             open             high              low            close 
##      3056.394030     10078.470977      6913.127072      4666.194550 
##           volume unadjustedVolume           change    changePercent 
##         3.792919         3.290902        14.888381        12.928151 
##             vwap             macd               m5              m20 
##     20652.289953         8.450642         3.005024         8.777751 
##              rsi             vol1             vol5 
##         4.624581         1.314428         1.638141

M1B Small Model

Reducing the model down to only the significant predictors leads to a nonsensical results where the two variables cancel each other out.

model.small <- build_model('model.small', "TARGET ~ .-vol5 -high -rsi -open -vol1-changePercent-low-macd-m20-volume-unadjustedVolume-m5-change", data = AAPL_model1_data)

## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.534  -1.214   1.044   1.136   1.356  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.19119    0.20539   0.931   0.3519  
## close       -0.16739    0.07510  -2.229   0.0258 *
## vwap         0.16665    0.07511   2.219   0.0265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1705.0  on 1231  degrees of freedom
## Residual deviance: 1699.6  on 1229  degrees of freedom
## AIC: 1705.6
## 
## Number of Fisher Scoring iterations: 4

models.df <- rbind(models.df,model.small$df_info)
summary(model.small)

##         Length Class  Mode
## model   30     glm    list
## df_info  8     tbl_df list

M1C Multicollinearity Removed

This model removes a lot of the predictors (open,high,low,vwap,change,macd) but adds Close2Open and Open2Open in order to try to retain some information from the variables that were dropped. The removed variables were determined to be too correlated and we retained only the predictors with Variance Inflation Factors below an allowable treshold as seen below. Unfortunately, this model also lacks significance.

m1c_data <- AAPL_model1_data
m1c_data$Close2Open <- m1c_data$open - shift(m1c_data$close, n=1, fill=NA, type="lag")
m1c_data$Open2Open <- m1c_data$open - shift(m1c_data$open, n=1, fill=NA, type="lag")
m1c_data_trimmed <- m1c_data %>% select(-c(open,high,low,vwap,change,macd)) %>% drop_na()

model.m1c  <- build_model('model.m1c', "TARGET ~ .", data = m1c_data_trimmed )

## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.452  -1.215   1.041   1.133   1.522  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)       2.315e-01  5.546e-01   0.417    0.676
## close            -6.350e-04  1.716e-03  -0.370    0.711
## volume            3.806e-09  5.157e-09   0.738    0.461
## unadjustedVolume -3.337e-09  5.102e-09  -0.654    0.513
## changePercent    -8.424e-02  5.278e-02  -1.596    0.110
## m5                1.639e-02  1.749e-02   0.937    0.349
## m20               4.954e-03  9.862e-03   0.502    0.615
## rsi              -1.601e-03  7.907e-03  -0.202    0.840
## vol1             -2.477e-02  1.974e-01  -0.126    0.900
## vol5             -1.477e-02  1.599e-01  -0.092    0.926
## Close2Open        1.055e-01  7.038e-02   1.499    0.134
## Open2Open        -1.718e-02  3.825e-02  -0.449    0.653
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1703.5  on 1230  degrees of freedom
## Residual deviance: 1697.7  on 1219  degrees of freedom
## AIC: 1721.7
## 
## Number of Fisher Scoring iterations: 4

models.df <- rbind(models.df,model.m1c$df_info)
summary(model.m1c)

##         Length Class  Mode
## model   30     glm    list
## df_info  8     tbl_df list

Variance Inflation Factors

car::vif(model.m1c$model)

##            close           volume unadjustedVolume    changePercent 
##         1.271410         3.558891         3.154121         2.017531 
##               m5              m20              rsi             vol1 
##         2.014173         2.882669         3.543602         1.306379 
##             vol5       Close2Open        Open2Open 
##         1.619487         2.799758         2.131832

Logistic Model Results

The results of the logistic models are displayed below. The in-sample testing accuracy is fairly low at 52%. No model has a particular edge so we proceed to move on to other types of model. This exercise could be enhanced by using data from other stocks by removing all the variables that are stock dependent. Additionally, more target could be used such as the the direction of the price movement over 5 days or 20 days.

models.df

Model 2 (Autoregressive Models)

In this section we drop all the predictors but the close price of the time series. We only use information contain the series itself to fit a model and make predictions. The AAPL stock price had an underlying trend but large fluctuations. We need to verify if certain conditions are met in order to model this time series. We first look at the autocorrelation of the series and see that the series is highly correlated with its previous values with a significant lag. We also consider the PACF which removes variations explained by earlier lags so we get only the relevant features.

Identification of an AR model is often done using the PACF while MA models use the ACF. The order of the model is determined by the number of significant lags taken into account. The PACF suggests ar AR(1) model and a regression line of time t onto time t-1 shows a close linear fit and a highly statistical coefficient of approximately 1 for t-1. A few methods are used to arrive at the estiamte but the coefficient is consistent. However, the slowly decaying ACF suggests that the series is not stationary and that some degree of differentiation might be required to stabilize the mean.

par(mfrow=c(2,2))
acf(AAPL$close)
pacf(AAPL$close)

t <- AAPL$close[-1]
t_1 <- AAPL$close[-length(AAPL$close)]
m2.lm <- lm(t ~ t_1)

plot(t_1, t)
abline(m2.lm, col=3, lwd=2)

summary(m2.lm)

## 
## Call:
## lm(formula = t ~ t_1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6872  -0.8211  -0.0341   0.9425  11.1222 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.340328   0.210876   1.614    0.107    
## t_1         0.997991   0.001572 634.859   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.129 on 1255 degrees of freedom
## Multiple R-squared:  0.9969, Adjusted R-squared:  0.9969 
## F-statistic: 4.03e+05 on 1 and 1255 DF,  p-value: < 2.2e-16

We use the generalized least squares method to obtain more information on the mode. We specify an error correlation structure of order 1. We note that the coefficient estiamte is again 1 and the AIC is 5478. This will serve as a comparison of more complex models.

AAPL_close <- ts(AAPL$close)
m2.gls <- gls(AAPL_close ~ time(AAPL_close), correlation = corAR1(form=~1))
summary(m2.gls)

## Generalized least squares fit by REML
##   Model: AAPL_close ~ time(AAPL_close) 
##   Data: NULL 
##        AIC     BIC    logLik
##   5478.647 5499.19 -2735.324
## 
## Correlation Structure: AR(1)
##  Formula: ~1 
##  Parameter estimate(s):
##       Phi 
## 0.9999998 
## 
## Coefficients:
##                     Value Std.Error   t-value p-value
## (Intercept)      68.90127  3174.416 0.0217052  0.9827
## time(AAPL_close)  0.08198     0.060 1.3647854  0.1726
## 
##  Correlation: 
##                  (Intr)
## time(AAPL_close) -0.012
## 
## Standardized residuals:
##          Min           Q1          Med           Q3          Max 
## -0.009293359 -0.001385895  0.002775650  0.006557553  0.020775575 
## 
## Residual standard error: 3174.415 
## Degrees of freedom: 1258 total; 1256 residual

We test the justification of the added autoregressive structure using a likelyhood ratio test. With a very small p value, we can reject the null hypothesis that the added term is not necessary.

m2.gls.0 <- update(m2.gls, correlation=NULL)
anova(m2.gls, m2.gls.0) # AR(1) vs Uncorrelated errors

Fitting an AR(1) model using the ar function confirms the coefficient of approxmiately 1 (0.997).

m2.ar1 <- ar(AAPL_close)
m2.ar1_fitted <- AAPL$close - residuals(m2.ar1)
m2.ar1

## 
## Call:
## ar(x = AAPL_close)
## 
## Coefficients:
##     1  
## 0.997  
## 
## Order selected 1  sigma^2 estimated as  8.862

We can view the fitted of the GLS and AR(1) models side by side.

par(mfrow=c(1,2))
m2.gls_fitted <- AAPL_close - residuals(m2.gls)
plot(AAPL$close, type = "l", col = 4, lty = 1, main="GLS")
points(m2.gls_fitted, type = "l", col = 2, lty = 2)

plot(AAPL$close, type = "l", col = 4, lty = 1, main="AR(1)")
points(m2.ar1_fitted, type = "l", col = 2, lty = 2)

We noted earlier that the slowly decaying structure of the ACF suggested that the series is not stationary. We verify this claim using the Dickey Fuller tests below. The null hypothesis that the series is not stationary is not rejected for the original series, but rejected for the once differentiated series. Therefore, the original series is not stationary and should be differentiated to stabilize the mean.

Dickey Fuller Test on time series:

adf.test(diff(AAPL$close, differences=1))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(AAPL$close, differences = 1)
## Dickey-Fuller = -9.8502, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary

Dickey Fuller Test on differentiated time series:

adf.test(diff(AAPL$close, differences=1))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(AAPL$close, differences = 1)
## Dickey-Fuller = -9.8502, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary

ARIMA

ARIMA models allows to combine the AR structure seen above with differentiation and moving average terms.

Here we look at the first two differentials. The higher order differentials have mean 0 but the variance seems to increase in the last part of the series.

Order 1 differential

The order 1 differential shows the highest correlation at position 0 and a significant lag at position 6, while the partial ACF seenms to show periodic behavior with a few significant lags at 7 and 8.

APPLdiff1 <- diff(AAPL$close, differences=1)
par(mfrow=c(2,2))
acf(APPLdiff1, lag.max=20)             
pacf(APPLdiff1, lag.max=20)
plot.ts(APPLdiff1)

Order 2 differential

On the other hand, the order two differential has several meaningful lags at position 1 and 7, and the partial ACF decays and stays significant until lag 6.

APPLdiff2 <- diff(AAPL$close, differences=2)
par(mfrow=c(2,2))
acf(APPLdiff2, lag.max=20)             
pacf(APPLdiff2, lag.max=20)
plot.ts(APPLdiff2)

Form these observations, we can study the following ARMA (autoregressive moving average) models are possible for the differentiated time series:

First Order ARIMA(7,1,0): AR(7) high order on the autoregressive structure
First Order ARIMA(0,1,2): MA(1) moving average structure
First Order ARIMA(1,1,1): ARMA(1,1) combination
Second Order ARIMA(1,2,0): AR(1) autoregressive structure
Second Order ARIMA(0,2,1): MA(1) moving average structure
Second Order ARIMA(6,2,0): AR(6) autoregressive structure
Second Order ARIMA(6,2,1): ARMA(6,1) combination

arimadf <- tibble(model=character(), coef=character(), loglik=numeric(), aic=numeric()) 

arimamodels <- function(p,d,q) {
  # A specification of the non-seasonal part of the ARIMA model: the three components (p, d, q) are the AR order, the degree of differencing, and the MA order. ARMA(p,q)
  arima.model <- arima(AAPL$close, order = c(p,d,q))
  arima.model_fitted <- AAPL$close - residuals(arima.model)
  df <- tibble(model=paste0(p,d,q), loglik=arima.model$loglik, aic=arima.model$aic)
  return.list <- list(model=arima.model,fitted=arima.model_fitted, df=df)
  return(return.list)
}

Model Results

The results of the ARIMA models listed above are printed here. We find that the model with the lowest AIC is the first order differential AR(7) model with a value of 5466. However is is also a complex model with 8 parameters. A smaller model with marginally decreased AIC is preffered. We can choose for the any of the remaining 3 parameter models. We select the second order MA(1) model at the best model.

# First order models
m710 <- arimamodels(p=7, d=1, q=0)
arimadf <- rbind(arimadf, m710$df)
m012 <- arimamodels(p=0, d=1, q=2)
arimadf <- rbind(arimadf, m012$df)
m111 <- arimamodels(p=1, d=1, q=1)
arimadf <- rbind(arimadf, m111$df)

# Second order models
m120 <- arimamodels(p=1, d=2, q=0)
arimadf <- rbind(arimadf, m120$df)
m021 <- arimamodels(0, 2, 1)
arimadf <- rbind(arimadf, m021$df)
m620 <- arimamodels(6, 2, 0)
arimadf <- rbind(arimadf, m620$df)
m621 <- arimamodels(6, 2, 1)
arimadf <- rbind(arimadf, m621$df)

arimadf

The coefficients for the best model are listed below and we can also see from the graph that the model is a good fit.

m021$model

## 
## Call:
## arima(x = AAPL$close, order = c(p, d, q))
## 
## Coefficients:
##          ma1
##       -1.000
## s.e.   0.003
## 
## sigma^2 estimated as 4.536:  log likelihood = -2735.32,  aic = 5474.65

plot(AAPL$close, type = "l", col = 4, lty = 1)
points(m021$fitted, type = "l", col = 2, lty = 2)

We can now use the model to forecast the price of the stock. In the example below the price is predicted for the next 150 trading days. Confidence intervals (shown in blue) are provided at 80% and 95% levels.

AAPLforecast <- forecast(m021$model, h=150)
autoplot(AAPLforecast)

Forecast vs Actual

arimaforecast <- function(stock, ticker) {
  # Subset the training data, fit model and get fitted valyes
  training_subset <- stock$close[1:round(0.7*length(AAPL$close))] 
  model <- arima(training_subset, order = c(0, 2, 1))
  fitted_values <- training_subset - residuals(model)
  
  boundary <- round(length(stock$close)*.7)
  end <- length(stock$close)
  data <- tibble(date=stock$date, price=stock$close, prediction = NA, fitted=NA, Low95 = NA,
         Low80 = NA,
         High95 = NA,
         High80 = NA)
  stocksubsetforecast <- forecast(model, h=377)
  forecast_df <- fortify(as.data.frame(stocksubsetforecast)) %>% as_tibble()
  forecast_df <- forecast_df %>%
    rename("Low95" = "Lo 95",
           "Low80" = "Lo 80",
           "High95" = "Hi 95",
           "High80" = "Hi 80",
           "Forecast" = "Point Forecast")
  data$fitted <- c(as.vector(fitted_values),rep(NA,(end-boundary)))
  data$prediction <- c(rep(NA,boundary),forecast_df$Forecast)
  data$Low80 <- c(rep(NA,boundary),forecast_df$Low80)
  data$High80 <- c(rep(NA,boundary),forecast_df$High80)
  data$Low95 <- c(rep(NA,boundary),forecast_df$Low95)
  data$High95 <- c(rep(NA,boundary),forecast_df$High95)
  
  ggplot(data, aes(x = date)) + 
    geom_ribbon(aes(ymin = Low95, ymax = High95, fill = "95%")) +
    geom_ribbon(aes(ymin = Low80, ymax = High80, fill = "80%")) +
    geom_point(aes(y = price, colour = "price"), size = 1) +
    geom_line(aes(y = price, group = 1, colour = "price"), 
              linetype = "dotted", size = 0.75) +
    geom_line(aes(y = fitted, group = 2, colour = "fitted"), size = 0.75) +
    geom_line(aes(y = prediction, group = 3, colour = "prediction"), size = 0.75) +
    scale_x_date(breaks = scales::pretty_breaks(), date_labels = "%b %y") +
    scale_colour_brewer(name = "Legend", type = "qual", palette = "Dark2") +
    scale_fill_brewer(name = "Intervals") +
    guides(colour = guide_legend(order = 1), fill = guide_legend(order = 2)) +
    theme_bw(base_size = 14) +
    ggtitle(ticker)
}

Now that we have a well-fiting model that can predict the stock price we can compare the prediction performance to the actual price of the stock by subsetting a training set using the first 70% of the data and predicting on the remaining 30% of the trading days. We find that the actual price of AAPL remain within the 80% and 95% confidence interval ribbons. The actual price is slightly lower that the linearly predicted price. We also tested how the structure of this ARIMA model generalized to other stock price time series. Below are price prediction for Microsoft and IBM. In the case of the former, the price stays in the upper range of the of the 95% confidence interval and even exceeds it. The final price is well above the prediction. The forecast for IBM is more within the range of the linear predictor seems a reasonable estimation of the movement.

arimaforecast(AAPL,'AAPL')

MSFT <- dataset %>% filter(ticker=='MSFT') 
IBM <- dataset %>% filter(ticker=='IBM')

arimaforecast(MSFT,'MSFT')

arimaforecast(IBM,'IBM')

Model 3 (Panel Regression)

For the 3rd model we will attempt to panel models on the stocks dataset as this dataset provides data across stocks and over time, thus having both cross-sectional and time-series dimensions. The general assumption for this dataset is that there is correlation over time for a given stock, but each stock is independent of other stocks. This dataset has not have any time-invariant regressors.

Balance dataset

Determine if all stocks are observed for all time periods and if unbalanced, balance so that all stocks are observed on the same trade dates. This indicates that some observations will be dropped so that all stocks have an observations on the same trading days

dataset2 <- dataset %>% select(-label)
is.pbalanced(dataset2)

## [1] FALSE

dataset2 <- make.pbalanced(dataset2,balance.type = "shared.times")
is.pbalanced(dataset2)

## [1] TRUE

Create Predictors

For all stock observations in our dataset, we will create the engineered features: MACD, RSI, m5, m20, v1, and v5

create_predictors <- function(df){
  df_full <- df %>% mutate(macd=MACD(Cl(df), nFast=12, nSlow=26, nSig=9, maType=SMA)[,1], 
                        m5=momentum(df$close, n=5), 
                        m20=momentum(df$close, n=20), 
                        rsi=RSI(df$close, n=14),
                        vol1=ROC(df$volume, n=1),
                        vol5=ROC(df$volume, n=5))
  return(df_full)
}

balanced_tickers <- unique(dataset2$ticker)
dataset2_full <- dataset2 %>% filter(ticker=="AAPL") %>% create_predictors() 

for(i in balanced_tickers[2:length(balanced_tickers)])
{
 dataset2_full <- rbind(dataset2_full, dataset2 %>% filter(ticker==i) %>% create_predictors())
}

dataset2_full %>% filter(date=="2014-05-21")

Test and Train datasets

Create test and training datasets for prediction purposes, and ensure both datasets are balanced

stock_dates <- unique(dataset2_full$date)

test_dates <- tail(stock_dates, 250)
train_dates <- head(stock_dates, length(stock_dates)-250)

test_stocks <- dataset2_full %>% dplyr::filter(date %in% test_dates)
train_stocks <- dataset2_full %>% dplyr::filter(date %in% train_dates)
is.pbalanced(test_stocks)

## [1] TRUE

is.pbalanced(train_stocks)

## [1] TRUE

train_stocks <- train_stocks %>% drop_na()
panel_stocks <- pdata.frame(train_stocks, index = c("ticker", "date"))

Model 3A: Pooled OLS

A Pooled model on panel data is applies the Ordinary Least Squares technique on the data. It is the most restrictive panel model as cross-sectional dimensions are ignored: \(y_{it}=\alpha+\beta x_{it}+u_{it}\)

stocks_m3_pooled <- plm(close ~ macd + m5 + m20,  data = panel_stocks, model = "pooling")
summary(stocks_m3_pooled)

## Pooling Model
## 
## Call:
## plm(formula = close ~ macd + m5 + m20, data = panel_stocks, model = "pooling")
## 
## Balanced Panel: n = 29, T = 983, N = 28507
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -98.2878 -31.1645  -5.8626  21.9314 280.1262 
## 
## Coefficients:
##             Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept) 86.36634    0.25985 332.376 < 2.2e-16 ***
## macd        -8.03873    0.25224 -31.869 < 2.2e-16 ***
## m5          -1.87150    0.12322 -15.188 < 2.2e-16 ***
## m20          5.02074    0.10047  49.975 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    58494000
## Residual Sum of Squares: 52403000
## R-Squared:      0.10414
## Adj. R-Squared: 0.10404
## F-statistic: 1104.43 on 3 and 28503 DF, p-value: < 2.22e-16

Pooled_Model <- glance(stocks_m3_pooled)

Fixed Effect Models

In Fixed affects models we will assume there is an unobserved heterogeneity across the stocks such as each company’s core competency, or anything else that is unique to each company and thus something unobserved factored into each company’s stock price. This heterogeneity (\(\alpha_i\)) is not known but we would want to investigate it’s correlation with the created predictor variables see it’s impact on the closing price: \(y_{it}=\alpha_i+\beta x_{it}+u_{it}\)

Model 3B: Fixed Model - Within Estimator

stocks_m3_within <- plm(close ~ macd + m5 + m20 + rsi + vol5,  data = panel_stocks, model = "within")
summary(stocks_m3_within)

## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = close ~ macd + m5 + m20 + rsi + vol5, data = panel_stocks, 
##     model = "within")
## 
## Balanced Panel: n = 29, T = 983, N = 28507
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -64.76738  -8.82872  -0.75545   6.23842 204.90320 
## 
## Coefficients:
##       Estimate Std. Error  t-value  Pr(>|t|)    
## macd -3.473877   0.130298 -26.6609 < 2.2e-16 ***
## m5   -1.048252   0.065263 -16.0619 < 2.2e-16 ***
## m20   2.270449   0.045208  50.2219 < 2.2e-16 ***
## rsi   0.185650   0.017343  10.7047 < 2.2e-16 ***
## vol5  0.739358   0.237983   3.1068  0.001893 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    11677000
## Residual Sum of Squares: 10039000
## R-Squared:      0.14026
## Adj. R-Squared: 0.13926
## F-statistic: 929.038 on 5 and 28473 DF, p-value: < 2.22e-16

Fixed_Model <- glance(stocks_m3_within)

Model 3C: Fixed Model - First Difference Estimator

The First difference estimator uses period changes for each stock, where the individual specific effects (unobserved heterogeneity) is canceled out: \(y_{it}-y_{i,t-1} = \beta(x_{it}-x_{i,t-1})+(e_{it}-e_{i,t-1})\)

Looking at the result, we see that while the adjusted \(R^2\) is fairly high compared to the fitted values are nonsensical

stocks_m3_fd <- plm(close ~ macd + m5 + m20 + rsi + vol1 + vol5,  data = panel_stocks, model = "fd")
summary(stocks_m3_fd)

## Oneway (individual) effect First-Difference Model
## 
## Call:
## plm(formula = close ~ macd + m5 + m20 + rsi + vol1 + vol5, data = panel_stocks, 
##     model = "fd")
## 
## Balanced Panel: n = 29, T = 983, N = 28507
## Observations used in estimation: 28478
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -7.214443 -0.244970 -0.011041  0.233413 10.271243 
## 
## Coefficients:
##                Estimate  Std. Error  t-value  Pr(>|t|)    
## (Intercept)  0.05347588  0.00323057  16.5531 < 2.2e-16 ***
## macd         0.23063393  0.01368203  16.8567 < 2.2e-16 ***
## m5           0.21175171  0.00241756  87.5890 < 2.2e-16 ***
## m20          0.21845909  0.00256222  85.2617 < 2.2e-16 ***
## rsi          0.11329595  0.00093788 120.8004 < 2.2e-16 ***
## vol1        -0.02273702  0.00690079  -3.2948  0.000986 ***
## vol5         0.01974797  0.00788665   2.5040  0.012286 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    42088
## Residual Sum of Squares: 8460.4
## R-Squared:      0.79899
## Adj. R-Squared: 0.79894
## F-statistic: 18860.9 on 6 and 28471 DF, p-value: < 2.22e-16

Fixed_Model_FD <- glance(stocks_m3_fd)
fitted.values(stocks_m3_fd)[1:10]

##  [1] -0.2416019  0.5803779  0.2802988 -0.7070574 -1.3107559 -1.3498763
##  [7] -0.3178371  1.1958132 -0.8536587 -0.2521375

Model 3D: Random Effect Model

In Random affects models assumes no fixed effects.The individual specific effects (unobserved heterogeneity) are not correlated with the predictor variables and are independent of the predictor variables. This would result in the assumption that any residual variation on the dependent variable ( closing price) is random and should randomly distributed with the error term:
\(y_{it}=\beta x_{it}+(\alpha_i+ e_{it})\)

stocks_m3_random <- plm(close ~ macd + m5 + m20 + rsi + vol5,  data = panel_stocks, model = "random")
summary(stocks_m3_random)

## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = close ~ macd + m5 + m20 + rsi + vol5, data = panel_stocks, 
##     model = "random")
## 
## Balanced Panel: n = 29, T = 983, N = 28507
## 
## Effects:
##                  var std.dev share
## idiosyncratic 352.59   18.78 0.386
## individual    560.33   23.67 0.614
## theta: 0.9747
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -62.2882  -8.8892  -1.3109   5.9865 206.3854 
## 
## Coefficients:
##              Estimate Std. Error  z-value  Pr(>|z|)    
## (Intercept) 77.366926   4.492012  17.2232 < 2.2e-16 ***
## macd        -3.477005   0.130416 -26.6609 < 2.2e-16 ***
## m5          -1.048818   0.065322 -16.0561 < 2.2e-16 ***
## m20          2.272307   0.045249  50.2183 < 2.2e-16 ***
## rsi          0.185538   0.017358  10.6886 < 2.2e-16 ***
## vol5         0.739583   0.238199   3.1049  0.001903 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    11707000
## Residual Sum of Squares: 10067000
## R-Squared:      0.14006
## Adj. R-Squared: 0.13991
## Chisq: 4642.05 on 5 DF, p-value: < 2.22e-16

Random_Model <- glance(stocks_m3_random)

Panel Model Testing & Selection

Below is a summary glance of the 4 Panel models run on the stocks dataset. Choosing between either a pooled, fixed effect, or random effect model requires running Breusch–Pagan Lagrange Multiplier and Hausman tests. Of the 4 models below, we will discard the First Difference model due to its nonsensical fitted values

Panel_Model_Summary <- data.frame(Pooled_Model)

Panel_Model_Summary <- rbind(Panel_Model_Summary, Fixed_Model)
Panel_Model_Summary <- rbind(Panel_Model_Summary, Fixed_Model_FD)
Panel_Model_Summary <- rbind(Panel_Model_Summary, Random_Model)


rownames(Panel_Model_Summary) <- c("Pooled Model", "Fixed Model", "First Difference Model", "Random Model")
Panel_Model_Summary

The Breusch–Pagan Lagrange Multiplier Test (LM Test) is used to test for heteroskedasticity in a linear regression model. The null hypothesis assumes homoskedasticity and in the alternate hypothesis heteroskedasticity is assumed.

First, we test for Random Effects against OLS. From the test we see that since the p-value is near 0, we reject the null hypothesis. The individual specific effects (unobserved heterogeneity) of each stock are significant and therefore we should not use the Pooled OLS model.

plmtest(stocks_m3_pooled, effect = "individual")

## 
##  Lagrange Multiplier Test - (Honda) for balanced panels
## 
## data:  close ~ macd + m5 + m20
## normal = 2918.2, p-value < 2.2e-16
## alternative hypothesis: significant effects

We test the Fixed Effects model against OLS Model, we again see that we reject the null for the alternative hypothesis as the test suggest more support for the Fixed Effect model

#LM test for fixed effects vs OLS
#significant result, more support for fixed effects model over ols
pFtest(stocks_m3_within, stocks_m3_pooled)

## 
##  F test for individual effects
## 
## data:  close ~ macd + m5 + m20 + rsi + vol5
## F = 4005, df1 = 30, df2 = 28473, p-value < 2.2e-16
## alternative hypothesis: significant effects

From the two tests above, we can set aside the Pooled OLS model which ignores both the cross-sectional and time-series dimensions. Next, we compare the Fixed Effects and Random Effects model. We do this using the Hausman Test, which evaluates the consistency of estimators, in our case the fixed effect and random effect estimators. If there is no correlation between the independent variables and the individual specific effects, then both Fixed Effect and Random Effect models are consistent, but the Fixed Effect model is not efficient. Should there be a correlation, then the Fixed Effects model is consistent and the Random Effects model is inconsistent.

From the result we see that the null hypothesis is rejected. This indicates that the hypothesis is that individual random effects of each stock are uncorrelated with the error term does not have support. We therefore choose the alternative hypothesis and select the Fixed Effect model

phtest(stocks_m3_random, stocks_m3_within)

## 
##  Hausman Test
## 
## data:  close ~ macd + m5 + m20 + rsi + vol5
## chisq = 1.078, df = 5, p-value = 0.956
## alternative hypothesis: one model is inconsistent

Discussion and Conclusion

In this paper, we implemented models using logistic regression, panel regression, and auto-regressive models using ARIMA.

The logistic models which try to predict the direction of the next day’s movement yielded poor results. Hardly any of the predictor variable were significant, and the classification accuracies were around 52% which is hardly better than guessing. The multicollinearity issues that were addressed did not improve the results so this model was abandonned. Some improvements are possible using a larger dataset of multiple stocks or using different target variables.

The auto-regressive models that predict subsequent values of a stock time series were of greater use. This required finding a suitable combination of parameters to fit the data closely. Simple linear regression suggested that an AR(1) could be used where the predictor is the previous value of the price. We estimated the coefficients and fit a model but also noted that the series was not stationary and should be differentiated in order to stabilize the mean to conform with the modeling assumptions. We progress some a simple AR(1) auto-regressive model to include the differentiation parameter as well as the moving average to the structure. A variety of models were fit and it was found that smalle models were preferred. The model selected as the best model was a second order differential model with a MA(1) structure. We were able to test its validity by splitting stock data into training and test sets to evaluate the performance of the model. Surprsingly, the MA(1) consistently capture the movement of the stock price within its confidence intervals. The intervals remain fairly wide and the linear trend is not garanteed to reflect the actual price of the stock at any time, but only centers the trend.

For Panel Regression we created and compared the significance of 3 models, the Pooled OLS, Fixed Effects, and Random Effects models. Within the Fixed Model, we tried the First Differences estimator which while having explaining the most variability produced results which do not make sense therefore was discarded. The remaining 3 models had similar explanation of variability, but given our dataset and the nature of it only the Fixed Effect model should be applicable. The Breusch–Pagan Lagrange Multiplier Test (LM Test) and Hausman Test showed support for this over the Pooled OLS and Random Effect models because every company has innate characteristics that impact it’s closing price. The limitation of the Fixed Effect model was it could not predict results for the test dataset due to an unresolvable multicollinearity issue with the engineered dataset. Resolution of this would require different feature set to be engineered and is left for future work.

Given these 3 models, the auto-regressive models using ARIMA produced the most stable results. The output of each of the 3 models is different making comparison of each to each other problematic. The logistic model produces binary results, the panel model predicts actual close values, and the auto-regressive model estimates general stock trend over a given period of time. Given that it is a difficult task to produce a fairly accurate prediction of stock prices by trade day, the auto-regressive ARIMA model that gives trends over periods of time is a better and more suitable method determining/estimating stock price movement

install_load <- function(pkg){
  # Load packages & Install them if needed.
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)
}

DATA 621 Final Project - Stock Predictions Methods

Dhairav Chhatbar, Mael Illien, Salma Elshahawy
DATA 621, Master of Science in Data Science,
City University of New York

Abstract

Key Words

Introduction

Methodology

Experimentation and Results

Data Description

Data Processing

Feature Engineering

Data Exploration

Modeling

Model 1 (Logistic)

M1A Full Model

Variance Inflation Factors

M1B Small Model

M1C Multicollinearity Removed

Variance Inflation Factors

Logistic Model Results

Model 2 (Autoregressive Models)

ARIMA

Order 1 differential

Order 2 differential

Model Results

Forecast vs Actual

Model 3 (Panel Regression)

Balance dataset

Create Predictors

Test and Train datasets

Model 3A: Pooled OLS

Fixed Effect Models

Model 3B: Fixed Model - Within Estimator

Model 3C: Fixed Model - First Difference Estimator

Model 3D: Random Effect Model

Panel Model Testing & Selection

Discussion and Conclusion

DATA 621 Final Project - Stock Predictions Methods

Dhairav Chhatbar, Mael Illien, Salma Elshahawy DATA 621, Master of Science in Data Science, City University of New York

Abstract

Key Words

Introduction

Methodology

Experimentation and Results

Data Description

Data Processing

Feature Engineering

Data Exploration

Modeling

Model 1 (Logistic)

M1A Full Model

Variance Inflation Factors

M1B Small Model

M1C Multicollinearity Removed

Variance Inflation Factors

Logistic Model Results

Model 2 (Autoregressive Models)

ARIMA

Order 1 differential

Order 2 differential

Model Results

Forecast vs Actual

Model 3 (Panel Regression)

Balance dataset

Create Predictors

Test and Train datasets

Model 3A: Pooled OLS

Fixed Effect Models

Model 3B: Fixed Model - Within Estimator

Model 3C: Fixed Model - First Difference Estimator

Model 3D: Random Effect Model

Panel Model Testing & Selection

Discussion and Conclusion

Dhairav Chhatbar, Mael Illien, Salma Elshahawy
DATA 621, Master of Science in Data Science,
City University of New York