Libraries
library(dplyr)
library(tidyr)
library(stats)
library(prophet)
library(pROC)
library(ggplot2)
library(forecast)
library(tseries)
library(segmented)
library(data.table)
library(RCurl)
library(XML)
library(rvest)
library(scales)
library(MASS)
library(Metrics)
library(stR)
library(stats)
library(zoo)
library(shiny)
Primary Data Source Loading
Our data is sourced from middle-tier over-the-counter (OTCQB) developing US and international stocks, found here: https://www.otcmarkets.com/home. A lower price threshold of stocks that are $0.10 or more was set to filter out potentially bankrupt stocks. The meta-data and metrics that we’ve loaded are date, price, open, high, low, and close for the stocks that we’ll be analyzing.
#Adjust file path
price_history <- "/Users/digitalmarketer1977/Documents/price_history.csv"
insider_disclosure <- "/Users/digitalmarketer1977/Documents/insider_disclosure.csv"
price_historytb <- read.csv(price_history, header = TRUE, sep = ",")
price_historytb = subset(price_historytb, select = -c(X) )
insider_disclosure <- read.csv(insider_disclosure, header = TRUE, sep = ",")
Exploratory Look at the Data
At a very high-level with these class of stocks it’s important to remember that there is sunstantial risk to investors when buy in selling them1. This is also has impact on the assumptions that we make during our analysis. Given the highly volatile nature of some of these stocks, history is not always the best predictor of will happen tomorrow. Even some of the stocks that we have a 500+ day history on are highly variable, in some instances appear as though they’ve gone under - which highly plausible outcome given the class of equities we’re talking about.
It’s actually quite plausible, that the best fitting model in this case might not actually be the most realistic one, once domain expertise is factored into what we know.
# Remove Zero Trading Days
pricesubset <- filter(price_historytb, Volume > 0)
pricesubset$Symbol <- as.character(pricesubset$Symbol)
# Most Traded Stocks
tradevol <- pricesubset %>%
na.omit() %>%
group_by(Symbol) %>%
summarise(total = sum(as.numeric(Volume))) %>%
arrange(desc(total))
tradevol[1:10,]
## # A tibble: 10 x 2
## Symbol total
## <chr> <dbl>
## 1 FNMA 2592681339
## 2 BTCS 1469266818
## 3 FMCC 1310215137
## 4 MCIG 1302232599
## 5 MGTI 1179152597
## 6 grid 1069739756
## 7 LQMT 836746267
## 8 NWBO 732616101
## 9 SRNA 401087181
## 10 TITXF 318835611
ggplot(tradevol[1:10,], aes(x = Symbol, y = total)) + geom_histogram(fill="blue", stat = "identity", color="red") + ggtitle("OTCQB - Highest Volume Stocks (000)") + geom_text(aes(label = total/1000, sep=","), position = position_dodge(0.9), size=3, vjust=-0.25)
tradedays <- pricesubset %>%
na.omit() %>%
group_by(Symbol) %>%
summarise(alldays = NROW(Date)) %>%
arrange(desc(alldays))
tradedays[1:10,]
## # A tibble: 10 x 2
## Symbol alldays
## <chr> <int>
## 1 ACUR 504
## 2 ARTH 504
## 3 ATTBF 504
## 4 BTCS 504
## 5 CANN 504
## 6 CBDS 504
## 7 CNAB 504
## 8 COCP 504
## 9 CVSI 504
## 10 CYDY 504
ggplot(tradedays[1:10,], aes(x = Symbol, y = alldays)) + geom_histogram(fill="black", stat = "identity", color="red") + ggtitle("OTCQB - Most Trading Days") + geom_text(aes(label = alldays, sep=","), position = position_dodge(0.9), size=3, vjust=-0.25)
#Stock list Function...TBD
stocklist <- unique(pricesubset$Symbol)
trend = data.frame()
for (i in stocklist){
filter(pricesubset, Symbol == i)
trend <- pricesubset[c(1,5,7)]
}
toymodelB<-toymodel[c(1,2)]
colnames(toymodelB)[which(names(toymodelB) == "Close")] <- "y"
colnames(toymodelB)[which(names(toymodelB) == "Date")] <- "ds"
m <- prophet(toymodelB, daily.seasonality = TRUE)
## Initial log joint probability = -17.4411
## Optimization terminated normally:
## Convergence detected: absolute parameter change was below tolerance
summary(m)
## Length Class Mode
## growth 1 -none- character
## changepoints 25 POSIXct numeric
## n.changepoints 1 -none- numeric
## yearly.seasonality 1 -none- character
## weekly.seasonality 1 -none- character
## daily.seasonality 1 -none- logical
## holidays 0 -none- NULL
## seasonality.prior.scale 1 -none- numeric
## changepoint.prior.scale 1 -none- numeric
## holidays.prior.scale 1 -none- numeric
## mcmc.samples 1 -none- numeric
## interval.width 1 -none- numeric
## uncertainty.samples 1 -none- numeric
## specified.changepoints 1 -none- logical
## start 1 POSIXct numeric
## y.scale 1 -none- numeric
## logistic.floor 1 -none- logical
## t.scale 1 -none- numeric
## changepoints.t 25 -none- numeric
## seasonalities 2 -none- list
## extra_regressors 0 -none- list
## stan.fit 0 -none- NULL
## params 6 -none- list
## history 5 data.frame list
## history.dates 158 POSIXct numeric
future <- make_future_dataframe(m, periods = 15)
forecast <- predict(m, future)
plot(m, forecast)
prophet_plot_components(m, forecast)
Part III - forecast data for 2017 ONLY in a data frame for subsequent analysis
Part IV - stR Decomposition
The Str & Propphet compoments depict a similar story when comparing the trends, seasonal and random error plots - that the data appears to be seaonal and somewhat stationary.
Part V - Forecast package
Forecast against test data set. na.approx was used to impute missing data points.
In order for ARIMA models to work, the data has to be stationary. A check of this, along with our visual inspection of the deasonalized trend shows that the data is stationary (adf test p-value of .05), tells us that we do not need differencing in this model. This data is seasonal.
Following this, our (p,d,q) values that are generated by auto.arime() means that our “p” of 5 means that we have spikes at 5 “lags” or each of the periods/years in our Auto-Regressive (AR) model - (see ar1, ar2, ar3,ar4 & ar5 below). As we stated earlier, our p-value is .05, so this nets a differencing value of 1 in the AR model and 3 moving average lags (ma1, ma2, ma3).
##
## Augmented Dickey-Fuller Test
##
## data: na.approx(Pseries)
## Dickey-Fuller = -3.4521, Lag order = 10, p-value = 0.04691
## alternative hypothesis: stationary
Alternate Models - Insider Disclosure
There’s no correlation between score and price. The scatterplot essentially shows a flat distribution. Although we only tested a simple regression below, there’s nothing in the distribution to suggest another model (Negative Binomial, Exponential, Poisson) will yield different results.
##
## Call:
## lm(formula = Net ~ Score, data = id_score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57199716 -4486131 -4178747 -3033812 590461395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4533168 2964599 1.529 0.128
## Score 9569 14353 0.667 0.506
##
## Residual standard error: 42770000 on 207 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.002143, Adjusted R-squared: -0.002678
## F-statistic: 0.4445 on 1 and 207 DF, p-value: 0.5057
Part VI Actuals vs. Forecast for 2016
Based on the RMSE, MAE & MSE being the lowest for the prophet forecast when compared to other methods (ARIMA, Seasonal) from the forecast package - I would say this model was the best peforming of the ones we tested.