Libraries

library(dplyr)
library(tidyr)
library(stats)
library(prophet)
library(pROC)
library(ggplot2)
library(forecast)
library(tseries)
library(segmented)
library(data.table)
library(RCurl)
library(XML)
library(rvest)
library(scales)
library(MASS)
library(Metrics)
library(stR)
library(stats)
library(zoo)
library(shiny)

Primary Data Source Loading

Our data is sourced from middle-tier over-the-counter (OTCQB) developing US and international stocks, found here: https://www.otcmarkets.com/home. A lower price threshold of stocks that are $0.10 or more was set to filter out potentially bankrupt stocks. The meta-data and metrics that we’ve loaded are date, price, open, high, low, and close for the stocks that we’ll be analyzing.

#Adjust file path
price_history <- "/Users/digitalmarketer1977/Documents/price_history.csv"

insider_disclosure <- "/Users/digitalmarketer1977/Documents/insider_disclosure.csv"

price_historytb <- read.csv(price_history, header = TRUE, sep = ",")
price_historytb = subset(price_historytb, select = -c(X) )

insider_disclosure <- read.csv(insider_disclosure, header = TRUE, sep = ",")

Exploratory Look at the Data

At a very high-level with these class of stocks it’s important to remember that there is sunstantial risk to investors when buy in selling them1. This is also has impact on the assumptions that we make during our analysis. Given the highly volatile nature of some of these stocks, history is not always the best predictor of will happen tomorrow. Even some of the stocks that we have a 500+ day history on are highly variable, in some instances appear as though they’ve gone under - which highly plausible outcome given the class of equities we’re talking about.

It’s actually quite plausible, that the best fitting model in this case might not actually be the most realistic one, once domain expertise is factored into what we know.

# Remove Zero Trading Days
pricesubset <- filter(price_historytb, Volume > 0)
pricesubset$Symbol <- as.character(pricesubset$Symbol)

# Most Traded Stocks
tradevol <- pricesubset %>%
 na.omit() %>%
 group_by(Symbol) %>%
 summarise(total = sum(as.numeric(Volume))) %>%
 arrange(desc(total))

tradevol[1:10,]
## # A tibble: 10 x 2
##    Symbol      total
##     <chr>      <dbl>
## 1    FNMA 2592681339
## 2    BTCS 1469266818
## 3    FMCC 1310215137
## 4    MCIG 1302232599
## 5    MGTI 1179152597
## 6   grid  1069739756
## 7    LQMT  836746267
## 8    NWBO  732616101
## 9    SRNA  401087181
## 10  TITXF  318835611
ggplot(tradevol[1:10,], aes(x = Symbol, y = total)) + geom_histogram(fill="blue", stat = "identity", color="red") + ggtitle("OTCQB - Highest Volume Stocks (000)") + geom_text(aes(label = total/1000, sep=","), position = position_dodge(0.9), size=3, vjust=-0.25)

tradedays <- pricesubset %>%
 na.omit() %>%
 group_by(Symbol) %>%
 summarise(alldays = NROW(Date)) %>%
 arrange(desc(alldays))

tradedays[1:10,]
## # A tibble: 10 x 2
##    Symbol alldays
##     <chr>   <int>
## 1    ACUR     504
## 2    ARTH     504
## 3   ATTBF     504
## 4    BTCS     504
## 5    CANN     504
## 6    CBDS     504
## 7    CNAB     504
## 8    COCP     504
## 9    CVSI     504
## 10   CYDY     504
ggplot(tradedays[1:10,], aes(x = Symbol, y = alldays)) + geom_histogram(fill="black", stat = "identity", color="red") + ggtitle("OTCQB -  Most Trading Days") + geom_text(aes(label = alldays, sep=","), position = position_dodge(0.9), size=3, vjust=-0.25)

#Stock list Function...TBD
stocklist <- unique(pricesubset$Symbol)

trend = data.frame()
for (i in stocklist){
  
  filter(pricesubset, Symbol == i)
  trend <- pricesubset[c(1,5,7)]
}
Shiny applications not supported in static R Markdown documents
toymodelB<-toymodel[c(1,2)]
colnames(toymodelB)[which(names(toymodelB) == "Close")] <- "y"
colnames(toymodelB)[which(names(toymodelB) == "Date")] <- "ds"
m <- prophet(toymodelB, daily.seasonality = TRUE)
## Initial log joint probability = -17.4411
## Optimization terminated normally: 
##   Convergence detected: absolute parameter change was below tolerance
summary(m)
##                         Length Class      Mode     
## growth                    1    -none-     character
## changepoints             25    POSIXct    numeric  
## n.changepoints            1    -none-     numeric  
## yearly.seasonality        1    -none-     character
## weekly.seasonality        1    -none-     character
## daily.seasonality         1    -none-     logical  
## holidays                  0    -none-     NULL     
## seasonality.prior.scale   1    -none-     numeric  
## changepoint.prior.scale   1    -none-     numeric  
## holidays.prior.scale      1    -none-     numeric  
## mcmc.samples              1    -none-     numeric  
## interval.width            1    -none-     numeric  
## uncertainty.samples       1    -none-     numeric  
## specified.changepoints    1    -none-     logical  
## start                     1    POSIXct    numeric  
## y.scale                   1    -none-     numeric  
## logistic.floor            1    -none-     logical  
## t.scale                   1    -none-     numeric  
## changepoints.t           25    -none-     numeric  
## seasonalities             2    -none-     list     
## extra_regressors          0    -none-     list     
## stan.fit                  0    -none-     NULL     
## params                    6    -none-     list     
## history                   5    data.frame list     
## history.dates           158    POSIXct    numeric
future <- make_future_dataframe(m, periods = 15)
forecast <- predict(m, future)
plot(m, forecast)

prophet_plot_components(m, forecast)

Part III - forecast data for 2017 ONLY in a data frame for subsequent analysis

Part IV - stR Decomposition

The Str & Propphet compoments depict a similar story when comparing the trends, seasonal and random error plots - that the data appears to be seaonal and somewhat stationary.

Part V - Forecast package

Forecast against test data set. na.approx was used to impute missing data points.

In order for ARIMA models to work, the data has to be stationary. A check of this, along with our visual inspection of the deasonalized trend shows that the data is stationary (adf test p-value of .05), tells us that we do not need differencing in this model. This data is seasonal.

Following this, our (p,d,q) values that are generated by auto.arime() means that our “p” of 5 means that we have spikes at 5 “lags” or each of the periods/years in our Auto-Regressive (AR) model - (see ar1, ar2, ar3,ar4 & ar5 below). As we stated earlier, our p-value is .05, so this nets a differencing value of 1 in the AR model and 3 moving average lags (ma1, ma2, ma3).

## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.approx(Pseries)
## Dickey-Fuller = -3.4521, Lag order = 10, p-value = 0.04691
## alternative hypothesis: stationary

Alternate Models - Insider Disclosure

There’s no correlation between score and price. The scatterplot essentially shows a flat distribution. Although we only tested a simple regression below, there’s nothing in the distribution to suggest another model (Negative Binomial, Exponential, Poisson) will yield different results.

## 
## Call:
## lm(formula = Net ~ Score, data = id_score)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -57199716  -4486131  -4178747  -3033812 590461395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  4533168    2964599   1.529    0.128
## Score           9569      14353   0.667    0.506
## 
## Residual standard error: 42770000 on 207 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.002143,   Adjusted R-squared:  -0.002678 
## F-statistic: 0.4445 on 1 and 207 DF,  p-value: 0.5057

Part VI Actuals vs. Forecast for 2016

Based on the RMSE, MAE & MSE being the lowest for the prophet forecast when compared to other methods (ARIMA, Seasonal) from the forecast package - I would say this model was the best peforming of the ones we tested.

Data Wrangling Cheatsheet

Omit NA

Plot Title & Format

ggplot2 Show Count Label

Factorization of Dates


  1. History of OTC Stocks