Brian Scott
This markdown uses the LM_DIAGNOSTIC Function and additional code to evaluate an OLS Model. In this case we are intending to predict BTC’s closing price and directional change one week in the future. A logit or probit model would be be a better model to use to determine directional change. The Logit model markdown will be the next addition to this repository. In this markdown, I determine directional change accuracy by the predicted point estimate. If the prediction for next week is greater than BTC’s price today and BTC actually increases over that week, we conclude a success, in terms of directional accuracy.
This function adds a heteroscedasticity test, autocorrelation test, and a multicollinearity test to the standard R summary statistics. This function can also be found in my R_Functions GitHub Repository and that repository includes a full explanation of the function’s output.
LM_DIAGNOSTIC(DF, "var1_log", c("var3_log_Lag", "var4_log_Lag", "var12_log_Lag"))## Loading required package: car
## Loading required package: carData
## Loading required package: gridExtra
All variable p-values are well below .05, indicating significance.The model overall has a very small FStat-PValue and is significant as well.
The adjusted R Square is .97722. Indicating a very strong model fit.
The VIF values are all around 1, indicating no multicollinearity.
The fitted vs residual plot shows no obvious changes in residual variance across the plot. For now I will conclude homoscedastic residuals, although plots can be misleading.
The residual histogram is skewed to the right, but shows that the residuals are generally normally distributed. This is backed up by the Q-Q plot.
The durbin watson statistic is 1.71, indicating a small degree of negative autocorrelation. In my experience 1.71 is a reasonably strong DW-Stat for time series data.
The Residuals vs Leverage plot has observation 24 very close to the Cooks’ Distance line. Technically it passes the Cooks’ Distance test, but that observation should be evaluated further.
This should always be examined before even running a model, and this is hard to confirm considering the variables are masked. None of the variables directly effect any other variable in the model, so endogeneity is no concern.
The model is very strong with only two issues, there is a small degree of negative autocorrelation and there are some potential outliers.
The code below evaluates the accuracy of the model over varying hold out samples. The samples are randomly generated and they change over every iteration. In this case there are 50 iterations. The results capture the predicted direction accuracy and the point estimate accuracy. The results are in percentage terms. A value of .1 indcatesI created a function for this, but it is only works with the specified variables I have. I intend to update the function to work with any specified variables.
set.seed(100)
LM_HOLD_OUT_TEST <- function(df, yVar, xVars, testSize){
D <- na.omit(df)
require(dplyr)
#Create Y variable for regression model
depV <- paste(yVar, "~")
#create X variable string for regression model
indepV <- paste(xVars, collapse = " + ")
#Combine to form the full model
formulaReg <- paste(depV, indepV)
#while loop starting point
iteration <- 1
#while loop ending point
amt <- testSize
#table to collect loop results
loopTestDataFrame <- data.frame(matrix(ncol = 4, nrow = 0))
#colnames for table
colnames(loopTestDataFrame) <- c("AVG_PCT_E", "MAX_PCT_E", "MIN_PCT_E", "Correct_Dir_PCT" )
while(iteration <= amt){
#create train and test samples
#sample is approximately 80% of the data, and it does not have to be in consecutive order.
sample <- sample(c(TRUE,FALSE), nrow(D), replace = TRUE, prob = c(.8,.2))
train <- D[sample, ]
test <- D[!sample,]
#create regression based on train sample
regTrain <- lm(formulaReg, data = train)
#create predictions for the test data based on the training regression model
predictions <- regTrain %>% predict(test)
#calculate percentage error
errorPCT <- ((test$var1_log - predictions)/ test$var1_log) * 100
#Calculate if the prediction was in the right direction as the actual direction change
realDir <- test$var1_log - test$var1_log_Lag
predDir <- test$var1_log_Lag - predictions
signPD <- sign(predDir)
signRD <- sign(realDir)
correctDir <- signPD * signRD
#find the average correct direction
x <- sum(correctDir >0 )
y <- nrow(test)
correctDIRPCT <- (x/y)*100
#make vector of the metrics to be added to the loopTestDataFrame
meanEPCT <- mean(errorPCT)
maxEPCT <- max(errorPCT)
minEPCT <- min(errorPCT)
iterationSummary <- c(meanEPCT,maxEPCT,minEPCT,correctDIRPCT)
#combine the vector to the dataframe
loopTestDataFrame <- rbind(loopTestDataFrame,iterationSummary)
colnames(loopTestDataFrame) <- c("AVG_PCT_E", "MAX_PCT_E", "MIN_PCT_E", "Correct_Dir_PCT" )
iteration = iteration +1
}
#Loop Summaries
lDF <- loopTestDataFrame
#summary table of the error averages over entire loop cycle
errorDescriptivesTbl <- data.frame(AVG_E = round(mean(lDF$AVG_PCT_E),4), AVG_MAX_E = round(mean(lDF$MAX_PCT_E),4), AVG_MIN_E = round(mean(lDF$MIN_PCT_E),4) )
#summary table of the direction stats over entire loop
directionDescriptivesTbl <- data.frame(AVG_Dir_Correct = round(mean(lDF$Correct_Dir_PCT),4), MAX_Correct_PCT = round(max(lDF$Correct_Dir_PCT),4), MIN_Correct_PCT = round(min(lDF$Correct_Dir_PCT),4))
#combined accuracy and direction table
fullLoopDescriptives <- cbind(errorDescriptivesTbl,directionDescriptivesTbl)
row.names(fullLoopDescriptives) <- c("Row")
print(fullLoopDescriptives, right = FALSE)
}
LM_HOLD_OUT_TEST(DF,"var1_log", c("var3_log_Lag", "var4_log_Lag", "var12_log_Lag"), 50)## AVG_E AVG_MAX_E AVG_MIN_E AVG_Dir_Correct MAX_Correct_PCT MIN_Correct_PCT
## Row -0.0222 3.3533 -3.8174 51.9349 64.2857 37.037
The average percentage error is minuscule but we are only catching the correct direction 52% of the time. The average ranges from -3.817 to 3.533%. The model seems to be working strongly in terms of percentage error, but if the direction is off then investment using this method is risky. Deeper analysis is needed to determine how the percentage error can be so small and not catch the direction.