#Insert pakages

library(devtools)
## Warning: package 'devtools' was built under R version 4.0.2
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 4.0.2
library(vars)
## Warning: package 'vars' was built under R version 4.0.2
## Loading required package: MASS
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 4.0.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.0.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.0.2
## Loading required package: urca
## Warning: package 'urca' was built under R version 4.0.2
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 4.0.2
library(mFilter)
## Warning: package 'mFilter' was built under R version 4.0.2
library(astsa)
library(zoo)
library(tseries)
## Warning: package 'tseries' was built under R version 4.0.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(xts)
## Warning: package 'xts' was built under R version 4.0.2

The cell below reads our data. The data is made up of 2501 observation and 5 variables. The aim is to predict the closing price 5 days into the future using Closing, Opening, Highest, and lowest prices as the inputs.

df <- read.csv(file = "Price_History.csv", header=TRUE)
head(df)
##         Date    Volume      Low     High     Open    Close
## 1 2018-06-21 414717344 56226.37 57122.49 56651.66 56234.43
## 2 2018-06-20 362642010 56253.31 57174.56 56253.31 56651.65
## 3 2018-06-19 318060683 56141.15 57236.84 57236.84 56253.30
## 4 2018-06-18 197545773 56860.06 57660.50 57660.50 57236.84
## 5 2018-06-15 628754578 57639.70 58703.69 58495.67 57660.50
## 6 2018-06-14 239805357 57953.73 58659.02 58437.23 58495.67

The code below changes the date column from a factor to a date with the format year,month, and day.

#change date column from a factor to date
df$Date = as.Date(df$Date, format = "%Y-%m-%d")
class(df$Date)
## [1] "Date"

The highest price have one outlier which is above 62000. Therefore, we remove the data points with an outlier in the below code.

#Remove outliers
#Replacing all zeroes with NA:
#df$High[df$High >= 62000] <- NA

#removing NA
data <- na.omit(df)
head(data)
##         Date    Volume      Low     High     Open    Close
## 1 2018-06-21 414717344 56226.37 57122.49 56651.66 56234.43
## 2 2018-06-20 362642010 56253.31 57174.56 56253.31 56651.65
## 3 2018-06-19 318060683 56141.15 57236.84 57236.84 56253.30
## 4 2018-06-18 197545773 56860.06 57660.50 57660.50 57236.84
## 5 2018-06-15 628754578 57639.70 58703.69 58495.67 57660.50
## 6 2018-06-14 239805357 57953.73 58659.02 58437.23 58495.67

Sort the data by Date (from lowest to largest)

#sort data by date
data <- data[order(data$Date),] 
#head(data)

The code below set the date column to be an index

#set an index to date
rownames(data) <- data$Date
head(data)
##                  Date    Volume      Low     High     Open    Close
## 2008-06-20 2008-06-20 220592198 30577.17 31295.46 31295.46 30580.63
## 2008-06-23 2008-06-23 166902116 30370.91 30893.41 30589.36 30473.44
## 2008-06-24 2008-06-24 203936859 30121.44 30784.39 30473.44 30222.88
## 2008-06-25 2008-06-25 250539073 29882.91 30222.88 30222.88 29964.55
## 2008-06-26 2008-06-26 224698362 29750.97 30249.56 29964.55 29988.12
## 2008-06-27 2008-06-27 234933104 29613.68 30375.47 29988.12 30375.47

Now that our date column is an index. We want to drop the Date Column. Below we create a list of fields that we want to keep in our data.

#list of cols to keep
keeps<- c("Volume","Low","High","Open","Close")
data <- data[keeps]
head(data)
##               Volume      Low     High     Open    Close
## 2008-06-20 220592198 30577.17 31295.46 31295.46 30580.63
## 2008-06-23 166902116 30370.91 30893.41 30589.36 30473.44
## 2008-06-24 203936859 30121.44 30784.39 30473.44 30222.88
## 2008-06-25 250539073 29882.91 30222.88 30222.88 29964.55
## 2008-06-26 224698362 29750.97 30249.56 29964.55 29988.12
## 2008-06-27 234933104 29613.68 30375.47 29988.12 30375.47

The code below converts our data into a time series. We are using the packages zoo and xts.

data_series= zoo(as.xts(data))
head(data_series)
##               Volume      Low     High     Open    Close
## 2008-06-20 220592198 30577.17 31295.46 31295.46 30580.63
## 2008-06-23 166902116 30370.91 30893.41 30589.36 30473.44
## 2008-06-24 203936859 30121.44 30784.39 30473.44 30222.88
## 2008-06-25 250539073 29882.91 30222.88 30222.88 29964.55
## 2008-06-26 224698362 29750.97 30249.56 29964.55 29988.12
## 2008-06-27 234933104 29613.68 30375.47 29988.12 30375.47

Splitting the data into training set (80%)and the testing set(20%). The testing data will be used for validating the performance of the model. The price values differ in range with the volume. We need to transform/normalize our data so that we have the same sale. In the next step we will use log transformation to transform our data.

train_index <- nrow(data_series) *0.8

train <- data_series[1:train_index,]
test <- data_series[-c(1:train_index),]
head(train)
##               Volume      Low     High     Open    Close
## 2008-06-20 220592198 30577.17 31295.46 31295.46 30580.63
## 2008-06-23 166902116 30370.91 30893.41 30589.36 30473.44
## 2008-06-24 203936859 30121.44 30784.39 30473.44 30222.88
## 2008-06-25 250539073 29882.91 30222.88 30222.88 29964.55
## 2008-06-26 224698362 29750.97 30249.56 29964.55 29988.12
## 2008-06-27 234933104 29613.68 30375.47 29988.12 30375.47
nrow(train)
## [1] 2000
nrow(test)
## [1] 501

The code below shows us the transformed data.

# tranform series

train_log =log(train)
head(train_log)
##              Volume      Low     High     Open    Close
## 2008-06-20 19.21183 10.32801 10.35123 10.35123 10.32812
## 2008-06-23 18.93292 10.32124 10.33830 10.32841 10.32461
## 2008-06-24 19.13332 10.31299 10.33476 10.32461 10.31635
## 2008-06-25 19.33913 10.30504 10.31635 10.31635 10.30777
## 2008-06-26 19.23027 10.30062 10.31724 10.30777 10.30856
## 2008-06-27 19.27481 10.29599 10.32139 10.30856 10.32139

We plot the time series for the training data to have a look at the trend. It is obeserved that there is a similar increasing trend in the stock prices except for the volume. Trends in the data indicate that the data is non stationary. When building the VAR model one of the requirements is to ensure that the series is stationary, i.e the mean, variance, and covariances are constant over time. The next step we test for stationarity using the Dicky Fuller test.

plot(train_log)

The code below test for stationary using the Dicky fuller test. The null hypothesis states that the data is non stationary. We reject the null hypothesis if the p-value is less than or equal 0.1 and conclude that the series is stationary. The results shows us that the prices are non stationary therefore we need to difference our data.

#ADF test
adf.test(train_log$Volume, alternative ="stationary")
## Warning in adf.test(train_log$Volume, alternative = "stationary"): p-value
## smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log$Volume
## Dickey-Fuller = -9.6964, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
adf.test(train_log$Close, alternative ="stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log$Close
## Dickey-Fuller = -3.8167, Lag order = 12, p-value = 0.01815
## alternative hypothesis: stationary
adf.test(train_log$Open, alternative ="stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log$Open
## Dickey-Fuller = -3.9358, Lag order = 12, p-value = 0.01219
## alternative hypothesis: stationary
adf.test(train_log$Low, alternative ="stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log$Low
## Dickey-Fuller = -3.8213, Lag order = 12, p-value = 0.01792
## alternative hypothesis: stationary
adf.test(train_log$High, alternative ="stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log$High
## Dickey-Fuller = -3.872, Lag order = 12, p-value = 0.01538
## alternative hypothesis: stationary

The code below difference our training data.

train_log_diff =diff(train_log, differences = 1)
head(train_log_diff)
##                 Volume          Low          High         Open        Close
## 2008-06-23 -0.27890823 -0.006768410 -0.0129301466 -0.022820803 -0.003511317
## 2008-06-24  0.20040292 -0.008248032 -0.0035351492 -0.003796752 -0.008256231
## 2008-06-25  0.20580446 -0.007950465 -0.0184084899 -0.008256231 -0.008584237
## 2008-06-26 -0.10885600 -0.004425009  0.0008823855 -0.008584237  0.000786287
## 2008-06-27  0.04454192 -0.004625320  0.0041537359  0.000786287  0.012834072
## 2008-06-30 -0.19214323  0.025398958  0.0124851620  0.012834072  0.001248912

Performing an ADF test on the differenced series. It is observe that the p-value for all our variable equal to 0.1 therefore, we can start building our VAR model.

#ADF test on differenced series

adf.test(train_log_diff$Volume, alternative ="stationary")
## Warning in adf.test(train_log_diff$Volume, alternative = "stationary"): p-value
## smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log_diff$Volume
## Dickey-Fuller = -18.548, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
adf.test(train_log_diff$Close, alternative ="stationary")
## Warning in adf.test(train_log_diff$Close, alternative = "stationary"): p-value
## smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log_diff$Close
## Dickey-Fuller = -13.31, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
adf.test(train_log_diff$Open, alternative ="stationary")
## Warning in adf.test(train_log_diff$Open, alternative = "stationary"): p-value
## smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log_diff$Open
## Dickey-Fuller = -13.372, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
adf.test(train_log_diff$Low, alternative ="stationary")
## Warning in adf.test(train_log_diff$Low, alternative = "stationary"): p-value
## smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log_diff$Low
## Dickey-Fuller = -12.814, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
adf.test(train_log_diff$High, alternative ="stationary")
## Warning in adf.test(train_log_diff$High, alternative = "stationary"): p-value
## smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train_log_diff$High
## Dickey-Fuller = -13.44, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary

Before implementing the VAR model we need to choose the number of lags. In this case we will the AIC to select the number of lags (p) to be used. The AIC suggests that we build a VAR model of order p=8.

no_lags <- VARselect(train_log_diff, lag.max = 8)
no_lags$selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      8      8      5      8

The code below build a VAR model of order 8.

#VAR Model
var_model <- VAR(train_log_diff, p = 2, type = "const")
summary(var_model)
## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: Volume, Low, High, Open, Close 
## Deterministic variables: const 
## Sample size: 1997 
## Log Likelihood: 33465.292 
## Roots of the characteristic polynomial:
## 0.556 0.556 0.5491 0.5491 0.5216 0.5216 0.5026 0.4171 0.4171 0.3071
## Call:
## VAR(y = train_log_diff, p = 2, type = "const")
## 
## 
## Estimation results for equation Volume: 
## ======================================= 
## Volume = Volume.l1 + Low.l1 + High.l1 + Open.l1 + Close.l1 + Volume.l2 + Low.l2 + High.l2 + Open.l2 + Close.l2 + const 
## 
##             Estimate Std. Error t value Pr(>|t|)    
## Volume.l1 -0.4100058  0.0222089 -18.461  < 2e-16 ***
## Low.l1    -0.0642087  1.3053649  -0.049  0.96077    
## High.l1    3.4623960  1.3351854   2.593  0.00958 ** 
## Open.l1    5.9427209  9.0220330   0.659  0.51017    
## Close.l1  -2.0458449  1.1325952  -1.806  0.07102 .  
## Volume.l2 -0.2769344  0.0221126 -12.524  < 2e-16 ***
## Low.l2    -0.8790857  1.3007932  -0.676  0.49924    
## High.l2    1.9721238  1.3347321   1.478  0.13969    
## Open.l2   -1.6218308  0.9994884  -1.623  0.10482    
## Close.l2  -8.4198172  9.0493741  -0.930  0.35226    
## const      0.0007441  0.0062985   0.118  0.90598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.2813 on 1986 degrees of freedom
## Multiple R-Squared: 0.1701,  Adjusted R-squared: 0.1659 
## F-statistic:  40.7 on 10 and 1986 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation Low: 
## ==================================== 
## Low = Volume.l1 + Low.l1 + High.l1 + Open.l1 + Close.l1 + Volume.l2 + Low.l2 + High.l2 + Open.l2 + Close.l2 + const 
## 
##             Estimate Std. Error t value Pr(>|t|)    
## Volume.l1 -0.0001321  0.0006878  -0.192  0.84774    
## Low.l1    -0.7035128  0.0404252 -17.403  < 2e-16 ***
## High.l1   -0.0975319  0.0413487  -2.359  0.01843 *  
## Open.l1    0.5207709  0.2793990   1.864  0.06248 .  
## Close.l1   0.9338618  0.0350748  26.625  < 2e-16 ***
## Volume.l2 -0.0007432  0.0006848  -1.085  0.27791    
## Low.l2    -0.2782235  0.0402836  -6.907 6.66e-12 ***
## High.l2   -0.1231704  0.0413347  -2.980  0.00292 ** 
## Open.l2    0.1571355  0.0309527   5.077 4.20e-07 ***
## Close.l2   0.0360224  0.2802458   0.129  0.89774    
## const      0.0001533  0.0001951   0.786  0.43208    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.008711 on 1986 degrees of freedom
## Multiple R-Squared: 0.3921,  Adjusted R-squared: 0.389 
## F-statistic: 128.1 on 10 and 1986 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation High: 
## ===================================== 
## High = Volume.l1 + Low.l1 + High.l1 + Open.l1 + Close.l1 + Volume.l2 + Low.l2 + High.l2 + Open.l2 + Close.l2 + const 
## 
##             Estimate Std. Error t value Pr(>|t|)    
## Volume.l1 -3.356e-05  6.049e-04  -0.055   0.9558    
## Low.l1     3.987e-02  3.556e-02   1.121   0.2623    
## High.l1   -6.897e-01  3.637e-02 -18.964   <2e-16 ***
## Open.l1    6.093e-01  2.457e-01   2.479   0.0132 *  
## Close.l1   7.971e-01  3.085e-02  25.839   <2e-16 ***
## Volume.l2 -6.642e-04  6.023e-04  -1.103   0.2703    
## Low.l2     7.055e-02  3.543e-02   1.991   0.0466 *  
## High.l2   -3.431e-01  3.636e-02  -9.436   <2e-16 ***
## Open.l2    6.116e-02  2.722e-02   2.246   0.0248 *  
## Close.l2  -1.924e-01  2.465e-01  -0.781   0.4351    
## const      1.714e-04  1.716e-04   0.999   0.3178    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.007662 on 1986 degrees of freedom
## Multiple R-Squared: 0.4544,  Adjusted R-squared: 0.4517 
## F-statistic: 165.4 on 10 and 1986 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation Open: 
## ===================================== 
## Open = Volume.l1 + Low.l1 + High.l1 + Open.l1 + Close.l1 + Volume.l2 + Low.l2 + High.l2 + Open.l2 + Close.l2 + const 
## 
##             Estimate Std. Error t value Pr(>|t|)    
## Volume.l1  1.039e-04  4.829e-05   2.151  0.03157 *  
## Low.l1     8.909e-03  2.838e-03   3.139  0.00172 ** 
## High.l1    2.262e-03  2.903e-03   0.779  0.43598    
## Open.l1   -5.011e-01  1.962e-02 -25.548  < 2e-16 ***
## Close.l1   9.931e-01  2.462e-03 403.305  < 2e-16 ***
## Volume.l2  6.414e-05  4.808e-05   1.334  0.18231    
## Low.l2     6.430e-03  2.828e-03   2.274  0.02310 *  
## High.l2   -2.614e-03  2.902e-03  -0.901  0.36776    
## Open.l2   -1.985e-03  2.173e-03  -0.913  0.36117    
## Close.l2   4.935e-01  1.967e-02  25.083  < 2e-16 ***
## const      2.787e-07  1.369e-05   0.020  0.98377    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.0006116 on 1986 degrees of freedom
## Multiple R-Squared: 0.9977,  Adjusted R-squared: 0.9977 
## F-statistic: 8.531e+04 on 10 and 1986 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation Close: 
## ====================================== 
## Close = Volume.l1 + Low.l1 + High.l1 + Open.l1 + Close.l1 + Volume.l2 + Low.l2 + High.l2 + Open.l2 + Close.l2 + const 
## 
##             Estimate Std. Error t value Pr(>|t|)  
## Volume.l1 -3.153e-05  9.942e-04  -0.032   0.9747  
## Low.l1    -4.767e-02  5.844e-02  -0.816   0.4148  
## High.l1   -8.243e-02  5.977e-02  -1.379   0.1680  
## Open.l1    7.691e-01  4.039e-01   1.904   0.0570 .
## Close.l1   9.737e-02  5.070e-02   1.920   0.0549 .
## Volume.l2 -9.349e-04  9.899e-04  -0.944   0.3450  
## Low.l2     3.564e-02  5.823e-02   0.612   0.5406  
## High.l2   -7.765e-02  5.975e-02  -1.300   0.1939  
## Open.l2   -4.624e-02  4.474e-02  -1.033   0.3015  
## Close.l2  -7.339e-01  4.051e-01  -1.812   0.0702 .
## const      3.028e-04  2.820e-04   1.074   0.2830  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.01259 on 1986 degrees of freedom
## Multiple R-Squared: 0.01205, Adjusted R-squared: 0.007076 
## F-statistic: 2.422 on 10 and 1986 DF,  p-value: 0.007272 
## 
## 
## 
## Covariance matrix of residuals:
##            Volume        Low      High       Open      Close
## Volume  7.912e-02 -2.304e-04 2.274e-04  1.977e-06 -5.492e-06
## Low    -2.304e-04  7.588e-05 4.410e-05 -2.396e-07  9.264e-05
## High    2.274e-04  4.410e-05 5.870e-05  1.985e-07  7.752e-05
## Open    1.977e-06 -2.396e-07 1.985e-07  3.740e-07 -1.220e-07
## Close  -5.492e-06  9.264e-05 7.752e-05 -1.220e-07  1.586e-04
## 
## Correlation matrix of residuals:
##           Volume      Low    High     Open     Close
## Volume  1.000000 -0.09402 0.10553  0.01149 -0.001551
## Low    -0.094018  1.00000 0.66076 -0.04497  0.844578
## High    0.105535  0.66076 1.00000  0.04237  0.803551
## Open    0.011491 -0.04497 0.04237  1.00000 -0.015841
## Close  -0.001551  0.84458 0.80355 -0.01584  1.000000

The code below plot our residuals. The is an issue with the margins thus we are able to see the residual plot for Volume only. The code needs to be fixed so that we can look at the residul plots for other variables included in the model. In the next step we perform goodness of fit test using the Portmanteau test.

#plotting the residuals
graphics.off()
par("mar")
## [1] 5.1 4.1 4.1 2.1
par(mar=c(1,1,1,1))
#plot(var_model, size= 1)

We are testing if the model is a good fit (if there is no serial autocorrelation in the data). The null hypothesis states that there is no serial autocorrelation in the residual. The hypothesis is rejected if the p value is greater than 0.05. The p-value suggets that the model is not a good fit and there exist a serial correlation in the data.

serial.test(var_model, lags.pt = 30, type = "PT.adjusted")
## 
##  Portmanteau Test (adjusted)
## 
## data:  Residuals of VAR object var_model
## Chi-squared = 2300.6, df = 700, p-value < 2.2e-16

We are now testing if the residuals have a normal distribution. P value greater than 0.05 indicates thats the residuals are fairly normally distributed. The results indicates that our residuals are not normally distributed.

normality.test(var_model, multivariate.only = TRUE)
## $JB
## 
##  JB-Test (multivariate)
## 
## data:  Residuals of VAR object var_model
## Chi-squared = 19512041, df = 10, p-value < 2.2e-16
## 
## 
## $Skewness
## 
##  Skewness only (multivariate)
## 
## data:  Residuals of VAR object var_model
## Chi-squared = 3111.4, df = 5, p-value < 2.2e-16
## 
## 
## $Kurtosis
## 
##  Kurtosis only (multivariate)
## 
## data:  Residuals of VAR object var_model
## Chi-squared = 19508930, df = 5, p-value < 2.2e-16

Now we are predicting 5 days ahead

VAR_pred = predict(var_model, n.predict=5)
plot(VAR_pred, col ="green")

The next step is to validate the performance of the model using the testing data and also to calculate the difference between the true values and the values estimated by the model