Background

The Standard and Poor’s 500, or simply the S&P 500, is a stock market index tracking the stock performance of the 500 largest companies listed on stock exchanges in the United States.

The FTSE 100 is a major stock market index which tracks the performance of the 100 most capitalized companies traded on the London Stock Exchange. These companies represent about 80% of the entire market capitalization of the London Stock Exchange and it’s a free-float index. It has a base value of 1000 since January 3, 1984.

The Hang Seng is a major stock market index which tracks the performance of around the 50 largest companies listed in the Stock Exchange of Hong Kong. It is free floating, capitalization-weighted index. The HSI has a base value of 100 since June 30, 1964.

The EURO STOXX 50 is a major stock market index which tracks the performance of 50 Blue-chip companies based in twelve Euro Area countries: Austria, Belgium, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg, the Netherlands, Portugal and Spain.

The Nikkei 225 Stock Average Index is a major stock market index which tracks the performance of the 225 top rated companies listed on the Tokyo Stock Exchange. It is a price-weighted index. The Nikkei 225 has a base value of 176.21 since May 16, 1949.

Data description

  • The time series objects are ranging from 2018-01-04 for a five years period, until 2022-12-31.
  • Five global market indices are analysed and predicted.
  • The data contains daily observations.

Objectives

First Section

Comparative analysis of 5 global markets using 5 years data from 2018 to 2022

  1. Short term returns over 5 years of period
  2. Volatility in daily returns and risk calculations
  3. Correlations among global market returns
  4. Time taken to recover during pandemic

Second Section Predicting Open Direction for S&P500 market using other global indicators and market specific indicators.

  1. Data preparation for prediction
  2. Machine Learning algorithms to predict the opening direction of market

Third Section Pair Trading in technology sector - Identifying co-integrated pairs and time to entry/exit

  1. Data collection from Yahoo Finance website
  2. Performing Cointegration test
  3. Automatic signal detection

The analysis also available through the developed Shiny application on the following address: Shiny: https://ptrgska.shinyapps.io/financial_analysis_shiny/

# Import packages
pacman::p_load(quantmod, knitr, zoo, ggplot2, GGally, TSstudio, stringr,
               corrplot, ggcorrplot, PerformanceAnalytics, Hmisc, dplyr,
               reshape2, moments, urca, rpart, aTSA, egcm,caret, neuralnet,
               tidyverse, randomForest, bslib, TTR, e1071, shinyalert, car,
               performance, see, patchwork)

Data import

Financial data has been downloaded from Yahoo Finance website.

# Create environment for stocks and function to download data.
ENV.STOCK <- new.env()
ticks.index <- c('^FTSE','^GSPC', '^HSI', '^SPEUP', '^N225')
sDate.index <- as.POSIXct('2018-01-04')
eDate.index <- as.POSIXct('2022-12-31')

readFinancialData <- function(ticker, envrmnt, start, end){
  data <- tryCatch(
            expr = {
              getSymbols(ticker,
                         src='yahoo',
                         from = start,
                         to = end,
                         periodicity = 'daily',
                         env = envrmnt)
            },
            error = function(e){
              message(paste('Please check the tickers are correct and try again.'))
              message(e)
              return(NA)
            },
            finally = {
              message(paste(" Processed ticker: ", ticker))
            }
          )
  return(data)
}

readFinancialData(ticks.index, ENV.STOCK, sDate.index, eDate.index)
## [1] "FTSE"  "GSPC"  "HSI"   "SPEUP" "N225"

Data cleaning

Checking the number of missing entries from our financial data we obtained from yahoo. If there are missing values in the time series object, they will cause errors in the calculations, hence we use interpolation to get predicted values.

# Clear the downloaded data from NA's.
for (i in ls(ENV.STOCK)){
  x <- get(i, envir = ENV.STOCK)
  print(paste0('The number of missing observations from ',
               i, ' is: ', sum(is.na(x))))
  if (sum(is.na(x)) != 0){
    ENV.STOCK[[i]] <- na.approx(x)
    print(paste0('Missing values are being interpolated for ', i, '.'))
  }
}
## [1] "The number of missing observations from FTSE is: 6"
## [1] "Missing values are being interpolated for FTSE."
## [1] "The number of missing observations from GSPC is: 0"
## [1] "The number of missing observations from HSI is: 0"
## [1] "The number of missing observations from N225 is: 72"
## [1] "Missing values are being interpolated for N225."
## [1] "The number of missing observations from SPEUP is: 18"
## [1] "Missing values are being interpolated for SPEUP."

As we can see there were missing values in the data. We used spline interpolation to predict the values for the missing days.

# Check for missing values and head of each stock.
for (i in ls(ENV.STOCK)){
      x <- get(i, envir = ENV.STOCK)
      print(ifelse(sum(is.na(x)) > 0, paste0('There are still some missing values in ',i),
                     paste0('There are no missing values in ',i,
                            ' , data has been successfully cleaned.')))
    }
## [1] "There are no missing values in FTSE , data has been successfully cleaned."
## [1] "There are no missing values in GSPC , data has been successfully cleaned."
## [1] "There are no missing values in HSI , data has been successfully cleaned."
## [1] "There are no missing values in N225 , data has been successfully cleaned."
## [1] "There are no missing values in SPEUP , data has been successfully cleaned."
lapply(X = ENV.STOCK, FUN = function(x) head(x))
## $GSPC
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
## 2018-01-04   2719.31   2729.29  2719.07    2723.99  3697340000       2723.99
## 2018-01-05   2731.33   2743.45  2727.92    2743.15  3239280000       2743.15
## 2018-01-08   2742.67   2748.51  2737.60    2747.71  3246160000       2747.71
## 2018-01-09   2751.15   2759.14  2747.86    2751.29  3467460000       2751.29
## 2018-01-10   2745.55   2750.80  2736.06    2748.23  3579900000       2748.23
## 2018-01-11   2752.97   2767.56  2752.78    2767.56  3645690000       2767.56
## 
## $FTSE
##            FTSE.Open FTSE.High FTSE.Low FTSE.Close FTSE.Volume FTSE.Adjusted
## 2018-01-04    7671.1    7702.5   7671.1     7695.9   705864000        7695.9
## 2018-01-05    7695.9    7727.7   7689.8     7724.2   636035700        7724.2
## 2018-01-08    7724.2    7733.4   7691.8     7696.5   635135400        7696.5
## 2018-01-09    7696.5    7733.1   7696.5     7731.0   709674500        7731.0
## 2018-01-10    7731.0    7756.1   7716.2     7748.5   861758600        7748.5
## 2018-01-11    7748.5    7769.0   7734.6     7762.9   982791800        7762.9
## 
## $N225
##            N225.Open N225.High N225.Low N225.Close N225.Volume N225.Adjusted
## 2018-01-04  23073.73  23506.33 23065.20   23506.33   102200000      23506.33
## 2018-01-05  23643.00  23730.47 23520.52   23714.53   101900000      23714.53
## 2018-01-08  23872.48  23897.07 23721.90   23816.12    96050000      23816.12
## 2018-01-09  23948.97  23952.61 23789.03   23849.99    94100000      23849.99
## 2018-01-10  23832.81  23864.76 23755.45   23788.20    88800000      23788.20
## 2018-01-11  23656.39  23734.97 23601.84   23710.43    83700000      23710.43
## 
## $SPEUP
##            SPEUP.Open SPEUP.High SPEUP.Low SPEUP.Close SPEUP.Volume
## 2018-01-04    1576.98    1593.98   1576.78     1591.51            0
## 2018-01-05    1592.17    1606.62   1591.68     1606.12            0
## 2018-01-08    1610.94    1612.89   1606.90     1610.35            0
## 2018-01-09    1612.98    1618.48   1609.74     1617.60            0
## 2018-01-10    1615.43    1617.30   1607.13     1612.59            0
## 2018-01-11    1615.29    1615.29   1603.81     1607.95            0
##            SPEUP.Adjusted
## 2018-01-04        1591.51
## 2018-01-05        1606.12
## 2018-01-08        1610.35
## 2018-01-09        1617.60
## 2018-01-10        1612.59
## 2018-01-11        1607.95
## 
## $HSI
##            HSI.Open HSI.High  HSI.Low HSI.Close HSI.Volume HSI.Adjusted
## 2018-01-04 30691.71 30796.93 30560.43  30736.48 2995571100     30736.48
## 2018-01-05 30893.86 30911.01 30638.53  30814.64 2263930100     30814.64
## 2018-01-08 30895.09 30929.15 30732.54  30899.53 2004162500     30899.53
## 2018-01-09 30933.51 31056.70 30852.43  31011.41 2004940800     31011.41
## 2018-01-10 31009.24 31267.57 30928.58  31073.72 2595365800     31073.72
## 2018-01-11 31066.21 31133.18 30950.07  31120.39 1794778900     31120.39

The data has been cleaned and ready for further analysis. The first few rows of each downloaded index can be seen above.

Time series

# Create dataset for plot indices.
xts.adjusted <- do.call(merge, eapply(ENV.STOCK, Ad))
xts.adjusted <- na.locf(xts.adjusted)
xts.adjusted <- xts.adjusted[,c(2,1,5,3,4)]
    
for(i in colnames(xts.adjusted)){
    a <- str_locate(i, '.Adjusted')
    colnames(xts.adjusted)[which(names(xts.adjusted) == i)] <- str_sub(i,start = 1L, end = a[1,1]-1)
}

As the holiday and openings of the markets are different in North America, Europe and Asia, when combining the different data-sets, we inadvertently introduce Na’s. Hence we replace each NA with the most recent value. This ensures that we use the last available value for the days when North-American or European markets were closed but Asia was trading.

Once we have our data cleaned we can plot the time series, we use the adjusted values of the indices.

The closing price is just the value of the last transacted price before the market closes, while the adjusted closing price factors in corporate actions, such as dividends, stock splits and rights offerings, hence it amends a stock’s closing price to reflect the true value of the stock after calculating any corporate actions.

# Plot the time series of the indices.
lapply(X = xts.adjusted, FUN = function(x) plot.xts(x, 
       main = paste0('The price movement for the index ',colnames(x))))
## $FTSE

## 
## $GSPC

## 
## $HSI

## 
## $N225

## 
## $SPEUP

Above the five global markets price movement over five years of period.

Short term returns

Daily returns

Daily returns of the markets over the five year period can be seen below.

# Calculating daily return and plot for each indices.
xts.dailyreturns <- data.frame()
for (i in ls(ENV.STOCK)){
  x <- get(i, envir = ENV.STOCK)
  x2 <- round(dailyReturn(x)*100, 2)
  if(ncol(xts.dailyreturns) == 0){
    xts.dailyreturns = x2
  } else {
    xts.dailyreturns <- cbind(xts.dailyreturns, x2)
  }
  colnames(xts.dailyreturns)[ncol(xts.dailyreturns)] <- c(i)
}

par(mfrow=c(2,1))
lapply(X = xts.dailyreturns, FUN = function(x) 
       plot.xts(x, main = paste0('Simple daily return for the index ',colnames(x))))
## $FTSE
## 
## $GSPC

## 
## $HSI
## 
## $N225

## 
## $SPEUP

We create a cumulative plot to show the prive movement of markets compare to each other over the years. The cumulative return is the aggregate effect of price in the invested portfolio, it is the amount that the investment gained or lost over time. As the price fluctuates so does the return.

# Calculate cumulative product of indices and plot data.
xts.cumulative.dailyret <- data.frame()
for (i in colnames(xts.adjusted)){
  x2 <- cumprod(1+dailyReturn(xts.adjusted[,i], type = 'arithmetic'))
  if(ncol(xts.cumulative.dailyret) == 0){
    xts.cumulative.dailyret = x2
  } else {
    xts.cumulative.dailyret = cbind(xts.cumulative.dailyret, x2)
  }
  colnames(xts.cumulative.dailyret)[ncol(xts.cumulative.dailyret)] <- c(i)
}

plot.xts(xts.cumulative.dailyret, main = 'The cumulative returns for the five indices',
         legend.names = c('FTSE', 'GSPC', 'SPEUP', 'N225', 'HSI' ),
         legend.loc = 'topleft',
         col=c('black', 'red', 'blue', 'lightblue', 'green'),
         lty=c(1, 1), lwd=c(2, 1)
         )

Average daily returns for each indices over the five years:

# Calculate average daily returns.
for (i in colnames(xts.cumulative.dailyret)){
  print(paste0(i, ' mean: ' , mean(xts.cumulative.dailyret[,i], na.rm = TRUE)))
}
## [1] "FTSE mean: 0.916545663974266"
## [1] "GSPC mean: 1.26690985982175"
## [1] "HSI mean: 0.84079746980419"
## [1] "N225 mean: 1.04490776300021"
## [1] "SPEUP mean: 1.01655010733551"

Log returns

Calculating daily log returns and visualizing each of the series. The log returns are the percentage-wise increase or decrease in the price of a stock. The log-return from one period is calculated as log(price at time t/price at time t-1) which is approximately equal to the percentage change in the price from time t-1 to time t.

# Create logreturn dataset and plot data.
xts.logreturns <- diff(log(xts.adjusted), lag = 1)
xts.logreturns <- na.omit(xts.logreturns)

par(mfrow=c(2,1))
lapply(X = xts.logreturns, FUN = function(x) 
       plot.xts(x, main = paste0('Daily log return for ',colnames(x))))
## $FTSE
## 
## $GSPC

## 
## $HSI
## 
## $N225

## 
## $SPEUP

The monthly and annual log returns for each of the indices calculated. The table shows the annual log returns for each index by year.

# Create monthly and annual data-set for log returns.
xts.annuallogreturns <- data_frame()
xts.monthlylogreturns <- data.frame()

for (i in ls(ENV.STOCK)){
  y <- get(i, envir = ENV.STOCK)
  x <- coredata(annualReturn(y, type = 'log'))
  x2 <- monthlyReturn(y, type = 'log')
  if(ncol(xts.annuallogreturns) == 0 && ncol(xts.monthlylogreturns) == 0){
    xts.annuallogreturns = x
    xts.monthlylogreturns = x2
  } else {
    xts.annuallogreturns = cbind(xts.annuallogreturns, x)
    xts.monthlylogreturns = cbind(xts.monthlylogreturns, x2)
  }
  colnames(xts.annuallogreturns)[ncol(xts.annuallogreturns)] <- c(i)
  colnames(xts.monthlylogreturns)[ncol(xts.monthlylogreturns)] <- c(i)
}
kable(xts.annuallogreturns, caption = 'Annual log return of the indices.')
Annual log return of the indices.
FTSE GSPC HSI N225 SPEUP
-0.1311672 -0.0813512 -0.1718484 -0.1519675 -0.1422057
0.1142476 0.2536966 0.0868142 0.1769157 0.2035754
-0.1548337 0.1506496 -0.0345976 0.1485108 -0.0559373
0.1336765 0.2381720 -0.1517244 0.0479338 0.2039324
0.0090590 -0.2162030 -0.1678938 -0.0983630 -0.1221858

Visualizing the monthly mean with percentage-wise difference and returns of each stock. Also summarised in the table the yearly average return for each index.

# Mean plot of the indices.
xts.mon_mean <- apply.monthly(xts.adjusted, function(x) apply(x, 2, mean))
xts.monthlyreturns <- data.frame()
xts.yearlyreturns <- data.frame()

for (i in ls(ENV.STOCK)){
  x <- get(i, envir = ENV.STOCK)
  x2 <- round(monthlyReturn(x)*100, 2)
  x3 <- coredata(round(yearlyReturn(x)*100, 2))
  if(ncol(xts.monthlyreturns) == 0){
    xts.monthlyreturns = x2
    xts.yearlyreturns = x3
  } else {
    xts.monthlyreturns = cbind(xts.monthlyreturns, x2)
    xts.yearlyreturns = cbind(xts.yearlyreturns, x3)
  }
  colnames(xts.monthlyreturns)[ncol(xts.monthlyreturns)] <- c(i)
  colnames(xts.yearlyreturns)[ncol(xts.yearlyreturns)] <- c(i)
  
  plt <- chart_Series(xts.mon_mean[,i], TA=paste0('add_TA(xts.monthlyreturns$',i,',type = "h", name = "Monthly mean return over the 5 years period")'),
                    name = paste0('Monthly mean for ',i,' over the 5 years period'))
  print(plt)
}

kable(xts.yearlyreturns, caption = 'Yearly average returns of each index.')
Yearly average returns of each index.
FTSE GSPC HSI N225 SPEUP
-12.29 -7.81 -15.79 -14.10 -13.26
12.10 28.88 9.07 19.35 22.58
-14.34 16.26 -3.40 16.01 -5.44
14.30 26.89 -14.08 4.91 22.62
0.91 -19.44 -15.46 -9.37 -11.50

Volatility in markets and their respective daily returns

The volatility of a security is the standard deviation from its own daily return. It is used to measure the risk of a security. We calculate the daily historical volatility that is a statistical measure of dispersion on returns.

# Create volatility data-set and plot data.
xts.dailyvolatility <- data_frame()
for (i in ls(ENV.STOCK)){
  x <- get(i, envir = ENV.STOCK)
  x2 <- round(volatility(x, n = 3, calc = 'close'), 4)
  if(ncol(xts.dailyvolatility) == 0){
    xts.dailyvolatility = x2
  } else {
    xts.dailyvolatility <- cbind(xts.dailyvolatility, x2)
  }
  colnames(xts.dailyvolatility)[ncol(xts.dailyvolatility)] <- c(i)
}

par(mfrow=c(2,1))
lapply(X = xts.dailyvolatility, FUN = function(x) 
       plot.xts(x, main = paste0('Daily Volatility for ',colnames(x))))
## $FTSE
## 
## $GSPC

## 
## $HSI
## 
## $N225

## 
## $SPEUP

The volatility plots for each market can be seen above.

Calculating yearly volatility. Table below shows the yearly volatility of indices.

# Create annual volatility data-set.
df.yearlyvolatility <- data.frame()
for(i in colnames(xts.logreturns)){
  x <- apply.yearly(xts.logreturns[,i], sd)
  if(ncol(df.yearlyvolatility) == 0){
    df.yearlyvolatility = x * sqrt(252) * 100
  } else {
    df.yearlyvolatility = cbind(df.yearlyvolatility, x * sqrt(252) * 100)
  }
  colnames(df.yearlyvolatility)[ncol(df.yearlyvolatility)] <- c(i)
}

annualvola <- sapply(xts.logreturns, FUN = sd)
vec.yearlyvolatility <- round(annualvola * sqrt(252) * 100, 2)
print(paste0('The average annual percentage-wise volatility for the index ',
             c('FTSE', 'GSPC', 'HSI', 'N225', 'SPEUP'), ' ', vec.yearlyvolatility, '%'))
## [1] "The average annual percentage-wise volatility for the index FTSE 17.67%" 
## [2] "The average annual percentage-wise volatility for the index GSPC 21.62%" 
## [3] "The average annual percentage-wise volatility for the index HSI 22.42%"  
## [4] "The average annual percentage-wise volatility for the index N225 19.24%" 
## [5] "The average annual percentage-wise volatility for the index SPEUP 17.69%"
kable(df.yearlyvolatility, caption = 'Yearly volatility by index from 2018 to 2022')
Yearly volatility by index from 2018 to 2022
FTSE GSPC HSI N225 SPEUP
12.59850 16.85766 19.23865 18.35699 12.81583
11.58218 12.29823 15.45603 13.19002 11.21460
29.10246 34.14966 22.81332 24.95778 28.11037
12.59168 12.90541 19.65196 18.07612 12.09630
16.15612 23.76849 31.64136 19.78637 18.28945

Risk vs Return plot

# Risk - Return plot
df.logret <- data.frame(coredata(xts.annuallogreturns))
df.logret <- as.data.frame(melt(lapply(df.logret, na.omit)))
df.yvola <- melt(coredata(df.yearlyvolatility))

df.logret[ncol(df.logret)+1] <- df.yvola[,3]

ggplot(data = df.logret, aes(x = value, y = V3))+
  geom_line(aes(linetype = L1, colour = L1))+
  geom_text(aes(label = round(V3, 2)), vjust = -1, size = 3)+
  ggtitle('Risk return plot of the global markets')+
  labs(x = 'Annual logreturn', y = 'Percentage change', color = 'Indices')

Value at risk (VaR) and expected shortfall (ES) calculations are important to understand the risk of an investment. It will show us what is the worst loss we can expect for a certain confidence interval and what is the expected return on average, in worst case scenario. We first check our returns for normality.

# Normality test and histogram plot.
df.stat <- data.frame()

for(i in colnames(xts.logreturns)){
    plt <-  ggplot(data = xts.logreturns, aes(x = xts.logreturns[,i])) +
            geom_histogram(aes(y = ..density..),
                                 binwidth = 0.01,
                                 color = "black",
                                 fill = 4) +
            geom_density(size = 1.0) + 
            xlab('Daily Log Returns') +
            ylab('Density')+
            ggtitle(paste0(i,' Index'),
                    subtitle = paste0('Mean = ',round(mean(xts.logreturns[,i]),6),
                                      ' Std. Deviation = ', round(sd(xts.logreturns[,i]),4),
                                      ' Skewness: ', round(skewness(xts.logreturns[,i]),4),
                                      ' Kurtosis: ', round(kurtosis(xts.logreturns[,i]),4)))
    
  print(plt)
  jbt <- jarque.test(as.vector(xts.logreturns[,i]))
  swt <- shapiro.test(as.vector(xts.logreturns[,i]))
  
  new.row <- data.frame(Index = i,
                        Mean = round(mean(xts.logreturns[,i]),6),
                        Std.Deviation = round(sd(xts.logreturns[,i]),4),
                        Skewness = round(skewness(xts.logreturns[,i]),4),
                        Kurtosis = round(kurtosis(xts.logreturns[,i]),4),
                        Jarque.Bera.Test = c(jbt$statistic),
                        JB.p.value = c(jbt$p.value),
                        Shapriro.Wilk.Test = c(swt$statistic),
                        SW.p.value = c(swt$p.value))
                        
  df.stat <- rbind(df.stat, new.row)
}

kable(df.stat, caption = 'Statistics of the five idices.')
Statistics of the five idices.
Index Mean Std.Deviation Skewness Kurtosis Jarque.Bera.Test JB.p.value Shapriro.Wilk.Test SW.p.value
JB FTSE -0.000025 0.0111 -1.1020 15.0518 12561.6694 0 0.8809021 0
JB1 GSPC 0.000264 0.0136 -0.7988 13.3745 9849.2251 0 0.8691023 0
JB2 HSI -0.000340 0.0141 0.1871 3.6812 744.6045 0 0.9564242 0
JB3 N225 0.000080 0.0121 -0.0813 3.8213 795.5233 0 0.9563523 0
JB4 SPEUP 0.000060 0.0111 -1.3701 16.7422 15621.8464 0 0.8762599 0

The distribution of log returns deviates from normality as each histograms of indices are leptokurtic and four of them is negatively-skewed, meaning there are more extremely negative values than positive ones. The HSI is the only one that has a symmetric distribution. This non-normality is supported by the Jarque-Bera and Shapiro-Wilks normality tests as well. Our indices are more volatile than if it had normal distribution. As our data is asymmetric we, ruled out using normal distribution, hence we will simulate directly from an empirical distribution without making any assumption about it’s shape.

  • VaR - the quantile corresponding to alpha in the return distribution

  • ES - the average of all the returns to the left of the VaR

# Value at Risk and Exprected Shortfall calculation.
df.VaR.ES <- data.frame()

for(i in colnames(xts.logreturns)){
    sample.emp <- sample(as.vector(xts.logreturns[,i]), 1000000, replace = TRUE)
    VaR.emp <- round(quantile(sample.emp, probs = 0.05), 6)
    ES.emp <- round(mean(sample.emp[sample.emp < VaR.emp]), 6)
    cat('Value at Risk for ',i,': ' , VaR.emp*100, '% and the Expected Shortfall: ',
        ES.emp*100, '%', fill = TRUE)
    pf <- 1000 # portfolio
    VaR.pf <- pf * (exp(VaR.emp) - 1)
    ES.pf <- pf * (exp(ES.emp) - 1)
    cat('Value at Risk for',i,': ' ,'in terms of $1000 investments is: ' ,
        VaR.pf, 'and the Expected Shortfall is: ', ES.pf, fill = TRUE)
    
     new.row <- data.frame(Index = i,
                           Values_at_Risk = VaR.emp,
                           Espected_Shortfal = ES.emp,
                           Value_at_Risk_Portfolio = VaR.pf,
                           Estimated_Shortfal_Portfolio = ES.pf)
     df.VaR.ES <- rbind(df.VaR.ES, new.row)
  }
## Value at Risk for  FTSE :  -1.6966 % and the Expected Shortfall:  -2.8738 %
## Value at Risk for FTSE :  in terms of $1000 investments is:  -16.82289 
## and the Expected Shortfall is:  -28.32899
## Value at Risk for  GSPC :  -2.119 % and the Expected Shortfall:  -3.441 %
## Value at Risk for GSPC :  in terms of $1000 investments is:  -20.96707 
## and the Expected Shortfall is:  -33.82471
## Value at Risk for  HSI :  -2.3 % and the Expected Shortfall:  -3.2336 %
## Value at Risk for HSI :  in terms of $1000 investments is:  -22.73752 
## and the Expected Shortfall is:  -31.81878
## Value at Risk for  N225 :  -2.0354 % and the Expected Shortfall:  -2.8907 %
## Value at Risk for N225 :  in terms of $1000 investments is:  -20.14826 
## and the Expected Shortfall is:  -28.49319
## Value at Risk for  SPEUP :  -1.6879 % and the Expected Shortfall:  -2.8728 %
## Value at Risk for SPEUP :  in terms of $1000 investments is:  -16.73735 
## and the Expected Shortfall is:  -28.31927
kable(df.VaR.ES, caption = 'Value at Risk and Expected Shortfall for each index on average')
Value at Risk and Expected Shortfall for each index on average
Index Values_at_Risk Espected_Shortfal Value_at_Risk_Portfolio Estimated_Shortfal_Portfolio
5% FTSE -0.016966 -0.028738 -16.82289 -28.32899
5%1 GSPC -0.021190 -0.034410 -20.96707 -33.82471
5%2 HSI -0.023000 -0.032336 -22.73752 -31.81878
5%3 N225 -0.020354 -0.028907 -20.14826 -28.49319
5%4 SPEUP -0.016879 -0.028728 -16.73735 -28.31927

The table shows the Value at Risk and Estimated shortfal on average for an invested $1000.

We don’t expect to lose more than the value of Value at Risk at any given day with 95% confidence interval and in the worse 5% of cases we could expect a loss equal to the Expected Shortfall.

Correlations among global markets and their daily returns

The correlation between the pairs of the daily returns of the five global markets.

# Correlation plots - heatmap.
xts.daily.cor <- rcorr(as.matrix(na.approx(xts.dailyreturns)))
xts.monthly.cor <- rcorr(as.matrix(xts.monthlyreturns))
xts.yearly.cor <- rcorr(as.matrix(xts.yearlyreturns))

ggcorrplot(xts.daily.cor$r, hc.order = TRUE, outline.col = "white", 
           colors = c("blue", "white", "red"), lab = TRUE)+
      labs(x = NULL, y = NULL, title="Daily correlation")

chart.Correlation(na.approx(xts.dailyreturns), histogram=TRUE, pch = '+',
                  method = c("pearson", "kendall", "spearman"))

ggcorrplot(xts.monthly.cor$r, hc.order = TRUE, outline.col = "white",
           colors = c("blue", "white", "red"), lab = TRUE)+
      labs(x = NULL, y = NULL, title="Monthly correlation")

chart.Correlation(na.approx(xts.monthlyreturns), histogram=TRUE, pch = '+',
                  method = c("pearson", "kendall", "spearman"))

ggcorrplot(xts.yearly.cor$r, hc.order = TRUE, outline.col = "white",
           colors = c("blue", "white", "red"), lab = TRUE)+
      labs(x = NULL, y = NULL, title="Annual correlation")

chart.Correlation(na.approx(xts.yearlyreturns), histogram=TRUE, pch = '+',
                  method = c("pearson", "kendall", "spearman"))

The Pearson’s correlation coefficient, also known as Pearson’s r, describes the linear relationship between two quantitative variables.

The assumptions the data must meet if we want to use Pearson’s r: - Both variables are on an interval or ratio level of measurement. - Data from both variables follow normal distributions. (It has been concluded that our data is skewed and not normally distributed.) - Have no outliers. - The data is from a random or representative sample. - Expecting a linear relationship between the two variables.

The correlation between markets over a certain period shown below with a few combinations of markets and time frame.

# Correlation plots.
xts.dailyClose <- do.call(merge, eapply(ENV.STOCK, Cl))
xts.dailyClose <- na.locf(xts.dailyClose)

for(i in colnames(xts.dailyClose)){
  a <- str_locate(i, '.Close')
  colnames(xts.dailyClose)[which(names(xts.dailyClose) == i)] <- str_sub(i,start = 1L, end = a[1,1]-1)
}

xts.dailyClose <- as.data.frame(xts.dailyClose)
xts.dailyClose <-  xts.dailyClose %>% 
                   mutate(Cor = rollapply(cbind(xts.dailyClose[,1], xts.dailyClose[,2]),
                                           width = as.numeric(60),
                                           FUN = function(z) cor(z[,1],z[,2]),
                                           by.column = FALSE,
                                           fill = NA))
      
ggplot(xts.dailyClose, aes(x = index(xts.dailyClose), y = xts.dailyClose$Cor, group = 1))+
      geom_line()+
      geom_area(aes(fill = xts.dailyClose$Cor))+
      ggtitle(paste('Correlation between GSPC and FTSE by, 60 days period.'))+
      labs(x = 'Investigated period', y = 'Correlation positive/negative', fill = NULL)

xts.dailyClose <-  xts.dailyClose %>% 
                   mutate(Cor = rollapply(cbind(xts.dailyClose[,3], xts.dailyClose[,5]),
                                           width = as.numeric(90),
                                           FUN = function(z) cor(z[,1],z[,2]),
                                           by.column = FALSE,
                                           fill = NA))
      
ggplot(xts.dailyClose, aes(x = index(xts.dailyClose), y = xts.dailyClose$Cor, group = 1))+
      geom_line()+
      geom_area(aes(fill = xts.dailyClose$Cor))+
      ggtitle(paste('Correlation between N225 and HSI by, 90 days period.'))+
      labs(x = 'Investigated period', y = 'Correlation positive/negative', fill = NULL)

xts.dailyClose <-  xts.dailyClose %>% 
                    mutate(Cor = rollapply(cbind(xts.dailyClose[,1], xts.dailyClose[,5]),
                                           width = as.numeric(60),
                                           FUN = function(z) cor(z[,1],z[,2]),
                                           by.column = FALSE,
                                           fill = NA))
      
ggplot(xts.dailyClose, aes(x = index(xts.dailyClose), y = xts.dailyClose$Cor, group = 1))+
      geom_line()+
      geom_area(aes(fill = xts.dailyClose$Cor))+
      ggtitle(paste('Correlation between GSPC and HSI by, 60 days period.'))+
      labs(x = 'Investigated period', y = 'Correlation positive/negative', fill = NULL)

xts.dailyClose <-  xts.dailyClose %>% 
                    mutate(Cor = rollapply(cbind(xts.dailyClose[,4], xts.dailyClose[,2]),
                                           width = as.numeric(30),
                                           FUN = function(z) cor(z[,1],z[,2]),
                                           by.column = FALSE,
                                           fill = NA))
      
ggplot(xts.dailyClose, aes(x = index(xts.dailyClose), y = xts.dailyClose$Cor, group = 1))+
      geom_line()+
      geom_area(aes(fill = xts.dailyClose$Cor))+
      ggtitle(paste('Correlation between SPEUP and FTSE by, 30 days period.'))+
      labs(x = 'Investigated period', y = 'Correlation positive/negative', fill = NULL)

Scatter plot shown below for few of the possible combinations for correlation between the markets.

# Correlation plots - scatter.
df.dailyreturns <- as.data.frame(xts.dailyreturns)
      
ggplot(data = df.dailyreturns, aes(x = df.dailyreturns[,1], 
                                   y = df.dailyreturns[,2], 
                                   colour = df.dailyreturns[,1]))+
      geom_point()+
      labs(x = 'FTSE index', y = 'GSPC index') + 
      ggtitle('Correlation between FTSE and GSPC') +
      theme(plot.title = element_text(hjust = 0.5))+
      geom_smooth(method='loess',  linetype="dashed", 
                  color="red")+
      scale_colour_gradientn(colours = rainbow(3), 
                             name = paste('Colored by FTSE'))

ggplot(data = df.dailyreturns, aes(x = df.dailyreturns[,4], 
                                   y = df.dailyreturns[,3], 
                                   colour = df.dailyreturns[,4]))+
      geom_point()+
      labs(x = 'N225 index', y = 'HSI index') + 
      ggtitle('Correlation between N225 and HSI') +
      theme(plot.title = element_text(hjust = 0.5))+
      geom_smooth(method='loess',  linetype="dashed", 
                  color="red")+
      scale_colour_gradientn(colours = rainbow(3), 
                             name = paste('Colored by N225'))

ggplot(data = df.dailyreturns, aes(x = df.dailyreturns[,2], 
                                   y = df.dailyreturns[,5], 
                                   colour = df.dailyreturns[,2]))+
      geom_point()+
      labs(x = 'GSPC index', y = 'SPEUP index') + 
      ggtitle('Correlation between GSPC and SPEUP') +
      theme(plot.title = element_text(hjust = 0.5))+
      geom_smooth(method='loess',  linetype="dashed", 
                  color="red")+
      scale_colour_gradientn(colours = rainbow(3), 
                             name = paste('Colored by GSPC'))

ggplot(data = df.dailyreturns, aes(x = df.dailyreturns[,1], 
                                   y = df.dailyreturns[,5], 
                                   colour = df.dailyreturns[,1]))+
      geom_point()+
      labs(x = 'FTSE index', y = 'SPEUP index') + 
      ggtitle('Correlation between FTSE and SPEUP') +
      theme(plot.title = element_text(hjust = 0.5))+
      geom_smooth(method='loess',  linetype="dashed", 
                  color="red")+
      scale_colour_gradientn(colours = rainbow(3), 
                             name = paste('Colored by FTSE'))

Time taken to recover during pandemic

On 20 February 2020, stock markets across the world suddenly crashed after growing instability due to the COVID-19 pandemic. The following plots show the period from the time of the crash until the price recovered to pre-pandemic levels with the time taken and the drop in value of the index.

# Recovery time calculations.
start.date = as.POSIXct('2020-02-20')
crash_value <- window(xts.adjusted, index = start.date)
kable(crash_value, caption = 'The values of the indices at the time of crash')
The values of the indices at the time of crash
FTSE GSPC HSI N225 SPEUP
7436.6 3373.23 27609.16 23479.15 1727.86
xts.adjusted.crash <- window(xts.adjusted, start = '2020-02-21')

for(i in colnames(xts.adjusted.crash)){
  x <- crash_value[,i]
    for(j in xts.adjusted.crash[,i]){
      if(j >= x){
        print(paste0('Stock ',i, ' crash value is ', x ,
                     ' and the recovery value ',j))
        print(paste0(' reached the same level at time ',
          index(xts.adjusted.crash[xts.adjusted.crash[,i] %in% j])))
        k <- index(xts.adjusted.crash[xts.adjusted.crash[,i] %in% j])
        print(paste0((k-index(crash_value)),
                     ' days has been ellapsed since the crash and recovery.'))
        plt <- plot.xts(xts.adjusted[,i],
                        subset = paste0('2020-02-20::',k),
                        main = paste0('Recovery time for ', i))
        break
      }
    }
  end.date = as.POSIXct(k)
  subdata <- window(xts.adjusted.crash[,i], start = start.date, end = end.date)
  print(paste0('The smallest market value of the crash was ',
               min(subdata), ' for the index ', i))
  print(paste0(', the Maximum Drowdrawn (MAXDD) was ', x-min(subdata) ,
               ' in price within the recovery period, that is '))
  print(paste0(round(100-((min(subdata)/x)*100), 2),'% drop in the price.'))
  print(plt)
}
## [1] "Stock FTSE crash value is 7436.60009765625 and the recovery value 7505.2001953125"
## [1] " reached the same level at time 2022-01-04"
## [1] "684 days has been ellapsed since the crash and recovery."
## [1] "The smallest market value of the crash was 4993.89990234375 for the index FTSE"
## [1] ", the Maximum Drowdrawn (MAXDD) was 2442.7001953125 in price within the recovery period, that is "
## [1] "32.85% drop in the price."

## [1] "Stock GSPC crash value is 3373.22998046875 and the recovery value 3380.35009765625"
## [1] " reached the same level at time 2020-08-12"
## [1] "174 days has been ellapsed since the crash and recovery."
## [1] "The smallest market value of the crash was 2237.39990234375 for the index GSPC"
## [1] ", the Maximum Drowdrawn (MAXDD) was 1135.830078125 in price within the recovery period, that is "
## [1] "33.67% drop in the price."

## [1] "Stock HSI crash value is 27609.16015625 and the recovery value 27649.859375"
## [1] " reached the same level at time 2021-01-05"
## [1] "320 days has been ellapsed since the crash and recovery."
## [1] "The smallest market value of the crash was 21696.130859375 for the index HSI"
## [1] ", the Maximum Drowdrawn (MAXDD) was 5913.029296875 in price within the recovery period, that is "
## [1] "21.42% drop in the price."

## [1] "Stock N225 crash value is 23479.150390625 and the recovery value 23559.30078125"
## [1] " reached the same level at time 2020-09-14"
## [1] "207 days has been ellapsed since the crash and recovery."
## [1] "The smallest market value of the crash was 16552.830078125 for the index N225"
## [1] ", the Maximum Drowdrawn (MAXDD) was 6926.3203125 in price within the recovery period, that is "
## [1] "29.5% drop in the price."

## [1] "Stock SPEUP crash value is 1727.85998535156 and the recovery value 1731.39001464844"
## [1] " reached the same level at time 2021-04-06"
## [1] "411 days has been ellapsed since the crash and recovery."
## [1] "The smallest market value of the crash was 1121.72998046875 for the index SPEUP"
## [1] ", the Maximum Drowdrawn (MAXDD) was 606.130004882812 in price within the recovery period, that is "
## [1] "35.08% drop in the price."

Predicting Open Direction for S&P500 market using other global indicators and market specific indicators

Trading stocks on any trading day takes an abrupt halt each afternoon when the markets close until next day, leaving uncertainty between then and the next day’s open. While the financial markets have clearly stated business hours, developments outside trading hours continue to influence both the value and investors action. Geopolitical events and natural disasters can occur at any time. Earnings announcements in companies made after the market closed or before it’s opening can influence the market’s direction. During January, April, July, and October, majority of firms release their quarterly results that can change the direction of markets.

After-hours trading activity is a common indicator of the next day’s open, this extended-hours trading takes place on electronic communication networks known as ECNs before the financial markets open for the day, as well as after they close. Trading virtually 24 hours a day and also index futures can indicate how the market will likely trend at the start of the next day. Unlike the stock market, futures markets rarely close. When one market is closed for the day, international markets are open and trading. A good day in Asian markets can suggest that U.S. markets will open higher but disastrous losses can lead to a lower open.

# Creating new environment for ML and downloading data.
ENV.ML <- new.env()
ticks.index <- c('^FTSE','^GSPC', '^HSI', '^SPEUP', '^N225')
sDate.index <- as.POSIXct('2018-01-04')
eDate.index <- as.POSIXct('2022-12-31')
readFinancialData(ticks.index, ENV.ML, sDate.index, eDate.index)
## [1] "FTSE"  "GSPC"  "HSI"   "SPEUP" "N225"
for (i in ls(ENV.ML)){
  x <- get(i, envir = ENV.ML)
  if (sum(is.na(x)) != 0){
    ENV.ML[[i]] <- na.approx(x)
  }
}

Preparing data for machine learning prediction.

# Assembling data-set for machine learning predictions.
df.GSPC <- ENV.ML$GSPC
ClToOp <- Cl(ENV.ML$GSPC) / Op(ENV.ML$GSPC)
HiToLow <- Hi(ENV.ML$GSPC)/Lo(ENV.ML$GSPC)
Vola <- round(volatility(ENV.ML$GSPC, n = 3, calc = 'close'), 4)
IndDir_N225 <- ifelse(dailyReturn(ENV.ML$N225) > 0,1,0)
IndDir_HSI <- ifelse(dailyReturn(ENV.ML$HSI) > 0,1,0)
IndDir_FTSE <- ifelse(((ENV.ML$FTSE$FTSE.Close - 
                        lag(ENV.ML$FTSE$FTSE.Close))/lag(ENV.ML$FTSE$FTSE.Close)) > 0,1,0)
IndDir_SPEUP <- ifelse(((ENV.ML$SPEUP$SPEUP.Close - 
                        lag(ENV.ML$SPEUP$SPEUP.Close))/lag(ENV.ML$SPEUP$SPEUP.Close)) > 0,1,0)
IndDir_GSPC <- ifelse(((ENV.ML$GSPC$GSPC.Close - 
                        lag(ENV.ML$GSPC$GSPC.Close))/lag(ENV.ML$GSPC$GSPC.Close)) > 0,1,0)

df.GSPC <- merge(df.GSPC, ClToOp, HiToLow, Vola, IndDir_GSPC,
                 IndDir_FTSE, IndDir_SPEUP, IndDir_HSI, IndDir_N225 )
df.GSPC <- na.locf(df.GSPC, fromLast = TRUE)
df.GSPC <- as.data.frame(df.GSPC)
colnames(df.GSPC)[7:14] <- c('GSPC.CloseToOpen','GSPC.HighToLow', 'GSPC.Volatility',
                             'GSPC.Direction', 'FTSE.Direction','SPEUP.Direction',
                             'HSI.Direction', 'N225.Direction')

df.GSPC <- df.GSPC %>% slice(-1)
df.GSPC <- na.omit(df.GSPC) 

index <- createDataPartition(df.GSPC$GSPC.Direction, p = 0.7, list = FALSE)
GSPC_train <- df.GSPC[index,]
GSPC_train <- na.locf(GSPC_train) 
GSPC_test <- df.GSPC[-index,]
GSPC_test <- na.locf(GSPC_test)

GSPC.Direction <- as.data.frame(GSPC_test$GSPC.Direction)
colnames(GSPC.Direction) <- c('GSPC_Direction') 
    
GSPC.formula <- as.formula(paste("GSPC.Direction ~ ",
                           paste(colnames(df.GSPC)[c(7, 8, 9, 11, 12, 13, 14)],
                           collapse = " + ")))

The directional plot for GSPC over the five years period shown below with the first few rows of the prepared data-set.

# Directional plot of GSPC.
z <- as.data.frame(table(df.GSPC$GSPC.Direction))
z<- z %>% mutate(prop = ((z[2]/length(df.GSPC$GSPC.Direction))*100))

ggplot(data = z, aes(x="", y = Freq, fill = Var1)) +
       geom_bar(width = 1, color = 'white', stat="identity")+
       coord_polar("y", start=0)+
       geom_text(aes(label = paste(round(prop$Freq, 2)," %")),
                 position = position_stack(vjust = 0.5))+
       ggtitle(paste0('Open/Close position for the index - GSPC' ))+
       labs(x = NULL, y = NULL, fill = NULL)+
       theme_classic() +
       theme(axis.line = element_blank(),
             axis.text = element_blank(),
             axis.ticks = element_blank())

head(df.GSPC)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
## 2018-01-05   2731.33   2743.45  2727.92    2743.15  3239280000       2743.15
## 2018-01-08   2742.67   2748.51  2737.60    2747.71  3246160000       2747.71
## 2018-01-09   2751.15   2759.14  2747.86    2751.29  3467460000       2751.29
## 2018-01-10   2745.55   2750.80  2736.06    2748.23  3579900000       2748.23
## 2018-01-11   2752.97   2767.56  2752.78    2767.56  3645690000       2767.56
## 2018-01-12   2770.18   2787.85  2769.64    2786.24  3587220000       2786.24
##            GSPC.CloseToOpen GSPC.HighToLow GSPC.Volatility GSPC.Direction
## 2018-01-05         1.004327       1.005693          0.0610              1
## 2018-01-08         1.001838       1.003985          0.0610              1
## 2018-01-09         1.000051       1.004105          0.0041              1
## 2018-01-10         1.000976       1.005387          0.0275              0
## 2018-01-11         1.005300       1.005369          0.0926              1
## 2018-01-12         1.005797       1.006575          0.0032              1
##            FTSE.Direction SPEUP.Direction HSI.Direction N225.Direction
## 2018-01-05              1               1             1              1
## 2018-01-08              0               1             1              1
## 2018-01-09              1               1             1              1
## 2018-01-10              1               0             1              0
## 2018-01-11              1               0             1              0
## 2018-01-12              1               1             1              0

Binary Logistic Regression Model

Using a logistic regression to train our model and predict the possible opening direction of the market on a test set.

# Binary Logistic Regression models and predictions.
model_BLR <- glm(GSPC.formula, 
                 data = GSPC_train, 
                 family = binomial)
summary(model_BLR)
## 
## Call:
## glm(formula = GSPC.formula, family = binomial, data = GSPC_train)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -414.40498   40.79457 -10.158  < 2e-16 ***
## GSPC.CloseToOpen  489.16292   37.53878  13.031  < 2e-16 ***
## GSPC.HighToLow    -75.63567   26.60316  -2.843  0.00447 ** 
## GSPC.Volatility     0.89575    1.57693   0.568  0.57001    
## FTSE.Direction      0.75443    0.28896   2.611  0.00903 ** 
## SPEUP.Direction     1.23007    0.29438   4.179 2.93e-05 ***
## HSI.Direction       0.74467    0.23603   3.155  0.00161 ** 
## N225.Direction      0.01772    0.23211   0.076  0.93916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1254.59  on 908  degrees of freedom
## Residual deviance:  520.94  on 901  degrees of freedom
## AIC: 536.94
## 
## Number of Fisher Scoring iterations: 7
model_BLR2 <- glm(GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction +
                                   FTSE.Direction + HSI.Direction + GSPC.HighToLow,
                 data = GSPC_train, 
                 family = binomial)
summary(model_BLR2)
## 
## Call:
## glm(formula = GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction + 
##     FTSE.Direction + HSI.Direction + GSPC.HighToLow, family = binomial, 
##     data = GSPC_train)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -423.2677    37.7020 -11.227  < 2e-16 ***
## GSPC.CloseToOpen  488.5194    37.4422  13.047  < 2e-16 ***
## SPEUP.Direction     1.2307     0.2902   4.241 2.23e-05 ***
## FTSE.Direction      0.7572     0.2886   2.623  0.00870 ** 
## HSI.Direction       0.7554     0.2312   3.268  0.00108 ** 
## GSPC.HighToLow    -66.1233    20.1250  -3.286  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1254.59  on 908  degrees of freedom
## Residual deviance:  521.27  on 903  degrees of freedom
## AIC: 533.27
## 
## Number of Fisher Scoring iterations: 7
# Model comparison with the intercept only model.
glm_intercept <- glm(GSPC.Direction~1, 
                     data = GSPC_train, 
                     family = binomial)

anova(glm_intercept, model_BLR2, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: GSPC.Direction ~ 1
## Model 2: GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction + FTSE.Direction + 
##     HSI.Direction + GSPC.HighToLow
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       908    1254.59                          
## 2       903     521.27  5   733.32 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Prediction
GSPC.Direction$BLR.Pred <- ifelse(predict(model_BLR2,
                                          newdata = GSPC_test,
                                          type = 'response') > 0.6, 1, 0)

GSPC.Direction <- GSPC.Direction %>% 
                  mutate_at(c('GSPC_Direction','BLR.Pred'), as.factor)
confusionMatrix(GSPC.Direction$GSPC_Direction,
                GSPC.Direction$BLR.Pred,
                positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 174  16
##          1  25 174
##                                           
##                Accuracy : 0.8946          
##                  95% CI : (0.8597, 0.9233)
##     No Information Rate : 0.5116          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7893          
##                                           
##  Mcnemar's Test P-Value : 0.2115          
##                                           
##             Sensitivity : 0.9158          
##             Specificity : 0.8744          
##          Pos Pred Value : 0.8744          
##          Neg Pred Value : 0.9158          
##              Prevalence : 0.4884          
##          Detection Rate : 0.4473          
##    Detection Prevalence : 0.5116          
##       Balanced Accuracy : 0.8951          
##                                           
##        'Positive' Class : 1               
## 
print('Multicolinearity between the variables:')
## [1] "Multicolinearity between the variables:"
vif(model_BLR2)
## GSPC.CloseToOpen  SPEUP.Direction   FTSE.Direction    HSI.Direction 
##         1.263266         1.728906         1.716132         1.100891 
##   GSPC.HighToLow 
##         1.090890
check_model(model_BLR2)

In our first model we can see that some of the variables are not significant hence the second corrected model. Multicolinearity is not present between the variables and the model has achieved over 87% accuracy.

Decision Tree Model

# Decision Tree models and predictions.
model_DT <- rpart(GSPC.formula,
                  data = GSPC_train, 
                  method = 'class',
                  parms = list(split='information'))

model_DT2 <- rpart(GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction + 
                                    FTSE.Direction + HSI.Direction + GSPC.HighToLow,
                   data = GSPC_train, 
                  method = 'class',
                  parms = list(split='information'))

plot(model_DT)
text(model_DT, splits = T, use.n = T, all = T, cex = 0.75)

importances <- varImp(model_DT)
importances %>% arrange(desc(Overall))
##                      Overall
## GSPC.CloseToOpen 329.1631963
## SPEUP.Direction  109.2568007
## FTSE.Direction    74.0931034
## GSPC.HighToLow    43.0717581
## GSPC.Volatility   20.7836966
## HSI.Direction     16.6512750
## N225.Direction     0.8051349
GSPC.Direction$DT.Pred <- predict(model_DT, newdata = GSPC_test, type = 'class')
GSPC.Direction$DT2.Pred <- predict(model_DT2, newdata = GSPC_test, type = 'class')

confusionMatrix(GSPC.Direction$GSPC_Direction, GSPC.Direction$DT.Pred, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 163  27
##          1  24 175
##                                           
##                Accuracy : 0.8689          
##                  95% CI : (0.8312, 0.9008)
##     No Information Rate : 0.5193          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7376          
##                                           
##  Mcnemar's Test P-Value : 0.7794          
##                                           
##             Sensitivity : 0.8663          
##             Specificity : 0.8717          
##          Pos Pred Value : 0.8794          
##          Neg Pred Value : 0.8579          
##              Prevalence : 0.5193          
##          Detection Rate : 0.4499          
##    Detection Prevalence : 0.5116          
##       Balanced Accuracy : 0.8690          
##                                           
##        'Positive' Class : 1               
## 
confusionMatrix(GSPC.Direction$GSPC_Direction, GSPC.Direction$DT2.Pred, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 163  27
##          1  24 175
##                                           
##                Accuracy : 0.8689          
##                  95% CI : (0.8312, 0.9008)
##     No Information Rate : 0.5193          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7376          
##                                           
##  Mcnemar's Test P-Value : 0.7794          
##                                           
##             Sensitivity : 0.8663          
##             Specificity : 0.8717          
##          Pos Pred Value : 0.8794          
##          Neg Pred Value : 0.8579          
##              Prevalence : 0.5193          
##          Detection Rate : 0.4499          
##    Detection Prevalence : 0.5116          
##       Balanced Accuracy : 0.8690          
##                                           
##        'Positive' Class : 1               
## 

The Decision tree model prediction is over 80% with both combination of formulas.

Random Forest Model

# Random Forest models and predictions.
set.seed(2)
model_RF <- randomForest(GSPC.formula,
                         data = GSPC_train,
                         mtry = 2,
                         ntree = 150,
                         importance = TRUE,
                         cutoff = c(0.6,0.4))

set.seed(5)
model_RF2 <- randomForest(GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction + 
                                           FTSE.Direction + HSI.Direction + GSPC.HighToLow,
                         data = GSPC_train,
                         mtry = 4,
                         ntree = 500,
                         importance = TRUE,
                         cutoff = c(0.6,0.4))

print(model_RF)
## 
## Call:
##  randomForest(formula = GSPC.formula, data = GSPC_train, mtry = 2,      ntree = 150, importance = TRUE, cutoff = c(0.6, 0.4)) 
##                Type of random forest: regression
##                      Number of trees: 150
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.09135514
##                     % Var explained: 63.23
print(model_RF)
## 
## Call:
##  randomForest(formula = GSPC.formula, data = GSPC_train, mtry = 2,      ntree = 150, importance = TRUE, cutoff = c(0.6, 0.4)) 
##                Type of random forest: regression
##                      Number of trees: 150
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.09135514
##                     % Var explained: 63.23
plot(model_RF)

print(model_RF2)
## 
## Call:
##  randomForest(formula = GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction +      FTSE.Direction + HSI.Direction + GSPC.HighToLow, data = GSPC_train,      mtry = 4, ntree = 500, importance = TRUE, cutoff = c(0.6,          0.4)) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.09060435
##                     % Var explained: 63.54
print(model_RF2)
## 
## Call:
##  randomForest(formula = GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction +      FTSE.Direction + HSI.Direction + GSPC.HighToLow, data = GSPC_train,      mtry = 4, ntree = 500, importance = TRUE, cutoff = c(0.6,          0.4)) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.09060435
##                     % Var explained: 63.54
plot(model_RF2)

GSPC.Direction$RF.Pred <- ifelse(predict(model_RF, GSPC_test, type = 'class') > 0.6, 1, 0)
GSPC.Direction$RF2.Pred <- ifelse(predict(model_RF2, GSPC_test, type = 'class') > 0.6, 1, 0)

GSPC.Direction$RF.Pred <- as.factor(GSPC.Direction$RF.Pred)
GSPC.Direction$RF2.Pred <- as.factor(GSPC.Direction$RF2.Pred)

confusionMatrix(GSPC.Direction$GSPC_Direction, GSPC.Direction$RF.Pred, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 174  16
##          1  32 167
##                                           
##                Accuracy : 0.8766          
##                  95% CI : (0.8397, 0.9076)
##     No Information Rate : 0.5296          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7536          
##                                           
##  Mcnemar's Test P-Value : 0.03038         
##                                           
##             Sensitivity : 0.9126          
##             Specificity : 0.8447          
##          Pos Pred Value : 0.8392          
##          Neg Pred Value : 0.9158          
##              Prevalence : 0.4704          
##          Detection Rate : 0.4293          
##    Detection Prevalence : 0.5116          
##       Balanced Accuracy : 0.8786          
##                                           
##        'Positive' Class : 1               
## 
confusionMatrix(GSPC.Direction$GSPC_Direction, GSPC.Direction$RF2.Pred, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 174  16
##          1  37 162
##                                           
##                Accuracy : 0.8638          
##                  95% CI : (0.8256, 0.8962)
##     No Information Rate : 0.5424          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.728           
##                                           
##  Mcnemar's Test P-Value : 0.00601         
##                                           
##             Sensitivity : 0.9101          
##             Specificity : 0.8246          
##          Pos Pred Value : 0.8141          
##          Neg Pred Value : 0.9158          
##              Prevalence : 0.4576          
##          Detection Rate : 0.4165          
##    Detection Prevalence : 0.5116          
##       Balanced Accuracy : 0.8674          
##                                           
##        'Positive' Class : 1               
## 

The random forest model has achieved an over 80% accuracy in both cases. Increasing the number of repetitions and the number of trees shows some improvements in the prediction.

Artificial Neural Net Model

# Artificial Neural Net models and predictions.
normalize <- function(x){
  return((x-min(x))/(max(x)-min(x)))
}
GSPC_train$GSPC.CloseToOpen <- normalize(GSPC_train$GSPC.CloseToOpen)
GSPC_train$GSPC.HighToLow <- normalize(GSPC_train$GSPC.HighToLow)
GSPC_train$GSPC.Volatility <- normalize(GSPC_train$GSPC.Volatility)

GSPC_test$GSPC.CloseToOpen <- normalize(GSPC_test$GSPC.CloseToOpen)
GSPC_test$GSPC.HighToLow <- normalize(GSPC_test$GSPC.HighToLow)
GSPC_test$GSPC.Volatility <- normalize(GSPC_test$GSPC.Volatility)

GSPC_train <- GSPC_train %>% mutate_at(c('GSPC.Direction', 'FTSE.Direction',
                                         'N225.Direction', 'SPEUP.Direction',
                                         'HSI.Direction'), as.numeric)
GSPC_test<- GSPC_test %>% mutate_at(c('GSPC.Direction', 'FTSE.Direction',
                                      'N225.Direction', 'SPEUP.Direction',
                                      'HSI.Direction'), as.numeric)

model_ANN <- neuralnet(GSPC.formula,
                       data = GSPC_train,
                       hidden = c(4,2),
                       linear.output = F,
                       threshold = 0.05,
                       rep = 2,
                       algorithm = "rprop+")

model_ANN2 <- neuralnet(GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction +
                        FTSE.Direction + HSI.Direction + GSPC.HighToLow,
                       data = GSPC_train,
                       hidden = c(3,4),
                       linear.output = F,
                       threshold = 0.05,
                       rep = 4)

plot(model_ANN, rep = 'best')

plot(model_ANN2, rep = 'best')

GSPC.Direction$ANN.Pred <- ifelse(predict(model_ANN, GSPC_test) > 0.6, 1, 0)
GSPC.Direction$ANN2.Pred <- ifelse(predict(model_ANN2, GSPC_test) > 0.6, 1, 0)

GSPC.Direction$ANN.Pred <- as.factor(GSPC.Direction$ANN.Pred)
GSPC.Direction$ANN2.Pred <- as.factor(GSPC.Direction$ANN2.Pred)

levels(GSPC.Direction$ANN.Pred) <- levels(GSPC.Direction$GSPC_Direction)
levels(GSPC.Direction$ANN2.Pred) <- levels(GSPC.Direction$GSPC_Direction)
confusionMatrix(GSPC.Direction$GSPC_Direction, GSPC.Direction$ANN.Pred, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 144  46
##          1   9 190
##                                         
##                Accuracy : 0.8586        
##                  95% CI : (0.82, 0.8917)
##     No Information Rate : 0.6067        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7158        
##                                         
##  Mcnemar's Test P-Value : 1.208e-06     
##                                         
##             Sensitivity : 0.8051        
##             Specificity : 0.9412        
##          Pos Pred Value : 0.9548        
##          Neg Pred Value : 0.7579        
##              Prevalence : 0.6067        
##          Detection Rate : 0.4884        
##    Detection Prevalence : 0.5116        
##       Balanced Accuracy : 0.8731        
##                                         
##        'Positive' Class : 1             
## 
confusionMatrix(GSPC.Direction$GSPC_Direction, GSPC.Direction$ANN2.Pred, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 147  43
##          1   8 191
##                                           
##                Accuracy : 0.8689          
##                  95% CI : (0.8312, 0.9008)
##     No Information Rate : 0.6015          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7366          
##                                           
##  Mcnemar's Test P-Value : 1.927e-06       
##                                           
##             Sensitivity : 0.8162          
##             Specificity : 0.9484          
##          Pos Pred Value : 0.9598          
##          Neg Pred Value : 0.7737          
##              Prevalence : 0.6015          
##          Detection Rate : 0.4910          
##    Detection Prevalence : 0.5116          
##       Balanced Accuracy : 0.8823          
##                                           
##        'Positive' Class : 1               
## 

The neural net models both has a prediction of over 80%.

Support Vector Machines Model

# Support Vector Machines models and predictions.
GSPC_train$GSPC.Direction <- as.factor(GSPC_train$GSPC.Direction)
GSPC_test$GSPC.Direction <- as.factor(GSPC_test$GSPC.Direction)

model_SVM <- svm(GSPC.formula,
                      data = GSPC_train,
                      type = 'C-classification',
                      probability = TRUE,
                      kernel = 'linear',
                      cost = 10,
                      scale = FALSE
                      )

model_SVM2 <- svm(GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction +
                                   FTSE.Direction + HSI.Direction + GSPC.HighToLow,
                      data = GSPC_train,
                      type = 'nu-classification',
                      probability = TRUE,
                      kernel = 'polynomial',
                      cost = 10,
                      scale = FALSE
                      )

summary(model_SVM)
## 
## Call:
## svm(formula = GSPC.formula, data = GSPC_train, type = "C-classification", 
##     probability = TRUE, kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  364
## 
##  ( 183 181 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
summary(model_SVM2)
## 
## Call:
## svm(formula = GSPC.Direction ~ GSPC.CloseToOpen + SPEUP.Direction + 
##     FTSE.Direction + HSI.Direction + GSPC.HighToLow, data = GSPC_train, 
##     type = "nu-classification", probability = TRUE, kernel = "polynomial", 
##     cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  nu-classification 
##  SVM-Kernel:  polynomial 
##      degree:  3 
##       gamma:  0.2 
##      coef.0:  0 
##          nu:  0.5 
## 
## Number of Support Vectors:  463
## 
##  ( 232 231 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
GSPC.Direction$SVM.Pred <- predict(model_SVM, GSPC_test, probability = TRUE)
GSPC.Direction$SVM2.Pred <- predict(model_SVM2, GSPC_test, probability = TRUE)

confusionMatrix(GSPC.Direction$GSPC_Direction, GSPC.Direction$SVM.Pred, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 147  43
##          1   7 192
##                                           
##                Accuracy : 0.8715          
##                  95% CI : (0.8341, 0.9031)
##     No Information Rate : 0.6041          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7417          
##                                           
##  Mcnemar's Test P-Value : 7.431e-07       
##                                           
##             Sensitivity : 0.8170          
##             Specificity : 0.9545          
##          Pos Pred Value : 0.9648          
##          Neg Pred Value : 0.7737          
##              Prevalence : 0.6041          
##          Detection Rate : 0.4936          
##    Detection Prevalence : 0.5116          
##       Balanced Accuracy : 0.8858          
##                                           
##        'Positive' Class : 1               
## 
confusionMatrix(GSPC.Direction$GSPC_Direction, GSPC.Direction$SVM2.Pred, positive = '1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 158  32
##          1  23 176
##                                         
##                Accuracy : 0.8586        
##                  95% CI : (0.82, 0.8917)
##     No Information Rate : 0.5347        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.7168        
##                                         
##  Mcnemar's Test P-Value : 0.2807        
##                                         
##             Sensitivity : 0.8462        
##             Specificity : 0.8729        
##          Pos Pred Value : 0.8844        
##          Neg Pred Value : 0.8316        
##              Prevalence : 0.5347        
##          Detection Rate : 0.4524        
##    Detection Prevalence : 0.5116        
##       Balanced Accuracy : 0.8595        
##                                         
##        'Positive' Class : 1             
## 

The support vector machine predictions are over 80% in both cases.

All of the above tested models has an overall 80-90% accuracy rate.

Pair Trading in technology:semiconductor industry - Identifying co-integrated pairs and time to entry/exit

The underlying logic of Pairs Trading is to monitor movements of co-integrated stocks and to look for trading opportunities when the divergence presents. Under the mean-reversion assumption, the stock price would tend to move back to the long-term equilibrium. As a result, the spread between two co-integrated stock prices would eventually converge. We can define a trading signal when to short or take a long position.

We investigate the Technology - Semiconductor industry that are listed on Nasdaq in the US. The stocks being investigated, all have been traded on the exchange from the time we conclude our analysis.

# Creating environment for pair-trading and downloading data.
ENV.PAIR <- new.env()

ticks.pair <- c('AMD', 'AOSL', 'AMBA', 'AMKR', 'ASYS', 'ADI', 'AAOI',
                'ATOM', 'AXTI', 'BEEM', 'AVGO', 'CEVA','IMOS', 'CRUS',
                'CCMP', 'DIOD', 'EMKR', 'ENTG', 'FSLR', 'FORM', 'GSIT',
                'HIMX', 'ICHR', 'INTC','IPGP', 'ISUN', 'KLIC', 'LSCC',
                'MTSI', 'MRVL', 'MMAT', 'MCHP', 'MU', 'MPWR', 'MOSY',
                'LASR','NVEC', 'NVDA', 'NXPI', 'OIIM', 'ON', 'PLAB',
                'PXLW', 'POWI', 'QUIK', 'RMBS', 'RESN', 'RBCN','LEDS',
                'SMTC', 'SLAB', 'SIMO', 'SWKS', 'SGH', 'SEDG', 'SPI',
                'SPWR', 'SUNW', 'SPCB', 'TXN','TSEM', 'UCTT', 'OLED',
                'VECO', 'XLNX')
sDate.pair <- as.POSIXct('2019-01-01')
eDate.pair <- as.POSIXct('2022-12-31')

readFinancialData(ticks.pair, ENV.PAIR, sDate.pair, eDate.pair)
##  [1] "AMD"  "AOSL" "AMBA" "AMKR" "ASYS" "ADI"  "AAOI" "ATOM" "AXTI" "BEEM"
## [11] "AVGO" "CEVA" "IMOS" "CRUS" "DIOD" "EMKR" "ENTG" "FSLR" "FORM" "GSIT"
## [21] "HIMX" "ICHR" "INTC" "IPGP" "ISUN" "KLIC" "LSCC" "MTSI" "MRVL" "MMAT"
## [31] "MCHP" "MU"   "MPWR" "LASR" "NVEC" "NVDA" "NXPI" "ON"   "PLAB" "PXLW"
## [41] "POWI" "QUIK" "RMBS" "RBCN" "LEDS" "SMTC" "SLAB" "SIMO" "SWKS" "SGH" 
## [51] "SEDG" "SPI"  "SPWR" "SUNW" "SPCB" "TXN"  "TSEM" "UCTT" "OLED" "VECO"

Engle-Granger Test -Creating trading pairs and identifying the co-integration

Co-integration describes a long-term relationship between the prices. The “integration” refers to an integrated time series of order d denoted as I(d) - integrated or differenced of order one time series denoted as I(1) (random walk). I(0) is a stationary time series, it implies that the mean and the variance of the time series are finite and do not change over time hence it’s a time invariant property, meaning that if one series moves too far away from the mean the time invariant property will drag it back to the mean.

By definition x_t and y_t are co-integrated, if x_t and y_t are I(1) series and there exists a beta such that z_t = x_t - beta*y_t is an I(0) series. This makes it possible to construct a stationary time series from two asset prices.

Financial data are non-stationary time series. From two stocks, X and Y we perform a linear regression between them and check whether the residual is stationary using the Augmented Dick-Fuller (ADF) test. If the residual is stationary, then the two asset prices are co-integrated. The co-integration coefficient is obtained as the coefficient of the regressor.

We take the closing prices of stocks and perform the Engle-Granger Test on each possible combination of pairs. The test performs the above explained logic between the pairs and returns the test statistics and whether the pair is co-integrated. We create a data-frame with this information and then we filter out only those pairs that are co-integrated.

# Creating data-set for identification and performing test.
xts.close.pair <- do.call(merge, eapply(ENV.PAIR, Cl))

for(i in colnames(xts.close.pair)){
      if (sum(is.na(xts.close.pair[,i])) != 0){
        xts.close.pair[,i] <- na.approx(xts.close.pair[,i])
      }
  
      a <- str_locate(i, '.Close')
      colnames(xts.close.pair)[which(names(xts.close.pair) == i)] <- 
                              str_sub(i,start = 1L, end = a[1,1]-1)
}

xts.close.pair <- log(xts.close.pair)

df.allpairs <- allpairs.egcm(xts.close.pair)
df.cointegrated <- df.allpairs[df.allpairs$is.cointegrated == 'TRUE',]
head(df.cointegrated[,c(1,2,6:24)], 
     caption = 'The cointegrated trading pairs based on the Engle-Granger Test')
##    series1 series2       alpha   alpha.se      beta     beta.se       rho
## 8     SPWR     TXN  4.32859378 0.01250565 0.2647162 0.003076056 0.9706755
## 41     TXN    NXPI -2.11067648 0.04864136 1.4159487 0.009550878 0.9636577
## 50     TXN    SLAB -0.09904814 0.07419998 0.9798712 0.014702258 0.9848352
## 55    NXPI    SLAB  1.34312784 0.04485927 0.6957459 0.008782136 0.9766837
## 83     TXN    DIOD -3.70563377 0.08008459 1.5635106 0.015869936 0.9850454
## 88    NXPI    DIOD -1.32001179 0.04867286 1.0931323 0.009554018 0.9801524
##      rho.raw      rho.se s1.i1.stat   s1.i1.p s2.i1.stat   s2.i1.p    r.stat
## 8  0.9643583 0.008236718  -5.416253 0.8076100  -9.543126 0.5773955 -28.51022
## 41 0.9573727 0.009150194  -9.543126 0.5773955  -8.805691 0.6185328 -40.30644
## 50 0.9784531 0.006508047  -9.543126 0.5773955 -14.641941 0.2929620 -21.82717
## 55 0.9703390 0.007634068  -8.805691 0.6185328 -14.641941 0.2929620 -26.94719
## 83 0.9786623 0.006472054  -9.543126 0.5773955 -11.363305 0.4758582 -22.85934
## 88 0.9737918 0.007247409  -8.805691 0.6185328 -11.363305 0.4758582 -23.26024
##            r.p eps.ljungbox.stat eps.ljungbox.p     s1.dsd     s2.dsd
## 8  0.011508422        29.1185209   6.808301e-08 0.05036979 0.02067641
## 41 0.004883774         0.3647013   5.459068e-01 0.02067641 0.02897405
## 50 0.043856432         2.7094609   9.975470e-02 0.02067641 0.03046120
## 55 0.016746889         3.6461270   5.619976e-02 0.02897405 0.03046120
## 83 0.036528230         5.0459231   2.468395e-02 0.02067641 0.02831944
## 88 0.033681954        12.6956614   3.665049e-04 0.02897405 0.02831944
##    residuals.sd     eps.sd
## 8    0.07304671 0.01909775
## 41   0.06398656 0.01857616
## 50   0.09849848 0.02035175
## 55   0.08519448 0.02064295
## 83   0.10632140 0.02184409
## 88   0.09268242 0.02130632

We choose a pair from our co-integrated data-frame, AAOI - GSIT and performing the analysis on the pair.

The plot shows the log of the closing price of the selected pair.

# Pair time series plot.
chart_Series(xts.close.pair$AAOI, col = 'blue', type='line',
                 TA=c('add_TA(xts.close.pair$GSIT, col = "red", on = 1)'),
                 name = 'AAOI and TSEM, log plot over the investigated period')

The cointegrated pair-plot Price series with Residuals and Innovations, also shown the test summary.

# Test plot and summary.
obj.pair <- egcm(xts.close.pair$AAOI, xts.close.pair$GSIT)
plot(egcm(xts.close.pair$AAOI, xts.close.pair$GSIT))

summary(egcm(xts.close.pair$AAOI, xts.close.pair$GSIT))
## GSIT[i] =   0.5478 AAOI[i] +   0.6746 + R[i], R[i] =   0.9841 R[i-1] + eps[i], eps ~ N(0,  0.0376^2)
##            (0.0092)           (0.0266)                (0.0065)
## 
## R[2022-12-30] = -0.4752 (t = -2.612)
## 
## Unit Root Tests of Residuals
##                                                     Statistic    p-value
##   Augmented Dickey Fuller (ADF)                        -2.477    0.29305
##   Phillips-Perron (PP)                                -22.554    0.03869
##   Pantula, Gonzales-Farias and Fuller (PGFF)            0.985    0.11902
##   Elliott, Rothenberg and Stock DF-GLS (ERSD)          -1.090    0.49243
##   Johansen's Trace Test (JOT)                         -17.842    0.10423
##   Schmidt and Phillips Rho (SPR)                      -19.128    0.14352
## 
## Variances
##   SD(diff(AAOI))       =   0.046607
##   SD(diff(GSIT))       =   0.033410
##   SD(diff(residuals))  =   0.037753
##   SD(residuals)        =   0.181909
##   SD(innovations)      =   0.037554
## 
## Half life       =  43.262316
## R[last]         =  -0.475215 (t=-2.61)

Normality and Partial Autocorrelation plots with scatter plot of the selected pair.

# Normality, pacf and scatter plot.
df.residuals <- as.data.frame(obj.pair$residuals)
colnames(df.residuals) <- c('residuals')
qqnorm(df.residuals$residuals, col = 'red')
qqline(df.residuals$residuals, distribution = qnorm, col = 'blue')

pacf(df.residuals$residuals)

xts.pairScatter <<- do.call(merge, eapply(ENV.PAIR, Cl))
    
for(i in colnames(xts.pairScatter)){
  a <- str_locate(i, '.Close')
  colnames(xts.pairScatter)[which(names(xts.pairScatter) == i)] <- 
    str_sub(i,start = 1L, end = a[1,1]-1)
}

df.pairScatter <- cbind(xts.pairScatter$AAOI, xts.pairScatter$GSIT)
df.pairScatter <- as.data.frame(df.pairScatter) 

ggplot(data = df.pairScatter, aes(x = xts.pairScatter$AAOI , y
                                  = df.pairScatter$GSIT, 
                                  color = df.pairScatter$AAOI))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)+
  ggtitle('Scatter plot of the two cointegrated pairs and best fit regression line',
          subtitle = paste0('Standard Deviation of the residuals: ',sd(df.residuals$residuals)))+
  labs(x = 'AAOI', y = 'GSIT')+
  scale_colour_gradientn(colours = rainbow(4), name = 'Colored by AAOI')

Generating the data for the signal points.

# Creating signal points data-set based on choosen pair.
df.signalPoints <- data.frame()
counter_up <- 0
counter_down <- 0
percentsd <- (sd(df.residuals$residuals))*as.numeric(0.1)

for(i in df.residuals$residuals){
  if(i > as.numeric(1)*sd(df.residuals$residuals) & counter_up == 0){
    counter_up <- 1
    df.signalPoints <- rbind(df.signalPoints, list(match(i, df.residuals$residuals), i, 'Short')) 
  }
  else if(counter_up == 1 & i < percentsd){
    counter_up <- 0
    df.signalPoints <- rbind(df.signalPoints, list(match(i, df.residuals$residuals), i, 'Sell'))
  }
  else if(i < -as.numeric(1)*sd(df.residuals$residuals) & counter_down == 0){
    counter_down <- 1
    df.signalPoints <- rbind(df.signalPoints, list(match(i, df.residuals$residuals), i, 'Long'))
  }
  else if(counter_down == 1 & i > percentsd){
    counter_down <- 0
    df.signalPoints <- rbind(df.signalPoints, list(match(i, df.residuals$residuals), i, 'Sell'))
  }
}
colnames(df.signalPoints) <- c('Index', 'Value', 'Position')

The entry/exit plot for the pair AAOI-GSIT pair, where the red circles mark the signal points.

# Residuals plot with circles indicating the calculated signals.
ggplot(data = df.residuals, aes(x = index(df.residuals), y = df.residuals[,1] ))+
  geom_line()+
  geom_hline(yintercept = 0, linetype='dashed', 
             color = 'red', 
             linewidth = 0.25)+
  geom_hline(yintercept = c(-1*sd(df.residuals$residuals), 
                            1*sd(df.residuals$residuals),
                            percentsd, -percentsd),
             linetype='dashed', 
             color = 'blue', 
             linewidth = 0.25)+
  geom_point(data = df.signalPoints,
             aes(x = df.signalPoints[,1],
                 y = df.signalPoints[,2]), 
             pch = 21, 
             size = 4, 
             color = 'red')+
  theme_bw()+
  ggtitle('Residuals plot with signal points to entry exit from trade')+
  labs(x = 'Timescale from first to last index', y = 'Residuals')

kable(df.signalPoints, caption = 'The signal point for the AAOI-GSIt pair.')
The signal point for the AAOI-GSIt pair.
Index Value Position
1 -0.5160480 Long
58 0.0240140 Sell
101 0.2059385 Short
218 -0.0132628 Sell
311 0.2585614 Short
337 -0.0069688 Sell
391 -0.2228104 Long
471 0.0464928 Sell
485 0.2007339 Short
521 -0.0018420 Sell
546 0.2174994 Short
585 0.0068612 Sell
706 -0.1874147 Long
719 0.1024614 Sell
859 0.1894793 Short
926 0.0173314 Sell
935 -0.3423715 Long