1 Objective

This analysis aims to forecast daily direction of the FTSE100 index (Up/Down) using 5 years of historical data and some technical indicators.

2 Data Preparation

2.1 Loading and Cleaning Data

Load required packages

list.packages<-c("quantmod","corrplot", "corrgram","TTR","caret","caTools","ROCR","ROSE","pROC", "MASS", "glmnet")
invisible(lapply(list.packages, require, character.only = TRUE))

Download FTSE100 data using quantmod package and clean missing values

getSymbols("^FTSE", from = "2020-04-24", to = "2025-04-25")
## [1] "FTSE"
#check for NA values
sum(is.na(FTSE))
## [1] 6
#remove NA values
FTSE <- na.omit(FTSE)

2.2 Feature Engineering

Extract open, close, high, low and volume of prices

FTSE <- data.frame(
  date = index(FTSE),
  open = as.numeric(Op(FTSE)),
  close = as.numeric(Cl(FTSE)),
  high = as.numeric(Hi(FTSE)),
  low = as.numeric(Lo(FTSE)),
  volume = as.numeric(Vo(FTSE)))
head(FTSE)
##         date   open  close   high    low     volume
## 1 2020-04-24 5826.6 5752.2 5827.2 5728.8  808821600
## 2 2020-04-27 5752.2 5846.8 5855.6 5752.2  792246400
## 3 2020-04-28 5846.8 5958.5 5977.8 5835.9 1196850700
## 4 2020-04-29 5958.5 6115.3 6129.6 5950.8 1288097400
## 5 2020-04-30 6115.3 5901.2 6151.6 5897.9 1933348900
## 6 2020-05-01 5901.2 5763.1 5901.2 5746.1  781328900

Calculate log returns

FTSE$return <- c(NA,diff(log(FTSE$close)))
head(FTSE)
##         date   open  close   high    low     volume      return
## 1 2020-04-24 5826.6 5752.2 5827.2 5728.8  808821600          NA
## 2 2020-04-27 5752.2 5846.8 5855.6 5752.2  792246400  0.01631205
## 3 2020-04-28 5846.8 5958.5 5977.8 5835.9 1196850700  0.01892430
## 4 2020-04-29 5958.5 6115.3 6129.6 5950.8 1288097400  0.02597502
## 5 2020-04-30 6115.3 5901.2 6151.6 5897.9 1933348900 -0.03563804
## 6 2020-05-01 5901.2 5763.1 5901.2 5746.1  781328900 -0.02368021

Calculating lag returns and other indicators

Calculate Lag returns (1 to 5) and other technical indicators such as rolling standard deviation, rolling mean, 10 days moving average, 20 days moving average, 14 days relative strength index and 21 days relative strength index as predictors.

Lagged returns

FTSE$lag1 <-Lag(FTSE$return, k = 1)
FTSE$lag2 <- Lag(FTSE$return, k = 2)
FTSE$lag3 <- Lag(FTSE$return, k = 3)
FTSE$lag4 <- Lag(FTSE$return, k = 4)
FTSE$lag5 <- Lag(FTSE$return, k = 5)

Rolling Standard deviation and mean of closing prices

FTSE$roll_sd <- rollapply(FTSE$close, width = 10, FUN = sd,fill = NA,align = "right")
FTSE$roll_mean <- rollapply(FTSE$close, width = 10, FUN = mean,fill = NA,align = "right")

10-day and 20-day simple moving averages

FTSE$ma10 <- SMA(FTSE$close, n = 10)
FTSE$ma20 <- SMA(FTSE$close, n = 20)

14 and 21 day relative strength index

FTSE$rsi14 <- RSI(FTSE$close, n = 14)
FTSE$rsi21 <- RSI(FTSE$close, n = 21)
head(FTSE)
##         date   open  close   high    low     volume      return       Lag.1
## 1 2020-04-24 5826.6 5752.2 5827.2 5728.8  808821600          NA          NA
## 2 2020-04-27 5752.2 5846.8 5855.6 5752.2  792246400  0.01631205          NA
## 3 2020-04-28 5846.8 5958.5 5977.8 5835.9 1196850700  0.01892430  0.01631205
## 4 2020-04-29 5958.5 6115.3 6129.6 5950.8 1288097400  0.02597502  0.01892430
## 5 2020-04-30 6115.3 5901.2 6151.6 5897.9 1933348900 -0.03563804  0.02597502
## 6 2020-05-01 5901.2 5763.1 5901.2 5746.1  781328900 -0.02368021 -0.03563804
##        Lag.2      Lag.3      Lag.4 Lag.5 roll_sd roll_mean ma10 ma20 rsi14
## 1         NA         NA         NA    NA      NA        NA   NA   NA    NA
## 2         NA         NA         NA    NA      NA        NA   NA   NA    NA
## 3         NA         NA         NA    NA      NA        NA   NA   NA    NA
## 4 0.01631205         NA         NA    NA      NA        NA   NA   NA    NA
## 5 0.01892430 0.01631205         NA    NA      NA        NA   NA   NA    NA
## 6 0.02597502 0.01892430 0.01631205    NA      NA        NA   NA   NA    NA
##   rsi21
## 1    NA
## 2    NA
## 3    NA
## 4    NA
## 5    NA
## 6    NA

Remove NA values

FTSE <- na.omit(FTSE)

Creating direction into binary target variable

Create direction variables based on the returns of FTSE100. The direction variable indicates whether the stock went up or down a given day. This is the dependent variable we want to predict using lagged returns and other indicators. “Up” is assigned to positive returns and “Down” is assigned to negative returns.

FTSE$direction <- ifelse(FTSE$return >= 0,"Up","Down")
head(FTSE)
##          date   open  close   high    low     volume       return        Lag.1
## 22 2020-05-27 6067.8 6144.3 6168.0 6067.8 1559401400  0.012528722  0.012353923
## 23 2020-05-28 6144.3 6218.8 6234.0 6144.3 1210102000  0.012052140  0.012528722
## 24 2020-05-29 6218.8 6076.6 6218.8 6060.2 2246313300 -0.023131586  0.012052140
## 25 2020-06-01 6076.6 6166.4 6179.3 6076.6  742112200  0.014669838 -0.023131586
## 26 2020-06-02 6166.4 6220.1 6238.3 6165.1 1192259900  0.008670816  0.014669838
## 27 2020-06-03 6220.1 6382.4 6394.7 6220.1 1232836600  0.025758187  0.008670816
##           Lag.2        Lag.3        Lag.4        Lag.5   roll_sd roll_mean
## 22 -0.003664045 -0.008591055  0.010771145 -0.007700755 126.49935   5978.41
## 23  0.012353923 -0.003664045 -0.008591055  0.010771145 143.90602   6009.88
## 24  0.012528722  0.012353923 -0.003664045 -0.008591055 109.32880   6043.39
## 25  0.012052140  0.012528722  0.012353923 -0.003664045  74.48401   6080.05
## 26 -0.023131586  0.012052140  0.012528722  0.012353923  85.38445   6097.20
## 27  0.014669838 -0.023131586  0.012052140  0.012528722 117.12928   6135.22
##       ma10     ma20    rsi14    rsi21 direction
## 22 5978.41 5942.485 61.98998 60.52730        Up
## 23 6009.88 5955.500 64.36032 62.11849        Up
## 24 6043.39 5953.565 57.04761 57.47507      Down
## 25 6080.05 5966.825 60.12855 59.48332        Up
## 26 6097.20 5989.675 61.88904 60.65014        Up
## 27 6135.22 6021.105 66.67786 63.94523        Up

Create a sub data of required indicators

FTSE_sub <- FTSE[, c("date","volume", names(FTSE)[8:19])]
head(FTSE_sub)
##          date     volume        Lag.1        Lag.2        Lag.3        Lag.4
## 22 2020-05-27 1559401400  0.012353923 -0.003664045 -0.008591055  0.010771145
## 23 2020-05-28 1210102000  0.012528722  0.012353923 -0.003664045 -0.008591055
## 24 2020-05-29 2246313300  0.012052140  0.012528722  0.012353923 -0.003664045
## 25 2020-06-01  742112200 -0.023131586  0.012052140  0.012528722  0.012353923
## 26 2020-06-02 1192259900  0.014669838 -0.023131586  0.012052140  0.012528722
## 27 2020-06-03 1232836600  0.008670816  0.014669838 -0.023131586  0.012052140
##           Lag.5   roll_sd roll_mean    ma10     ma20    rsi14    rsi21
## 22 -0.007700755 126.49935   5978.41 5978.41 5942.485 61.98998 60.52730
## 23  0.010771145 143.90602   6009.88 6009.88 5955.500 64.36032 62.11849
## 24 -0.008591055 109.32880   6043.39 6043.39 5953.565 57.04761 57.47507
## 25 -0.003664045  74.48401   6080.05 6080.05 5966.825 60.12855 59.48332
## 26  0.012353923  85.38445   6097.20 6097.20 5989.675 61.88904 60.65014
## 27  0.012528722 117.12928   6135.22 6135.22 6021.105 66.67786 63.94523
##    direction
## 22        Up
## 23        Up
## 24      Down
## 25        Up
## 26        Up
## 27        Up

Check structure of the data

str(FTSE_sub)
## 'data.frame':    1240 obs. of  14 variables:
##  $ date     : Date, format: "2020-05-27" "2020-05-28" ...
##  $ volume   : num  1.56e+09 1.21e+09 2.25e+09 7.42e+08 1.19e+09 ...
##  $ lag1     : num [1:1240, 1] 0.0124 0.0125 0.0121 -0.0231 0.0147 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr "Lag.1"
##  $ lag2     : num [1:1240, 1] -0.00366 0.01235 0.01253 0.01205 -0.02313 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr "Lag.2"
##  $ lag3     : num [1:1240, 1] -0.00859 -0.00366 0.01235 0.01253 0.01205 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr "Lag.3"
##  $ lag4     : num [1:1240, 1] 0.01077 -0.00859 -0.00366 0.01235 0.01253 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr "Lag.4"
##  $ lag5     : num [1:1240, 1] -0.0077 0.01077 -0.00859 -0.00366 0.01235 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr "Lag.5"
##  $ roll_sd  : num  126.5 143.9 109.3 74.5 85.4 ...
##  $ roll_mean: num  5978 6010 6043 6080 6097 ...
##  $ ma10     : num  5978 6010 6043 6080 6097 ...
##  $ ma20     : num  5942 5955 5954 5967 5990 ...
##  $ rsi14    : num  62 64.4 57 60.1 61.9 ...
##  $ rsi21    : num  60.5 62.1 57.5 59.5 60.7 ...
##  $ direction: chr  "Up" "Up" "Down" "Up" ...

convert character data into a factor

FTSE_sub$direction <- factor(FTSE_sub$direction)

Checking for Multicollinearity

The relationship between the predictors helps in understanding how the predictors behave with respect to each other and helps in improving the regression model. Correlation analysis was performed to measure the strength of association between the predictors. Highly correlated predictors can lead to multicollinearity, which reduces model reliability. In one of the models, only one variable from each group of highly correlated predictors was included in the model.

cor.data<-cor(FTSE_sub[,2:12])
corrplot(cor.data, method = "color", type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black", diag = FALSE)

Split the data into training (before 2025) and testing (2025) sets

The data was split into train and test. Data before 2025 was used as the training data while 2025 data was used as the test data. Dependent and Independent variables were then created out of the train and test data

train = (FTSE_sub$date < "2025-01-01")
FTSE_train = FTSE_sub[train,]
FTSE_test = FTSE_sub[!train,]

Create separate dependent and independent variables of the split data

train_x <- FTSE_train[, 2:13] # independent variables
train_y <- FTSE_train$direction # dependent variable
test_x <- FTSE_test[, 2:13] # independent variables
test_y <- FTSE_test$direction # dependent variable

3 Model Evaluation

Three models were explored. The first model included predictors selected based on correlation analysis. The second model incorporated all predictors, while the third model used stepwise regression to identify statistically relevant predictors. The statistical significance of each predictor was assessed based on p-values, using a 5% significance level. Predictors with p-values below 0.05 were considered statistically significant, while those with p-values above 0.05 were not.

3.1 Model 1: Selected predictors (Correlation based)

set.seed(100)
#LRM with predictors selected from correlation plot
lrm.corr <- glm(train_y ~lag1+lag2+lag3+lag4+lag5+volume+roll_sd+ma10+rsi14, data = train_x,maxit = 1000, family = "binomial")
#summary of the estimation
summary(lrm.corr)
## 
## Call:
## glm(formula = train_y ~ lag1 + lag2 + lag3 + lag4 + lag5 + volume + 
##     roll_sd + ma10 + rsi14, family = "binomial", data = train_x, 
##     maxit = 1000)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.598e+00  9.587e-01  -5.839 5.24e-09 ***
## lag1        -6.760e+01  8.730e+00  -7.743 9.68e-15 ***
## lag2        -5.397e+01  8.433e+00  -6.400 1.56e-10 ***
## lag3        -5.774e+01  8.330e+00  -6.932 4.16e-12 ***
## lag4        -2.223e+01  7.875e+00  -2.823  0.00475 ** 
## lag5        -3.101e+01  7.814e+00  -3.969 7.23e-05 ***
## volume      -4.742e-10  2.322e-10  -2.042  0.04111 *  
## roll_sd      1.031e-02  2.024e-03   5.094 3.51e-07 ***
## ma10        -8.395e-05  1.076e-04  -0.780  0.43533    
## rsi14        1.158e-01  9.495e-03  12.196  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1604.1  on 1160  degrees of freedom
## Residual deviance: 1390.0  on 1151  degrees of freedom
## AIC: 1410
## 
## Number of Fisher Scoring iterations: 4
#calculate fitted probabilities
prob_lrm.corr <- predict(lrm.corr,newdata=test_x,type='response')
pred_lrm.corr <- ifelse(prob_lrm.corr > 0.5,"Up","Down")

#Calculate the accuracy of the model
error_lrm.corr <- mean(pred_lrm.corr != test_y)
accuracy_lrm.corr <- (1-error_lrm.corr)
accuracy_lrm.corr
## [1] 0.6582278
#Confusion matrix
table(test_y, prob_lrm.corr > 0.5)
##       
## test_y FALSE TRUE
##   Down    17   17
##   Up      10   35

From the first model, all the independent variables are significant, with the exception of 10 day moving average. The accuracy of the model is 65.82%, which is considered acceptable for predicting stock price movements. The confusion matrix breaks down the accuracy of the model. From the confusion matrix, the true positive is 35, suggesting the model correctly predicted 35 instances where the market went up. The true negative, 17, suggests that the model correctly predicted 17 instances where the market went down. The false positive and false negative of 17 and 10 respectively suggests that, the model incorrectly predicted 17 instances as “Up”, when they were actually “Down”, as well as incorrectly predicted 10 instances as “Down”, when the were actually “Up”. In summary, the model correctly predicted 52 (17+35) instances correctly and incorrectly predicted 27 (17+10) instances, giving an accuracy of 65.82%

3.1.1 Model 2: All predictors included

#LRM with full predictors
lrm.full <- glm(train_y ~lag1+lag2+lag3+lag4+lag5+volume+roll_sd+roll_mean+ma10+ma20+rsi14+rsi21, data = train_x,maxit = 1000, family = "binomial")
#summary of the estimation
summary(lrm.full)
## 
## Call:
## glm(formula = train_y ~ lag1 + lag2 + lag3 + lag4 + lag5 + volume + 
##     roll_sd + roll_mean + ma10 + ma20 + rsi14 + rsi21, family = "binomial", 
##     data = train_x, maxit = 1000)
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.079e+00  1.379e+00  -5.860 4.62e-09 ***
## lag1        -1.783e+02  1.387e+01 -12.851  < 2e-16 ***
## lag2        -1.450e+02  1.244e+01 -11.657  < 2e-16 ***
## lag3        -1.287e+02  1.179e+01 -10.908  < 2e-16 ***
## lag4        -6.444e+01  1.012e+01  -6.369 1.90e-10 ***
## lag5        -6.077e+01  9.466e+00  -6.419 1.37e-10 ***
## volume      -2.907e-11  2.756e-10  -0.106  0.91598    
## roll_sd      7.848e-03  2.491e-03   3.150  0.00163 ** 
## roll_mean   -2.695e-02  2.156e-03 -12.500  < 2e-16 ***
## ma10                NA         NA      NA       NA    
## ma20         2.676e-02  2.144e-03  12.484  < 2e-16 ***
## rsi14        8.013e-01  5.771e-02  13.886  < 2e-16 ***
## rsi21       -6.214e-01  6.016e-02 -10.330  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1604.1  on 1160  degrees of freedom
## Residual deviance: 1051.7  on 1149  degrees of freedom
## AIC: 1075.7
## 
## Number of Fisher Scoring iterations: 5
#calculate fitted probabilities
prob_lrm.full <- predict(lrm.full,newdata=test_x,type='response')
pred_lrm.full <- ifelse(prob_lrm.full > 0.5,"Up","Down")

#Calculate the accuracy of the model
error_lrm.full <- mean(pred_lrm.full != test_y)
accuracy_lrm.full <- (1-error_lrm.full)
accuracy_lrm.full
## [1] 0.7848101
#confusion matrix
table(test_y, prob_lrm.full > 0.5)
##       
## test_y FALSE TRUE
##   Down    25    9
##   Up       8   37

From the second model, all the predictors have p-value below 0.05, hence significant, except for volume and ma10, which has a NA value. This means that it was perfectly correlated with one other predictor, the rolling mean as indicated by the correlation plot. The accuracy of the model is 78.48%, which is considered acceptable for predicting stock price movements. The confusion matrix breaks down the accuracy of the model. From the confusion matrix, the true positive is 37, suggesting the model correctly predicted 37 instances where the market went up. The true negative, 25, suggests that the model correctly predicted 25 instances where the market went down. The false positive and false negative of 9 and 8 respectively suggests that, the model incorrectly predicted 9 instances as “Up”, when they were actually “Down”, as well as incorrectly predicted 8 instances as “Down”, when they were actually “Up”. In summary, the model correctly predicted 62 (37+25) instances correctly and incorrectly predicted 17 (9+8) instances, giving an accuracy of 78.48%

3.1.1.1 Model 3: Stepwise-selected predictors

Stepwise Regression

#Stepwise regression to Identify important variables
model_step <- step(lrm.full, direction = "both")
## Start:  AIC=1075.74
## train_y ~ lag1 + lag2 + lag3 + lag4 + lag5 + volume + roll_sd + 
##     roll_mean + ma10 + ma20 + rsi14 + rsi21
## 
## 
## Step:  AIC=1075.74
## train_y ~ lag1 + lag2 + lag3 + lag4 + lag5 + volume + roll_sd + 
##     roll_mean + ma20 + rsi14 + rsi21
## 
##             Df Deviance    AIC
## - volume     1   1051.8 1073.8
## <none>           1051.7 1075.7
## - roll_sd    1   1061.7 1083.7
## - lag5       1   1095.2 1117.2
## - lag4       1   1095.8 1117.8
## - rsi21      1   1185.0 1207.0
## - lag3       1   1205.0 1227.0
## - lag2       1   1234.0 1256.0
## - ma20       1   1273.3 1295.3
## - roll_mean  1   1273.7 1295.7
## - lag1       1   1297.9 1319.9
## - rsi14      1   1350.7 1372.7
## 
## Step:  AIC=1073.75
## train_y ~ lag1 + lag2 + lag3 + lag4 + lag5 + roll_sd + roll_mean + 
##     ma20 + rsi14 + rsi21
## 
##             Df Deviance    AIC
## <none>           1051.8 1073.8
## + volume     1   1051.7 1075.7
## - roll_sd    1   1062.1 1082.1
## - lag5       1   1095.2 1115.2
## - lag4       1   1095.8 1115.8
## - rsi21      1   1186.8 1206.8
## - lag3       1   1205.2 1225.2
## - lag2       1   1234.0 1254.0
## - ma20       1   1274.0 1294.0
## - roll_mean  1   1274.6 1294.6
## - lag1       1   1297.9 1317.9
## - rsi14      1   1354.8 1374.8
# Print the summary of the selected model
summary(model_step)
## 
## Call:
## glm(formula = train_y ~ lag1 + lag2 + lag3 + lag4 + lag5 + roll_sd + 
##     roll_mean + ma20 + rsi14 + rsi21, family = "binomial", data = train_x, 
##     maxit = 1000)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.080e+00  1.379e+00  -5.861 4.59e-09 ***
## lag1        -1.783e+02  1.387e+01 -12.856  < 2e-16 ***
## lag2        -1.450e+02  1.244e+01 -11.657  < 2e-16 ***
## lag3        -1.287e+02  1.179e+01 -10.913  < 2e-16 ***
## lag4        -6.446e+01  1.012e+01  -6.369 1.91e-10 ***
## lag5        -6.076e+01  9.468e+00  -6.418 1.38e-10 ***
## roll_sd      7.788e-03  2.425e-03   3.211  0.00132 ** 
## roll_mean   -2.696e-02  2.154e-03 -12.516  < 2e-16 ***
## ma20         2.677e-02  2.142e-03  12.496  < 2e-16 ***
## rsi14        8.018e-01  5.752e-02  13.940  < 2e-16 ***
## rsi21       -6.220e-01  5.989e-02 -10.385  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1604.1  on 1160  degrees of freedom
## Residual deviance: 1051.7  on 1150  degrees of freedom
## AIC: 1073.7
## 
## Number of Fisher Scoring iterations: 5

Model 3

#LRM with predictors from stepwise regression
lrm.step <- glm(train_y ~lag1+lag2+lag3+lag4+lag5+roll_sd+roll_mean+ma20+rsi14+rsi21, data = train_x,maxit = 1000, family = "binomial")
#summary of the estimation
summary(lrm.step)
## 
## Call:
## glm(formula = train_y ~ lag1 + lag2 + lag3 + lag4 + lag5 + roll_sd + 
##     roll_mean + ma20 + rsi14 + rsi21, family = "binomial", data = train_x, 
##     maxit = 1000)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.080e+00  1.379e+00  -5.861 4.59e-09 ***
## lag1        -1.783e+02  1.387e+01 -12.856  < 2e-16 ***
## lag2        -1.450e+02  1.244e+01 -11.657  < 2e-16 ***
## lag3        -1.287e+02  1.179e+01 -10.913  < 2e-16 ***
## lag4        -6.446e+01  1.012e+01  -6.369 1.91e-10 ***
## lag5        -6.076e+01  9.468e+00  -6.418 1.38e-10 ***
## roll_sd      7.788e-03  2.425e-03   3.211  0.00132 ** 
## roll_mean   -2.696e-02  2.154e-03 -12.516  < 2e-16 ***
## ma20         2.677e-02  2.142e-03  12.496  < 2e-16 ***
## rsi14        8.018e-01  5.752e-02  13.940  < 2e-16 ***
## rsi21       -6.220e-01  5.989e-02 -10.385  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1604.1  on 1160  degrees of freedom
## Residual deviance: 1051.7  on 1150  degrees of freedom
## AIC: 1073.7
## 
## Number of Fisher Scoring iterations: 5
#calculate fitted probabilities
prob_lrm.step <- predict(lrm.step,newdata=test_x,type='response')
pred_lrm.step <- ifelse(prob_lrm.step > 0.5,"Up","Down")

#Calculate the accuracy of the model
error_lrm.step <- mean(pred_lrm.step != test_y)
accuracy_lrm.step <- (1-error_lrm.step)
accuracy_lrm.step
## [1] 0.7848101
#confusion matrix
table(test_y, prob_lrm.step > 0.5)
##       
## test_y FALSE TRUE
##   Down    25    9
##   Up       8   37

From the third model, the stepwise regression was used to determine relevant predictors. The relevant predictors were lag1 - lag5, rolling standard deviation, rolling mean, 20 day moving average, 21 and 14 day relative strength index. The accuracy of the model is 78.48%, which is considered acceptable for predicting stock price movements. The confusion matrix breaks down the accuracy of the model. From the confusion matrix, the true positive is 37, suggesting the model correctly predicted 37 instances where the market went up. The true negative, 25, suggests that the model correctly predicted 25 instances where the market went down. The false positive and false negative of 9 and 8 respectively suggests that, the model incorrectly predicted 9 instances as “Up”, when they were actually “Down”, as well as incorrectly predicted 8 instances as “Down”, when they were actually “Up”. In summary, the model correctly predicted 62 (37+25) instances correctly and incorrectly predicted 17 (8+9) instances, giving an accuracy of 78.48%

4 Results and Conclusion

4.1 Model Comparison

#Accuracy comparison
accuracy_comparison <- data.frame(
  Model = c("lrm.corr", "lrm.full", "lrm.step"),
  Accuracy = c(accuracy_lrm.corr, accuracy_lrm.full, accuracy_lrm.step)
)
print(accuracy_comparison)
##      Model  Accuracy
## 1 lrm.corr 0.6582278
## 2 lrm.full 0.7848101
## 3 lrm.step 0.7848101

4.2 ROC Curve

#ROC Curve for all the models
par(cex.main = 2, cex.axis = 1.5, cex.lab = 1.5)
roc.curve(test_y, prob_lrm.corr, 
          main = "ROC curve of the 3 models", col = 1, lwd = 2, lty = 1)
## Area under the curve (AUC): 0.686
roc.curve(test_y,prob_lrm.full, add=TRUE, col=2, 
          lwd=3, lty=2)
## Area under the curve (AUC): 0.867
roc.curve(test_y,prob_lrm.step, add=TRUE, col=3, 
          lwd=3, lty=3)
## Area under the curve (AUC): 0.866
legend("bottomright", c("LR with Corr. Predictors", "LR with full Predictors", "LR with stepwise regression Predictors"), 
       col=1:3, lty=1:3, lwd=2)

From the accuracy table of the 3 models, the second and third models had the highest accuracy with 78.48%, indicating that both correctly predicted 78.48% of the test data. From the ROC plot, Area Under the Curve (AUC) of the first model is 69%, while the second model is an AUC of 87% and the third model has AUC of 87%. This indicates that, the second and third model is able to better distinguish between Upward movement and downward movement as shown by their higher AUC.