#INTRODUCTION
#In this paper we will explore a Stock Market data and build a logistic regression model. The S&P 500 Index (formerly Standard & Poor's 500 Index) is a market-capitalization-weighted index of the 500 largest U.S. publicly traded companies by market value, The index is widely regarded as the best single gauge of large-cap U.S. equities.
#Data: We will make use of an inbuilt dataset called Smarket (S&P Stock Market Data) having 1250 observations and 9 variables. Columns and row names have been given below:
#Year: The year that the observation was recorded
#Lag1: Percentage return for previous day
#Lag2: Percentage return for 2 days previous
#Lag3: Percentage return for 3 days previous
#Lag4: Percentage return for 4 days previous
#Lag5: Percentage return for 5 days previous
#Volume: Volume of shares traded (number of daily shares traded in billions)
#Today: Percentage return for today
#Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given day
#Business Problem: Exploring stock market data and daily log return values for the present and the past 5 days to explore the relation amongst returns, and then build a logit regressed model. Let's start our work.
#Loading Smarket from ISLR Package
library(ISLR)
data("Smarket")
#Data Preparation
str(Smarket)
## 'data.frame': 1250 obs. of 9 variables:
## $ Year : num 2001 2001 2001 2001 2001 ...
## $ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...
## $ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...
## $ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...
## $ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...
## $ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...
## $ Volume : num 1.19 1.3 1.41 1.28 1.21 ...
## $ Today : num 0.959 1.032 -0.623 0.614 0.213 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
Smarket$Direction <- factor(Smarket$Direction, levels = c("Up", "Down"), labels = c(1,0))
#Labelled my Direction attributes ("Up", "Down") to (1,0).
#We won't change Year to the factor form as it will be needed later in the numerci form.
#Checking for missingness in our data
colSums(is.na(Smarket))
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume
## 0 0 0 0 0 0 0
## Today Direction
## 0 0
#Well, our work has been already reduced as we won't have to deal with the missing values.
#We don't require any more data preparation in accordance with our theme. So our second step is also done.
##Summary
summary(Smarket)
## Year Lag1 Lag2
## Min. :2001 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500
## Median :2003 Median : 0.039000 Median : 0.039000
## Mean :2003 Mean : 0.003834 Mean : 0.003919
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000
## Lag3 Lag4 Lag5
## Min. :-4.922000 Min. :-4.922000 Min. :-4.92200
## 1st Qu.:-0.640000 1st Qu.:-0.640000 1st Qu.:-0.64000
## Median : 0.038500 Median : 0.038500 Median : 0.03850
## Mean : 0.001716 Mean : 0.001636 Mean : 0.00561
## 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.59700
## Max. : 5.733000 Max. : 5.733000 Max. : 5.73300
## Volume Today Direction
## Min. :0.3561 Min. :-4.922000 1:648
## 1st Qu.:1.2574 1st Qu.:-0.639500 0:602
## Median :1.4229 Median : 0.038500
## Mean :1.4783 Mean : 0.003138
## 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. :3.1525 Max. : 5.733000
#Let's visualise our data
##Data Exploration: Data visualization is perhaps the fastest and most useful way to summarize and learn more about your data.
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
for(i in 1:8) {
hist(Smarket[,i], main=names(Smarket)[i], col = "orange")
}








for(i in 1:8) {
boxplot(Smarket[,i], main=names(Smarket)[i], col="red")
}








# Histograms and boxplots shows that most of the variables show a gaussian normal distribution of data.
#Volume and lags have similar range. Let's visualise.
#Correlation
library(corrplot)
## corrplot 0.84 loaded
library(corrgram)
corrgram(Smarket, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main="Corrgram")

corrplot(cor(Smarket[,1:8]), method="circle")

#All the variables are highly correlated with themselves and possess no cross - correlation. This is not good for our model as nature of Stock is not directly dependent with lags. Let's see if we could make a model with good accuracy.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
featurePlot(Smarket[,1:8], Smarket[,9], plot="density")

#Feature plot depicts that the direction of the stock going up and down are visually overlapping. Only Today attribute seems to make a distinction between the nature. All other varibles are showing similar impacts on the binary output.
#Building a logistic regression model: As we have seen that all Lags have the seemingly equal correlation with the outcome and no interrelation amongst themselves so our first model would have all Lags and see how it works.
#MODEL1: glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial)
regressor1 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, family = binomial)
summary(regressor1)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.326 -1.145 -1.065 1.203 1.446
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.126000 0.240736 0.523 0.601
## Lag1 0.073074 0.050167 1.457 0.145
## Lag2 0.042301 0.050086 0.845 0.398
## Lag3 -0.011085 0.049939 -0.222 0.824
## Lag4 -0.009359 0.049974 -0.187 0.851
## Lag5 -0.010313 0.049511 -0.208 0.835
## Volume -0.135441 0.158360 -0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
#Summary says that the p values of Lags3, Lag4 and Lag5 aren't contributing significantly to the nature of the binary outcome as compared to Lag1 and Lag2.
#Predicting values based on MODEL1
prob1 <- predict(regressor1, type="response", newdata = Smarket)
prob1[1:5]
## 1 2 3 4 5
## 0.4929159 0.5185321 0.5188612 0.4847776 0.4892188
#Values are around 0.5 not so predictible. Let's turn these probabilities of dependent varable into classifications by thresholding at 0.5.
pred1 <- ifelse(prob1>0.5, 1, 0)
#Confusion Matrix: Showing the tabular relation between the true and the actual values. It helps to calculate the accuracy of the predicted values.
table(pred1, Smarket$Direction)
##
## pred1 1 0
## 0 507 457
## 1 141 145
mean(pred1==Smarket$Direction)
## [1] 0.4784
#Checking the accuracy of the model
library(caret)
confusionMatrix(factor(pred1), Smarket$Direction)
## Warning in confusionMatrix.default(factor(pred1), Smarket$Direction):
## Levels are not in the same order for reference and data. Refactoring data
## to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 141 145
## 0 507 457
##
## Accuracy : 0.4784
## 95% CI : (0.4504, 0.5065)
## No Information Rate : 0.5184
## P-Value [Acc > NIR] : 0.9979
##
## Kappa : -0.0228
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.2176
## Specificity : 0.7591
## Pos Pred Value : 0.4930
## Neg Pred Value : 0.4741
## Prevalence : 0.5184
## Detection Rate : 0.1128
## Detection Prevalence : 0.2288
## Balanced Accuracy : 0.4884
##
## 'Positive' Class : 1
##
#Accuracy of the model is 47.84%. We haven't created a good model. Let's divide our dataset into training and test it onto the test data.
#Conclusion: We will remove Lag3, lag4 and Lag5 as their p-values have relatively high magnitudes.
#MODEL2: (Direction ~ Lag1 + Lag2 + Lag3 + Volume, data = Smarket, family = binomial)
#Let's pre-process our dataset into training and test subsets to test predictions of training_ set onto test_set to correlate the predicted and the true values.
#Training Data: Data prior to 2005
library(caTools)
training_set <- subset(Smarket, Year<2005)
#Test Data: Data of Year=2005
test_set <- subset(Smarket, Year==2005)
#MODEL2
regressor2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Volume, data = training_set, family = binomial)
summary(regressor2)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Volume, family = binomial,
## data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.359 -1.161 -1.082 1.190 1.299
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.193568 0.332328 -0.582 0.560
## Lag1 0.054163 0.051789 1.046 0.296
## Lag2 0.045884 0.051786 0.886 0.376
## Lag3 -0.007131 0.051614 -0.138 0.890
## Volume 0.118004 0.238595 0.495 0.621
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1381.1 on 993 degrees of freedom
## AIC: 1391.1
##
## Number of Fisher Scoring iterations: 3
##Predicting values based on MODEL2
prob2 <- predict(regressor2, type="response", newdata = test_set)
prob2[1:5]
## 999 1000 1001 1002 1003
## 0.4731380 0.4836376 0.4775169 0.4860566 0.5005673
#Values are around 0.5 not so predictible but better than the MODEL1. Let's turn these probabilities of dependent varable into classifications by thresholding at 0.5.
pred2 <- ifelse(prob2>0.5, 1, 0)
#Confusion Matrix: Showing the tabular relation between the true and the actual values. It helps to calculate the accuracy of the predicted values. Let's see it is better than the first one.
table(pred2, test_set$Direction)
##
## pred2 1 0
## 0 42 33
## 1 99 78
mean(pred2==test_set$Direction)
## [1] 0.5238095
#Checking the accuracy of the model
library(caret)
confusionMatrix(factor(pred2), test_set$Direction)
## Warning in confusionMatrix.default(factor(pred2), test_set$Direction):
## Levels are not in the same order for reference and data. Refactoring data
## to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 99 78
## 0 42 33
##
## Accuracy : 0.5238
## 95% CI : (0.4602, 0.5869)
## No Information Rate : 0.5595
## P-Value [Acc > NIR] : 0.885805
##
## Kappa : -6e-04
## Mcnemar's Test P-Value : 0.001398
##
## Sensitivity : 0.7021
## Specificity : 0.2973
## Pos Pred Value : 0.5593
## Neg Pred Value : 0.4400
## Prevalence : 0.5595
## Detection Rate : 0.3929
## Detection Prevalence : 0.7024
## Balanced Accuracy : 0.4997
##
## 'Positive' Class : 1
##
#Accuracy of the model is 52.38%. We have improvised our model both in terms of the trueness to the actual values and the significance level.
#Conclusion: We will remove Lag3 as it has the highest p-value in accordance to the backward elimination method of model synthesis.
#MODEL3:
regressor3 <- glm(Direction ~ Lag1 + Lag2 + Volume, data = training_set, family = binomial)
summary(regressor3)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Volume, family = binomial,
## data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.351 -1.161 -1.087 1.189 1.299
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.19653 0.33163 -0.593 0.553
## Lag1 0.05421 0.05179 1.047 0.295
## Lag2 0.04606 0.05177 0.890 0.374
## Volume 0.12018 0.23807 0.505 0.614
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1381.1 on 994 degrees of freedom
## AIC: 1389.1
##
## Number of Fisher Scoring iterations: 3
##Predicting values based on MODEL3
prob3 <- predict(regressor3, type="response", newdata = test_set)
prob3[1:10]
## 999 1000 1001 1002 1003 1004 1005
## 0.4728124 0.4837164 0.4774226 0.4847586 0.4985877 0.4973743 0.4986346
## 1006 1007 1008
## 0.4912696 0.4961702 0.4891323
#Values are less than 0.5 not so predictible but better than the MODEL2. Let's turn these probabilities of dependent varable into classifications by thresholding at 0.5.
pred3 <- ifelse(prob3>0.5, 1, 0)
#Confusion Matrix: Showing the tabular relation between the true and the actual values. It helps to calculate the accuracy of the predicted values. Let's see if it is better than the second one.
table(pred3, test_set$Direction)
##
## pred3 1 0
## 0 41 32
## 1 100 79
mean(pred3==test_set$Direction)
## [1] 0.5238095
#Accuracy
confusionMatrix(factor(pred3),test_set$Direction)
## Warning in confusionMatrix.default(factor(pred3), test_set$Direction):
## Levels are not in the same order for reference and data. Refactoring data
## to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 100 79
## 0 41 32
##
## Accuracy : 0.5238
## 95% CI : (0.4602, 0.5869)
## No Information Rate : 0.5595
## P-Value [Acc > NIR] : 0.8858048
##
## Kappa : -0.0026
## Mcnemar's Test P-Value : 0.0007312
##
## Sensitivity : 0.7092
## Specificity : 0.2883
## Pos Pred Value : 0.5587
## Neg Pred Value : 0.4384
## Prevalence : 0.5595
## Detection Rate : 0.3968
## Detection Prevalence : 0.7103
## Balanced Accuracy : 0.4988
##
## 'Positive' Class : 1
##
#MODEL3 seems equivalent to MODEL2 with the same accuracy. It means getting lag3 out won't bring any change in the outcome keeping other variables same. Let's take Lag3 in and volume out and see if it works.
#MODEL4: (Direction ~ Lag1 + Lag2 + Lag3, data = training_set, family = binomial)
regressor4 <- glm(Direction ~ Lag1 + Lag2 + Lag3, data = training_set, family = binomial)
summary(regressor4)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3, family = binomial,
## data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.335 -1.163 -1.072 1.189 1.338
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.032230 0.063377 -0.509 0.611
## Lag1 0.055523 0.051709 1.074 0.283
## Lag2 0.044300 0.051674 0.857 0.391
## Lag3 -0.008815 0.051495 -0.171 0.864
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1381.4 on 994 degrees of freedom
## AIC: 1389.4
##
## Number of Fisher Scoring iterations: 3
#Lag3 isn't contributing to the model significantly.
##Predicting values based on MODEL4
prob4 <- predict(regressor4, type="response", newdata = test_set)
prob4[1:5]
## 999 1000 1001 1002 1003
## 0.4901877 0.4791816 0.4670936 0.4757875 0.4953661
#All values are less than 0.5 so not at all a good model. Let's remove Lag3 too.
#MODEL5: (Direction ~ Lag1 + Lag2 , data = training_set, family = binomial)
regressor5 <- glm(Direction ~ Lag1 + Lag2 , data = training_set, family = binomial)
summary(regressor5)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.326 -1.164 -1.074 1.188 1.345
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.03222 0.06338 -0.508 0.611
## Lag1 0.05562 0.05171 1.076 0.282
## Lag2 0.04449 0.05166 0.861 0.389
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1381.4 on 995 degrees of freedom
## AIC: 1387.4
##
## Number of Fisher Scoring iterations: 3
##Predicting values based on MODEL2
prob5 <- predict(regressor5, type="response", newdata = test_set)
prob5[1:10]
## 999 1000 1001 1002 1003 1004 1005
## 0.4901725 0.4791763 0.4667365 0.4739426 0.4927897 0.4938612 0.4951110
## 1006 1007 1008
## 0.4872698 0.4906968 0.4843769
# Here also probabilities are predicted wrongly. So, we need to add Volume attribute to this model to get the final possible model that we can have based on the the basic knowledge we have on logistic regression. And this model will be our Model with greatest accuracy of 52.38 %.
#Final MODEL: Direction ~ Lag1 + Lag2 + Volume
#CONCLUSION: MODEL3 will be our final model though it's accuracy is same as that of MODEL2 but it isn't used as it may include overfitting of data points. 52.38% is the best we can have with this basic logit function without using transformation and advanced techniques. Notice that this isn't best model possible, it's accuracy can be increased by using other machine learning algorithms.
#Thanks for reading. Take care !!