#INTRODUCTION
#In this paper we will explore a Stock Market data and build a logistic regression model. The S&P 500 Index (formerly Standard & Poor's 500 Index) is a market-capitalization-weighted index of the 500 largest U.S. publicly traded companies by market value, The index is widely regarded as the best single gauge of large-cap U.S. equities. 

#Data: We will make use of an inbuilt dataset called Smarket (S&P Stock Market Data) having 1250 observations and 9 variables. Columns and row names have been given below:

#Year: The year that the observation was recorded
#Lag1: Percentage return for previous day
#Lag2: Percentage return for 2 days previous
#Lag3: Percentage return for 3 days previous
#Lag4: Percentage return for 4 days previous
#Lag5: Percentage return for 5 days previous
#Volume: Volume of shares traded (number of daily shares traded in billions)
#Today: Percentage return for today
#Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given day

#Business Problem: Exploring stock market data and daily log return values for the present and the past 5 days to explore the relation amongst returns, and then build a logit regressed model. Let's start our work.

#Loading Smarket from ISLR Package

library(ISLR)
data("Smarket")

#Data Preparation

str(Smarket)
## 'data.frame':    1250 obs. of  9 variables:
##  $ Year     : num  2001 2001 2001 2001 2001 ...
##  $ Lag1     : num  0.381 0.959 1.032 -0.623 0.614 ...
##  $ Lag2     : num  -0.192 0.381 0.959 1.032 -0.623 ...
##  $ Lag3     : num  -2.624 -0.192 0.381 0.959 1.032 ...
##  $ Lag4     : num  -1.055 -2.624 -0.192 0.381 0.959 ...
##  $ Lag5     : num  5.01 -1.055 -2.624 -0.192 0.381 ...
##  $ Volume   : num  1.19 1.3 1.41 1.28 1.21 ...
##  $ Today    : num  0.959 1.032 -0.623 0.614 0.213 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
Smarket$Direction <- factor(Smarket$Direction, levels = c("Up", "Down"), labels = c(1,0)) 

#Labelled my Direction attributes ("Up", "Down") to (1,0).

#We won't change Year to the factor form as it will be needed later in the numerci form.

#Checking for missingness in our data
colSums(is.na(Smarket)) 
##      Year      Lag1      Lag2      Lag3      Lag4      Lag5    Volume 
##         0         0         0         0         0         0         0 
##     Today Direction 
##         0         0
#Well, our work has been already reduced as we won't have to deal with the missing values.

#We don't require any more data preparation in accordance with our theme. So our second step is also done.

##Summary
summary(Smarket)
##       Year           Lag1                Lag2          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500  
##  Median :2003   Median : 0.039000   Median : 0.039000  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000  
##       Lag3                Lag4                Lag5         
##  Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.92200  
##  1st Qu.:-0.640000   1st Qu.:-0.640000   1st Qu.:-0.64000  
##  Median : 0.038500   Median : 0.038500   Median : 0.03850  
##  Mean   : 0.001716   Mean   : 0.001636   Mean   : 0.00561  
##  3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.59700  
##  Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.73300  
##      Volume           Today           Direction
##  Min.   :0.3561   Min.   :-4.922000   1:648    
##  1st Qu.:1.2574   1st Qu.:-0.639500   0:602    
##  Median :1.4229   Median : 0.038500            
##  Mean   :1.4783   Mean   : 0.003138            
##  3rd Qu.:1.6417   3rd Qu.: 0.596750            
##  Max.   :3.1525   Max.   : 5.733000
#Let's visualise our data

##Data Exploration: Data visualization is perhaps the fastest and most useful way to summarize and learn more about your data.

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
for(i in 1:8) {
    hist(Smarket[,i], main=names(Smarket)[i], col = "orange")
}

for(i in 1:8) {
    boxplot(Smarket[,i], main=names(Smarket)[i], col="red")
}

# Histograms and boxplots shows that most of the variables show a gaussian normal distribution of data. 

#Volume and lags have similar range. Let's visualise.

#Correlation
  
library(corrplot)
## corrplot 0.84 loaded
library(corrgram)
corrgram(Smarket, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main="Corrgram")

corrplot(cor(Smarket[,1:8]), method="circle")

#All the variables are highly correlated with themselves and possess no cross - correlation. This is not good for our model as nature of Stock is not directly dependent with lags. Let's see if we could make a model with good accuracy.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
## 
##     panel.fill
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
featurePlot(Smarket[,1:8], Smarket[,9], plot="density")

#Feature plot depicts that the direction of the stock going up and down are visually overlapping. Only Today attribute seems to make a distinction between the nature. All other varibles are showing similar impacts on the binary output.

#Building a logistic regression model: As we have seen that all Lags have the seemingly equal correlation with the outcome and no interrelation amongst themselves so our first model would have all Lags and see how it works.

#MODEL1: glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Smarket, family = binomial)

regressor1 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Smarket, family = binomial)
summary(regressor1)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.326  -1.145  -1.065   1.203   1.446  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.126000   0.240736   0.523    0.601
## Lag1         0.073074   0.050167   1.457    0.145
## Lag2         0.042301   0.050086   0.845    0.398
## Lag3        -0.011085   0.049939  -0.222    0.824
## Lag4        -0.009359   0.049974  -0.187    0.851
## Lag5        -0.010313   0.049511  -0.208    0.835
## Volume      -0.135441   0.158360  -0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3
#Summary says that the p values of Lags3, Lag4 and Lag5 aren't contributing significantly to the nature of the binary outcome as compared to Lag1 and Lag2.

#Predicting values based on MODEL1
prob1 <- predict(regressor1, type="response", newdata = Smarket)
prob1[1:5]
##         1         2         3         4         5 
## 0.4929159 0.5185321 0.5188612 0.4847776 0.4892188
#Values are around 0.5 not so predictible. Let's turn these probabilities of dependent varable into classifications by thresholding at 0.5.
pred1 <- ifelse(prob1>0.5, 1, 0)

#Confusion Matrix: Showing the tabular relation between the true and the actual values. It helps to calculate the accuracy of the predicted values.

table(pred1, Smarket$Direction)
##      
## pred1   1   0
##     0 507 457
##     1 141 145
mean(pred1==Smarket$Direction)
## [1] 0.4784
#Checking the accuracy of the model

library(caret)
confusionMatrix(factor(pred1), Smarket$Direction)
## Warning in confusionMatrix.default(factor(pred1), Smarket$Direction):
## Levels are not in the same order for reference and data. Refactoring data
## to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1 141 145
##          0 507 457
##                                           
##                Accuracy : 0.4784          
##                  95% CI : (0.4504, 0.5065)
##     No Information Rate : 0.5184          
##     P-Value [Acc > NIR] : 0.9979          
##                                           
##                   Kappa : -0.0228         
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.2176          
##             Specificity : 0.7591          
##          Pos Pred Value : 0.4930          
##          Neg Pred Value : 0.4741          
##              Prevalence : 0.5184          
##          Detection Rate : 0.1128          
##    Detection Prevalence : 0.2288          
##       Balanced Accuracy : 0.4884          
##                                           
##        'Positive' Class : 1               
## 
#Accuracy of the model is 47.84%. We haven't created a good model. Let's divide our dataset into training and test it onto the test data. 

#Conclusion: We will remove Lag3, lag4 and Lag5 as their p-values have relatively high magnitudes.
#MODEL2: (Direction ~ Lag1 + Lag2 + Lag3 + Volume, data = Smarket, family = binomial)

#Let's pre-process our dataset into training and test subsets to test predictions of training_ set onto test_set to correlate the predicted and the true values.

#Training Data: Data prior to 2005
library(caTools)
training_set <- subset(Smarket, Year<2005)

#Test Data: Data of Year=2005
test_set <- subset(Smarket, Year==2005)

#MODEL2
regressor2 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Volume, data = training_set, family = binomial)
summary(regressor2)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Volume, family = binomial, 
##     data = training_set)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.359  -1.161  -1.082   1.190   1.299  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.193568   0.332328  -0.582    0.560
## Lag1         0.054163   0.051789   1.046    0.296
## Lag2         0.045884   0.051786   0.886    0.376
## Lag3        -0.007131   0.051614  -0.138    0.890
## Volume       0.118004   0.238595   0.495    0.621
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1381.1  on 993  degrees of freedom
## AIC: 1391.1
## 
## Number of Fisher Scoring iterations: 3
##Predicting values based on MODEL2
prob2 <- predict(regressor2, type="response", newdata = test_set)
prob2[1:5]
##       999      1000      1001      1002      1003 
## 0.4731380 0.4836376 0.4775169 0.4860566 0.5005673
#Values are around 0.5 not so predictible but better than the MODEL1. Let's turn these probabilities of dependent varable into classifications by thresholding at 0.5.
pred2 <- ifelse(prob2>0.5, 1, 0)

#Confusion Matrix: Showing the tabular relation between the true and the actual values. It helps to calculate the accuracy of the predicted values. Let's see it is better than the first one.

table(pred2, test_set$Direction)
##      
## pred2  1  0
##     0 42 33
##     1 99 78
mean(pred2==test_set$Direction)
## [1] 0.5238095
#Checking the accuracy of the model

library(caret)
confusionMatrix(factor(pred2), test_set$Direction)
## Warning in confusionMatrix.default(factor(pred2), test_set$Direction):
## Levels are not in the same order for reference and data. Refactoring data
## to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  0
##          1 99 78
##          0 42 33
##                                           
##                Accuracy : 0.5238          
##                  95% CI : (0.4602, 0.5869)
##     No Information Rate : 0.5595          
##     P-Value [Acc > NIR] : 0.885805        
##                                           
##                   Kappa : -6e-04          
##  Mcnemar's Test P-Value : 0.001398        
##                                           
##             Sensitivity : 0.7021          
##             Specificity : 0.2973          
##          Pos Pred Value : 0.5593          
##          Neg Pred Value : 0.4400          
##              Prevalence : 0.5595          
##          Detection Rate : 0.3929          
##    Detection Prevalence : 0.7024          
##       Balanced Accuracy : 0.4997          
##                                           
##        'Positive' Class : 1               
## 
#Accuracy of the model is 52.38%. We have improvised our model both in terms of the trueness to the actual values and the significance level.

#Conclusion: We will remove Lag3 as it has the highest p-value in accordance to the backward elimination method of model synthesis.
#MODEL3:
regressor3 <- glm(Direction ~ Lag1 + Lag2 + Volume, data = training_set, family = binomial)
summary(regressor3)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Volume, family = binomial, 
##     data = training_set)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.351  -1.161  -1.087   1.189   1.299  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.19653    0.33163  -0.593    0.553
## Lag1         0.05421    0.05179   1.047    0.295
## Lag2         0.04606    0.05177   0.890    0.374
## Volume       0.12018    0.23807   0.505    0.614
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1381.1  on 994  degrees of freedom
## AIC: 1389.1
## 
## Number of Fisher Scoring iterations: 3
##Predicting values based on MODEL3
prob3 <- predict(regressor3, type="response", newdata = test_set)
prob3[1:10]
##       999      1000      1001      1002      1003      1004      1005 
## 0.4728124 0.4837164 0.4774226 0.4847586 0.4985877 0.4973743 0.4986346 
##      1006      1007      1008 
## 0.4912696 0.4961702 0.4891323
#Values are less than 0.5 not so predictible but better than the MODEL2. Let's turn these probabilities of dependent varable into classifications by thresholding at 0.5.
pred3 <- ifelse(prob3>0.5, 1, 0)

#Confusion Matrix: Showing the tabular relation between the true and the actual values. It helps to calculate the accuracy of the predicted values. Let's see if it is better than the second one.

table(pred3, test_set$Direction)
##      
## pred3   1   0
##     0  41  32
##     1 100  79
mean(pred3==test_set$Direction)
## [1] 0.5238095
#Accuracy 
confusionMatrix(factor(pred3),test_set$Direction)
## Warning in confusionMatrix.default(factor(pred3), test_set$Direction):
## Levels are not in the same order for reference and data. Refactoring data
## to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1 100  79
##          0  41  32
##                                           
##                Accuracy : 0.5238          
##                  95% CI : (0.4602, 0.5869)
##     No Information Rate : 0.5595          
##     P-Value [Acc > NIR] : 0.8858048       
##                                           
##                   Kappa : -0.0026         
##  Mcnemar's Test P-Value : 0.0007312       
##                                           
##             Sensitivity : 0.7092          
##             Specificity : 0.2883          
##          Pos Pred Value : 0.5587          
##          Neg Pred Value : 0.4384          
##              Prevalence : 0.5595          
##          Detection Rate : 0.3968          
##    Detection Prevalence : 0.7103          
##       Balanced Accuracy : 0.4988          
##                                           
##        'Positive' Class : 1               
## 
#MODEL3 seems equivalent to MODEL2 with the same accuracy. It means getting lag3 out won't bring any change in the outcome keeping other variables same. Let's take Lag3 in and volume out and see if it works.
#MODEL4: (Direction ~ Lag1 + Lag2 + Lag3, data = training_set, family = binomial)

regressor4 <- glm(Direction ~ Lag1 + Lag2 + Lag3, data = training_set, family = binomial)
summary(regressor4)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3, family = binomial, 
##     data = training_set)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.335  -1.163  -1.072   1.189   1.338  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.032230   0.063377  -0.509    0.611
## Lag1         0.055523   0.051709   1.074    0.283
## Lag2         0.044300   0.051674   0.857    0.391
## Lag3        -0.008815   0.051495  -0.171    0.864
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1381.4  on 994  degrees of freedom
## AIC: 1389.4
## 
## Number of Fisher Scoring iterations: 3
#Lag3 isn't contributing to the model significantly.

##Predicting values based on MODEL4
prob4 <- predict(regressor4, type="response", newdata = test_set)
prob4[1:5]
##       999      1000      1001      1002      1003 
## 0.4901877 0.4791816 0.4670936 0.4757875 0.4953661
#All values are less than 0.5 so not at all a good model. Let's remove Lag3 too.

#MODEL5: (Direction ~ Lag1 + Lag2 , data = training_set, family = binomial)

regressor5 <- glm(Direction ~ Lag1 + Lag2 , data = training_set, family = binomial)
summary(regressor5)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = training_set)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.326  -1.164  -1.074   1.188   1.345  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.03222    0.06338  -0.508    0.611
## Lag1         0.05562    0.05171   1.076    0.282
## Lag2         0.04449    0.05166   0.861    0.389
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1381.4  on 995  degrees of freedom
## AIC: 1387.4
## 
## Number of Fisher Scoring iterations: 3
##Predicting values based on MODEL2
prob5 <- predict(regressor5, type="response", newdata = test_set)
prob5[1:10]
##       999      1000      1001      1002      1003      1004      1005 
## 0.4901725 0.4791763 0.4667365 0.4739426 0.4927897 0.4938612 0.4951110 
##      1006      1007      1008 
## 0.4872698 0.4906968 0.4843769
# Here also probabilities are predicted wrongly. So, we need to add Volume attribute to this model to get the final possible model that we can have based on the the basic knowledge we have on logistic regression. And this model will be our Model with greatest accuracy of 52.38 %. 

#Final MODEL: Direction ~ Lag1 + Lag2 + Volume

#CONCLUSION: MODEL3 will be our final model though it's accuracy is same as that of MODEL2 but it isn't used as it may include overfitting of data points. 52.38% is the best we can have with this basic logit function without using transformation and advanced techniques. Notice that this isn't best model possible, it's accuracy can be increased by using other machine learning algorithms.

#Thanks for reading. Take care !!