Looking at Logistic Regression models using the GLM functions.

URL: https://www.youtube.com/watch?v=TxvEVc8YNlU&feature=youtu.be&list=PL5-da3qGB5IC4vaDba5ClatUmFppXLAhE

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
#contains the datasets for the model

Import the data and take a look at it.

names(Smarket)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
summary(Smarket)
##       Year           Lag1                Lag2          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500  
##  Median :2003   Median : 0.039000   Median : 0.039000  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000  
##       Lag3                Lag4                Lag5         
##  Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.92200  
##  1st Qu.:-0.640000   1st Qu.:-0.640000   1st Qu.:-0.64000  
##  Median : 0.038500   Median : 0.038500   Median : 0.03850  
##  Mean   : 0.001716   Mean   : 0.001636   Mean   : 0.00561  
##  3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.59700  
##  Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.73300  
##      Volume           Today           Direction 
##  Min.   :0.3561   Min.   :-4.922000   Down:602  
##  1st Qu.:1.2574   1st Qu.:-0.639500   Up  :648  
##  Median :1.4229   Median : 0.038500             
##  Mean   :1.4783   Mean   : 0.003138             
##  3rd Qu.:1.6417   3rd Qu.: 0.596750             
##  Max.   :3.1525   Max.   : 5.733000
#?Smarket

Create a pairs plot to look at the data.

pairs(Smarket, col=Smarket$Direction)

Colors of the pairs plot are what we are going to use as our response variable later in the analysis.

We start with a model including ALL the variables except for “Today”

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
            data = Smarket, family = binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Seems none of the P values are significant. This does not mean they are not useful just there is no significance.

We run a predict on our model with type response and show the first 5. This is a prediction on weather the market will be up or down based on the other predictors

glm.probs=predict(glm.fit,type="response")
glm.probs[1:5]
##         1         2         3         4         5 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812

We see they are very close to 50% which is not surprising as this is stock market data. Strong predictions are not expected.

We can turn the predictors into classifications values by using an if/else command and the break point of above 50%

glm.pred=ifelse(glm.probs>0.5,"Up","Down")

Now we can look at the performance. It makes it easier to attach the dataframe sol the variables are available by name.

attach(Smarket)

We can make a table of our predictor variable (up & Downs) against the true direction and capture the mean

table(glm.pred,Direction)
##         Direction
## glm.pred Down  Up
##     Down  145 141
##     Up    457 507
mean(glm.pred==Direction)
## [1] 0.5216

We see on the table there are lots of elements on the Off diagonals which is where we do mistake classification (diagonals is where we do correct classification). The mean is where the prediction classification is equal to the actual direction and take a mean of those. We are “slightly” better than chance.

We may have over fit on the training data. We need to divide the data into a training and test set.

Make training and test set

#setting the cut off point at 2005
train = Year<2005
#refit our data on the training set
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
            data = Smarket, family = binomial, subset=train) #only observations where train = TRUE as they are under 2005
glm.probs=predict(glm.fit,newdata=Smarket[!train,], type = "response") #predict on data above 2005 NOT part of the training set, don't forget the comma on !train,
glm.pred=ifelse(glm.probs>0.5,"Up","Down") #again bring down the Up/Down variable
Direction.2005=Smarket$Direction[!train] #removing the training set from the test set
table(glm.pred,Direction.2005) #pulling in the test data
##         Direction.2005
## glm.pred Down Up
##     Down   77 97
##     Up     34 44
mean(glm.pred==Direction.2005) #pulling in the test data
## [1] 0.4801587

We have done less than 50% so worse than the null rate. This is proof of over fit

Fit smaller Model, only using the first two lag variables.

glm.fit=glm(Direction~Lag1+Lag2,
            data=Smarket,family = binomial, subset = train)
glm.probs=predict(glm.fit,newdata=Smarket[!train,], type = "response")
glm.pred=ifelse(glm.probs>0.5,"Up", "Down")
table(glm.pred,Direction.2005)
##         Direction.2005
## glm.pred Down  Up
##     Down   35  35
##     Up     76 106
mean(glm.pred==Direction.2005)
## [1] 0.5595238

Did anything become significant now we have a smaller model?

summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Smarket, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.345  -1.188   1.074   1.164   1.326  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.03222    0.06338   0.508    0.611
## Lag1        -0.05562    0.05171  -1.076    0.282
## Lag2        -0.04449    0.05166  -0.861    0.389
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.3  on 997  degrees of freedom
## Residual deviance: 1381.4  on 995  degrees of freedom
## AIC: 1387.4
## 
## Number of Fisher Scoring iterations: 3

Nope, but our prediction of the direction performance improved.