Looking at Logistic Regression models using the GLM functions.
URL: https://www.youtube.com/watch?v=TxvEVc8YNlU&feature=youtu.be&list=PL5-da3qGB5IC4vaDba5ClatUmFppXLAhE
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
#contains the datasets for the model
Import the data and take a look at it.
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
summary(Smarket)
## Year Lag1 Lag2
## Min. :2001 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500
## Median :2003 Median : 0.039000 Median : 0.039000
## Mean :2003 Mean : 0.003834 Mean : 0.003919
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000
## Lag3 Lag4 Lag5
## Min. :-4.922000 Min. :-4.922000 Min. :-4.92200
## 1st Qu.:-0.640000 1st Qu.:-0.640000 1st Qu.:-0.64000
## Median : 0.038500 Median : 0.038500 Median : 0.03850
## Mean : 0.001716 Mean : 0.001636 Mean : 0.00561
## 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.59700
## Max. : 5.733000 Max. : 5.733000 Max. : 5.73300
## Volume Today Direction
## Min. :0.3561 Min. :-4.922000 Down:602
## 1st Qu.:1.2574 1st Qu.:-0.639500 Up :648
## Median :1.4229 Median : 0.038500
## Mean :1.4783 Mean : 0.003138
## 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. :3.1525 Max. : 5.733000
#?Smarket
Create a pairs plot to look at the data.
pairs(Smarket, col=Smarket$Direction)
Colors of the pairs plot are what we are going to use as our response variable later in the analysis.
We start with a model including ALL the variables except for “Today”
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data = Smarket, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
Seems none of the P values are significant. This does not mean they are not useful just there is no significance.
We run a predict on our model with type response and show the first 5. This is a prediction on weather the market will be up or down based on the other predictors
glm.probs=predict(glm.fit,type="response")
glm.probs[1:5]
## 1 2 3 4 5
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812
We see they are very close to 50% which is not surprising as this is stock market data. Strong predictions are not expected.
We can turn the predictors into classifications values by using an if/else command and the break point of above 50%
glm.pred=ifelse(glm.probs>0.5,"Up","Down")
Now we can look at the performance. It makes it easier to attach the dataframe sol the variables are available by name.
attach(Smarket)
We can make a table of our predictor variable (up & Downs) against the true direction and capture the mean
table(glm.pred,Direction)
## Direction
## glm.pred Down Up
## Down 145 141
## Up 457 507
mean(glm.pred==Direction)
## [1] 0.5216
We see on the table there are lots of elements on the Off diagonals which is where we do mistake classification (diagonals is where we do correct classification). The mean is where the prediction classification is equal to the actual direction and take a mean of those. We are “slightly” better than chance.
We may have over fit on the training data. We need to divide the data into a training and test set.
Make training and test set
#setting the cut off point at 2005
train = Year<2005
#refit our data on the training set
glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data = Smarket, family = binomial, subset=train) #only observations where train = TRUE as they are under 2005
glm.probs=predict(glm.fit,newdata=Smarket[!train,], type = "response") #predict on data above 2005 NOT part of the training set, don't forget the comma on !train,
glm.pred=ifelse(glm.probs>0.5,"Up","Down") #again bring down the Up/Down variable
Direction.2005=Smarket$Direction[!train] #removing the training set from the test set
table(glm.pred,Direction.2005) #pulling in the test data
## Direction.2005
## glm.pred Down Up
## Down 77 97
## Up 34 44
mean(glm.pred==Direction.2005) #pulling in the test data
## [1] 0.4801587
We have done less than 50% so worse than the null rate. This is proof of over fit
Fit smaller Model, only using the first two lag variables.
glm.fit=glm(Direction~Lag1+Lag2,
data=Smarket,family = binomial, subset = train)
glm.probs=predict(glm.fit,newdata=Smarket[!train,], type = "response")
glm.pred=ifelse(glm.probs>0.5,"Up", "Down")
table(glm.pred,Direction.2005)
## Direction.2005
## glm.pred Down Up
## Down 35 35
## Up 76 106
mean(glm.pred==Direction.2005)
## [1] 0.5595238
Did anything become significant now we have a smaller model?
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Smarket,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.345 -1.188 1.074 1.164 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.03222 0.06338 0.508 0.611
## Lag1 -0.05562 0.05171 -1.076 0.282
## Lag2 -0.04449 0.05166 -0.861 0.389
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.3 on 997 degrees of freedom
## Residual deviance: 1381.4 on 995 degrees of freedom
## AIC: 1387.4
##
## Number of Fisher Scoring iterations: 3
Nope, but our prediction of the direction performance improved.