This paper is the jist of a methodology to describe the methodology involved in doing analysis with Logistic regression. The exercises and many of the interpretation has been taken from the book Introduction to statistical learning. The objective of this paper is only to learn more on Logistic regression.
This will be done using the stock market data set in ISLR package. The dataset and the variables involved are as below.
library(ISLR)
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
summary(Smarket)
## Year Lag1 Lag2 Lag3
## Min. :2001 Min. :-4.922 Min. :-4.922 Min. :-4.922
## 1st Qu.:2002 1st Qu.:-0.640 1st Qu.:-0.640 1st Qu.:-0.640
## Median :2003 Median : 0.039 Median : 0.039 Median : 0.038
## Mean :2003 Mean : 0.004 Mean : 0.004 Mean : 0.002
## 3rd Qu.:2004 3rd Qu.: 0.597 3rd Qu.: 0.597 3rd Qu.: 0.597
## Max. :2005 Max. : 5.733 Max. : 5.733 Max. : 5.733
## Lag4 Lag5 Volume Today
## Min. :-4.922 Min. :-4.922 Min. :0.356 Min. :-4.922
## 1st Qu.:-0.640 1st Qu.:-0.640 1st Qu.:1.257 1st Qu.:-0.640
## Median : 0.038 Median : 0.038 Median :1.423 Median : 0.038
## Mean : 0.002 Mean : 0.006 Mean :1.478 Mean : 0.003
## 3rd Qu.: 0.597 3rd Qu.: 0.597 3rd Qu.:1.642 3rd Qu.: 0.597
## Max. : 5.733 Max. : 5.733 Max. :3.152 Max. : 5.733
## Direction
## Down:602
## Up :648
##
##
##
##
Let us first find if there are any correlation between the variables. The cor() function in R does this.
cor(Smarket[,-9])
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume
## Year 1.00000 0.029700 0.030596 0.033195 0.035689 0.029788 0.53901
## Lag1 0.02970 1.000000 -0.026294 -0.010803 -0.002986 -0.005675 0.04091
## Lag2 0.03060 -0.026294 1.000000 -0.025897 -0.010854 -0.003558 -0.04338
## Lag3 0.03319 -0.010803 -0.025897 1.000000 -0.024051 -0.018808 -0.04182
## Lag4 0.03569 -0.002986 -0.010854 -0.024051 1.000000 -0.027084 -0.04841
## Lag5 0.02979 -0.005675 -0.003558 -0.018808 -0.027084 1.000000 -0.02200
## Volume 0.53901 0.040910 -0.043383 -0.041824 -0.048414 -0.022002 1.00000
## Today 0.03010 -0.026155 -0.010250 -0.002448 -0.006900 -0.034860 0.01459
## Today
## Year 0.030095
## Lag1 -0.026155
## Lag2 -0.010250
## Lag3 -0.002448
## Lag4 -0.006900
## Lag5 -0.034860
## Volume 0.014592
## Today 1.000000
The correlation betweeen the variables and todays volume is not significant. Let us plot the daily volumes. We can find from the volume plot that the volume of shares traded has increased over time
attach(Smarket)
plot(Volume)
Let us fit a logistic regression model to all the variables
sharefit <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family="binomial")
summary(sharefit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.45 -1.20 1.07 1.15 1.33
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.12600 0.24074 -0.52 0.60
## Lag1 -0.07307 0.05017 -1.46 0.15
## Lag2 -0.04230 0.05009 -0.84 0.40
## Lag3 0.01109 0.04994 0.22 0.82
## Lag4 0.00936 0.04997 0.19 0.85
## Lag5 0.01031 0.04951 0.21 0.83
## Volume 0.13544 0.15836 0.86 0.39
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1742
##
## Number of Fisher Scoring iterations: 3
As seen from the summary, all the p values are pretty high and therefore there is no cleare evidence of a real association between the variables and Direction.
Let us do a prediction with the created model on to the training data.
sharepred <- predict(sharefit,data=Smarket,type="response")
sharepred[1:10]
## 1 2 3 4 5 6 7 8 9 10
## 0.5071 0.4815 0.4811 0.5152 0.5108 0.5070 0.4927 0.5092 0.5176 0.4888
contrasts(Smarket$Direction)
## Up
## Down 0
## Up 1
The resultant figures does not give any indication of whether it is Up or down. Let us do that with the following code. First of all let us find the number of observations. There are 1250 observations. Let us first give a logical function for giving the output as “up” or “down”. Finally let us compare what we have predicted to the actual direction and calculate our error rates.
sharepred[sharepred >= .5] = "Up"
sharepred[sharepred < .5] = "Down"
table(sharepred1$Dir,Smarket$Direction)
##
## Down Up
## Down 145 141
## UP 457 507
Accuracy <- (145+507)/1250
Accuracy
## [1] 0.5216
As seen from the statistics the the accuracy is only 52%. However in reality it will be much less as all the tests we have done so far is on the training set.
Let us divide our data into training and test sets.
train <- (Year < 2005)
Share2005 <- Smarket[!train,]
Direction2005 <- Direction[!train]
model2005 <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family="binomial",subset=train)
pred2005 <- predict(model2005,Share2005,type="response")
Let us check out the accuracy levels of this new model
length(Direction2005)
## [1] 252
temppred <- rep("Down",252)
temppred[pred2005 > .5] <- "Up"
table(temppred,Direction2005)
## Direction2005
## temppred Down Up
## Down 77 97
## Up 34 44
mean(temppred==Direction2005)
## [1] 0.4802
mean(temppred != Direction2005)
## [1] 0.5198
So as you can see the performace of the model is quite bad on the test set
Let us re-work the model a bit and exclude all those predictors who have very high p values.
sharemod2 <- glm(Direction~Lag1+Lag2,family="binomial",data=Smarket,subset=train)
mod2pred <- predict(sharemod2,Share2005,type="response")
mod2check <- rep("Down", 252)
mod2check[mod2pred > .5] <- "Up"
table(mod2check,Direction2005)
## Direction2005
## mod2check Down Up
## Down 35 35
## Up 76 106
mean(mod2check==Direction2005)
## [1] 0.5595
mean(mod2check != Direction2005)
## [1] 0.4405
So as seen above the model performance has improved quite bit.
Let us try predicting for some specific values of lag1 & lag 2
predict(sharemod2,newdata=data.frame(Lag1 = c(1.2,1.5),Lag2=c(1.1,-0.8)),type="response")
## 1 2
## 0.4791 0.4961
Let us now try to learn the dynamics of Linear Discriminant Analysis. LDA is similar to the logistic regression models. Let us fit LDA to the stockmarket data. The LDA model is represented by the function all lda() and is part of the MASS database
library(MASS)
ldamod1 <- lda(Direction~Lag1+Lag2,data=Smarket,subset=train)
ldamod1
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.492 0.508
##
## Group means:
## Lag1 Lag2
## Down 0.04279 0.03389
## Up -0.03955 -0.03133
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420
## Lag2 -0.5135
plot(ldamod1)
Let us try to learn more about the output of the LDA model.
The term prior probabilities indicates the probabilities of the response variable within the training data. So as per this statistics 49% of the data sets corresponds to the outcome that the market will go down and 51% to the outcome that the market will go up.
The group means gives the average within each class for each predictors. Looking at the group means it can be inferred that the previous two days market tends to be positive when the market is down to a particular day and previous days values to be negative when the market goes down today.
And for the last term, the coefficients are the variables with which each of the predictor is multiplied to get the classifier. If (-0.64) x Lag1 - (0.51) x Lag2 has a high value then the LDA classifier will predict that the market will go up and viceversa.
stockldapred <- predict(ldamod1,Share2005)
names(stockldapred)
## [1] "class" "posterior" "x"
The prediction function consists of three elements which needs to be discussed,
The Class element - This element lists out the class of the prediction for the respective training exampl
Posterior element - This element lists out the posterior probabilities for both the classes for all the training examples.
LDA - This element is the LDA calculated for each value of x
The posterior predictors can be used to do classification. Let us assume the threshold to be .5 and let us take the posterior probabilities pertaining to the class which says the market will be down.
ldapostpred1 <- sum(stockldapred$posterior[,1] >= .5)
ldapostpred2 <- sum(stockldapred$posterior[,1] < .5)
ldapostpred1
## [1] 70
ldapostpred2
## [1] 182
ldapostpred3 <- sum(stockldapred$posterior[,2] >= .5)
ldapostpred4 <- sum(stockldapred$posterior[,2] < .5)
ldapostpred3
## [1] 182
ldapostpred4
## [1] 70
The quadratic model in terms of execution is similar to LDA. Let us get this going
stockqdamod <- qda(Direction~Lag1+Lag2,data=Smarket,subset=train)
stockqdamod
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.492 0.508
##
## Group means:
## Lag1 Lag2
## Down 0.04279 0.03389
## Up -0.03955 -0.03133
stockqdapred <- predict(stockqdamod,Share2005)$class
table(stockqdapred,Direction2005)
## Direction2005
## stockqdapred Down Up
## Down 30 20
## Up 81 121
mean(stockqdapred==Direction2005)
## [1] 0.5992
We could see that the prediction % has improved considerably with the quadratic model.
Unlike the other models, KNN does not have a seperate function for fitting model and also to predict. Both can be done with a single command. The Knn model has the following elements 1. Set of predictors for the training data 2. set of predictors for test data 3. The training response 4. K value
Let us try to satisfy the data requirements
train.X = cbind(Lag1,Lag2)[train,]
test.X = cbind(Lag1,Lag2)[!train,]
train.Direction = Direction[train]
Let us fit the model now. For the KNN function we require the “class” library
library(class)
set.seed(123)
stockknn1 <- knn(train.X,test.X,train.Direction,k=1)
table(stockknn1,Direction2005)
## Direction2005
## stockknn1 Down Up
## Down 43 58
## Up 68 83
mean(stockknn1==Direction2005)
## [1] 0.5
stockknn2 <- knn(train.X,test.X,train.Direction,k=3)
table(stockknn2,Direction2005)
## Direction2005
## stockknn2 Down Up
## Down 48 55
## Up 63 86
mean(stockknn2==Direction2005)
## [1] 0.5317
Well not a pretty impressive value after all. However if we keep changing the K value to around 3 it is found to marginally increase
It is observed that the results from KNN is much better when we are do this on a larger scale with larger set of variables. We will not try this experiment with the Caravan dataset in the ISLR package.
library(ISLR)
dim(Caravan)
## [1] 5822 86
The variable, “Purchase” is our response variable.
Another exercise which we have to do is to scale the observations so as to get a normalised data set. This can be done by the Scale() function in R
standard.X <- scale(Caravan[,-86])
test <- 1:1000
train.X = standard.X[-test,]
test.X = standard.X[test,]
train.Y = Caravan[-test,86]
test.Y=Caravan[test,86]
set.seed(123)
knnpredict1 <- knn(train.X,test.X,train.Y,k=1)
table(knnpredict1,test.Y)
## test.Y
## knnpredict1 No Yes
## No 874 49
## Yes 67 10
mean(knnpredict1!= test.Y)
## [1] 0.116
set.seed(123)
knnpredict2 <- knn(train.X,test.X,train.Y,k=5)
table(knnpredict2,test.Y)
## test.Y
## knnpredict2 No Yes
## No 930 55
## Yes 11 4
mean(knnpredict2!= test.Y)
## [1] 0.066
Along with the logistic regression exercises let us also attempt to do some re-sampling method exercises. The various resampling methods which are being done here are the following.
Let us use the Auto data set in ISLR for our re-sampling exercise
library(ISLR)
set.seed(1)
train = sample(392,196)
Let us fit a linear model for the Auto set and calculate the MSE for it
attach(Auto)
autolm1 <- lm(mpg~horsepower,data=Auto,subset=train)
mean((mpg-predict(autolm1,Auto))[-train]^2)
## [1] 26.14
Let us try the same by fitting some polynomial models and finding the error rates
autolm2 <- lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(autolm2,Auto))[-train]^2)
## [1] 19.82
autolm3 <- lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(autolm3,Auto))[-train]^2)
## [1] 19.78
Let us repeat the exercise with a different training set and check for varibility of the data
set.seed(2)
train = sample(392,196)
autolm4 <- lm(mpg~horsepower,data=Auto,subset=train)
mean((mpg-predict(autolm4,Auto))[-train]^2)
## [1] 23.3
autolm5 <- lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(autolm5,Auto))[-train]^2)
## [1] 18.9
autolm6 <- lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(autolm6,Auto))[-train]^2)
## [1] 19.26
So the results are consistent and it is proved that the model with a quadratic fit is the better among the three models.
The LOOCv can be carried out using the function cv.glm, which is part of the “boot” library. We will try this out for various polynomial functions.
library(boot)
automodel1 <- glm(mpg~horsepower,data=Auto)
model1cv <- cv.glm(Auto,automodel1)
model1cv$delta
## [1] 24.23 24.23
The $delta gives the mean error terms in the model.
model2cv <- rep(0,5)
for(i in 1:5) {
automodel2 <- glm(mpg~poly(horsepower,i),data=Auto)
model2cv[i] <- cv.glm(Auto,automodel2)$delta[1]
}
model2cv
## [1] 24.23 19.25 19.33 19.42 19.03
The K fold validation is similar to the LOOCV model. The usual value for k = 10
set.seed(17)
model3kf <- rep(0,10)
for(i in 1:10){
model3 <- glm(mpg~poly(horsepower,i),data=Auto)
model3kf[i] <- cv.glm(Auto,model3,K=10)$delta[1]
}
model3kf
## [1] 24.21 19.19 19.31 19.34 18.88 19.02 18.90 19.71 18.95 19.50