Statistical Learning - Logistic Regression

This paper is the jist of a methodology to describe the methodology involved in doing analysis with Logistic regression. The exercises and many of the interpretation has been taken from the book Introduction to statistical learning. The objective of this paper is only to learn more on Logistic regression.

This will be done using the stock market data set in ISLR package. The dataset and the variables involved are as below.

library(ISLR)
names(Smarket)

## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"

summary(Smarket)

##       Year           Lag1             Lag2             Lag3       
##  Min.   :2001   Min.   :-4.922   Min.   :-4.922   Min.   :-4.922  
##  1st Qu.:2002   1st Qu.:-0.640   1st Qu.:-0.640   1st Qu.:-0.640  
##  Median :2003   Median : 0.039   Median : 0.039   Median : 0.038  
##  Mean   :2003   Mean   : 0.004   Mean   : 0.004   Mean   : 0.002  
##  3rd Qu.:2004   3rd Qu.: 0.597   3rd Qu.: 0.597   3rd Qu.: 0.597  
##  Max.   :2005   Max.   : 5.733   Max.   : 5.733   Max.   : 5.733  
##       Lag4             Lag5            Volume          Today       
##  Min.   :-4.922   Min.   :-4.922   Min.   :0.356   Min.   :-4.922  
##  1st Qu.:-0.640   1st Qu.:-0.640   1st Qu.:1.257   1st Qu.:-0.640  
##  Median : 0.038   Median : 0.038   Median :1.423   Median : 0.038  
##  Mean   : 0.002   Mean   : 0.006   Mean   :1.478   Mean   : 0.003  
##  3rd Qu.: 0.597   3rd Qu.: 0.597   3rd Qu.:1.642   3rd Qu.: 0.597  
##  Max.   : 5.733   Max.   : 5.733   Max.   :3.152   Max.   : 5.733  
##  Direction 
##  Down:602  
##  Up  :648  
##            
##            
##            
##

Let us first find if there are any correlation between the variables. The cor() function in R does this.

cor(Smarket[,-9])

##           Year      Lag1      Lag2      Lag3      Lag4      Lag5   Volume
## Year   1.00000  0.029700  0.030596  0.033195  0.035689  0.029788  0.53901
## Lag1   0.02970  1.000000 -0.026294 -0.010803 -0.002986 -0.005675  0.04091
## Lag2   0.03060 -0.026294  1.000000 -0.025897 -0.010854 -0.003558 -0.04338
## Lag3   0.03319 -0.010803 -0.025897  1.000000 -0.024051 -0.018808 -0.04182
## Lag4   0.03569 -0.002986 -0.010854 -0.024051  1.000000 -0.027084 -0.04841
## Lag5   0.02979 -0.005675 -0.003558 -0.018808 -0.027084  1.000000 -0.02200
## Volume 0.53901  0.040910 -0.043383 -0.041824 -0.048414 -0.022002  1.00000
## Today  0.03010 -0.026155 -0.010250 -0.002448 -0.006900 -0.034860  0.01459
##            Today
## Year    0.030095
## Lag1   -0.026155
## Lag2   -0.010250
## Lag3   -0.002448
## Lag4   -0.006900
## Lag5   -0.034860
## Volume  0.014592
## Today   1.000000

The correlation betweeen the variables and todays volume is not significant. Let us plot the daily volumes. We can find from the volume plot that the volume of shares traded has increased over time

attach(Smarket)
plot(Volume)

plot of chunk unnamed-chunk-3

Let us fit a logistic regression model to all the variables

sharefit <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family="binomial")
summary(sharefit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.45   -1.20    1.07    1.15    1.33  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.12600    0.24074   -0.52     0.60
## Lag1        -0.07307    0.05017   -1.46     0.15
## Lag2        -0.04230    0.05009   -0.84     0.40
## Lag3         0.01109    0.04994    0.22     0.82
## Lag4         0.00936    0.04997    0.19     0.85
## Lag5         0.01031    0.04951    0.21     0.83
## Volume       0.13544    0.15836    0.86     0.39
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1742
## 
## Number of Fisher Scoring iterations: 3

As seen from the summary, all the p values are pretty high and therefore there is no cleare evidence of a real association between the variables and Direction.

Let us do a prediction with the created model on to the training data.

sharepred <- predict(sharefit,data=Smarket,type="response")
sharepred[1:10]

##      1      2      3      4      5      6      7      8      9     10 
## 0.5071 0.4815 0.4811 0.5152 0.5108 0.5070 0.4927 0.5092 0.5176 0.4888

contrasts(Smarket$Direction)

##      Up
## Down  0
## Up    1

The resultant figures does not give any indication of whether it is Up or down. Let us do that with the following code. First of all let us find the number of observations. There are 1250 observations. Let us first give a logical function for giving the output as “up” or “down”. Finally let us compare what we have predicted to the actual direction and calculate our error rates.

sharepred[sharepred >= .5] = "Up"

sharepred[sharepred < .5] = "Down"



table(sharepred1$Dir,Smarket$Direction)

##       
##        Down  Up
##   Down  145 141
##   UP    457 507

Accuracy <- (145+507)/1250
Accuracy

## [1] 0.5216

As seen from the statistics the the accuracy is only 52%. However in reality it will be much less as all the tests we have done so far is on the training set.

Let us divide our data into training and test sets.

train <- (Year < 2005)
Share2005 <- Smarket[!train,]
Direction2005 <- Direction[!train]

model2005 <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family="binomial",subset=train)

pred2005 <- predict(model2005,Share2005,type="response")

Let us check out the accuracy levels of this new model

length(Direction2005)

## [1] 252

temppred <- rep("Down",252)
temppred[pred2005 > .5] <- "Up"

table(temppred,Direction2005)

##         Direction2005
## temppred Down Up
##     Down   77 97
##     Up     34 44

mean(temppred==Direction2005)

## [1] 0.4802

mean(temppred != Direction2005)

## [1] 0.5198

So as you can see the performace of the model is quite bad on the test set

Let us re-work the model a bit and exclude all those predictors who have very high p values.

sharemod2 <- glm(Direction~Lag1+Lag2,family="binomial",data=Smarket,subset=train)
mod2pred <- predict(sharemod2,Share2005,type="response")

mod2check <- rep("Down", 252)
mod2check[mod2pred > .5] <- "Up"
table(mod2check,Direction2005)

##          Direction2005
## mod2check Down  Up
##      Down   35  35
##      Up     76 106

mean(mod2check==Direction2005)

## [1] 0.5595

mean(mod2check != Direction2005)

## [1] 0.4405

So as seen above the model performance has improved quite bit.

Let us try predicting for some specific values of lag1 & lag 2

predict(sharemod2,newdata=data.frame(Lag1 = c(1.2,1.5),Lag2=c(1.1,-0.8)),type="response")

##      1      2 
## 0.4791 0.4961

Linear Discriminant Analysis

Let us now try to learn the dynamics of Linear Discriminant Analysis. LDA is similar to the logistic regression models. Let us fit LDA to the stockmarket data. The LDA model is represented by the function all lda() and is part of the MASS database

library(MASS)
ldamod1 <- lda(Direction~Lag1+Lag2,data=Smarket,subset=train)
ldamod1

## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##  Down    Up 
## 0.492 0.508 
## 
## Group means:
##          Lag1     Lag2
## Down  0.04279  0.03389
## Up   -0.03955 -0.03133
## 
## Coefficients of linear discriminants:
##          LD1
## Lag1 -0.6420
## Lag2 -0.5135

plot(ldamod1)

plot of chunk unnamed-chunk-11

Let us try to learn more about the output of the LDA model.

The term prior probabilities indicates the probabilities of the response variable within the training data. So as per this statistics 49% of the data sets corresponds to the outcome that the market will go down and 51% to the outcome that the market will go up.

The group means gives the average within each class for each predictors. Looking at the group means it can be inferred that the previous two days market tends to be positive when the market is down to a particular day and previous days values to be negative when the market goes down today.

And for the last term, the coefficients are the variables with which each of the predictor is multiplied to get the classifier. If (-0.64) x Lag1 - (0.51) x Lag2 has a high value then the LDA classifier will predict that the market will go up and viceversa.

stockldapred <- predict(ldamod1,Share2005)
names(stockldapred)

## [1] "class"     "posterior" "x"

The prediction function consists of three elements which needs to be discussed,

The Class element - This element lists out the class of the prediction for the respective training exampl
Posterior element - This element lists out the posterior probabilities for both the classes for all the training examples.
LDA - This element is the LDA calculated for each value of x

The posterior predictors can be used to do classification. Let us assume the threshold to be .5 and let us take the posterior probabilities pertaining to the class which says the market will be down.

ldapostpred1 <- sum(stockldapred$posterior[,1] >= .5)
ldapostpred2 <- sum(stockldapred$posterior[,1] < .5)
ldapostpred1

## [1] 70

ldapostpred2

## [1] 182

ldapostpred3 <- sum(stockldapred$posterior[,2] >= .5)
ldapostpred4 <- sum(stockldapred$posterior[,2] < .5)
ldapostpred3

## [1] 182

ldapostpred4

## [1] 70

Quadratic Discriminant analysis

The quadratic model in terms of execution is similar to LDA. Let us get this going

stockqdamod <- qda(Direction~Lag1+Lag2,data=Smarket,subset=train)
stockqdamod

## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##  Down    Up 
## 0.492 0.508 
## 
## Group means:
##          Lag1     Lag2
## Down  0.04279  0.03389
## Up   -0.03955 -0.03133

stockqdapred <- predict(stockqdamod,Share2005)$class
table(stockqdapred,Direction2005)

##             Direction2005
## stockqdapred Down  Up
##         Down   30  20
##         Up     81 121

mean(stockqdapred==Direction2005)

## [1] 0.5992

We could see that the prediction % has improved considerably with the quadratic model.

KNN method

Unlike the other models, KNN does not have a seperate function for fitting model and also to predict. Both can be done with a single command. The Knn model has the following elements 1. Set of predictors for the training data 2. set of predictors for test data 3. The training response 4. K value

Let us try to satisfy the data requirements

train.X = cbind(Lag1,Lag2)[train,]
test.X = cbind(Lag1,Lag2)[!train,]
train.Direction = Direction[train]

Let us fit the model now. For the KNN function we require the “class” library

library(class)
set.seed(123)
stockknn1 <- knn(train.X,test.X,train.Direction,k=1)
table(stockknn1,Direction2005)

##          Direction2005
## stockknn1 Down Up
##      Down   43 58
##      Up     68 83

mean(stockknn1==Direction2005)

## [1] 0.5

stockknn2 <- knn(train.X,test.X,train.Direction,k=3)
table(stockknn2,Direction2005)

##          Direction2005
## stockknn2 Down Up
##      Down   48 55
##      Up     63 86

mean(stockknn2==Direction2005)

## [1] 0.5317

Well not a pretty impressive value after all. However if we keep changing the K value to around 3 it is found to marginally increase

It is observed that the results from KNN is much better when we are do this on a larger scale with larger set of variables. We will not try this experiment with the Caravan dataset in the ISLR package.

library(ISLR)
dim(Caravan)

## [1] 5822   86

The variable, “Purchase” is our response variable.

Another exercise which we have to do is to scale the observations so as to get a normalised data set. This can be done by the Scale() function in R

standard.X <- scale(Caravan[,-86])
test <- 1:1000
train.X = standard.X[-test,]
test.X = standard.X[test,]
train.Y = Caravan[-test,86]
test.Y=Caravan[test,86]

set.seed(123)
knnpredict1 <- knn(train.X,test.X,train.Y,k=1)
table(knnpredict1,test.Y)

##            test.Y
## knnpredict1  No Yes
##         No  874  49
##         Yes  67  10

mean(knnpredict1!= test.Y)

## [1] 0.116

set.seed(123)
knnpredict2 <- knn(train.X,test.X,train.Y,k=5)
table(knnpredict2,test.Y)

##            test.Y
## knnpredict2  No Yes
##         No  930  55
##         Yes  11   4

mean(knnpredict2!= test.Y)

## [1] 0.066

Resampling methods

Along with the logistic regression exercises let us also attempt to do some re-sampling method exercises. The various resampling methods which are being done here are the following.

Validation set approach
Leave one out Cross validation
K-Fold cross validation 4.Bootstratp

Let us use the Auto data set in ISLR for our re-sampling exercise

library(ISLR)
set.seed(1)
train = sample(392,196)

Let us fit a linear model for the Auto set and calculate the MSE for it

attach(Auto)
autolm1 <- lm(mpg~horsepower,data=Auto,subset=train)
mean((mpg-predict(autolm1,Auto))[-train]^2)

## [1] 26.14

Let us try the same by fitting some polynomial models and finding the error rates

autolm2 <- lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(autolm2,Auto))[-train]^2)

## [1] 19.82

autolm3 <- lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(autolm3,Auto))[-train]^2)

## [1] 19.78

Let us repeat the exercise with a different training set and check for varibility of the data

set.seed(2)
train = sample(392,196)

autolm4 <- lm(mpg~horsepower,data=Auto,subset=train)
mean((mpg-predict(autolm4,Auto))[-train]^2)

## [1] 23.3

autolm5 <- lm(mpg~poly(horsepower,2),data=Auto,subset=train)
mean((mpg-predict(autolm5,Auto))[-train]^2)

## [1] 18.9

autolm6 <- lm(mpg~poly(horsepower,3),data=Auto,subset=train)
mean((mpg-predict(autolm6,Auto))[-train]^2)

## [1] 19.26

So the results are consistent and it is proved that the model with a quadratic fit is the better among the three models.

LOOCV

The LOOCv can be carried out using the function cv.glm, which is part of the “boot” library. We will try this out for various polynomial functions.

library(boot)
automodel1 <- glm(mpg~horsepower,data=Auto)
model1cv <- cv.glm(Auto,automodel1)

model1cv$delta

## [1] 24.23 24.23

The $delta gives the mean error terms in the model.

model2cv <- rep(0,5)

for(i in 1:5) {
  
  automodel2 <- glm(mpg~poly(horsepower,i),data=Auto)
  model2cv[i] <- cv.glm(Auto,automodel2)$delta[1]  
  
}

model2cv

## [1] 24.23 19.25 19.33 19.42 19.03

K - Fold validation

The K fold validation is similar to the LOOCV model. The usual value for k = 10

set.seed(17)
model3kf <- rep(0,10)
for(i in 1:10){
  
  model3 <- glm(mpg~poly(horsepower,i),data=Auto)
  model3kf[i] <- cv.glm(Auto,model3,K=10)$delta[1]
  
}
model3kf

##  [1] 24.21 19.19 19.31 19.34 18.88 19.02 18.90 19.71 18.95 19.50