Logistic Regression, LDA, QDA, KNN

Paul Jozefek

3/4/2020

Abstract


In this report I will give a brief overview of Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, and K-Nearest Neighbors. I will then apply these classification methods to S&P 500 data.

Logistic Regression


Why It Is Used

If we were to use linear regression to predict a binary qualitative response some estimates could lie outside of the 0 to 1 interval, making them difficult to interpret as probabilities.

Logistic Function

Instead we can use the logisitic function, which for one predictor is:

\(p(X)=P(Y=1|X)\) where Y is a dummy variable, taking on values 0 or 1

\(p(X)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}\) which results in a value between 0 and 1. We can now interpret the solution as a probability.

It will always produce an S-shaped curve.

ODDS = \(\frac{p(X)}{1-p(X)}=e^{\beta_0+\beta_1X}\)

LOG-ODDS = \(log\left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1X\)

A one unit increase in \(X\) changes the log-odds by \(\beta_1\)

Although the relationship between \(p(X)\) and \(X\) is not a straight line, if \(\beta_1\) is positive than increasing \(X\) will be associated with increasing \(p(X)\) and if \(\beta_1\) is negative than increasing \(X\) will be associated with decreasing \(p(X)\)

Generalizing the function

For multiple precidtors the function can be generalized to:

\(p(X)=\frac{e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}\)

\(log\left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1X_1+...+\beta_pX_p\)

Estimating the Coefficients

The estimates for \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to maximize the likelihood function.

\(L(\beta_0,\beta_1) = \displaystyle\prod_{i=1}^np(x_i)^{y_i}(1-p(x_i))^{1-y_i}\)

Where \(x_i\) is a vector of features and \(y_i\) is the observed class. The probability of the class is either \(p\) if \(y_i=1\) or \(1-p\) if \(y_i=0\).

Linear Discriminant Analysis


Why It Is Used

Classification for 1 Predictor

\(\delta_k(x)=x \frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\)

LDA assigns the observation to the class for which this equation is largest.

\(\pi_k\) represents the prior probability that an observation is associated with the \(k^{th}\) category. It is estimated with the proportion of observations that belong to the \(k^{th}\) class.

\(\hat\pi_k=\frac{n_k}{n}\)

LDA estimates \(\mu_k\) with the average of observations in the \(k^{th}\) class.

\(\hat\mu_k = \frac{1}{n_k}\displaystyle\sum_{i:y_i=k}x_i\)

The estimate for \(\sigma^2\) is the weighted average of sample variances for each of the \(k\) classes.

\(\hat\sigma^2=\frac{1}{n-K}\displaystyle\sum_{k=1}^K\displaystyle\sum_{i:y_i=k}(x_i-\hat\mu_k)^2\)

Generalizing the function

For more than one predictor, LDA assumes that \(X=(X_1,X_2,...,X_p)\) is drawn from a multivariate Gaussian distribution, with a class specific mean vector and common covariance matrix.

\(X \sim N(\mu,\sum)\)
\(Cov(X) = \sum\), the \(p\) x \(p\) covariance matrix.

Equation
\(\delta_k(x)=x^T\sum^{-1}\mu_k-\frac{1}{2}\mu_k^T\sum^{-1}\mu_k+log\pi_k\)

LDA assigns the observation to the class for which this equation is largest.

Quadratic Discriminant Analysis


Why It Is Used

QDA Function

In QDA we estimate a mean and covariance matrix for each class seperately.

\(X \sim N(\mu_k,\sum_k)\)

\(Cov(X) = \sum_k\), the covariance matrix for the \(k^{th}\) class

Equation \(\delta_k(x)=-\frac{1}{2}(x-\mu_k)^T\sum_k^{-1}(x-\mu_k)-\frac{1}{2}log|\sum_k|+log\pi_k\)

QDA assigns the observation to the class for which this equation is largest.

K-Nearest Neighbors


Why It Is Used

Ideally we would like to use the Bayes Classifier to predict qualitative responses, but we don’t know the conditional distribution of \(Y\) given \(X\). KNN attempts to estimate the conditional distribution of \(Y\) given \(X\) and then classify a given observation according to the highest estimated probability.

KNN Formula

\(K=\) a positive integer
\(x_0=\) a test observation
\(N_0= K\) points in the training data that are closest to \(x_0\)
\(j\) = class

\(P(Y=j|X=x_0)=\frac{1}{K}\displaystyle\sum_{i\in N_0}I(y_i=j)\)

The conditional probaility for class \(j\) is represented by the fraction of points in \(N_0\) whose value equal \(j\)

KNN applies the Bayes rule and labels \(x_0\) as the class with the largest probaility.

S&P 500 Data


Data Exploration

I will now apply these techniques to S&P 500 data, which can be obtained from the quantmod package.

library(quantmod)
library(knitr)
library(kableExtra)
#Retrieve S&P500 data
getSymbols("^GSPC",from = "2000-01-01",to = "2020-03-03")
[1] "^GSPC"
#Create Data Frame with index
GSPC2 <- data.frame(date=index(GSPC), coredata(GSPC)[,])

#Create Table
kable(head(GSPC2))%>%kable_styling(position="left",bootstrap_options=c("striped","condensed", full_width=F))
date GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
2000-01-03 1469.25 1478.00 1438.36 1455.22 931800000 1455.22
2000-01-04 1455.22 1455.22 1397.43 1399.42 1009000000 1399.42
2000-01-05 1399.42 1413.27 1377.68 1402.11 1085500000 1402.11
2000-01-06 1402.11 1411.90 1392.10 1403.45 1092300000 1403.45
2000-01-07 1403.45 1441.47 1400.73 1441.47 1225200000 1441.47
2000-01-10 1441.47 1464.36 1441.47 1457.60 1064800000 1457.60

I then converted the output to a data frame with 5,067 days, from the beginning of 2000 until March 2, 2020. The columns include Year, the current day’s percentage return, Lag 1 through Lag 5, Volume, and an indicator of the current day’s direction.

#Create SP500 data frame
sp500 <- data.frame(Year=as.numeric(format(index(GSPC),"%Y")))
sp500$Lag1 <- round(as.numeric(lag(dailyReturn(GSPC),1)*100),3)
sp500$Lag2 <- round(as.numeric(lag(dailyReturn(GSPC),2)*100),3)
sp500$Lag3 <- round(as.numeric(lag(dailyReturn(GSPC),3)*100),3)
sp500$Lag4 <- round(as.numeric(lag(dailyReturn(GSPC),4)*100),3)
sp500$Lag5 <- round(as.numeric(lag(dailyReturn(GSPC),5)*100),3)
sp500$Volume <- as.numeric(GSPC$GSPC.Volume)/1000000000
sp500$Today <- round(as.numeric(dailyReturn(GSPC)*100),3)
sp500$Direction = as.factor(ifelse(sp500$Today<0,"Down","Up"))

#remove first 5 due to lags
sp500 <- sp500[-1:-5,]

dim(sp500) #number of rows and columns
[1] 5067    9
#Create Table
kable(head(sp500))%>%kable_styling(position="left",bootstrap_options=c("striped","condensed", full_width=F))
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
6 2000 2.709 0.096 0.192 -3.834 -0.955 1.0648 1.119 Up
7 2000 1.119 2.709 0.096 0.192 -3.834 1.0140 -1.306 Down
8 2000 -1.306 1.119 2.709 0.096 0.192 0.9746 -0.439 Down
9 2000 -0.439 -1.306 1.119 2.709 0.096 1.0304 1.217 Up
10 2000 1.217 -0.439 -1.306 1.119 2.709 1.0859 1.067 Up
11 2000 1.067 1.217 -0.439 -1.306 1.119 1.0567 -0.683 Down

It is typically helpful to view the correlation matrix.

#Correlation

kable(round(cor(sp500[,-9]),3))%>%kable_styling(position="left",bootstrap_options=c("striped","condensed", full_width=F))
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today
Year 1.000 0.016 0.016 0.018 0.019 0.020 0.567 0.018
Lag1 0.016 1.000 -0.072 -0.052 0.024 -0.012 -0.045 -0.073
Lag2 0.016 -0.072 1.000 -0.073 -0.052 0.024 -0.034 -0.055
Lag3 0.018 -0.052 -0.073 1.000 -0.073 -0.054 -0.030 0.025
Lag4 0.019 0.024 -0.052 -0.073 1.000 -0.073 -0.039 -0.014
Lag5 0.020 -0.012 0.024 -0.054 -0.073 1.000 -0.021 -0.045
Volume 0.567 -0.045 -0.034 -0.030 -0.039 -0.021 1.000 -0.016
Today 0.018 -0.073 -0.055 0.025 -0.014 -0.045 -0.016 1.000

We can see that the correlations are all close to zero, except for Volume and Year.

#bar chart from QuantMod
barChart(GSPC)

Volume has increased over time, as a step function. It increased in 2008 and remained at the higher level.

Logistic Regression with S&P 500 data

Full Model

First, I will try to predict Direction using Lag 1 through Lag 5 and Volume. I have used the entire dataset.

#full model with all data
glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
               data=sp500, family=binomial)
summary(glm.fit) #only Lag1 is significant

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = sp500)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.507  -1.233   1.052   1.120   1.465  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.162434   0.066185   2.454 0.014119 *  
Lag1        -0.083956   0.024186  -3.471 0.000518 ***
Lag2        -0.022233   0.024005  -0.926 0.354344    
Lag3         0.010868   0.024152   0.450 0.652725    
Lag4        -0.000491   0.024014  -0.020 0.983687    
Lag5        -0.028723   0.024008  -1.196 0.231552    
Volume      -0.005073   0.019276  -0.263 0.792415    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6998.3  on 5066  degrees of freedom
Residual deviance: 6983.8  on 5060  degrees of freedom
AIC: 6997.8

Number of Fisher Scoring iterations: 4

Based on the coefficient p-values it appears that only Lag 1 is significant. Since the Lag 1 coefficient is negative it suggests that the market is less likely to increase today if it increased yesterday.

#Check contrasts and examine 1st 10 predictions
contrasts(sp500$Direction)
     Up
Down  0
Up    1
glm.probs=round(predict(glm.fit,type="response"),2)
glm.probs[1:10]
   6    7    8    9   10   11   12   13   14   15 
0.49 0.53 0.57 0.56 0.49 0.50 0.56 0.55 0.54 0.54 

Up is recorded as 1 and Down is recorded as 0

glm.pred=rep("Down",5067) #Create vector of Down elements
glm.pred[glm.probs>.5]="Up" #Transform to Up for predicted prob >0.5
table(Prediction=glm.pred, Truth=sp500$Direction)
          Truth
Prediction Down   Up
      Down  205  193
      Up   2147 2522
mean(glm.pred==sp500$Direction) #correct predictions
[1] 0.5381883
mean(glm.pred!=sp500$Direction) #incorrect predictions
[1] 0.4618117

Logistic Regression correctly predicted the market 53.8% of the time, which is just slightly better than random guessing. Also, we can see that it predicts Up 92% of the time, which is not very helpful.

Full Model with Training Set

Next, I created a Training set with years 2000 through 2015 and a Testing set with years 2016 to present. I again ran the model with all the attributes.

#Create Subset of Data
sp500.Subset=(sp500$Year<2016) #Boolean TRUE FALSE for Year<2016
sp500.Train=sp500[sp500.Subset,] #Create training set
sp500.Test=sp500[!sp500.Subset,] #Create testing set
sp500.Direction=sp500.Test[,9] #Direction factor for test set

#Fit the training set
glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
               data=sp500.Train, family=binomial)

#Predict for test set
glm.probs=predict(glm.fit,sp500.Test,type="response")

dim(sp500.Test) #1047
[1] 1047    9
glm.pred=rep("Down",1047) #Create vector of Down elements
glm.pred[glm.probs>.5]="Up" #Transform to Up for predicted prob >0.5
table(Prediction=glm.pred, Truth=sp500.Direction)
          Truth
Prediction Down  Up
      Down   15  19
      Up    453 560
mean(glm.pred==sp500.Direction) #Correct predictions
[1] 0.5491882
mean(glm.pred!=sp500.Direction) #Incorrect predictions
[1] 0.4508118

In this case Logistic Regression correctly predicted the market 54.9% of the time. Again, it almost always predicts Up, so it is not very helpful.

Partial Model with Training Set

Although only Lag 1 had a p-value <0.05, I chose to try a model with Lag 1 and Lag 2.

#Fit the training set with just Lag1 and Lag2
glm.fit = glm(Direction~Lag1 + Lag2,
              data=sp500.Train, family=binomial)

#Predict for test set
glm.probs=predict(glm.fit,sp500.Test,type="response")

glm.pred=rep("Down",1047) #Create vector of Down elements
glm.pred[glm.probs>.5]="Up" #Transform to Up for predicted prob >0.5
table(Prediction=glm.pred, Truth=sp500.Direction)
          Truth
Prediction Down  Up
      Down   16  21
      Up    452 558
mean(glm.pred==sp500.Direction) #Correct predictions
[1] 0.548233
mean(glm.pred!=sp500.Direction) #Incorrect predictions
[1] 0.451767

The result was not much better than with all coeffcients.

Linear Discriminant Analysis with S&P 500 data

library(MASS)
#Linear Discriminant Analysis
lda.fit = lda(Direction~Lag1 + Lag2, data=sp500.Train)
lda.fit
Call:
lda(Direction ~ Lag1 + Lag2, data = sp500.Train)

Prior probabilities of groups:
     Down        Up 
0.4686567 0.5313433 

Group means:
            Lag1        Lag2
Down  0.08783121 0.034017516
Up   -0.04437172 0.003476124

Coefficients of linear discriminants:
            LD1
Lag1 -0.7744647
Lag2 -0.2353294

The LDA output indicates that \(\hat\pi_1=0.469\) and \(\hat\pi_2=0.508\). \(46.9\%\) of the training observations correspond to days on which the market went down. The group means are estimates of \(\mu_k\). On days when the market is up the previous day tends to be down and the oppositve is true when the market is down.

The coefficients are used to form the decision rule. If -0.774 X Lag 1 -0.235 X Lag 2 is large than the classifier will predict a market increase, and if it is small than it will predict a decrease.

plot(lda.fit)

#Predictions for the test set
lda.pred=predict(lda.fit,sp500.Test)
names(lda.pred)
[1] "class"     "posterior" "x"        
lda.class=lda.pred$class

table(Prediction=lda.class, Truth=sp500.Direction)
          Truth
Prediction Down  Up
      Down   16  20
      Up    452 559
mean(lda.class==sp500.Direction) #Correct predictions
[1] 0.5491882
mean(lda.class!=sp500.Direction) #Incorrect predictions
[1] 0.4508118

LDA correctly predicted the market 54.9% of the time. The result was very similar to Logistic regression

We can recreate the predictions by examining the posterior probabilities

#Posterior probability corresponds to the probability of a down day
sum(lda.pred$posterior[,1]>=0.5) #Down predicitons
[1] 36
sum(lda.pred$posterior[,1]<0.5) #Up predictions
[1] 1011
round(lda.pred$posterior[1:10,1],2)
4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 
0.44 0.43 0.46 0.44 0.41 0.43 0.46 0.49 0.42 0.49 
lda.class[1:10]
 [1] Up Up Up Up Up Up Up Up Up Up
Levels: Down Up

Quadratic Discriminant Analysis with S&P 500 data

#Quadratic Discriminant Analysis
qda.fit = qda(Direction~Lag1 + Lag2, data=sp500.Train)
qda.fit
Call:
qda(Direction ~ Lag1 + Lag2, data = sp500.Train)

Prior probabilities of groups:
     Down        Up 
0.4686567 0.5313433 

Group means:
            Lag1        Lag2
Down  0.08783121 0.034017516
Up   -0.04437172 0.003476124
qda.class=predict(qda.fit,sp500.Test)$class

The QDA output indicates that \(\hat\pi_1=0.469\) and \(\hat\pi_2=0.531\). \(46.9\%\) of the training observations correspond to days on which the market went down.

table(Prediction=qda.class, Truth=sp500.Direction)
          Truth
Prediction Down  Up
      Down   15  17
      Up    453 562
mean(qda.class==sp500.Direction) #Correct predictions
[1] 0.5510984
mean(qda.class!=sp500.Direction) #Incorrect predictions
[1] 0.4489016

QDA correctly predicted the market 55.1% of the time. However, the predictions are still overly weighted towards Up.

K-Nearest Neighbors Analysis with S&P 500 data

First, I have tried K=1

#K-Nearest Neighbors K=1

library(class)
#Create matrix training set with Lag 1 and Lag 2
train.X=cbind(sp500$Lag1,sp500$Lag2)[sp500.Subset,]
#Create matrix test set with Lag 1 and Lag 2
test.X=cbind(sp500$Lag1,sp500$Lag2)[!sp500.Subset,]
#vector of class labels
train.Direction=sp500$Direction[sp500.Subset]

set.seed(12345) #randomnly break tie
knn.pred=knn(train.X,test.X,train.Direction,k=1)
table(prediction=knn.pred, truth=sp500.Direction)
          truth
prediction Down  Up
      Down  222 281
      Up    246 298
mean(knn.pred==sp500.Direction) #Correct predictions
[1] 0.4966571
mean(knn.pred!=sp500.Direction) #Incorrect predictions
[1] 0.5033429

KNN correctly predicted the market 49.7% of the time, which is similar to guessing. Unlike Logisitic Regression, LDA, or QDA, the predictions are more evenly weigthed towards Up and Down, which is closer to reality.

#KNN with K=3
set.seed(12345)
knn.pred=knn(train.X,test.X,train.Direction,k=3)
table(prediction=knn.pred, truth=sp500.Direction)
          truth
prediction Down  Up
      Down  215 269
      Up    253 310
mean(knn.pred==sp500.Direction) #Correct predictions
[1] 0.5014327
mean(knn.pred!=sp500.Direction) #Incorrect predictions
[1] 0.4985673

The results are similar with K=3