In this report I will give a brief overview of Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, and K-Nearest Neighbors. I will then apply these classification methods to S&P 500 data.
If we were to use linear regression to predict a binary qualitative response some estimates could lie outside of the 0 to 1 interval, making them difficult to interpret as probabilities.
Instead we can use the logisitic function, which for one predictor is:
\(p(X)=P(Y=1|X)\) where Y is a dummy variable, taking on values 0 or 1
\(p(X)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}\) which results in a value between 0 and 1. We can now interpret the solution as a probability.
It will always produce an S-shaped curve.
ODDS = \(\frac{p(X)}{1-p(X)}=e^{\beta_0+\beta_1X}\)
LOG-ODDS = \(log\left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1X\)
A one unit increase in \(X\) changes the log-odds by \(\beta_1\)
Although the relationship between \(p(X)\) and \(X\) is not a straight line, if \(\beta_1\) is positive than increasing \(X\) will be associated with increasing \(p(X)\) and if \(\beta_1\) is negative than increasing \(X\) will be associated with decreasing \(p(X)\)
For multiple precidtors the function can be generalized to:
\(p(X)=\frac{e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+...+\beta_pX_p}}\)
\(log\left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1X_1+...+\beta_pX_p\)
The estimates for \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to maximize the likelihood function.
\(L(\beta_0,\beta_1) = \displaystyle\prod_{i=1}^np(x_i)^{y_i}(1-p(x_i))^{1-y_i}\)
Where \(x_i\) is a vector of features and \(y_i\) is the observed class. The probability of the class is either \(p\) if \(y_i=1\) or \(1-p\) if \(y_i=0\).
\(\delta_k(x)=x \frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)\)
LDA assigns the observation to the class for which this equation is largest.
\(\pi_k\) represents the prior probability that an observation is associated with the \(k^{th}\) category. It is estimated with the proportion of observations that belong to the \(k^{th}\) class.
\(\hat\pi_k=\frac{n_k}{n}\)
LDA estimates \(\mu_k\) with the average of observations in the \(k^{th}\) class.
\(\hat\mu_k = \frac{1}{n_k}\displaystyle\sum_{i:y_i=k}x_i\)
The estimate for \(\sigma^2\) is the weighted average of sample variances for each of the \(k\) classes.
\(\hat\sigma^2=\frac{1}{n-K}\displaystyle\sum_{k=1}^K\displaystyle\sum_{i:y_i=k}(x_i-\hat\mu_k)^2\)
For more than one predictor, LDA assumes that \(X=(X_1,X_2,...,X_p)\) is drawn from a multivariate Gaussian distribution, with a class specific mean vector and common covariance matrix.
\(X \sim N(\mu,\sum)\)
\(Cov(X) = \sum\), the \(p\) x \(p\) covariance matrix.
Equation
\(\delta_k(x)=x^T\sum^{-1}\mu_k-\frac{1}{2}\mu_k^T\sum^{-1}\mu_k+log\pi_k\)
LDA assigns the observation to the class for which this equation is largest.
In QDA we estimate a mean and covariance matrix for each class seperately.
\(X \sim N(\mu_k,\sum_k)\)
\(Cov(X) = \sum_k\), the covariance matrix for the \(k^{th}\) class
Equation \(\delta_k(x)=-\frac{1}{2}(x-\mu_k)^T\sum_k^{-1}(x-\mu_k)-\frac{1}{2}log|\sum_k|+log\pi_k\)
QDA assigns the observation to the class for which this equation is largest.
Ideally we would like to use the Bayes Classifier to predict qualitative responses, but we don’t know the conditional distribution of \(Y\) given \(X\). KNN attempts to estimate the conditional distribution of \(Y\) given \(X\) and then classify a given observation according to the highest estimated probability.
\(K=\) a positive integer
\(x_0=\) a test observation
\(N_0= K\) points in the training data that are closest to \(x_0\)
\(j\) = class
\(P(Y=j|X=x_0)=\frac{1}{K}\displaystyle\sum_{i\in N_0}I(y_i=j)\)
The conditional probaility for class \(j\) is represented by the fraction of points in \(N_0\) whose value equal \(j\)
KNN applies the Bayes rule and labels \(x_0\) as the class with the largest probaility.
I will now apply these techniques to S&P 500 data, which can be obtained from the quantmod package.
library(quantmod)
library(knitr)
library(kableExtra)
#Retrieve S&P500 data
getSymbols("^GSPC",from = "2000-01-01",to = "2020-03-03")
[1] "^GSPC"
#Create Data Frame with index
GSPC2 <- data.frame(date=index(GSPC), coredata(GSPC)[,])
#Create Table
kable(head(GSPC2))%>%kable_styling(position="left",bootstrap_options=c("striped","condensed", full_width=F))
date | GSPC.Open | GSPC.High | GSPC.Low | GSPC.Close | GSPC.Volume | GSPC.Adjusted |
---|---|---|---|---|---|---|
2000-01-03 | 1469.25 | 1478.00 | 1438.36 | 1455.22 | 931800000 | 1455.22 |
2000-01-04 | 1455.22 | 1455.22 | 1397.43 | 1399.42 | 1009000000 | 1399.42 |
2000-01-05 | 1399.42 | 1413.27 | 1377.68 | 1402.11 | 1085500000 | 1402.11 |
2000-01-06 | 1402.11 | 1411.90 | 1392.10 | 1403.45 | 1092300000 | 1403.45 |
2000-01-07 | 1403.45 | 1441.47 | 1400.73 | 1441.47 | 1225200000 | 1441.47 |
2000-01-10 | 1441.47 | 1464.36 | 1441.47 | 1457.60 | 1064800000 | 1457.60 |
I then converted the output to a data frame with 5,067 days, from the beginning of 2000 until March 2, 2020. The columns include Year, the current day’s percentage return, Lag 1 through Lag 5, Volume, and an indicator of the current day’s direction.
#Create SP500 data frame
sp500 <- data.frame(Year=as.numeric(format(index(GSPC),"%Y")))
sp500$Lag1 <- round(as.numeric(lag(dailyReturn(GSPC),1)*100),3)
sp500$Lag2 <- round(as.numeric(lag(dailyReturn(GSPC),2)*100),3)
sp500$Lag3 <- round(as.numeric(lag(dailyReturn(GSPC),3)*100),3)
sp500$Lag4 <- round(as.numeric(lag(dailyReturn(GSPC),4)*100),3)
sp500$Lag5 <- round(as.numeric(lag(dailyReturn(GSPC),5)*100),3)
sp500$Volume <- as.numeric(GSPC$GSPC.Volume)/1000000000
sp500$Today <- round(as.numeric(dailyReturn(GSPC)*100),3)
sp500$Direction = as.factor(ifelse(sp500$Today<0,"Down","Up"))
#remove first 5 due to lags
sp500 <- sp500[-1:-5,]
dim(sp500) #number of rows and columns
[1] 5067 9
#Create Table
kable(head(sp500))%>%kable_styling(position="left",bootstrap_options=c("striped","condensed", full_width=F))
Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | Direction | |
---|---|---|---|---|---|---|---|---|---|
6 | 2000 | 2.709 | 0.096 | 0.192 | -3.834 | -0.955 | 1.0648 | 1.119 | Up |
7 | 2000 | 1.119 | 2.709 | 0.096 | 0.192 | -3.834 | 1.0140 | -1.306 | Down |
8 | 2000 | -1.306 | 1.119 | 2.709 | 0.096 | 0.192 | 0.9746 | -0.439 | Down |
9 | 2000 | -0.439 | -1.306 | 1.119 | 2.709 | 0.096 | 1.0304 | 1.217 | Up |
10 | 2000 | 1.217 | -0.439 | -1.306 | 1.119 | 2.709 | 1.0859 | 1.067 | Up |
11 | 2000 | 1.067 | 1.217 | -0.439 | -1.306 | 1.119 | 1.0567 | -0.683 | Down |
It is typically helpful to view the correlation matrix.
#Correlation
kable(round(cor(sp500[,-9]),3))%>%kable_styling(position="left",bootstrap_options=c("striped","condensed", full_width=F))
Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | |
---|---|---|---|---|---|---|---|---|
Year | 1.000 | 0.016 | 0.016 | 0.018 | 0.019 | 0.020 | 0.567 | 0.018 |
Lag1 | 0.016 | 1.000 | -0.072 | -0.052 | 0.024 | -0.012 | -0.045 | -0.073 |
Lag2 | 0.016 | -0.072 | 1.000 | -0.073 | -0.052 | 0.024 | -0.034 | -0.055 |
Lag3 | 0.018 | -0.052 | -0.073 | 1.000 | -0.073 | -0.054 | -0.030 | 0.025 |
Lag4 | 0.019 | 0.024 | -0.052 | -0.073 | 1.000 | -0.073 | -0.039 | -0.014 |
Lag5 | 0.020 | -0.012 | 0.024 | -0.054 | -0.073 | 1.000 | -0.021 | -0.045 |
Volume | 0.567 | -0.045 | -0.034 | -0.030 | -0.039 | -0.021 | 1.000 | -0.016 |
Today | 0.018 | -0.073 | -0.055 | 0.025 | -0.014 | -0.045 | -0.016 | 1.000 |
We can see that the correlations are all close to zero, except for Volume and Year.
#bar chart from QuantMod
barChart(GSPC)
Volume has increased over time, as a step function. It increased in 2008 and remained at the higher level.
First, I will try to predict Direction using Lag 1 through Lag 5 and Volume. I have used the entire dataset.
#full model with all data
glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data=sp500, family=binomial)
summary(glm.fit) #only Lag1 is significant
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = sp500)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.507 -1.233 1.052 1.120 1.465
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.162434 0.066185 2.454 0.014119 *
Lag1 -0.083956 0.024186 -3.471 0.000518 ***
Lag2 -0.022233 0.024005 -0.926 0.354344
Lag3 0.010868 0.024152 0.450 0.652725
Lag4 -0.000491 0.024014 -0.020 0.983687
Lag5 -0.028723 0.024008 -1.196 0.231552
Volume -0.005073 0.019276 -0.263 0.792415
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6998.3 on 5066 degrees of freedom
Residual deviance: 6983.8 on 5060 degrees of freedom
AIC: 6997.8
Number of Fisher Scoring iterations: 4
Based on the coefficient p-values it appears that only Lag 1 is significant. Since the Lag 1 coefficient is negative it suggests that the market is less likely to increase today if it increased yesterday.
#Check contrasts and examine 1st 10 predictions
contrasts(sp500$Direction)
Up
Down 0
Up 1
glm.probs=round(predict(glm.fit,type="response"),2)
glm.probs[1:10]
6 7 8 9 10 11 12 13 14 15
0.49 0.53 0.57 0.56 0.49 0.50 0.56 0.55 0.54 0.54
Up is recorded as 1 and Down is recorded as 0
glm.pred=rep("Down",5067) #Create vector of Down elements
glm.pred[glm.probs>.5]="Up" #Transform to Up for predicted prob >0.5
table(Prediction=glm.pred, Truth=sp500$Direction)
Truth
Prediction Down Up
Down 205 193
Up 2147 2522
mean(glm.pred==sp500$Direction) #correct predictions
[1] 0.5381883
mean(glm.pred!=sp500$Direction) #incorrect predictions
[1] 0.4618117
Logistic Regression correctly predicted the market 53.8% of the time, which is just slightly better than random guessing. Also, we can see that it predicts Up 92% of the time, which is not very helpful.
Next, I created a Training set with years 2000 through 2015 and a Testing set with years 2016 to present. I again ran the model with all the attributes.
#Create Subset of Data
sp500.Subset=(sp500$Year<2016) #Boolean TRUE FALSE for Year<2016
sp500.Train=sp500[sp500.Subset,] #Create training set
sp500.Test=sp500[!sp500.Subset,] #Create testing set
sp500.Direction=sp500.Test[,9] #Direction factor for test set
#Fit the training set
glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data=sp500.Train, family=binomial)
#Predict for test set
glm.probs=predict(glm.fit,sp500.Test,type="response")
dim(sp500.Test) #1047
[1] 1047 9
glm.pred=rep("Down",1047) #Create vector of Down elements
glm.pred[glm.probs>.5]="Up" #Transform to Up for predicted prob >0.5
table(Prediction=glm.pred, Truth=sp500.Direction)
Truth
Prediction Down Up
Down 15 19
Up 453 560
mean(glm.pred==sp500.Direction) #Correct predictions
[1] 0.5491882
mean(glm.pred!=sp500.Direction) #Incorrect predictions
[1] 0.4508118
In this case Logistic Regression correctly predicted the market 54.9% of the time. Again, it almost always predicts Up, so it is not very helpful.
Although only Lag 1 had a p-value <0.05, I chose to try a model with Lag 1 and Lag 2.
#Fit the training set with just Lag1 and Lag2
glm.fit = glm(Direction~Lag1 + Lag2,
data=sp500.Train, family=binomial)
#Predict for test set
glm.probs=predict(glm.fit,sp500.Test,type="response")
glm.pred=rep("Down",1047) #Create vector of Down elements
glm.pred[glm.probs>.5]="Up" #Transform to Up for predicted prob >0.5
table(Prediction=glm.pred, Truth=sp500.Direction)
Truth
Prediction Down Up
Down 16 21
Up 452 558
mean(glm.pred==sp500.Direction) #Correct predictions
[1] 0.548233
mean(glm.pred!=sp500.Direction) #Incorrect predictions
[1] 0.451767
The result was not much better than with all coeffcients.
library(MASS)
#Linear Discriminant Analysis
lda.fit = lda(Direction~Lag1 + Lag2, data=sp500.Train)
lda.fit
Call:
lda(Direction ~ Lag1 + Lag2, data = sp500.Train)
Prior probabilities of groups:
Down Up
0.4686567 0.5313433
Group means:
Lag1 Lag2
Down 0.08783121 0.034017516
Up -0.04437172 0.003476124
Coefficients of linear discriminants:
LD1
Lag1 -0.7744647
Lag2 -0.2353294
The LDA output indicates that \(\hat\pi_1=0.469\) and \(\hat\pi_2=0.508\). \(46.9\%\) of the training observations correspond to days on which the market went down. The group means are estimates of \(\mu_k\). On days when the market is up the previous day tends to be down and the oppositve is true when the market is down.
The coefficients are used to form the decision rule. If -0.774 X Lag 1 -0.235 X Lag 2 is large than the classifier will predict a market increase, and if it is small than it will predict a decrease.
plot(lda.fit)
#Predictions for the test set
lda.pred=predict(lda.fit,sp500.Test)
names(lda.pred)
[1] "class" "posterior" "x"
lda.class=lda.pred$class
table(Prediction=lda.class, Truth=sp500.Direction)
Truth
Prediction Down Up
Down 16 20
Up 452 559
mean(lda.class==sp500.Direction) #Correct predictions
[1] 0.5491882
mean(lda.class!=sp500.Direction) #Incorrect predictions
[1] 0.4508118
LDA correctly predicted the market 54.9% of the time. The result was very similar to Logistic regression
We can recreate the predictions by examining the posterior probabilities
#Posterior probability corresponds to the probability of a down day
sum(lda.pred$posterior[,1]>=0.5) #Down predicitons
[1] 36
sum(lda.pred$posterior[,1]<0.5) #Up predictions
[1] 1011
round(lda.pred$posterior[1:10,1],2)
4026 4027 4028 4029 4030 4031 4032 4033 4034 4035
0.44 0.43 0.46 0.44 0.41 0.43 0.46 0.49 0.42 0.49
lda.class[1:10]
[1] Up Up Up Up Up Up Up Up Up Up
Levels: Down Up
#Quadratic Discriminant Analysis
qda.fit = qda(Direction~Lag1 + Lag2, data=sp500.Train)
qda.fit
Call:
qda(Direction ~ Lag1 + Lag2, data = sp500.Train)
Prior probabilities of groups:
Down Up
0.4686567 0.5313433
Group means:
Lag1 Lag2
Down 0.08783121 0.034017516
Up -0.04437172 0.003476124
qda.class=predict(qda.fit,sp500.Test)$class
The QDA output indicates that \(\hat\pi_1=0.469\) and \(\hat\pi_2=0.531\). \(46.9\%\) of the training observations correspond to days on which the market went down.
table(Prediction=qda.class, Truth=sp500.Direction)
Truth
Prediction Down Up
Down 15 17
Up 453 562
mean(qda.class==sp500.Direction) #Correct predictions
[1] 0.5510984
mean(qda.class!=sp500.Direction) #Incorrect predictions
[1] 0.4489016
QDA correctly predicted the market 55.1% of the time. However, the predictions are still overly weighted towards Up.
First, I have tried K=1
#K-Nearest Neighbors K=1
library(class)
#Create matrix training set with Lag 1 and Lag 2
train.X=cbind(sp500$Lag1,sp500$Lag2)[sp500.Subset,]
#Create matrix test set with Lag 1 and Lag 2
test.X=cbind(sp500$Lag1,sp500$Lag2)[!sp500.Subset,]
#vector of class labels
train.Direction=sp500$Direction[sp500.Subset]
set.seed(12345) #randomnly break tie
knn.pred=knn(train.X,test.X,train.Direction,k=1)
table(prediction=knn.pred, truth=sp500.Direction)
truth
prediction Down Up
Down 222 281
Up 246 298
mean(knn.pred==sp500.Direction) #Correct predictions
[1] 0.4966571
mean(knn.pred!=sp500.Direction) #Incorrect predictions
[1] 0.5033429
KNN correctly predicted the market 49.7% of the time, which is similar to guessing. Unlike Logisitic Regression, LDA, or QDA, the predictions are more evenly weigthed towards Up and Down, which is closer to reality.
#KNN with K=3
set.seed(12345)
knn.pred=knn(train.X,test.X,train.Direction,k=3)
table(prediction=knn.pred, truth=sp500.Direction)
truth
prediction Down Up
Down 215 269
Up 253 310
mean(knn.pred==sp500.Direction) #Correct predictions
[1] 0.5014327
mean(knn.pred!=sp500.Direction) #Incorrect predictions
[1] 0.4985673
The results are similar with K=3