No one method will dominate the others in every situation.
In this type of regression, unlike Linear Regression we predict the class or category of the dependent variable using independent variable by using relation between them.
Let us use the stock market dataset ‘Smarket’ from the ISLR Library, to predict the direction of stock market based on the other variables present in dataset.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.3
attach(Smarket)
head(Smarket, 10)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
## 2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
## 3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
## 4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
## 5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
## 6 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
## 7 2001 1.392 0.213 0.614 -0.623 1.032 1.4450 -0.403 Down
## 8 2001 -0.403 1.392 0.213 0.614 -0.623 1.4078 0.027 Up
## 9 2001 0.027 -0.403 1.392 0.213 0.614 1.1640 1.303 Up
## 10 2001 1.303 0.027 -0.403 1.392 0.213 1.2326 0.287 Up
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
Now let us create a logistic regression model to predict the ‘direction’ of the market using all other variables of the dataset.
glm.fit=glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family =binomial )
let us have a look at the model using ‘summary’ function.
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
The star logic follows as in Linear Regression. Variables having more number of stars are more significant (P-Value < 0.05). But in our model, no variable seems important or doesnot impact the ‘Direction’.
As our ‘Direction’ variable is a categorical variable, R assigns a dummy variable to it. To know the code of dummy variable, we use contrasts function.
contrasts(Direction)
## Up
## Down 0
## Up 1
Now using our ‘glm.fit’ function, we will try to make predictions for new unknown data. As I dont have any test data, I will use first 10 rows of the same data to make predictions.
test <- head(Smarket, 10)
Now I will give this ‘test’ set to ‘predict’ function and obtain probability for P(Y=1 | x). We give ‘type = “response”’ to obtain probabilities instead of ‘log-odds’
predictions <- predict(glm.fit, newdata = test, type = "response")
predictions
## 1 2 3 4 5 6 7
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509
## 8 9 10
## 0.5092292 0.5176135 0.4888378
Now, Let us convert these probabilities into categorical variable by assigning ‘Up’ for probabilities ‘>0.5’ and ‘Down’ for probabilities ‘<0.5’
direction <- ifelse(predictions > 0.5, 'Up', 'Down')
direction
## 1 2 3 4 5 6 7 8 9 10
## "Up" "Down" "Down" "Up" "Up" "Up" "Down" "Up" "Up" "Down"
Let us see how our predictions fared compared to original predictions using table function.
table(test$Direction, direction)
## direction
## Down Up
## Down 2 0
## Up 2 6
So our accuracy is 0.8
Logistic Regression is used for categorizing two class variables or data. Linear Discriminant Analysis is used for categorizing multiclass variables. It uses Bayes Theorem and Normal Distribution to do the classification. It assumes that inclass variance in all classes is same. We use ‘lda’ function to categorize.
We use the same ‘Smarket’ data to understand LDA.
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
lda.fit=lda(Direction ~ Lag1+Lag2 ,data=Smarket)
lda.fit
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket)
##
## Prior probabilities of groups:
## Down Up
## 0.4816 0.5184
##
## Group means:
## Lag1 Lag2
## Down 0.05068605 0.03229734
## Up -0.03969136 -0.02244444
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.7567605
## Lag2 -0.4707872
It provides ‘prior probabilities’ and ‘class means’ to be used by Bayes Theorem. We used only Lag1 and Lag2 variables for our model, just for simplicity.
We can use ‘predict’ function to predict the class.
new_pred <- predict(lda.fit, newdata = test)
new_pred
## $class
## [1] Up Down Down Up Up Up Down Up Up Down
## Levels: Down Up
##
## $posterior
## Down Up
## 1 0.4861024 0.5138976
## 2 0.5027466 0.4972534
## 3 0.5104516 0.4895484
## 4 0.4817860 0.5182140
## 5 0.4854771 0.5145229
## 6 0.4920394 0.5079606
## 7 0.5085978 0.4914022
## 8 0.4896886 0.5103114
## 9 0.4774690 0.5225310
## 10 0.5049515 0.4950485
##
## $x
## LD1
## 1 -0.193187790
## 2 -0.900356413
## 3 -1.227714911
## 4 -0.009643717
## 5 -0.166603724
## 6 -0.445506476
## 7 -1.148941474
## 8 -0.345614409
## 9 0.174041524
## 10 -0.994023376
So our model has 3 variables, ‘class’ which is the class of given observation deduced using our model.
Let us look at the confusion matrix, to verify our predictions.
table(test$Direction, new_pred$class)
##
## Down Up
## Down 2 0
## Up 2 6
Our accuracy was same as the previous one. i.e., 80%
The difference between LDA and QDA is QDA assumes that the inclass variance among different classes is different.
qda.fit=qda(Direction ~ Lag1+Lag2 ,data=Smarket)
qda.fit
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket)
##
## Prior probabilities of groups:
## Down Up
## 0.4816 0.5184
##
## Group means:
## Lag1 Lag2
## Down 0.05068605 0.03229734
## Up -0.03969136 -0.02244444
It provides ‘prior probabilities’ and ‘class means’ to be used by Bayes Theorem. We used only Lag1 and Lag2 variables for our model, just for simplicity.
We can use ‘predict’ function to predict the class.
qda_pred <- predict(qda.fit, newdata = test)
qda_pred
## $class
## [1] Up Up Down Up Up Up Down Up Up Up
## Levels: Down Up
##
## $posterior
## Down Up
## 1 0.4754475 0.5245525
## 2 0.4944981 0.5055019
## 3 0.5083068 0.4916932
## 4 0.4780731 0.5219269
## 5 0.4773961 0.5226039
## 6 0.4836726 0.5163274
## 7 0.5013701 0.4986299
## 8 0.4916379 0.5083621
## 9 0.4675238 0.5324762
## 10 0.4967838 0.5032162
So our model has 3 variables, ‘class’ which is the class of given observation deduced using our model.
Let us look at the confusion matrix, to verify our predictions.
table(test$Direction, qda_pred$class)
##
## Down Up
## Down 2 0
## Up 0 8
And now, our accuracy is 100%
kNN doesnot use parametric approach. It finds the k nearest points to our test data and assigns the test data point to the class which was most repleated in its k neighbours.
We will now perform KNN using the knn() function, which is part of the knn() class library. This function works rather differently from the other modelfitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs.
library (class)
## Warning: package 'class' was built under R version 3.4.3
train_knn <- Smarket[,2:3]
test_knn <- test[,2:3]
direction <- Smarket$Direction
We got our inputs, So let us create our kNN Algorithm. kNN directly makes the predictions
knn.pred=knn (train_knn, test_knn, direction ,k=3)
Now let us verify the accuracy of our model using table function to create confusion matrix.
table(test$Direction, knn.pred)
## knn.pred
## Down Up
## Down 1 1
## Up 0 8
Our accuracy is 78%