In this lab, we are going to practice KNN classification and Logistic regression. KNN is one type of nonparametric classification method, while Logistic regression is a parametric model.
Practice KNN classification and Logistic regression on a binary (two level) qualitative response.
Create confusion matrix. Calculate the error metrics.
Use the error metrics to find the optimal value for tuning parameters.
Compare KNN classification with Logistic regression.
library(caret)
library(ISLR)
library(MASS)
library(ggplot2)
library(ggfortify)
library(GGally)
library(class)
We will begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR library. This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date).
# print variable names
names(Smarket)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
# dimension of data
dim(Smarket)
## [1] 1250 9
# summary(Smarket)
The pairs() function produce a scatter plot matrix that contains all of the pairwise scatter plot among the predictors in the dataset. You can also use ggpairs() function in the GGally package.ggpairs() will also calculate the correlation coefficients.
pairs(Smarket)
ggpairs(Smarket)
As one would expect, the correlations between the lag variables and to- dayās returns are close to zero. In other words, there appears to be little correlation between todayās returns and previous daysā returns. The only substantial correlation is between Year and Volume. By plotting the data we see that Volume is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005.
Again, split the data into training set and test set. Here, we only use the first two variables Lag1 and Lag2 in the model.
# set random seed, select the columns for model
set.seed(1)
data <- Smarket[,c(2,3,9)] # select Lag1, Lag2, Direction
#---- training and test
n = nrow(data)
index <- sample(1:n, size = n*0.7 )
training <- data[index,]
test <- data[-index,]
# The first 2 columns are the predictors
train.x <- data.frame(training[,-3])
train.y <- training$Direction
# predictor in test set, will be used for prediction
test.x <- data.frame(test[,-3])
test.y <- test$Direction
We will now perform KNN using the knn() function, which is part of the class library. This function works rather differently from the other model- fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs.
A matrix containing the predictors associated with the training data, labeled train.x .
A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X .
A vector containing the class labels for the training observations, labeled train.y .
A value for K, the number of nearest neighbors to be used by the classifier.
Now the knn() function can be used to predict the marketās movement for the test set. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results. Letās try \(K=5\).
set.seed(1)
knn.pred=knn(train.x,test.x,train.y ,k=5)
We can use confusionMatrix() function to create the confusion matrix. confusionMatrix() will also calculate error metrics like sensitivity,specificityā¦
# confusionMatrix(prediction, true)
confusionMatrix(knn.pred, test.y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 85 96
## Up 89 105
##
## Accuracy : 0.5067
## 95% CI : (0.4549, 0.5584)
## No Information Rate : 0.536
## P-Value [Acc > NIR] : 0.8831
##
## Kappa : 0.0109
##
## Mcnemar's Test P-Value : 0.6591
##
## Sensitivity : 0.4885
## Specificity : 0.5224
## Pos Pred Value : 0.4696
## Neg Pred Value : 0.5412
## Prevalence : 0.4640
## Detection Rate : 0.2267
## Detection Prevalence : 0.4827
## Balanced Accuracy : 0.5054
##
## 'Positive' Class : Down
##
Or use table() function to have the confusion matrix only. We can define the error metrics here.
table( test.y,knn.pred)
## knn.pred
## test.y Down Up
## Down 85 89
## Up 96 105
# index the table, define positive = "up"
ConMat <- table(test.y,knn.pred)
TN <- ConMat[1,1]
FP <- ConMat[1,2]
FN <- ConMat[2,1]
TP <- ConMat[2,2]
Acc <- (TP + TN)/sum(ConMat)
TPR <- TP/(TP + FN)
TNR <- TN/(TN + FP)
Mis <- (FP+FN) / sum(ConMat)
Now, you can apply what we have in Lab 1 here, to find the optimal \(K\). For example, choose misclassification rate as the objective function. This time we want to find the \(K\) to achieve the lowest misclassification.
set.seed(1)
# create an empty vector to store the test misclassification
test_mis <- c()
# for each K=i, fit the knn, find the prediction,
# calculate the misclassification, store it in the ith item of vector misclassification
for(i in 1:100){
predicted <- knn(train.x,test.x,train.y ,k=i)
ConMat <- table(test.y,predicted)
TN <- ConMat[1,1]
FP <- ConMat[1,2]
FN <- ConMat[2,1]
TP <- ConMat[2,2]
test_mis[i] <-(FP+FN) / sum(ConMat)
}
test_mis
## [1] 0.4880000 0.4720000 0.4906667 0.4746667 0.4933333 0.5040000 0.5013333
## [8] 0.4933333 0.4906667 0.4880000 0.4853333 0.4746667 0.4800000 0.4800000
## [15] 0.4800000 0.4960000 0.4986667 0.4933333 0.4933333 0.5200000 0.4880000
## [22] 0.5093333 0.5146667 0.5093333 0.4693333 0.4826667 0.4906667 0.4746667
## [29] 0.4666667 0.4666667 0.4613333 0.4693333 0.4666667 0.4800000 0.4720000
## [36] 0.4826667 0.4746667 0.4800000 0.4506667 0.4480000 0.4560000 0.4640000
## [43] 0.4453333 0.4640000 0.4586667 0.4426667 0.4560000 0.4506667 0.4586667
## [50] 0.4586667 0.4480000 0.4533333 0.4693333 0.4666667 0.4666667 0.4640000
## [57] 0.4506667 0.4640000 0.4586667 0.4640000 0.4746667 0.4480000 0.4586667
## [64] 0.4506667 0.4720000 0.4693333 0.4746667 0.4586667 0.4640000 0.4560000
## [71] 0.4666667 0.4666667 0.4613333 0.4426667 0.4560000 0.4506667 0.4506667
## [78] 0.4453333 0.4613333 0.4640000 0.4693333 0.4560000 0.4586667 0.4693333
## [85] 0.4666667 0.4506667 0.4560000 0.4533333 0.4426667 0.4453333 0.4506667
## [92] 0.4560000 0.4373333 0.4506667 0.4480000 0.4613333 0.4586667 0.4746667
## [99] 0.4586667 0.4666667
which.min(test_mis) # find the k with minimum misclassification
## [1] 93
min(test_mis)
## [1] 0.4373333
The results are rather disappointing: the FDR is close to 50%, which is nearly as bad as random guessing! Of course this result is not all that surprising, given that one would not generally expect to be able to use previous daysā returns to predict future market performance. (After all, if it were possible to do so, then we would be out striking it rich rather than taking a statistics course)
Next, we will fit a logistic regression model in order to predict Direction using Lag1 and Lag2. The glm() function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass in the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.
Fit the model on training set
glm.fits=glm(Direction ~ Lag1+Lag2, data=training ,family=binomial)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.339 -1.191 1.082 1.160 1.332
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.04396 0.06769 0.649 0.516
## Lag1 -0.06527 0.05993 -1.089 0.276
## Lag2 -0.02861 0.06038 -0.474 0.636
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1212.6 on 874 degrees of freedom
## Residual deviance: 1211.2 on 872 degrees of freedom
## AIC: 1217.2
##
## Number of Fisher Scoring iterations: 3
Again, use predict() function to get the predicted probability
glm.prob = predict(glm.fits, test.x, type = "response")
In glm, itās important to specify the ātypeā.
type = "response" gives the predicted probabilities
type = "terms" gives the predicted linear predictors, which is the log-odds.
Note: The default is on the scale of the linear predictors. If you want to have the predicted probability, make sure to specify type = "response".
Now, we know these values correspond to the probability of \(Y=1\). The following question is, how R created this dummy variable? Use contract() function
contrasts(data$Direction)
## Up
## Down 0
## Up 1
The result indicates that R has coded a dummy variable with a 1 for Up. We can complete the prediction using the cutoff point. Here, we introduce the ifelse() function.
ifelse(test, yes, no)
test is an object which can be coerced to logical mode.
yes is the return values for true elements of test.
no is the return values for false elements of test.
For example, to calculate the absolute value of \(x\), we first ātestā if \(x>0\), if \(x>0\), then \(|x|=x\), or \(|x|=-x\). Thus we can have
absolute <- function(x){
ifelse(x>0,x,-x)
}
absolute(2)
## [1] 2
absolute(-2)
## [1] 2
Use ifelse function to complete the prediction.
glm.pred <- ifelse(glm.prob>0.5, "Up", "Down")
Now, we can repeat the same procedure in section 1. Calculate the misclassification rate. Our tuning parameter is the cutoff point \(q\). Letās use a wide range of \(q\), say \(q=0.44, 0.45,0.46, 0.47, ...,0.58\)(why?). Report the test misclassification rate for each \(q\)
summary(glm.prob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4229 0.5002 0.5104 0.5109 0.5228 0.5860
set.seed(1)
# create an empty vector to store the test misclassification
test_mis_l <- c()
# for each K=i, find the prediction,
# calculate the misclassification, store it in the ith item of vector test_misclassification
q <- seq(0.44,0.58,0.01) # generate sequence from 0.45 to 0.58 with the break 0.1
for(i in 1:15){
predicted <- ifelse(glm.prob>q[i], "Up", "Down")
ConMat <- table(test.y,predicted)
TN <- ConMat[1,1]
FP <- ConMat[1,2]
FN <- ConMat[2,1]
TP <- ConMat[2,2]
test_mis_l[i] <-(FP+FN) / sum(ConMat)
}
test_mis_l
## [1] 0.4666667 0.4693333 0.4586667 0.4453333 0.4506667 0.4666667 0.4666667
## [8] 0.4560000 0.4853333 0.5120000 0.5280000 0.5413333 0.5360000 0.5386667
## [15] 0.5386667
which.min(test_mis_l)
## [1] 4
min(test_mis_l)
## [1] 0.4453333
The misclassification rates of Logistic regression is less than the misclassification of KNN. However, both are very close to 0.5. Neither is good enough in practice.
glm.pred <- ifelse(glm.prob>q[4], "Up", "Down")
knn.pred=knn(train.x,test.x,train.y ,k=5)
# add the prediction into the test set
test$glm.pred <- glm.pred
test$knn.pred <- knn.pred
# create a plot: use color show the third variable
ggplot(data = test) + geom_point(aes(x = Lag1, y = Lag2, color = Direction))
# create a plot: text shows the true Direction, color shows the glm prediction
ggplot(data = test) + geom_text(aes(x = Lag1, y = Lag2, label = Direction,color = glm.pred)) + ggtitle("Classification Results from Logistic Regression")
# create a plot: text shows the true Direction, color shows the glm prediction
ggplot(data = test) + geom_text(aes(x = Lag1, y = Lag2, label = Direction,color = knn.pred)) + ggtitle("Classification Results from KNN Classification")
You can work individually or in pairs. Go back to the BreastCancerData.csv. You are going to explore the tumor diagnostics using certain predictors compactness_mean and concavity_mean. Write a statistical report on your finding. In your statistical report, make sure to include your response to the following questions.
Set the random seed to be 1, split the data as 90% training and 10% test set. Provide basic data description and EDA. Define your error metric.
Fit KNN classification model on the training set. Use the error metric to select the optimal \(K\).
Fit Logistic regression model on the training set. Use the error metric to select the optimal cutoff point.
Compare the logistic regression with KNN (e.g., accuracy, FDR, sensitivity, specificity, running timeā¦) What is multicollinearity? What if there is multicollinearity in logistic regression? Do your research and put a short discusison.