In this lab, we are going to practice KNN classification and Logistic regression. KNN is one type of nonparametric classification method, while Logistic regression is a parametric model.

Learning Objectives

library(caret)
library(ISLR)
library(MASS)
library(ggplot2)
library(ggfortify)
library(GGally)
library(class)

1. EDA

We will begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR library. This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date).

# print variable names
names(Smarket)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
# dimension of data
dim(Smarket)
## [1] 1250    9
# summary(Smarket)

The pairs() function produce a scatter plot matrix that contains all of the pairwise scatter plot among the predictors in the dataset. You can also use ggpairs() function in the GGally package.ggpairs() will also calculate the correlation coefficients.

pairs(Smarket)

ggpairs(Smarket)

As one would expect, the correlations between the lag variables and to- day’s returns are close to zero. In other words, there appears to be little correlation between today’s returns and previous days’ returns. The only substantial correlation is between Year and Volume. By plotting the data we see that Volume is increasing over time. In other words, the average number of shares traded daily increased from 2001 to 2005.

Again, split the data into training set and test set. Here, we only use the first two variables Lag1 and Lag2 in the model.

# set random seed, select the columns for model
set.seed(1)
data <- Smarket[,c(2,3,9)] # select Lag1, Lag2, Direction

#---- training and test
n = nrow(data)
index <- sample(1:n, size = n*0.7 )
training <- data[index,]
test <- data[-index,]

# The first 2 columns are the predictors
train.x <- data.frame(training[,-3]) 
train.y <- training$Direction

# predictor in test set, will be used for prediction
test.x <-  data.frame(test[,-3])
test.y <- test$Direction

2. KNN Classification

We will now perform KNN using the knn() function, which is part of the class library. This function works rather differently from the other model- fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs.

Now the knn() function can be used to predict the market’s movement for the test set. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results. Let’s try \(K=5\).

set.seed(1)
knn.pred=knn(train.x,test.x,train.y ,k=5)

We can use confusionMatrix() function to create the confusion matrix. confusionMatrix() will also calculate error metrics like sensitivity,specificity…

# confusionMatrix(prediction, true)
confusionMatrix(knn.pred, test.y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   85  96
##       Up     89 105
##                                           
##                Accuracy : 0.5067          
##                  95% CI : (0.4549, 0.5584)
##     No Information Rate : 0.536           
##     P-Value [Acc > NIR] : 0.8831          
##                                           
##                   Kappa : 0.0109          
##                                           
##  Mcnemar's Test P-Value : 0.6591          
##                                           
##             Sensitivity : 0.4885          
##             Specificity : 0.5224          
##          Pos Pred Value : 0.4696          
##          Neg Pred Value : 0.5412          
##              Prevalence : 0.4640          
##          Detection Rate : 0.2267          
##    Detection Prevalence : 0.4827          
##       Balanced Accuracy : 0.5054          
##                                           
##        'Positive' Class : Down            
## 

Or use table() function to have the confusion matrix only. We can define the error metrics here.

table( test.y,knn.pred)
##       knn.pred
## test.y Down  Up
##   Down   85  89
##   Up     96 105
# index the table, define positive = "up"

ConMat <- table(test.y,knn.pred)
TN <- ConMat[1,1]
FP <- ConMat[1,2]
FN <- ConMat[2,1]
TP <- ConMat[2,2]

Acc <- (TP + TN)/sum(ConMat)
TPR <- TP/(TP + FN)
TNR <- TN/(TN + FP)
Mis <- (FP+FN) / sum(ConMat)

Now, you can apply what we have in Lab 1 here, to find the optimal \(K\). For example, choose misclassification rate as the objective function. This time we want to find the \(K\) to achieve the lowest misclassification.

set.seed(1)
# create an empty vector to store the test misclassification 
test_mis <- c()
# for each K=i, fit the knn, find the prediction, 
# calculate the misclassification, store it in the ith item of vector misclassification
for(i in 1:100){
  predicted <- knn(train.x,test.x,train.y ,k=i)
  ConMat <- table(test.y,predicted)
  TN <- ConMat[1,1]
  FP <- ConMat[1,2]
  FN <- ConMat[2,1]
  TP <- ConMat[2,2]
  test_mis[i] <-(FP+FN) / sum(ConMat)
}
test_mis
##   [1] 0.4880000 0.4720000 0.4906667 0.4746667 0.4933333 0.5040000 0.5013333
##   [8] 0.4933333 0.4906667 0.4880000 0.4853333 0.4746667 0.4800000 0.4800000
##  [15] 0.4800000 0.4960000 0.4986667 0.4933333 0.4933333 0.5200000 0.4880000
##  [22] 0.5093333 0.5146667 0.5093333 0.4693333 0.4826667 0.4906667 0.4746667
##  [29] 0.4666667 0.4666667 0.4613333 0.4693333 0.4666667 0.4800000 0.4720000
##  [36] 0.4826667 0.4746667 0.4800000 0.4506667 0.4480000 0.4560000 0.4640000
##  [43] 0.4453333 0.4640000 0.4586667 0.4426667 0.4560000 0.4506667 0.4586667
##  [50] 0.4586667 0.4480000 0.4533333 0.4693333 0.4666667 0.4666667 0.4640000
##  [57] 0.4506667 0.4640000 0.4586667 0.4640000 0.4746667 0.4480000 0.4586667
##  [64] 0.4506667 0.4720000 0.4693333 0.4746667 0.4586667 0.4640000 0.4560000
##  [71] 0.4666667 0.4666667 0.4613333 0.4426667 0.4560000 0.4506667 0.4506667
##  [78] 0.4453333 0.4613333 0.4640000 0.4693333 0.4560000 0.4586667 0.4693333
##  [85] 0.4666667 0.4506667 0.4560000 0.4533333 0.4426667 0.4453333 0.4506667
##  [92] 0.4560000 0.4373333 0.4506667 0.4480000 0.4613333 0.4586667 0.4746667
##  [99] 0.4586667 0.4666667
which.min(test_mis) # find the k with minimum misclassification
## [1] 93
min(test_mis)
## [1] 0.4373333

The results are rather disappointing: the FDR is close to 50%, which is nearly as bad as random guessing! Of course this result is not all that surprising, given that one would not generally expect to be able to use previous days’ returns to predict future market performance. (After all, if it were possible to do so, then we would be out striking it rich rather than taking a statistics course)

3. Logistic Regression

Next, we will fit a logistic regression model in order to predict Direction using Lag1 and Lag2. The glm() function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass in the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

Fit the model on training set

glm.fits=glm(Direction  ~ Lag1+Lag2, data=training ,family=binomial)
summary(glm.fits)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = training)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.191   1.082   1.160   1.332  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.04396    0.06769   0.649    0.516
## Lag1        -0.06527    0.05993  -1.089    0.276
## Lag2        -0.02861    0.06038  -0.474    0.636
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1212.6  on 874  degrees of freedom
## Residual deviance: 1211.2  on 872  degrees of freedom
## AIC: 1217.2
## 
## Number of Fisher Scoring iterations: 3

Again, use predict() function to get the predicted probability

glm.prob = predict(glm.fits, test.x, type = "response")

In glm, it’s important to specify the ā€œtypeā€.

Note: The default is on the scale of the linear predictors. If you want to have the predicted probability, make sure to specify type = "response".

Now, we know these values correspond to the probability of \(Y=1\). The following question is, how R created this dummy variable? Use contract() function

contrasts(data$Direction)
##      Up
## Down  0
## Up    1

The result indicates that R has coded a dummy variable with a 1 for Up. We can complete the prediction using the cutoff point. Here, we introduce the ifelse() function.

ifelse(test, yes, no)

For example, to calculate the absolute value of \(x\), we first ā€œtestā€ if \(x>0\), if \(x>0\), then \(|x|=x\), or \(|x|=-x\). Thus we can have

absolute <- function(x){
  ifelse(x>0,x,-x)
}
absolute(2)
## [1] 2
absolute(-2)
## [1] 2

Use ifelse function to complete the prediction.

glm.pred <- ifelse(glm.prob>0.5, "Up", "Down")

Now, we can repeat the same procedure in section 1. Calculate the misclassification rate. Our tuning parameter is the cutoff point \(q\). Let’s use a wide range of \(q\), say \(q=0.44, 0.45,0.46, 0.47, ...,0.58\)(why?). Report the test misclassification rate for each \(q\)

summary(glm.prob) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4229  0.5002  0.5104  0.5109  0.5228  0.5860
set.seed(1)
# create an empty vector to store the test misclassification 
test_mis_l <- c()
# for each K=i, find the prediction, 
# calculate the misclassification, store it in the ith item of vector test_misclassification

q <- seq(0.44,0.58,0.01) # generate sequence from 0.45 to 0.58 with the break 0.1
for(i in 1:15){
  
  predicted <- ifelse(glm.prob>q[i], "Up", "Down")
  ConMat <- table(test.y,predicted)
  TN <- ConMat[1,1]
  FP <- ConMat[1,2]
  FN <- ConMat[2,1]
  TP <- ConMat[2,2]
  test_mis_l[i] <-(FP+FN) / sum(ConMat)
}
test_mis_l
##  [1] 0.4666667 0.4693333 0.4586667 0.4453333 0.4506667 0.4666667 0.4666667
##  [8] 0.4560000 0.4853333 0.5120000 0.5280000 0.5413333 0.5360000 0.5386667
## [15] 0.5386667
which.min(test_mis_l)
## [1] 4
min(test_mis_l)
## [1] 0.4453333

4. Comparison

The misclassification rates of Logistic regression is less than the misclassification of KNN. However, both are very close to 0.5. Neither is good enough in practice.

glm.pred <- ifelse(glm.prob>q[4], "Up", "Down")
knn.pred=knn(train.x,test.x,train.y ,k=5)

# add the prediction into the test set

test$glm.pred <- glm.pred
test$knn.pred <- knn.pred

# create a plot: use color show the third variable
ggplot(data = test) + geom_point(aes(x = Lag1, y = Lag2, color = Direction))

# create a plot: text shows the true Direction, color shows the glm prediction
ggplot(data = test) + geom_text(aes(x = Lag1, y = Lag2, label = Direction,color = glm.pred)) + ggtitle("Classification Results from Logistic Regression")

# create a plot: text shows the true Direction, color shows the glm prediction
ggplot(data = test) + geom_text(aes(x = Lag1, y = Lag2, label = Direction,color = knn.pred)) + ggtitle("Classification Results from KNN Classification")

5. Assignment

You can work individually or in pairs. Go back to the BreastCancerData.csv. You are going to explore the tumor diagnostics using certain predictors compactness_mean and concavity_mean. Write a statistical report on your finding. In your statistical report, make sure to include your response to the following questions.

1. Preparation

Set the random seed to be 1, split the data as 90% training and 10% test set. Provide basic data description and EDA. Define your error metric.

2. KNN

Fit KNN classification model on the training set. Use the error metric to select the optimal \(K\).

3. Logistic Regression

Fit Logistic regression model on the training set. Use the error metric to select the optimal cutoff point.

4. Comparison

Compare the logistic regression with KNN (e.g., accuracy, FDR, sensitivity, specificity, running time…) What is multicollinearity? What if there is multicollinearity in logistic regression? Do your research and put a short discusison.