We will test supervised classification using the k-nearest neighbors method, logistic regression, and support vector machines (SVM). The sections and subsections are independent, and you can address them in any order you prefer.

Présentation des données

We are interested in tumor data corresponding to samples of cancerous tissue of different types. The objective is to identify the type of tumors based on gene expression data.

  • The variable to predict is, therefore, the type of cancer (four possible types, coded 1, 2, 3, or 4).

  • The input variables are the expression levels of genes (2308 genes have been selected for the study).

We begin by loading the dataset in R.

# install.packages("ISLR")
library(ISLR)
data(Khan)
summary(Khan)
##        Length Class  Mode   
## xtrain 145404 -none- numeric
## xtest   46160 -none- numeric
## ytrain     63 -none- numeric
## ytest      20 -none- numeric
attach(Khan) # To be able to use the variables directly. xtrain,...
ytrain = factor(ytrain) # To code qualitative variables as factors.
ytest = factor(ytest)
ntrain = length(ytrain)
ntest = length(ytest)

We observe that the dataset already contains a training sample and a test sample. What is the size of the training sample? What about the test sample? In the training sample, how many tissues correspond to each of the 4 types of cancer?

Initially, we will focus on binary classification, meaning we will be interested in type 2 versus all other types.

Binary Classification

We start by generating a new variable ‘z’ by combining classes 1, 3, and 4 into a single class (labeled as 0).

ztrain = ifelse(ytrain==2,T,F)
ztest = ifelse(ytest==2,T,F)
summary(ztrain)
##    Mode   FALSE    TRUE 
## logical      40      23

K-Nearest Neighbors

Test the k-nearest neighbors method for different values of k. We can use, for example, the knn() function from the ‘class’ package. Calculate its error rate. We can also select k through cross-validation.

knnCV <- function(xtrain, ytrain, kvect, plot=TRUE){
  ntrain = nrow(xtrain)
  nbk = length(kvect)
  risque = rep(NA,nbk)
  for (j in 1:nbk){
    preds <- sapply(1:ntrain, function(i)
        knn(xtrain[-i,],xtrain[i,],ytrain[-i],k=kvect[j]))
    
    risque[j] <- mean(preds != ytrain)
  }
  khat <- kvect[which.min(risque)]
  print(cbind(k=kvect, risk=risque))
  if(plot)
  {
    plot(kvect, risque, type="b", pch=16)
    points(khat, min(risque), col="red")
  }
  khat
}

Logistic Regression

We recall that the logistic regression model aims to model the regression function \(η(X)\) = P(Y=1|X=x) as a function linearly dependent on the observations. In other words, there exist coefficients \(β_0, ..., β_d\) such that:

\(η(x) = g(β₀ + β₁x₁ + ... + βdxd), ∀x = (x₁, ..., xd)ᵀ ∈ ℝᵈ\),

where

\(g(t) = eᵗ / (1 + eᵗ)\)

The glm function, which works similarly to the lm function, allows for the estimation of the parameter \(β = (β₀, β₁, ..., β_d)\) by maximizing the likelihood.

log1 = glm(ztrain~xtrain,family='binomial')
summary(log1)

Try to interpret the software output. It seems that there is an issue related to the high dimensionality of the data. To address this, we can implement GLM Lasso or GLM Ridge.

# GLM Lasso
# install.packages("glmnet")
library('glmnet')
cv.out = cv.glmnet(xtrain,ztrain,alpha=1)
plot(cv.out)

lambda_opt.lasso = cv.out$lambda.min
lasso.mod.opt = glmnet(xtrain,ztrain,alpha = 1, lambda = lambda_opt.lasso,family="binomial")
lasso.pred.opt = predict(lasso.mod.opt,newx = xtest,type='response') # Predict the probabilities P(Y=1|X) = η(X)
ztest.pred = (lasso.pred.opt>0.5)
table(ztest.pred,ztest)
##           ztest
## ztest.pred FALSE TRUE
##      FALSE    14    1
##      TRUE      0    5

Do the same with GLM Ridge.

# GLM Ridge
library('glmnet')
cv.out.Ridge = cv.glmnet(xtrain,ztrain,alpha=0)
plot(cv.out.Ridge)

lambda_opt.Ridge = cv.out.Ridge$lambda.min
Ridge.mod.opt = glmnet(xtrain,ztrain,alpha = 0, lambda = lambda_opt.Ridge,family="binomial")
Ridge.pred.opt = predict(Ridge.mod.opt,newx = xtest,type='response') 
ztest.pred.Ridge = (Ridge.pred.opt>0.5)
table(ztest.pred.Ridge,ztest)
##                 ztest
## ztest.pred.Ridge FALSE TRUE
##            FALSE    14    0
##            TRUE      0    6

SVM

Now we test Support Vector Machines (SVM) using the svm function from the e1071 package.

# install.packages("e1071")
library("e1071")
# Parameter estimation
dat1=data.frame(x=xtrain , z=as.factor(ztrain))
svmlin = svm(z~., data=dat1, kernel="linear")
table(svmlin$fitted,ztrain)
##        ztrain
##         FALSE TRUE
##   FALSE    40    0
##   TRUE      0   23
# Prediction
dat1.test=data.frame(x=xtest , y=as.factor(ztest))
ztest.pred.svmlin=predict (svmlin , newdata =dat1.test)
table(ztest.pred.svmlin,ztest)
##                  ztest
## ztest.pred.svmlin FALSE TRUE
##             FALSE    14    0
##             TRUE      0    6

We notice that linear SVMs achieve perfect classification, whether on the training or test samples. The fact that the classification is perfect on the training sample does not necessarily indicate good classifier performance, as it could be overfitting. However, in the case of SVMs, it suggests that the sample is linearly separable, which is often the case in high dimensions with few observations. Therefore, it may not be necessary to further examine SVMs in this case. For other datasets, testing other kernels, such as the polynomial kernel (kernel=“polynomial”) or the Gaussian kernel (kernel=“radial”), could be useful.

The parameter C in the optimization problem formulation discussed in class is encoded in the ‘cost’ parameter. We haven’t specified it here. Look in the function’s documentation to see how it is set by default. You can test different values. Does it change the result?

Multi-label Classification

We now return to the original dataset, where we need to predict the type of cancer among 4 possibilities. Since there are extensions of logistic regression to the multi-label case, but they are not widely used in practice, in this section, we will limit ourselves to the k-nearest neighbors method and SVM.

The k-nearest neighbors

The knn function supports the multiclass case.

library(class)
# k=20
res.multi20 = knn(xtrain,xtest,ytrain,k=20)
table(res.multi20,ytest)
##            ytest
## res.multi20 1 2 3 4
##           1 1 0 0 0
##           2 2 6 3 0
##           3 0 0 3 0
##           4 0 0 0 5
sum(res.multi20!=ytest)/length(ytest)
## [1] 0.25
# For k=1, the k-NN predictions too often favor type 4

# k selected by cross-validation
k.CV = knnCV(xtrain,ytrain,kvect=1:20) 
##        k       risk
##  [1,]  1 0.07936508
##  [2,]  2 0.11111111
##  [3,]  3 0.06349206
##  [4,]  4 0.06349206
##  [5,]  5 0.06349206
##  [6,]  6 0.07936508
##  [7,]  7 0.07936508
##  [8,]  8 0.14285714
##  [9,]  9 0.15873016
## [10,] 10 0.17460317
## [11,] 11 0.20634921
## [12,] 12 0.20634921
## [13,] 13 0.19047619
## [14,] 14 0.19047619
## [15,] 15 0.19047619
## [16,] 16 0.19047619
## [17,] 17 0.23809524
## [18,] 18 0.20634921
## [19,] 19 0.22222222
## [20,] 20 0.28571429

res.multi.CV = knn(xtrain,xtest,ytrain,k=k.CV)
table(res.multi.CV,ytest)
##             ytest
## res.multi.CV 1 2 3 4
##            1 3 0 0 0
##            2 0 6 0 0
##            3 0 0 5 0
##            4 0 0 1 5
sum(res.multi.CV!=ytest)/length(ytest)
## [1] 0.05

SVM

The version of SVM discussed in class does not support multiclass classification. However, several methods exist to extend SVMs to the multiclass case. The one implemented in the svm() function that we are going to use is the one-versus-one classification. Suppose our target variable Y can take K different values. This method involves training a linear classifier for each possible pair of labels.

For example, in our case where K=4, we train 6 classifiers:

  • One classifier to differentiate type 1 from type 2.
  • One to differentiate type 2 from type 3.
  • One for type 1 versus type 4.
  • One for type 2 versus type 3.
  • One for type 2 versus type 4.
  • One for type 3 versus type 4.

Each classifier is trained on the subset of the training sample containing individuals i with corresponding values of Yi (for example, the parameters of the classifier distinguishing type 1 from type 2 are determined solely based on individuals in the training sample such that \(Yi∈{1,2})\).

The final classification of a point x is assigned based on majority vote (e.g., class 1 is assigned if, among the 6 classifiers, class 1 is more frequently assigned to point x than all others). The ‘svm()’ function we are using is based on this method.

library("e1071")
# Parameter estimation
dat2=data.frame(x=xtrain , y=as.factor(ytrain))
svmlin.multi = svm(y~., data=dat2, kernel="linear")
table(svmlin.multi$fitted,ytrain)
##    ytrain
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20
# Prediction
dat2.test=data.frame(x=xtest , y=as.factor(ytest))
ytest.pred.svmlin.multi=predict (svmlin.multi , newdata =dat2.test)
table(ytest.pred.svmlin.multi,ytest)
##                        ytest
## ytest.pred.svmlin.multi 1 2 3 4
##                       1 3 0 0 0
##                       2 0 6 2 0
##                       3 0 0 4 0
##                       4 0 0 0 5
sum(ytest.pred.svmlin.multi!=ytest)/length(ytest)
## [1] 0.1