Partial Least Squares for Leukemia Dataset

Treating the datasets

Read the data files data_set_ALL_AML_train.csv and data_set_ALL_AML_independent.csv. Enter the response manually into R from the table_ALL_AML_predic.doc document (the response corresponds to the “actual” column of such word document).
Form the data matrices X and Xt, containing the gene expression for the 38 training samples and 34 test samples. Be aware that data is presented in its transposed form, with samples as columns and they are not in order. Only numeric information is pertinent to solve the problem.

# Reading training
train <- read.csv("Data/data_set_ALL_AML_train.csv", sep=";")
train <- select(train,contains("X"))
train <- t(train)

test <-  read.csv("Data/data_set_ALL_AML_independent.csv",sep=";")
test <- select(test, contains("X"))
test <-  t(test)

# Deal with ordering
rownames(train)
rownames(test)

# Append an index without X
train_index <- as.data.frame(cbind("index"=as.numeric(substring(rownames(train), 2)),train))
test_index <-  as.data.frame(cbind("index"=as.numeric(substring(rownames(test), 2)),test))

# Order the dataset
training <- train_index %>% arrange(index)
training <- training[,-1]
testing <- test_index %>% arrange(index)
testing <- testing[,-1]

# Response Variables
# Recoding the variables for purposes of running the pls() model.
# Labels ALL == 0 and AML = 1
Y_train= c(rep(0,27),rep(1,11))
Y_test = as.numeric(read.table("Data/testinglabels.txt")[,2]=="AML")

# Centering but not scaling the data. The variance may be important.
scale_training <- scale(training,scale = F)
scale_testing <- scale(testing,center=colMeans(training),scale=F)

# Trainging and Testing dataset
training <- as.data.frame(cbind(Y_train,scale_training))
testing <- as.data.frame(cbind(Y_test,scale_testing))

PLS1 and Selecting Components

Perform the PLS1 regression of the training data. Select the number of PLS1 components.

pls1 <- plsr(Y_train ~ ., data = training, validation = "LOO")
plot(RMSEP(pls1), main = 'RMSE vs Number of Components', 
     xlab = 'Number of Components', ylab = 'Root Mean Squared Error')

# We can definitely use less Number of components

pls2 <- plsr(Y_train ~ .,ncomp =10, data = training, validation = "LOO")
plot(RMSEP(pls2), main = 'RMSE vs Number of Components', 
     xlab = 'Number of Components', ylab = 'Root Mean Squared Error')
abline(v=4, col="blue")

We could suggest that 4 components are enough for our model. So the analysis will be computed according only to the first 4 components, which already have proven to have minimized the Root Mean Squared Error.

Test Data Projection

Project the test data as supplementary individuals onto the selected PLS1 components (be aware of centering the test data respect to the mean of the training data).

# Projection for the testing set. (Remove the label)
test_projection <- as.matrix(testing[,-1]) %*% pls2$projection[,1:4]

We make the projection of our test data, in order to be able and plot the first two components for vizualization purposes.

Plotting the components

Perform a joint plot of the train and test individuals in the plane of the two first PLS1 components, differentiating those individuals with ALL from those with AML.

# The scores contains our values for each of the components calculated
Components <- cbind(pls2$scores[,1:2],training[,1])

# Plot of the Components, colored by each Factor
plot(Components[,1:2], col = as.factor(Components[,3]), main="PLS 1st vs 2nd Component")

# Add the projections of the test dataset colored by type of Leukemia
points(test_projection[,1:2], pch=19, col=as.factor(testing[,1]))

legend("topleft",c("ALL","AML","ALL_test","AML_test"),
       pch=c(1,1,19,19),
       col=c("black","red","black","red"))

By plotting the types of leukemia ALL and AML with their first two components, they appear to be separated, but maybe not completely compared to the training set. We must take into consideration that we are not explaining the most variance possible by not considering components 3 and 4.

Logistic regression model to predict the response in the training data

Using as input the selected PLS1 components, we proceed to build a GLM classifier. First we define our training set and our testing set out of our Components and projected components, for easy handling of the data.

#Setting up a dataframe with the Components and Response
trainingset <- as.data.frame(cbind(Y=training[,1],  pls2$scores[,1:4]))
names(trainingset) = c("Y",paste0("X",1:(ncol(trainingset)-1)))

testingset <- as.data.frame(cbind(Y=testing[,1],test_projection ))
names(testingset) = c("Y",paste0("X",1:(ncol(testingset)-1)))

 #Building the Model
glmfit = train(as.factor(Y) ~. ,  data=trainingset, method="glm", family="binomial")
PredictGLM = predict(glmfit)
cm=table(training[,1], PredictGLM)
(accuracy = sum(diag(cm))/sum(cm))

[1] 1

We get as an output an accuracy = 100%, which means that we are predicting perfectly each type of leukemia of our training set. This leads us to think we most probably are overfitting. To corroborate, we could see how the model performs in the testing dataset.

Predict the probability of AML leukemia in the test sample.

# Predicting on the test data
# GLM
PredictGLM = predict(glmfit, newdata= testingset, type='raw')
cm=table(testing[,1], PredictGLM)
(accuracy = sum(diag(cm))/sum(cm))

[1] 0.9705882

The accuracy is 97.06%. Out of 34 observations, we have only missclasified 1 by assigning AML to a patient who actually had ALL.

Conclusions

It is interesting to see how Partial Least Squares algorithm performance is great by reducing our dataset from \(7130\) variables to only \(4\) components containing most of the variability of our data. In any case, it could be interesting to try it with even more observations, and see its performance.