Data pre-processing

First, load packages.

library(genefilter)
library(MASS)
library(qvalue)
library(scatterplot3d)
library(expm)
library(e1071)
library(AUC)

Let’s load the training and test data:

trainData = read.csv("train.csv", header=TRUE)
dimensions = dim(trainData)

The dimensions of the data are 76020 x 371.

For the sake of convenience, let’s separate the training data matrix into the labels and \(m x n\) design matrix and \(m x 1\) label vector respectively, where \(m\) = number of samples and \(n\) = number of features.

labels = trainData$TARGET
train = trainData[, !(names(trainData) %in% "TARGET")]
rm(trainData)

Let’s see how many classes they are and what their labels are:

n = length(unique(labels))
uniqueLabels = unique(labels)

Thus, there are 2 classes.

Let us see the distribution of classes in our data. Particularly, we want to see if there is an even or uneven composition of labels in the training data.

p1 = round(sum(labels == uniqueLabels[1])/length(labels), 4)*100
p2 = round(sum(labels == uniqueLabels[2])/length(labels), 4)*100
col = c("grey50", "grey90")
ps = c(p1, p2)
pielabels = paste(ps, "%", sep="")
pie(ps, pielabels, main="Composition of samples", col=col)
legend("topright", c("0 class", "1 class"), fill=col)

Next, let us perform feature selection by removing features that cannot be used to sufficiently distinguish between the classes. Concretely, I will perform a t-test for each feature to see if the two classes are significantly different as evaluated by the p-value or q-value. Because there are a total of 370 features being tested, this is a multiple hypothesis testing scenario. Because the Bonferroni correction is too conservative (too many false negatives), we will instead correct for multiple hypothesis testing by controlling the false discovery rate. Calculating the q-value from the p-value addresses this, as the q-value can be interpreted as the false positive rate given that the p-value is significant. Selecting features with q-value < 0.05 would be selecting for features where the false positive rate is less than 5%.

groups = factor(labels)
trainTranspose = t(train)
pvals = rowttests(trainTranspose, groups)$p.val
hist(pvals, breaks = 20, ylab="Count")

qvals = p.adjust(pvals, method = "fdr")
hist(qvals, breaks = 20, ylab="Count")

If we control our false discovery rate \(\alpha\), we control many features we retain. Let us visualize how many features we retain based on \(\alpha\) values. Note that some p-values are NaN because the value of the feature is the same for every sample.

alphas = c(0.01, 0.05, 0.1, 0.15, 0.2)
retain = vector(length=length(alphas))
for (i in seq(length(alphas))) {
  a = alphas[i]
  retain[i] = sum(qvals <= a, na.rm = TRUE)
}
names(retain) = alphas
barplot(retain, xlab = "FDR", ylab = "Features Retained", main="Features retained vs Alphas")

perRetained = retain[2]/(dimensions[2]-1)
names(perRetained) = NULL

The number of features retained seems to level off after FDR = 0.05, thus this will be the FDR chosen to both (1) keep false discovery rate low while (2) retaining as many features as possible. If we set \(\alpha = 0.05\), we retain 0.3567568 of the features.

Exploratory Data Analysis

Of our original design matrix, we will pick features that achieve a false discovery rate of less than or equal to 0.05. The reason is to remove features that are not informative enough to distinguish one class from another, while achieving some degree of dimensionality reduction.

idx = which(qvals <= 0.05)
trainR = train[, idx]
dim(trainR)
## [1] 76020   132

Let us observe the variance explained by each principle component, up to the 30th one.

pca = prcomp(trainR, center = TRUE, scale = TRUE)
vars = (pca$sdev)^2
varExplain = vars*100/sum(vars)
names(varExplain) = seq(length(varExplain))
barplot(varExplain[1:30], xlab = "principle components", ylab = "Variance Explained")
abline(v=6, col="red")

Now perform PCA and visualize data in first three principle components.

cols = c(rgb(0, 0, 1, 0.4), rgb(1, 0, 0, 0.4))
colors = cols[labels + 1]
scatterplot3d(pca$x[,1], pca$x[,2], pca$x[,3], color=colors, pch = 19, cex.symbols = 0.5, xlab = paste("PC1:", round(varExplain[1], 2), "% variance explained"), ylab = paste("PC2: ", round(varExplain[2], 2), "% variance explained"), zlab = paste("PC3: ", round(varExplain[3], 2), "% variance explained"), main=paste0("PCA with ", dim(trainR)[2], " features"))
legend("topleft", c("Satisfied", "Unsatisfied"), fill=cols)

Training

It looks like the data is not linearly separable, so a kernel may be necessary in order to project the data into separable dimensions. Let us first prepare the data for SVM.

combined = cbind(trainR, labels)
save(combined, file="R_Object/trainPreproc.R")
if (file.exists("R_Object/tuneResult.R")) {
  load("R_Object/tuneResult.R")
}

At this point one of the team members provided another pre-processsed dataset that performed PCA dimensionality reduction. The number of principal components chosen was 100, and we will be using this pre-processed dataset to perform training because of better performance. Below are the functions written to perform tuning via cross-validation:

tuneSVM = function(data, c, degree, fold) {
  bestArea = 0
  n = length(c) * length(degree)
  areas = vector(length=n)
  counter = 1
  for (i in seq(length(c))) {
    for (j in seq(length(degree))) {
      parameters = c(c[i], degree[j])
      area = CV(data, parameters, fold)
      areas[counter] = area
      if (area > bestArea) {
        bestParameters = parameters
        bestArea = area
      }
      print(paste(c(counter,area), sep = " "))
      counter = counter + 1
      }
  }
  return(list(parameters=bestParameters, areas=areas))
}
CV = function(data, parameters, fold) {
  idx = sample(seq(nrow(data)))
  data = data[idx, ]
  n = floor(nrow(data) / fold)
  areas = vector(length=fold)
  for (i in seq(fold)) {
    tes1 = (i-1)*n + 1
    tes2 = i*n
    testidx = tes1:tes2
    trainidx = setdiff(seq(nrow(data)), testidx)
    test = data[testidx, ]
    train = data[trainidx, ]
    model = svm(TARGET ~., data = train, scale = FALSE, probability = TRUE, 
      kernel = "polynomial", cost = parameters[1], degree = parameters[2], gamma = 0.9)
    pred = predict(model, test)
    areas[i] = auc(roc(pred, factor(test[, "TARGET"])))
  }
  return(mean(areas))
}

From experience, the polynomial kernel seems to perform best among the other kernels, so we will tune using the polynomial kernel.

combined = as.matrix(read.csv("florian_pca_train_full.csv"))
idx = sample(seq(nrow(combined)), 20000, replace = FALSE)
sampled = combined[idx, ]
result = tuneSVM(sampled, seq(6, 16, 2), seq(3, 5), 3)
save(result, file="R_Object/tuneResult.R")

After tuning using 20000 samples, the best parameters appear to be C = 14 and degree = 3. The cross-validation AUC with these parameters is 0.6542631. Now we can proceed with training using train and test indices set by the team.

trainIdx = read.csv("ensemble_train_idx.csv")$x
Idx = seq(nrow(combined))
testIdx = setdiff(Idx, trainIdx)
train = combined[trainIdx, ]
model = svm(TARGET ~., data = train, scale = FALSE, probability = TRUE, 
    kernel = "polynomial", cost = 14, degree = 3, gamma = 0.9)
save(model, file="R_Object/model.R")
load("R_Object/model.R")

Prediction

Now predict on the test set using the model and calculate the AUC to evaluate performance.

pred = predict(model, test)
predictions = data.frame(index = testIdx, predictions = pred)
write.csv(predictions, file = "predictions.csv", row.names=FALSE)
area = auc(roc(pred, factor(test[, "TARGET"])))

It turns out that the AUC on the test set ends up being around 0.65, which is not great performance. We may be getting poor performance because we haven’t found the optimal parameters for the SVM. One reason finding the optimal parameters is difficult is because of the time it takes to tune. Hence to cut down on time, we only performed tuning on 20000 samples, or about 0.2857143 of the entire data set. The runtime for SVM is slow because the order of growth of the runtime appears to be \(\Theta(N^{2})\) (see SantanderSVM.ipynb).