Contents

1 Background

One of the reviewer’s comment is :
The example of classifying colorectal cancer samples using pairs of RAVs the proposed method identifies RAVs that are most associated with CMS subtypes. Given the total number of 4764 RAVs there are more than 10 million possible combinations. Given the number of possible feature combinations, it doesn’t seem too surprising that the RAVs perform as well as CRC. The same approach would most likely work just as well on the gene level selecting two genes and as features.

To respond this comment, we analyze how likely we will get PCSS-like RAV by simulating a random RAV and comparing the similarity of it with PCSSs.

2 Load data

2.1 RAVmodel

dir <- "~/data2/GenomicSuperSignatureLibrary/refinebioRseq/RAVmodel_536"
RAVmodel <- readRDS(file.path(dir, "refinebioRseq_RAVmodel_C2_20220115.rds"))

We calculate the mean and standard deviation of each RAV in our model.

ravindex <- RAVindex(RAVmodel)
ravDistribution <- as.data.frame(matrix(nrow = ncol(RAVmodel), ncol = 2))
colnames(ravDistribution) <- c("mean", "sd")
rownames(ravDistribution) <- paste0("RAV", seq_len(ncol(RAVmodel)))

for (i in seq_len(ncol(RAVmodel))) {
    m <- mean(ravindex[,i])
    sd <- sd(ravindex[,i])
    ravDistribution[i, "mean"] <- m
    ravDistribution[i, "sd"] <- sd
}

RAVs show a normal distribution with the mean ~ 0 and the standard deviation ~ 0.0065.

summary(ravDistribution$mean)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -7.762e-03 -1.264e-04 -1.747e-06  6.430e-07  1.249e-04  7.729e-03
summary(ravDistribution$sd)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002191 0.005979 0.006097 0.006524 0.008468 0.008472

hist(RAVindex(RAVmodel)[,100]) # random example RAV

2.2 PCSSs

avgLoading <- read.csv("~/data2/GenomicSuperSignaturePaper/Results/CRC/data/avg_loadings.csv")
rownames(avgLoading) <- avgLoading[,1]
avgLoading <- avgLoading[,-1]

3 PCSSs vs. RAVs

## Common genes between RAVmodel and PCSSs
cg <- intersect(rownames(RAVmodel), rownames(avgLoading)) # 5,344 common genes

## Calculate Pearson coefficients between PCSS1-4 and all 4,764 RAVs
loading_cor <- abs(stats::cor(avgLoading[cg,], RAVindex(RAVmodel)[cg,], 
                              use="pairwise.complete.obs", method="pearson"))

## Maximum Pearson coefficients between PCSS1/2 and all 4,764 RAVs
max1 <- which.max(loading_cor[1,])  # max. correlation with PCSS1
max2 <- which.max(loading_cor[2,])  # max. correlation with PCSS2

## Pearson coefficients between PCSS1 and all 4,764 RAVs
hist(loading_cor[1,], 
     xlab = "Pearson Correlation Coefficient",
     main = "PCSS1 vs. 4,764 RAVs")

RAV1575 is most similar to PCSS1 (Pearson coefficient = 0.59) and RAV834 is most similar to PCSS2 (Pearson coefficient = -0.56).

ravs <- RAVindex(RAVmodel)[cg, c(max1, max2)] # two most similar RAVs
pcss <- avgLoading[cg, 1:2] # PCSS1/2

stats::cor(pcss, ravs, use = "pairwise.complete.obs", method = "pearson")
##       RAV1575     RAV834
## cl1 0.5894306  0.1592830
## cl2 0.1937131 -0.5624299

4 PCSSs vs. simulated RAV

We simulate RAVs that follow the distribution of our RAVs: mean = 0 and standard deviation ~ 0.0065. We create 4,764 of simulated RAVs and calculate the absolute Pearson coefficient between them and PCSS1. The maximum value from this simulation-comparison is 0.056, which is ~10 fold lower than our RAVs.

set.seed(1)
n <- 4764
res_all <- vector(length = n)

for (i in seq_len(n)) {
    simulateRAVs <- rnorm(length(cg), mean = 0, sd = mean(ravDistribution$sd))
    res <- abs(stats::cor(avgLoading[cg, 1], simulateRAVs, method = "pearson"))
    res_all[i] <- res
}

max(res_all)
## [1] 0.05600147

## Pearson coefficients between PCSS1 and 4,764 simulated RAVs
hist(res_all, 
     xlab = "Pearson Correlation Coefficient",
     main = "PCSS1 vs. 4,764 simulated RAVs")