Since we expect analysis methods to have highly differing depth coverages and thus differing ranges in relative LFQ values, we attempt to correct by z-scoring per approach, to reveal that all methods capture important biological differences.
This likely did not work since this protein subset could be an artifact of the filtering based on method (scDVP, cnDVP, DVP) coverage and not of acute biological relevance
# load transformed, filtered and imputed matrix of all measurementsprotein_data =read.delim("/Users/dvaldes/Downloads/20251219_2134_DV_adatadata.txt", header =TRUE, sep ="\t")metadata_df =read.delim("/Users/dvaldes/Downloads/20251219_2134_DV_adatametadata.txt", header =TRUE, sep ="\t")counts =t(as.matrix(protein_data))colnamies =as.character(unlist(counts[1, ]))counts = counts[-1, , drop =FALSE]rownamies =rownames(counts)counts =apply(counts, 2, as.numeric)rownames(counts) = rownamiescolnames(counts) = colnamiescat("Number of proteins in matrix:", (nrow((counts))))
Number of proteins in matrix: 84
# make sce for easier downstream analyses sce =SingleCellExperiment(assays =list(counts = counts), colData = metadata_df)# Separate per approach and do a z-scoring that accounds for method-specific variations in protein abundancessce_approach1 = sce[, sce$approach =="scDVP"]counts_approach1 =as.matrix(counts(sce_approach1))colnames1 =colnames(counts_approach1)rownames1 =rownames(counts_approach1)counts_approach1 =apply(counts_approach1, 2, as.numeric)colnames(counts_approach1) = colnames1rownames(counts_approach1) = rownames1z_approach1 =t(scale(t(counts_approach1)))sce_approach2 = sce[, sce$approach =="cnDVP"]counts_approach2 =as.matrix(counts(sce_approach2))colnames2 =colnames(counts_approach2)rownames2 =rownames(counts_approach2)counts_approach2 =apply(counts_approach2, 2, as.numeric)colnames(counts_approach2) = colnames2rownames(counts_approach2) = rownames2z_approach2 =t(scale(t(counts_approach2)))sce_approach3 = sce[, sce$approach =="DVP"]counts_approach3 =as.matrix(counts(sce_approach3))colnames3 =colnames(counts_approach3)counts_approach3 = counts_approach3rownames3 =rownames(counts_approach3)counts_approach3 =apply(counts_approach3, 2, as.numeric)colnames(counts_approach3) = colnames3rownames(counts_approach3) = rownames3z_approach3 =t(scale(t(counts_approach3)))# Join corrected_mat =cbind(z_approach1, z_approach2, z_approach3)
In the original paper data as well as this new dataset, PC2 appears to separate between the biological replicates (primary vs relapse, i.e. 991 vs 992).
The PCA shows this approach also corrects for biological replicate in this subset of proteins, supporting the notion that this protein subset is more an artifact of the filtering based on coverage and not of acute biological relevance (i.e. important biology was lost with this filtering).
This is also supperted within this protein subset by heatmap and cluster analysis of zscored proteins and also when subjected to COMBAT batch correction.
# do combat correctionbatch =as.factor(metadata_df$approach) model =model.matrix(~ sample_id, data = metadata_df) # to preserve biologyexpr_bc =ComBat(dat = counts, batch = batch, mod = model)