proQC Brief

Read PPP1 peptide matrix

getwd()

## [1] "/Users/gc5k/manuscript/Protein/2019Dec"

proFile="pppb_pep2.txt"
prot=read.table(proFile, as.is = T, header = T)
#rownames(prot)=prot[,1]
#prot=prot[,-c(1,2)]

PPP0=read.csv("PPPB1_lineup.csv", as.is = T, header = T)
PPP1=PPP0
PPP1$PPPA_ID=tolower(PPP1$PPPA_ID)

dim(PPP1)

## [1] 1566   12

dim(prot)

## [1] 16644  1566

Remove missing

#evaluate missing value
protT=t(prot)

misD=matrix(1, nrow(protT), 4)
for(i in 1:nrow(misD)) {
  naIdx=which(is.na(protT[i,]))
  if (length(naIdx) < ncol(protT)) {
    misD[i, 1] = length(naIdx)/ncol(protT)
    misD[i, 2] = min(protT[i,-naIdx])
    misD[i, 3] = median(protT[i, -naIdx])
    misD[i, 4] = mean(protT[i, -naIdx])
  }
}
blankInd = which(misD[,1]> 0.8)

prot=prot[,-blankInd]
protT=protT[-blankInd,]
PPP1=PPP1[-blankInd,]
print(paste0("Removed: ", length(blankInd), " samples"))

## [1] "Removed: 96 samples"

Density plot

######Slide 1: density plot
print("Slide 1: A-AQUA, C-CTRL, N-Normal, T-Tumor")

## [1] "Slide 1: A-AQUA, C-CTRL, N-Normal, T-Tumor"

proDensityPlot(protT, PPP1$Tissue)

Quality score

Quality score is defined as \[Q=\frac{\sum_{i=1}^\tilde{M}\tilde{s}_i(k-\frac{n_o}{n})^c}{\sum_{i=1}^Ms_i}\] in which \(\tilde{s}_i=[y_i-(E(y_i|n_{o_i})+T_{\alpha/M}\sigma_i)]^2\) if \(y_i > E(y_i|n_{o_i})\) or \(\tilde{s}_i=[y_i-(E(y_i|n_{o_i})-T_{\alpha/M}\sigma_i)]^2\) if \(y_i < E(y_i|n_{o_i})\). In contrast, \(s_i=[y_i-E(y_i|n_{o_i})]^2\).

\(T_{\alpha/M}\) is the z-score given the p-value cutoff \(\alpha/M\), and \(\alpha\) can be for example 0.01 or 0.05.

\(E(y_i|n_{o_i})\) can be predicted from the fitted spline.

\(\alpha\) determins the confidence interval of the spline. The smaller the \(\alpha\), the larger the interval and the smaller \(\tilde{M}\) till zero.

\(K\), between 0 and 1, indicates the direction of the penalty. When it is 1, the penalty is more weighted on the proteins with fewer observations, and nearly no penalty at all for proteins without missing values.

\(C\) is the penalty for the score.

In summary, the quality score is upon \(M\), \(\alpha\), \(K\), and \(C\).

## [1] "Slide 2: Qscore"

##               A          C          N          T
## [1,] 0.14320251 0.11069550 0.10148688 0.10206335
## [2,] 0.10885643 0.13331441 0.10983203 0.10756830
## [3,] 0.11976395 0.08213037 0.08023418 0.08110984
## [4,] 0.08900582 0.11677759 0.09476257 0.09175888
## [5,] 0.10121566 0.06375519 0.06496515 0.06546652
## [6,] 0.07045752 0.09840241 0.07949354 0.07611556

Pair-wise comparison for expression

GC

## [1] "Slide 3: GC control"

## [1] "Bartlett test"

PCA

## [1] "Slide 4: PCA"

##            [,1]      [,2]      [,3]
## [1,] -0.4108678 0.0844348 0.2532472
## [2,]  0.4551288 0.4578845 0.6650268
## [3,]  0.5130904 0.5709893 0.8342511
## [4,] -0.5057928 1.0116235 1.2674499

##            [,1]      [,2]      [,3]
## [1,]  0.4057713 0.4812725 0.6459228
## [2,]  0.2234960 1.2121509 1.2621014
## [3,]  0.1751561 0.9479307 0.9786104
## [4,] -0.2319347 0.9824451 1.0362388

G_2007.0659 example

Each patient has a biological sample for tumor and a sample for normal control. Each tomor/normal sample has been analyzed independently on a machine. The analysis for Patient G_2007.0659, who is taken as the representative of the layout of the experiment.

UniSamp=table(PPP1$USZ_ID) #get sample ids, and G_2007.0659 is the first sample
for(i in 1:1){
  idx=which(PPP1$USZ_ID == names(UniSamp)[i])
  CB=combn(length(idx), 2)
  mat=matrix(0, ncol(CB), 3)
  for(j in 1:ncol(CB)) {
    ln=which(!is.na(prot[,idx[CB[1,j]]]) & !is.na(prot[,idx[CB[2,j]]]))
    tg=paste0(PPP1[idx[CB[1,j]],]$Tissue, PPP1[idx[CB[2,j]],]$Tissue)
    mat[j, 1] = tg
    mat[j, 2] = length(ln)
    mat[j, 3] = names(UniSamp)[i]
  }
  nn=mean(as.numeric(mat[which(mat[,1]=="NN", 1), 2]))
  tt=mean(as.numeric(mat[which(mat[,1]=="TT", 1), 2]))
  nt=mean(as.numeric(mat[which(mat[,1]=="NT" | mat[,1] == "TN", 1), 2]))
  tn=mean(as.numeric(mat[which(mat[,1]=="NT" | mat[,1] == "TN", 1), 2]))

  if(i == 1) {
    TB = mat
  } else {
    TB = rbind(TB, mat)
  }

  layout(matrix(1:4, 2, 2))

  ID=matrix(c(1,2,3,4,1,4,2,3), 4,2, byrow = T)
  for(j in 1:4) {
    idxP=which(!is.na(prot[,idx[ID[j,1]]]) & !is.na(prot[,idx[ID[j,2]]]))
    plot(xlim=c(0, 25), ylim=c(0, 25), col="black", main=paste0(names(UniSamp)[i]," [", ID[j,1], ",", ID[j,2], "]"), xlab=paste0(PPP1$Tissue[idx[ID[j,1]]], ", ", PPP1$MS_ID[idx[ID[j,1]]], " [", length(which(!is.na(prot[idx[ID[j,1]]]))), "]"), ylab=paste0(PPP1$Tissue[idx[ID[j,2]]], ", ", PPP1$MS_ID[idx[ID[j,2]]], " [", length(which(!is.na(prot[idx[ID[j,2]]]))), "]"),
prot[idxP, idx[ID[j,1]]], prot[idxP, idx[ID[j,2]]],
pch=16, cex=0.5, bty='l')
    idxP_1=which(!is.na(prot[idx[ID[j,1]]]) & is.na(prot[idx[ID[j,2]]]))
    idxP_2=which(is.na(prot[ID[j,1]]) & !is.na(prot[idx[ID[j,2]]]))
    l1=length(which(!is.na(prot[idx[ID[j,1]]])))
    l2=length(which(!is.na(prot[idx[ID[j,2]]])))
    points(prot[idxP_1, idx[ID[j,1]]], rep(0, length(idxP_1)), col="grey", pch=3, cex=0.2)
    points(rep(0, length(idxP_2)), prot[idxP_2, idx[ID[j,2]]], col="grey", pch=3, cex=0.2)

    abline(a=0, b=1, col="red", lty=2)
    legend("topleft",
           legend = c(format(length(idxP)/sqrt(l1*l2), digits = 4), format(length(idxP)/min(l1,l2), digits = 4), format(digits=4, cor(prot[idxP, idx[ID[j,1]]], prot[idxP, idx[ID[j,2]]]))),
           pch=c(16, 16, 16), col=c("white", "white", "white"), bty = 'n')
    legend("bottomright",
           legend = c("Observed in both", "Observed in either"),
           pch=c(16, 16), col=c("black", "grey"), bty = 'n')
  }
}

Annotation for G2007.0659

The patient G2007.0659 has its normal sample and tumor sample scanned independently on proCan 1 and 4, respectively. Top left is the correlation for the peptides of the normal tissue observed in proCan 1 and 4, respectively. The number of commonly observed peptides and their expression correlation can be found at the top left in each plot. The bottom left is for the tumor tissues, and the top left is between one of the normal tissue and one of the tumor tissue. Similarly for the bottom right panel.

Variance: Tech vs Biology

UniSamp=table(PPP1$USZ_ID)
UniCor=matrix(-1, length(UniSamp), 6)

for(i in 1:length(UniSamp)) {
  idx=which(PPP1$USZ_ID == names(UniSamp)[i])
  CB=combn(length(idx), 2)
  mat=matrix(0, ncol(CB), 8)
  #mat[,7]=100
  for(j in 1:ncol(CB)) {
    ln=which(!is.na(prot[,idx[CB[1,j]]]) & !is.na(prot[,idx[CB[2,j]]]))
    ln1=which(!is.na(prot[,idx[CB[1,j]]]) & is.na(prot[,idx[CB[2,j]]]))
    ln2=which(is.na(prot[,idx[CB[1,j]]]) & !is.na(prot[,idx[CB[2,j]]]))
    if(length(ln) != 0 && length(ln1) !=0 && length(ln2) != 0) {
      tg=paste0(PPP1[idx[CB[1,j]],]$Tissue, PPP1[idx[CB[2,j]],]$Tissue)
      mat[j, 1] = tg
      mat[j, 2] = length(ln)
      mat[j, 3] = names(UniSamp)[i]
      mat[j, 4] = cor(prot[ln, idx[CB[1,j]]], prot[ln, idx[CB[2,j]]])
      mat[j, 5] = length(which(!is.na(prot[,idx[CB[1,j]]])))
      mat[j, 6] = length(which(!is.na(prot[,idx[CB[2,j]]])))
      pepDat=matrix(0, as.numeric(mat[j,5])+as.numeric(mat[j,6])-as.numeric(mat[j,2]), 2)
      pepDat[1:length(ln),1]=prot[ln,idx[CB[1,j]]]
      pepDat[1:length(ln),2]=prot[ln,idx[CB[2,j]]]
      pepDat[(length(ln)+1):as.numeric(mat[j,5]),1]=prot[ln1, idx[CB[1,j]]]
      pepDat[(as.numeric(mat[j,5])+1):nrow(pepDat),2]=prot[ln2, idx[CB[2,j]]]
      mat[,8] = cor(pepDat[,1], pepDat[,2])
    }
  }
  nn=mean(as.numeric(mat[which(mat[,1]=="NN", 1), 2]))
  tt=mean(as.numeric(mat[which(mat[,1]=="TT", 1), 2]))
  nt=mean(as.numeric(mat[which(mat[,1]=="NT" | mat[,1] == "TN", 1), 2]))
  tn=mean(as.numeric(mat[which(mat[,1]=="NT" | mat[,1] == "TN", 1), 2]))

  cnt=0
  if(length(which(mat[,1] == "NN") > 0)) {
    UniCor[i,1] = mean(as.numeric(mat[mat[,1]=="NN",4]))
    cnt=cnt+1
  }
  if(length(which(mat[,1] == "TT") > 0)) {
    UniCor[i,2] = mean(as.numeric(mat[mat[,1]=="TT",4]))
    cnt=cnt+1
  }
  if(length(which(mat[,1] == "NT" | mat[,1] == "TN") > 0)) {
    UniCor[i,3] = mean(as.numeric(mat[mat[,1]=="NT" || mat[,1] == "TN",4]))
    cnt=cnt+1
  }
  if(cnt==3) {
    z1=1-UniCor[i,1]
    z2=1-UniCor[i,2]
    z3=1-UniCor[i,3]
    UniCor[i,4]=z1/(z1+z2+z3)
    UniCor[i,5]=z2/(z1+z2+z3)
    UniCor[i,6]=z3/(z1+z2+z3)
  }
  if(i == 1) {
    TB = mat
  } else {
    TB = rbind(TB, mat)
  }
}
colnames(UniCor)=c("", "", "", "Var (Tech normal)", "Var (Tech tumor)", "Var (Biology)")
idxCor=which(UniCor[,4]>0 & UniCor[,5]>0 & UniCor[,6]>0)
boxplot(UniCor[idxCor,4:6], ylab="Proportion of variance", col=c("yellow", "red", "blue"))

Correlation at various levels

TSnames=c("AQUA", "CTRL", "Normal", "Tumor")
Aidx=which(PPP1$Tissue == "A")
Cidx=which(PPP1$Tissue == "C")
Nidx=which(PPP1$Tissue == "N")
Tidx=which(PPP1$Tissue == "T")
IDX=c(length(Aidx), length(Cidx), length(Nidx), length(Tidx))

TSidx=list(Aidx)
TSidx[2]=list(Cidx)
TSidx[3]=list(Nidx)
TSidx[4]=list(Tidx)

####AQ
PPP1_A=PPP1[TSidx[[1]],]
exDat_A=protT[TSidx[[1]],]

batchName=names(table(PPP1_A$Batch))
AQcor=matrix(0, length(batchName), 6)
cnt=0
for(i in 1:length(batchName)) {
  if(length(which(PPP1_A$Batch == i))!=2) next 
  cnt=cnt+1
  ex2=exDat_A[which(PPP1_A$Batch==i),]
  n1=length(which(!is.na(ex2[1,])))
  n2=length(which(!is.na(ex2[2,])))
  no=length(which(!is.na(ex2[1,]) & !is.na(ex2[2,])))
  ct=cor.test(ex2[1,], ex2[2,])
  AQcor[cnt,1:6]=c(ct$estimate, n1, n2, no, no/sqrt(n1*n2), no/min(n1,n2))
}

####CTRL
PPP1_CL=PPP1[TSidx[[2]],]
exDat_CL=protT[TSidx[[2]],]

batchNameCL=names(table(PPP1_CL$Batch))
CLcor=matrix(0, length(batchNameCL), 6)
CLcnt=0
for(i in 1:length(batchNameCL)) {
  if(length(which(PPP1_CL$Batch == i))!=2) next 
  CLcnt=CLcnt+1
  ex2=exDat_CL[which(PPP1_CL$Batch==i),]
  n1=length(which(!is.na(ex2[1,])))
  n2=length(which(!is.na(ex2[2,])))
  no=length(which(!is.na(ex2[1,]) & !is.na(ex2[2,])))
  ct=cor.test(ex2[1,], ex2[2,])
  CLcor[CLcnt,1:6]=c(ct$estimate, n1, n2, no, no/sqrt(n1*n2), no/min(n1,n2))
}

layout(matrix(1:6, 3, 2))

hist(main="AQUA", AQcor[1:cnt,5], xlim=c(0, 1), xlab="Correlation (Overlapping)")
legend("topleft", legend = format(mean(AQcor[1:cnt,5]), digits = 3), bty='n')
hist(main="AQUA", AQcor[1:cnt,6], xlim=c(0, 1), xlab="Correlation (Normalized)")
legend("topleft", legend = format(mean(AQcor[1:cnt,6]), digits = 3), bty='n')
hist(main="AQUA", AQcor[1:cnt,1], xlim=c(0, 1), xlab="Correlation expression")
legend("topleft", legend = format(mean(AQcor[1:cnt,1]), digits = 3), bty='n')

hist(main="CTRL", CLcor[1:CLcnt,5], xlim=c(0, 1), xlab="Correlation (Overlapping)")
legend("topleft", legend = format(mean(CLcor[1:CLcnt,5]), digits = 3), bty='n')

hist(main="CTRL",
     CLcor[1:CLcnt,6], xlim=c(0, 1), xlab="Correlation (Normalized)")
legend("topleft", legend = format(mean(CLcor[1:CLcnt,6]), digits = 3), bty='n')

hist(main="CTRL",
     CLcor[1:CLcnt,1], xlim=c(0, 1), xlab="Correlation expression")
legend("topleft", legend = format(mean(CLcor[1:CLcnt,1]), digits = 3), bty='n')

layout(matrix(1:12, 3, 4))

TAG=c("NN", "NT", "TN", "TT")
for(i in 1:length(TAG)) {
  idxTT=which(TB[,1]==TAG[i] & as.numeric(TB[,2])>0)
  hist(main=paste0(TAG[i], " [", length(idxTT),"]"), breaks = 40, ylim=c(0, 120), xlab="Correlation (Overlapping)", xlim =c(0.3, 1), as.numeric(TB[idxTT,2])/sqrt(as.numeric(TB[idxTT,5])*as.numeric(TB[idxTT,6])))
  legend("topleft", legend = format(mean(as.numeric(TB[idxTT,2])/sqrt(as.numeric(TB[idxTT,5])*as.numeric(TB[idxTT,6]))), digits = 4), bty='n')

  DidxTT=apply(matrix(c(as.numeric(TB[idxTT, 5]), as.numeric(TB[idxTT,6])), length(idxTT),2), 1, min)
  hist(main=paste(TAG[i]), breaks = 40, ylim=c(0, 120), xlab="Correlation (Normalized)", xlim =c(0.3, 1), as.numeric(TB[idxTT,2])/DidxTT, col="red")
  legend("topleft", legend = format(mean(as.numeric(TB[idxTT,2])/DidxTT), digits = 4), bty='n')

  hist(main=paste(TAG[i]), breaks = 40, ylim=c(0, 120), xlab="Correlation (Pearson)", xlim =c(0.3, 1), as.numeric(TB[idxTT,4]), col="green")
  legend("topleft", legend = format(mean(as.numeric(TB[idxTT,4])), digits = 4), bty='n')
}

Proportion of variance

Each column represents one of four possible sample pairs.

Top row: \(cor_1=\frac{n_{o}}{\sqrt{n_1n_2}}\) without normalization.
Middle row: \(cor_2=\frac{n_{o}}{min(n_1,n_2)}\) normalized correlation.
Bottom row: \(cor_3=cor(exp_1,exp_2)\) for \(n_o\) peptides.

Variance analysis

There are three kinds of pairs:

\(r_{N,N}=cor(N_1, N_2)\), correlation between a pair of normal samples, indicating technical variation.
\(r_{T,T}=cor(T_1, T_2)\), correlation between a pair of tumor sample, indicating technical variation.
\(r_{N,T}=r_{T,N}=cor(N, T)=cor(T,N)\), correlation between a normal sample and a tumor sample, indicating biological variation.

Then the proportion of total variance can be partitioned into

\(V(Tech N)=\frac{r_{N_1,N_2}}{r_{N_1,N_2}+r_{T_1,T_2}+r_{N,T}}\)
\(V(Tech T)=\frac{r_{T_1,T_2}}{r_{N_1,N_2}+r_{T_1,T_2}+r_{N,T}}\)
\(V(Tech Bio)=\frac{r_{N,T}}{r_{N_1,N_2}+r_{T_1,T_2}+r_{N,T}}\)