Data Processing Human PDZ Predictions

This document is describing the Data Processing of the Human PDZ domain prediction for both sequence and structure based predictions.

1- Load all human domains

pdz_domains <- read.csv('~/Documents/Work/PDZ/POWCLUI/domains/HumanPOWNames.txt', header=F)
pdz_domains <- unique(pdz_domains$V1)
## [1] "Number of PDZ domains is 218"

2- Get all sequence-based prediction results files

pdzpath <- '~/Documents/Work/PDZ/Human-seq/Seq/' 
files <- list.files(pdzpath,full.names=T)
## [1] "Number of sequence-based prediction  files is 11"

3- Read all sequence-based predicted PDZ PPIs

pdz_seq <- data.frame()
for (i in 1:length(files))
      {
            d <- read.table(files[i], fill=T, header=F)
            pdz_seq <- rbind(pdz_seq,d)
      }
pdz_seq <- pdz_seq [ pdz_seq[[1]] %in% pdz_domains,]
pdz_seq <- pdz_seq[,1:4]
names (pdz_seq) <- c("DomainName", "BindingSiteSequence", "PeptideSequence", "DecisionScore")
## [1] "Number of sequence-based predicted PPIs is 11281248"
## [1] "Number of domains with predicted PPIs is 198"
DomainName BindingSiteSequence PeptideSequence DecisionScore
AHNAK2-1 ASGYSVTGKQLSYELE YCIGG -1.4038920541288675
AHNAK2-1 ASGYSVTGKQLSYELE PNAEP -0.9744675488251316
AHNAK2-1 ASGYSVTGKQLSYELE KHEYE -1.0880877951821866
AHNAK2-1 ASGYSVTGKQLSYELE EVSYA -0.869097214900627
AHNAK2-1 ASGYSVTGKQLSYELE PASAS -1.0076030891653431
AHNAK2-1 ASGYSVTGKQLSYELE TFQRA -1.6691679310600334
AHNAK2-1 ASGYSVTGKQLSYELE QEDDE -0.89593758624212
AHNAK2-1 ASGYSVTGKQLSYELE TPVKN -1.2779236501085798
AHNAK2-1 ASGYSVTGKQLSYELE QNQNL -1.035166200227473
AHNAK2-1 ASGYSVTGKQLSYELE KSSIC -0.7628681618461656

Preview of the first 10 records of the sequence-based predicted PDZ PPIs.

4- Get all structure-based prediction results files

pdzpath <- '~/Documents/Work/PDZ/Human-seq/Struct/' 
sfiles <- list.files(pdzpath,full.names=T)
## [1] "Number of structure-based prediction  files is 11"

5- Read all structure-based predicted PDZ PPIs

pdz_struct <- data.frame()
for (i in 1:length(sfiles))
      {
            d <- read.table(sfiles[i], fill=T, header=F)
            pdz_struct <- rbind(pdz_struct,d)
      }
pdz_struct <- pdz_struct [ pdz_struct[[1]] %in% pdz_domains,]
pdz_struct <- pdz_struct[,1:3]
names (pdz_struct) <- c("DomainName", "PeptideSequence", "DecisionScore")
## [1] "Number of sequence-based predicted PPIs is 5266638"
## [1] "Number of domains with predicted PPIs is 213"
DomainName PeptideSequence DecisionScore
DLG1-3 FYKAI -1.334127431816098
DLG1-3 WSVTQ -1.8271362367520325
DLG1-3 VVDCM -2.106712804560324
DLG1-3 KFWGT -2.1204977196802006
DLG1-3 QADKV -1.6264223331780503
DLG1-3 KVDSS -1.7435554130347501
DLG1-3 GEKAM -1.5968062021165985
DLG1-3 NQDNP -1.573107366795604
DLG1-3 TKPSP -2.113319129979189
DLG1-3 KVDSV -0.31159228150511353

Preview of the first 10 records of the structure-based predicted PDZ PPIs.