Data Processing Human PDZ Predictions

This document is describing the Data Processing of the Human PDZ domain prediction for both sequence and structure based predictions.

1- Load all human domains

pdz_domains <- read.csv('~/Documents/Work/PDZ/POWCLUI/domains/HumanPOWNames.txt', header=F)
pdz_domains <- unique(pdz_domains$V1)

## [1] "Number of PDZ domains is 218"

2- Get all sequence-based prediction results files

pdzpath <- '~/Documents/Work/PDZ/Human-seq/Seq/' 
files <- list.files(pdzpath,full.names=T)

## [1] "Number of sequence-based prediction  files is 11"

3- Read all sequence-based predicted PDZ PPIs

pdz_seq <- data.frame()
for (i in 1:length(files))
      {
            d <- read.table(files[i], fill=T, header=F)
            pdz_seq <- rbind(pdz_seq,d)
      }
pdz_seq <- pdz_seq [ pdz_seq[[1]] %in% pdz_domains,]
pdz_seq <- pdz_seq[,1:4]
names (pdz_seq) <- c("DomainName", "BindingSiteSequence", "PeptideSequence", "DecisionScore")

## [1] "Number of sequence-based predicted PPIs is 11281248"

## [1] "Number of domains with predicted PPIs is 198"

DomainName	BindingSiteSequence	PeptideSequence	DecisionScore
AHNAK2-1	ASGYSVTGKQLSYELE	YCIGG	-1.4038920541288675
AHNAK2-1	ASGYSVTGKQLSYELE	PNAEP	-0.9744675488251316
AHNAK2-1	ASGYSVTGKQLSYELE	KHEYE	-1.0880877951821866
AHNAK2-1	ASGYSVTGKQLSYELE	EVSYA	-0.869097214900627
AHNAK2-1	ASGYSVTGKQLSYELE	PASAS	-1.0076030891653431
AHNAK2-1	ASGYSVTGKQLSYELE	TFQRA	-1.6691679310600334
AHNAK2-1	ASGYSVTGKQLSYELE	QEDDE	-0.89593758624212
AHNAK2-1	ASGYSVTGKQLSYELE	TPVKN	-1.2779236501085798
AHNAK2-1	ASGYSVTGKQLSYELE	QNQNL	-1.035166200227473
AHNAK2-1	ASGYSVTGKQLSYELE	KSSIC	-0.7628681618461656

Preview of the first 10 records of the sequence-based predicted PDZ PPIs.

4- Get all structure-based prediction results files

pdzpath <- '~/Documents/Work/PDZ/Human-seq/Struct/' 
sfiles <- list.files(pdzpath,full.names=T)

## [1] "Number of structure-based prediction  files is 11"

5- Read all structure-based predicted PDZ PPIs

pdz_struct <- data.frame()
for (i in 1:length(sfiles))
      {
            d <- read.table(sfiles[i], fill=T, header=F)
            pdz_struct <- rbind(pdz_struct,d)
      }
pdz_struct <- pdz_struct [ pdz_struct[[1]] %in% pdz_domains,]
pdz_struct <- pdz_struct[,1:3]
names (pdz_struct) <- c("DomainName", "PeptideSequence", "DecisionScore")

## [1] "Number of sequence-based predicted PPIs is 5266638"

## [1] "Number of domains with predicted PPIs is 213"

DomainName	PeptideSequence	DecisionScore
DLG1-3	FYKAI	-1.334127431816098
DLG1-3	WSVTQ	-1.8271362367520325
DLG1-3	VVDCM	-2.106712804560324
DLG1-3	KFWGT	-2.1204977196802006
DLG1-3	QADKV	-1.6264223331780503
DLG1-3	KVDSS	-1.7435554130347501
DLG1-3	GEKAM	-1.5968062021165985
DLG1-3	NQDNP	-1.573107366795604
DLG1-3	TKPSP	-2.113319129979189
DLG1-3	KVDSV	-0.31159228150511353

Preview of the first 10 records of the structure-based predicted PDZ PPIs.