Prepare data
Get proteomes from ftp
C. auris proteome files were retrieved from NCBI ftp sites and stored in the data/proteome-fasta folder. The concatenated fasta file is stored in the same folder.
Write a fasta file
I wrote a combined fasta file using the command fwrite(x = Caurisfasta.table[, c(1,3)], file = “Cauris.txt”, sep = ", col.names = FALSE)
Using Notepad++ I changed /n to /n> then /t to /n to put in conventional fasta format. Saved as Caurisfasta.txt. There are 27278 proteins.
Note: I wrote a new fasta file to remove B11243 from the file. There are now 21772 proteins.
Update 2021-07-04 HB skipped this step as the concatenated file can be created by cat *.faa.gz > concatenated_cand_auris.faa.gz if needed
# we want to include the information about whether a protein is in our homologs list
# for that we will important the list of protein ids that were analyzed
Hil <- read_tsv("data/Hil-family-Cauris-homologs-id.txt", col_names = c("name", "strain"), col_types = "cc")
# initiate the seqInfo table
seqInfo <- read_tsv("data/Cauris-four-proteomes-seqLen.tsv", col_types = cols()) %>%
mutate(isHIL = name %in% Hil$name)
# number of Hil homologs by strain
seqInfo %>%
group_by(strain) %>%
summarize(Hil = sum(isHIL))
A total of 25 proteins in the four strains belong to the Hil family. B11221 misses Hil4 and B11245 misses Hil8. In both cases the reason is because of incomplete genome assembly rather than gene loss. B11220, the representative of clade II, indeed misses 5/8 homologs.
SignalP for signal peptides
Use SignalP 5.0 server to predict signal peptides. Maximum number of proteins is 5000, so I split the fasta file into 6 smaller files and submitted, providing my email each time. I used the old file with B11243 to run this. (HB: seems like the current result file contains B11245 instead). I resplit the new fasta into 5 files and reran. Downloaded .gff files as SignalP_(file #).gff3 and used command grep -v -h "##" SignalP* | sort > SignalP.txt
Note [HB]: SignalP output in GFF3 format only includes the input proteins that are predicted to have a signal peptide. There is a total of 1183 entried in the result file, from a total of >20,000 proteins submitted.
Modified file names, but otherwise Bin’s code:
# Signal peptide
gff.names <- c("id", "source", "name", "start", "end", "prob", "na1", "na2", "na3")
signalp5 <- read_tsv("output/SignalP.txt", comment = "#", col_names = gff.names, col_types = "ccciidccc")
# remove the column if it already exists
if("signalp" %in% names(seqInfo))
seqInfo <- select(seqInfo, -signalp)
seqInfo <- left_join(seqInfo, select(signalp5, name = id, prob), by = c("name" = "name")) %>%
mutate(signalp = !is.na(prob)) %>% select(-prob)
New problem B11243 and B11245 are identical. Because B11245 is assembled to the chromosome level, I will retain it and remove B11243 from the results.
PredGPI
Following Bin’s method, I used PredGPI to predict GPI anchors. For PredGPI, the maximum number of sequences is 500. To split the file I used the command split Caurisfasta.txt -l 1000 in my Ubuntu instance, which made 55 new files. The output includes name | FPrate | most likely position and the sequence for each. GPI anchor is “highly probable” when FPrate < 0.001, “probable” when 0.001 < FPrate < 0.005, “lowly probable” when 0.005 < FPrate < 0.01. Retain only header lines with the command grep -h “>” GPIPE* > PredGPIResults.txt.
This was done using the file that contained B11243. To remove duplicate results, I used the command grep “>” PredGPIResults.txt | sort -u > PredGPIResults2.txt. Then I renamed PredGPIResults2 PredGPIResults. It has 21772 proteins.
Update 2021-07-27 [HB]
It turns out that GPI-anchored proteins can be divided into two categories: those that are retained in the plasma membrane and those that are subsequently released and then covalently linked to the cell wall components (β-1,6-glucan). The latter, named cell wall proteins (CWP) account for the majority of the GPI-anchored proteins while the former, named plasma membrane proteins (PMP) are the minority. Based on Caro et al. 1997 (PMID: 9434352), the PMPs usually have a dibasic motif prior to (within 12 residues from) the cleavage site, or ω-site. Frieman and Cormack 2004 (PMID: 15470092) further showed that an elevated Ser/Thr frequency in the body of the protein can override the dibasic motif signal. In particular, they empirically showed that a group of known PMPs with dibasic motifs all have Ser/Thr frequencies lower than 30%, while the CWPs don’t have the dibasic motifs and 28/41 (68%) have a Ser/Thr frequency > 30%.
To remove the PMPs from consideration (from all I can glean from the literature, adhesins are CWPs and not PMPs), I determined the presence of a dibasic motif (“KK/RR/KR/RK”) in the 12 residues immediately prior to the ω-site, using a custom Python script (realized later that I could have used the pred.gpi tibble and seqinr package to do the same thing right here). Below I’ll read that in and merge it into the pred.gpi table.
tmp <- read_delim("output/PredGPIResults.txt", delim = "|", col_names = c("name","fp","omega"), col_types = cols())
tmp1 <- read_tsv("output/PredGPI_dibasic.tsv", col_types = cols())
pred.gpi <- tmp %>%
mutate(name = str_sub(name,2,-2), # remove > and the trailing space
fp = as.numeric(str_sub(fp, 9, -2)), # extract the numeric part
is.gpi = fp <= 0.01, # based on the cutoff of the PredGPI server (prob < 99% -> not GPI-anchored)
omega = str_sub(omega, 8),
cleaveRes = str_sub(omega, 1, 1),
cleavePos = as.integer(str_sub(omega, 3))
) %>%
left_join(tmp1, by = "name")
# remove the column if it already exists
if("pred.gpi" %in% names(seqInfo))
seqInfo <- select(seqInfo, -pred.gpi, -dibasic)
seqInfo <- left_join(seqInfo, select(pred.gpi, name, pred.gpi = is.gpi, dibasic), by = c("name"="name")) %>%
relocate(pred.gpi, dibasic, .after = signalp)
hmmscan for pfam domains
Using hmmscan to search the pfam database,the maximum allowed number of sequences is 500. I uploaded the same 55 files from PredGPI to hmmscan, providing my email address each time. HmmerWeb version 2.41.1. I resplit the new fasta into 44 files and reran. Copied and pasted results email into hmmer_results_emails.txt - checked to make sure that there were 44 and that each job id matched. Used the command grep “>” hmmer_results_emails.txt | sort > hmmer_results.txt. In Notepad++, copied and pasted header row from emails file.
This chunk will add a column that lists all non-overlapping domains found in each protein. In cases of overlap, the highest-scoring domain wins (using the outcompeted column in the data).
Update 2021-07-04 HB See HMMSCAN help page for clarification on the ’out-competed" column
In Pfam related entries are grouped into Clans, and as such can often match the same, or similar, regions on the query sequence. An additional column in the results table contains the clan accession for the family, if it belongs to a clan. Pfam employs a specific post processing on families from the same clan where the best match (determined by lowest E-value), is taken and the rest are out-competed.
hmm.names <- c("seq_id", "alignment_start", "alignment_end", "envelope_start", "envelope_end", "hmm_acc", "hmm_name", "hmm_start", "hmm_end", "hmm_length", "bit_score", "individual_e_value", "conditional_e_value", "database_significant", "outcompeted", "clan")
hmmscan_pfam <- read_table2(file = "output/hmmer_results.txt", col_names = hmm.names, col_types = "ciiiicciiidddiic", skip = 1)
df <- hmmscan_pfam %>%
filter(!outcompeted, # remove lower-scored overlapping domains
!hmm_name %in% c("Hyr1", "Candida_ALS")) %>% # remove two repeat domains
mutate(seq_id = str_sub(seq_id, 2)) %>% # remove >
group_by(seq_id) %>%
# collapse into list of pfam domains for each protein
summarise(pfam_domains = paste0(hmm_name, " (", envelope_start, "-", envelope_end, ")", collapse = ", "))
# remove the column if it already exists
if("pfam_domains" %in% names(seqInfo))
seqInfo <- select(seqInfo, -pfam_domains)
seqInfo <- left_join(seqInfo, df, by = c("name"="seq_id"))
TANGO
Created a .bat file to run tango on the condensed fasta. Used the command fwrite(x = Caurisfasta.table[, c(1,3)], file = "Cauris.txt", sep = "\t", col.names = FALSE). Command file is tango_in.bat. Ran it and compressed all results using the command gzip *.txt.
When I went to analyze I found that multiple files contained trash (columns with value -1.IO and values that didn’t match if the single line from the .bat file was rerun). Upon looking in the TANGO documentation, it says that files can’t be larger than 1000 sequences. Splitting it still induced errors, so I had Bin rerun all of the TANGO calculations and upload his error-free results to the github.
Note: Downloading and processing all of the TANGO output files takes a VERY long time. If you don’t need to regenerate the results, just import the file tango_results_df.txt as tango.res.df1. Everything after extracting the results will work off of that.
Using Bin’s tango extraction functions
require(dplyr)
extract_tango <- function(tango_output, agg_threshold = 5, required_in_serial = 5) {
require(tidyverse)
tmp <- read_tsv(file = tango_output, col_types = "icddddd") %>%
# a boolean vector for residues above threshold
mutate(pass = Aggregation > agg_threshold)
pass.rle <- rle(tmp$pass) # this creates a run length encoding that will be useful for identifying the sub-sequences in a run longer than certain length
# --- Explanation ---
# this rle object is at the core of this function
# an example of the rle looks like
# lengths: int[1:10] 5 19 20 8 1 5 19 6 181 18
# values: logi[1:10] F T F T F T F T F T
# note that by definition the values will always be T/F interdigited
# our goal is to identify the sub-sequences that is defined as a stretch of
# n consecutive positions with a score greater than the cutoff and record the
# sub-sequence, its length, start and end position, 90% quantile of the score
# --- End of explanation ---
# 1. assigns a unique id for each run of events
tmp$group <- rep(1:length(pass.rle$lengths), times = pass.rle$lengths)
# # 2. extract the subsequences
agg.seq <- tmp %>%
dplyr::filter(pass) %>% # drop residues not predicted to have aggregation potential
group_by(group) %>% # cluster by the runs
summarise(seq = paste0(aa, collapse = ""),
start = min(res), end = max(res), length = n(),
median = median(Aggregation),
q90 = quantile(Aggregation, probs = 0.9),
ivt = sum(aa %in% c("I","V","T")) / length(aa),
.groups = "drop") %>%
mutate(interval = start - dplyr::lag(end) - 1) %>%
dplyr::filter(length >= required_in_serial) %>%
select(-group)
return(agg.seq)
}
Apply to C. auris proteomes
tango.output.files <- list.files(path = "output/tango-output", pattern = ".txt|.txt.gz", full.names = TRUE)
# the read_csv() function used in the custom function can automatically decompress gzipped files
tango.res <- lapply(tango.output.files, extract_tango)
names(tango.res) <- gsub(".txt|.txt.gz", "", basename(tango.output.files))
# to add species information
# seqInfo <- read_tsv("raw-output/XP_028889033_homologs.tsv", comment = "#", col_types = c("ccci"))
tango.res.df <- bind_rows(tango.res, .id = "id")
# tango.res.df <- bind_rows(tango.res, .id = "id") %>%
# left_join(select(seqInfo, -length), by = c("id" = "id"))
# # save the tango output
# write_tsv(tango.res.df, "tango_summary_table.tsv.gz")
# # mutate(species = str_split(id, "_(?!.*_)", simplify = TRUE)[,2])
# # extract the species names
# # credit: https://stackoverflow.com/questions/20454768/how-to-split-a-string-from-right-to-left-like-pythons-rsplit
# # the split pattern is equivalent to the rsplit() function in python
Top 20 TANGO hits ranked by number of occurrences in all proteins, restricted to the strong hits with a median score >= 30%.
# tango.res.df1 <- tango.res.df %>% left_join(seqInfo %>% select(name, strain), by = c("id" = "name"))
tango.res.df1 <- read_tsv("output/tango_results_df.txt.gz", col_types = cols())
# find unique motifs and count the number of proteins and strains represented
motif.summary <- tango.res.df1 %>%
left_join(select(seqInfo, id = name, isHIL)) %>%
group_by(seq) %>%
summarize(n = n(), n.HIL = sum(isHIL),
n.prot = n_distinct(id),
n.strains = n_distinct(strain),
medScore = round(mean(median),1),
IVT = round(mean(ivt),2),
avg.intv = round(mean(interval, na.rm = T),1),
mad.intv = round(mad(interval, na.rm = T),1),
.groups = "drop") %>%
arrange(desc(n))
Joining, by = "id"
motif.summary %>%
filter(round(medScore,0) >= 30) %>%
head(20)
Discussion
- Somewhat to our surprise, even in the entire C. auris proteome, the “GVVIVTT” and its variants are still the dominant TANGO motif! Are they all in the Hil family homologs?
Identify variants of “GVVIVTT”
Since this is the dominant motif in the C. auris proteome, we want to determine what percent of the TANGO sequences in the C. auris proteome belong to this class?
To design the regular expression, we looked at both the TANGO sequences that “look similar” to “GVVIVTT”. In particular, Jan manually went through the TANGO hits in C. lusitaneae, C. auris, C. haemulonis, C. pseudohaemulonis, C. duohaemulonis, S. stipitis, S. cerevisiae, and C. glabrata and summarized the following pattern: [VI][VIVM[VI][VI]T
Below we will restrict the patterns to those with a median TANGO probability above 30.
seqs <- motif.summary$seq
# core group, must be G-[VI]x4-TT
pat0 <- "G[VI]{4}TT"
# version 1, requires G-[ALVI][VI][LVI][VI]-(either end of string or at least 1 suitable residue)
# OR (G not required)-[VI]x4-(two suitable residues)
# pat1 <- "G[ALVI][VI][LVI][VI]([LVIFAWYTM]{1,2}|$)|[VI]{4}[LVIFAWYTM]{2}"
# version 2, based on Jan's manual inspection of TANGO hits in the following species:
# C. lusitaneae, C. auris, C. haemulonis, C. pseudohaemulonis, C. duohaemulonis, S. stipitis, S. cerevisiae, and C. glabrata
pat.jan <- "[VI][VIVM[VI][VI]T"
cat(paste("Identify the group of TANGO sequences most similar to 'GVVIVTT' using the pattern", pat0, sep = " "))
Identify the group of TANGO sequences most similar to ‘GVVIVTT’ using the pattern G[VI]{4}TT
match0 <- grep(pat0, seqs)
motif.summary[match0,] %>% filter(round(medScore,0) >= 30) %>% arrange(desc(n)) %>% select(seq, n, n.HIL, medScore, IVT)
cat(paste("Identify TANGO sequences with slightly more variation from 'GVVIVTT', using `", pat.jan, "`", sep = ""))
Identify TANGO sequences with slightly more variation from ‘GVVIVTT’, using [VI][VIVM[VI][VI]T
match.jan <- grep(pat.jan, seqs)
# exclude those already identified with pattern 0
motif.summary[setdiff(match.jan, match0),] %>% filter(round(medScore,0) >= 30) %>% arrange(desc(n)) %>% select(seq, n, n.HIL, medScore, IVT)
# export the list of motifs for manual inspection
motif.summary[union(match.jan, match0),] %>%
filter(round(medScore,0) >= 30) %>%
arrange(desc(n)) %>%
write_csv("output/TANGO-GVVIVTT-like-sequences.csv")
Discussion
We can see that the vast majority of GVVIVTT-like sequences are within the Hil family proteins
Which non-Hil proteins have GVVIVTT sequences?
Note that XP_028891378.1 is a partial protein entry that I have determined to be the C-termianl fragment of Hil4 in B11221. Looking closely at the sequence and the chromosomal locations of QEL63124 and its neighboring QEL63125, I found that the latter, at ~500 a.a. is the N-terminal domain of Hil1 in B11245 while QEL63124 is the remaining part of that protein. So in conclusion, **ALL “GVVIVTT” like sequences, based on Bin’s regex, are within the Hil family.
In the second table that lists GVVIVTT-like sequences that are captured by Jan’s more inclusive regex and not my strict one, there is a notable outlier, namely GVVIVT, which is both so close to GVVIVTT and yet only present in a group of nonHil sequences, including PIS58185 in B8441 and others.
I analyzed this group of sequences further – see README.md in this folder.
This protein is very interesting: it no longer has the PF11765 domain but otherwise shows a high degree of sequence similarity with the latter.
Number and spacing of TANGO hits in each protein sequence
We first summarize the motif sequences within each protein and then use that to determine the distribution of TANGO hit # and spacing in the entire proteome, with or without the Hil family.
# first count the number of tango hits with median score >= 30 hits for each protein
# motif.per.seq <- tango.res.df1 %>%
# group_by(id) %>%
# summarize(n.all = sum(round(median, 0) >= 30))
# next we will filter the tango dataset in order to recalculate the intervals
# this will result in 14 sequences to be dropped since they have 0 hits meeting
# the criteria. we will add them back by right_join() with the tibble above
motif.per.seq <- tango.res.df1 %>%
# limit to sequences with median score >= 30
filter(round(median,0) >= 30) %>%
group_by(id) %>%
# recalculate the interval since we are now limiting the hits to >= 30
mutate(interval = start - lag(end) - 1) %>%
# summarize the results at a sequence level
summarize(n.all = n(),
n.type = length(unique(seq)),
medScore = round(mean(median),1),
IVT = round(mean(ivt),2),
avg.intv = round(mean(interval, na.rm = T),1),
IQR.intv = round(IQR(interval, na.rm = T)/1.349 ,1),
# median absolute deviation is a robust measure of the scale parameter
# https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/mad
mad.intv = round(mad(interval, na.rm = T),1),
seqs = paste(unique(seq), collapse = ","),
.groups = "drop") %>%
right_join(select(seqInfo, strain, id = name, length, isHIL), by = "id") %>%
mutate(n.all = ifelse(is.na(n.all), 0, n.all), n.type = ifelse(is.na(n.type), 0, n.type)) %>%
arrange(desc(n.all), desc(mad.intv))
#motif.per.seq
Distribution of the number of TANGO hits per kilo amino acids protein across the proteome.
ggplot(motif.per.seq, aes(x = strain, y = n.all/length*1000)) +
geom_boxplot(fill = "gray70", alpha = 0.7, outlier.alpha = 0.2) + ylab("# of TANGO hits per 1000 amino acids") +
geom_point(data = filter(motif.per.seq, isHIL), aes(color = length), position = position_jitter(0.05), size = 1.5) +
scale_color_viridis_c() +
theme_cowplot()

We can see that the 8 Hil orthologs fall into two groups. Hil1-4, which are longer than 1000 amino acids, have relatively high number of TANGO hits relative to their length. In fact, they are all above 89% percentile of the data.
Compare the distribution of TANGO hits # and score in the Hil family vs the proteome
ggplot(motif.per.seq, aes(x = length, y = n.all/length*1000)) +
geom_bin_2d(binwidth = c(200, 2)) + scale_fill_viridis_c() +
geom_point(data = filter(motif.per.seq, isHIL), color = "red") +
xlab("Protein length") + ylab("# of TANGO hits per 1000 amino acids") +
facet_wrap(~strain, nrow = 2) + theme_grey(base_size = 14)

Discussion
- this is another way of plotting the same data, again showing that Hil1-4 are exceptional in the proteome in that they harbor a very high amount of TANGO hits per 1000 amino acid residues relative to proteins of similar length to them.
Next we add the TANGO information to the seqInfo dataset.
if("n.tango" %in% names(seqInfo)){
seqInfo <- seqInfo %>%
select(-n.tango, -reg.spaced)
}
seqInfo <- seqInfo %>%
left_join(select(motif.per.seq, name = id, n.tango = n.all, mad.intv), by = "name") %>%
mutate(mad.intv = ifelse(n.tango > 2, mad.intv, NA),
reg.spaced = ifelse(n.tango > 3 & mad.intv < 5, TRUE, FALSE)) %>%
select(-mad.intv)
Summarize tango results for table
Three metrics for the by protein tango results can be number of pat0 (GVVIVTT) sequences, pat1 (GVVIVTT-like) sequences, and total number of aggregation sequences with score > 30
Update 2021-07-15 [HB] The code below is no longer included because the above analysese have shown that GVVIVTT and its variants are highly specific to the Hil family, which appears to be an outlier in terms of TANGO hits per protein in the entire proteome. In predicting additional adhesins, we will only look at the total number of TANGO hits.
# core group, must be G-[VI]x4-TT
pat0 <- "G[VI]{4}TT"
# version 1, requires G-[ALVI][VI][LVI][VI]-(either end of string or at least 1 suitable residue)
# OR (G not required)-[VI]x4-(two suitable residues)
pat1 <- "G[ALVI][VI][LVI][VI]([LVIFAWYTM]{1,2}|$)|[VI]{4}[LVIFAWYTM]{2}"
pat2 <- "[VI][VIVM][VI][VI]T"
match0 <- grep(pat0, motif.per.seq$seq)
match1 <- grep(pat1, motif.per.seq$seq)
match2 <- grep(pat2, motif.per.seq$seq)
tango.summ <- seqInfo[,"name"] %>%
# add pat0 results
left_join(motif.per.seq[match0,] %>% group_by(id) %>% summarise(n.gvvivtt.seqs = sum(n)), by = c("name" = "id")) %>%
# add pat1 results
left_join(motif.per.seq[setdiff(match1, match0),] %>% group_by(id) %>% summarise(n.gvvivtt.like.seqs = sum(n)), by = c("name" = "id")) %>%
# add pat2 results - pat2 from Fassler manual curation of TANGO positive sequences from FungalRV positive proteins in C. lusitaneae, C. auris, C. haemulonis, C. pseudohaemulonis, C. duohaemulonis, S. stipitis, S. cerevisiae, and C. glabrata into GVVIVTT-like and non GVVIVTT groups
left_join(motif.per.seq[match2,] %>% group_by(id) %>% summarise(n.vvvvt.seqs = sum(n)), by = c("name" = "id")) %>%
# add motif.per.seq results
left_join(motif.per.seq %>% group_by(id) %>% summarise(n.high.score.agg.seq = sum(n)), by = c("name" = "id"))
# join to seqInfo table
# remove the column if it already exists
if("n.gvvivtt.seqs" %in% names(seqInfo))
seqInfo <- select(seqInfo, -n.gvvivtt.seqs, -n.gvvivtt.like.seqs, -n.vvvvt.seqs, -n.high.score.agg.seq)
seqInfo <- left_join(seqInfo, tango.summ)
seqInfo <- seqInfo %>% mutate(n.gvvivtt.seqs = replace_na(n.gvvivtt.seqs,0), n.gvvivtt.like.seqs = replace_na(n.gvvivtt.like.seqs,0), n.vvvvt.seqs = replace_na(n.vvvvt.seqs,0), n.high.score.agg.seq = replace_na(n.high.score.agg.seq,0))
S/T frequency
Using Bin’s method, I’m running EMBOSS freak for S/T, S, T with a stepping value of 10 and an averaging window of 100. The results were saved as freak_aminoacid(s)_100_10.txt. Using Bin’s format_freaq_output.py script, each was converted to a readable .out file and all files were compressed for storage.
Update 2021-07-16 [HB] we will use two types of statistics to summarize the serine/threonine frequencies here. 1) frequency of the residues over the entire protein; 2) maximum frequency of serine or threnonine in a 100 amino acid sliding window (which will ignore any proteins < 100 residues). in the second measure, we reason that a 100 amino acid window is a reasonable size to capture a genuine “hike” in Serine or Threonine frequency. For one thing, it is large enough to avoid capturing just a small cluster of S/T, and for another thing the Serine rich cluster we observe in Hil1 is ~200 amino acids long and with a consistently elevated Serine content, suggesting that a 100 amino acid window is not too large to lead to diluted signals by flanking residues.
# load data
ST.protein <- read_tsv("output/ST-freq/Cauris-ST-freq-per-protein.txt.gz", col_types = cols())
ST.freq <- read_tsv("output/ST-freq/freak_st_100_10_freak.out.gz", col_types = "cid")
S.freq <- read_tsv("output/ST-freq/freak_s_100_10_freak.out.gz", col_types = "cid")
T.freq <- read_tsv("output/ST-freq/freak_t_100_10_freak.out.gz", col_types = "cid")
ST.window <- bind_rows("ST" = ST.freq, "S" = S.freq, "T" = T.freq, .id = "residue") %>%
group_by(id, residue) %>%
summarize(max = max(freq))
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
rm(list = c("ST.freq", "S.freq", "T.freq"))
ST.window %>%
left_join(select(seqInfo, id = name, isHIL), by = "id") %>%
mutate(residue = factor(residue, levels = c("ST", "S", "T"), labels = c("Ser/Thr", "Ser", "Thr")),
`Hil family` = ifelse(isHIL, "Yes", "No")) %>%
ggplot(aes(x = max)) + geom_histogram(binwidth = 0.02) +
xlab("Max frequency of Ser/Thr in 100 amino acid window") + scale_y_continuous(position = "right") +
facet_grid(`Hil family` ~ residue, scales = "free_y", labeller = "label_both", switch = "y") +
theme_cowplot() + panel_border()

Discussion
From the above we can see that the not all Hil family proteins have well-above-average Serine or Threonine content alone. But when both residues are counted together, they consistently rank above the majority of the proteome.
ST.protein %>%
left_join(select(seqInfo, ID = name, isHIL), by = "ID") %>%
mutate(`Ser/Thr` = (Ser+Thr)/length, Ser = Ser/length, Thr = Thr/length) %>%
pivot_longer(cols = c(`Ser/Thr`, Ser, Thr), names_to = "residue", values_to = "value") %>%
mutate(residue = factor(residue, levels = c("Ser/Thr", "Ser", "Thr")),
`Hil family` = ifelse(isHIL, "Yes", "No")) %>%
ggplot(aes(x = value)) + geom_histogram(binwidth = 0.02) +
xlab("Frequency of Ser/Thr in the whole protein") + scale_y_continuous(position = "right") +
facet_grid(`Hil family` ~ residue, scales = "free_y", labeller = "label_both", switch = "y") +
theme_cowplot() + panel_border()

Discussion
Similar as above, we see that the combined Ser/Thr frequency appear to better distinguish the Hil family from the rest of the proteins. For this reason let’s add the max S/T frequency in 100 amino acid windows and the overall S/T frequency per protein to seqInfo
#remove columns if present
if("ST.prot" %in% names(seqInfo))
seqInfo <- select(seqInfo, -ST.prot, -ST.window.max)
#join maximums back into seqInfo
tmp <- ST.protein %>%
mutate(ST.prot = round((Ser+Thr)/length, 3)) %>%
select(name = ID, ST.prot) %>%
left_join(ST.window %>% filter(residue == "ST") %>% select(name = id, ST.window.max = max), by = "name") %>%
# for proteins shorter than 100 a.a., the sliding window estimate would be NA. replace it with whole protein estimate in ST.protein
mutate(ST.window.max = round(coalesce(ST.window.max, ST.prot),3))
seqInfo <- seqInfo %>% left_join(tmp, by = "name")
# rm(list = c("ST.protein", "ST.window"))
Write the seqInfo to a file to be included as supplementary data
seqInfo %>%
select(-pfam_domains, pfam_domains) %>%
write_tsv(file = "output/Rmd-out/seqInfo_table_all.tsv")
Prediction
Here we try to use the sequence properties collected above to identify additional proteins outside of the Hil family that may function as adhesins. The idea here is to apply filters in a step-wise manner and watch if all Hil proteins are still retained and how many non-Hil proteins remain in the dataset.
Cell wall proteins
How many proteins are likely to be GPI-anchored cell wall proteins?
# Load function
# reference: http://rstudio-pubs-static.s3.amazonaws.com/6975_c4943349b6174f448104a5513fed59a9.html
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")
crosstab(seqInfo, row.vars = "signalp", col.vars = "pred.gpi", type = c("f", "t"), dec.places = 1, addmargins = FALSE)
Count Total %
pred.gpi FALSE TRUE FALSE TRUE
signalp
FALSE 20459.0 130.0 94.0 0.6
TRUE 861.0 322.0 4.0 1.5
We see that only a very small fraction (1.5%) of the proteome pass both the SignalP and PredGPI tests. By comparison, all eight C. auris Hil proteins pass both tests, and the vast majority of Hil homologs pass both tests as well. So far this is the most stringent (aka effective) filter to reduce the number of candidate proteins.
Update 2021-07-27 [HB] Following the discussion on GPI-CWP vs GPI-PMP (see notes in the PredGPI section above), we will additionally test whether proteins with Signal Peptide and GPI-anchor also have low (<30%) Ser/Thr frequency and the presence of a dibasic motif
filter1 <- seqInfo %>%
mutate(`Hil_family` = ifelse(isHIL, "Yes", "No"),
`GPI` = ifelse(signalp & pred.gpi, ifelse(ST.prot < 0.3 & !is.na(dibasic), "PMP", "CWP"), 'No'),
`GPI` = factor(GPI, levels = c("No", "CWP", "PMP"))) %>%
select(-signalp, -pred.gpi, -isHIL)
crosstab(filter1, row.vars = "GPI", col.vars = "Hil_family", addmargins = FALSE)
Hil_family No Yes
GPI
No 21448 2
CWP 273 23
PMP 26 0
The two proteins that dropped out
are both in the clade IV strain B11245 and appear to be missing the C-terminal sequence, hence failed the PredGPI test.
And the PMPs are shown below
Protein length
Below we show the distribution of protein length in the proteome.
p <- ggplot(data = seqInfo) +
geom_histogram(mapping = aes(x=length), binwidth = 50)+
geom_vline(aes(xintercept = 500), color="blue", size=1) +
annotate("text", x = 3000, y = 1500, label = paste0("# of proteins >= 500 residues: ", sum(seqInfo$length >= 500)), size = 5) +
xlab("Protein length (amino acids)") + ylab("# of proteins") + theme_cowplot()
print(p)

We classify the CWP hits into two groups based on their length. Those with more than 500 a.a. are labeled as “CWP long” and those smaller than 500 a.a. as “CWP short”.
filter2 <- filter1 %>%
mutate(Len = ifelse(length >= 500, ">=500", "<500"),
# group1 is the old definition
# group1 = ifelse(Hil_family == "Yes", # group 1: Hil family
# "Hil", ifelse(GPI != "No", # group 2&3: cell wall long / short
# ifelse(length >=500, "CWP long", "CWP short"),
# "Others")), # group 4: Others
# group1 = factor(group1, levels = c("Hil", "CWP long", "CWP short", "Others")),
# group is the new definition
GPI = as.character(GPI), # necessary for the case_when() statements
group = case_when(
Hil_family == "Yes" ~ "Hil",
GPI == "CWP" & length >= 500 ~ "CWP long",
GPI == "CWP" & length < 500 ~ "CWP short",
GPI == "PMP" ~ "PMP",
GPI == "No" ~ "Others",
),
group = factor(group, levels = c("Hil", "CWP long", "CWP short", "PMP", "Others")))
crosstab(filter2, row.vars = "group")
group Count Total %
Hil 25.00 0.11
CWP long 63.00 0.29
CWP short 210.00 0.96
PMP 26.00 0.12
Others 21448.00 98.51
Sum 21772.00 100.00
Ser/Thr frequency
seqInfo %>%
pivot_longer(cols = starts_with("ST."), names_to = "type", names_prefix = "ST.", values_to = "value") %>%
mutate(type = factor(type, levels = c("prot", "window.max"), labels = c("whole protein", "max in 100 a.a. windows"))) %>%
ggplot(aes(x = value, color = isHIL)) + stat_ecdf() + facet_wrap(~type) +
scale_x_continuous(limits = c(0,0.8)) + scale_color_manual("Hil family", values = c("black", "gold3")) +
xlab("Ser+Thr frequency") + ylab("Cumulative probability") +
theme_cowplot() + background_grid()
Warning: Removed 4 rows containing non-finite values (stat_ecdf).

p <- filter2 %>%
pivot_longer(cols = starts_with("ST."), names_to = "type", names_prefix = "ST.", values_to = "value") %>%
mutate(type = factor(type, levels = c("prot", "window.max"), labels = c("whole protein", "max in 100 a.a. window"))) %>%
ggplot(aes(x = type, y = value, fill = group)) + geom_boxplot(outlier.size = 0.5, outlier.shape = 1) +
scale_fill_viridis_d() + xlab("") + ylab("Ser/Thr frequency") + theme_cowplot()
p
ggsave(file = "output/figure/20210721-ST-boxplot-compare-cwp-to-rest.png", plot = p + theme(legend.position = "bottom"), width = 5, height = 4.5)

Seems like the two measures behave very similarly. Let’s see how they compare by applying the above cutoffs.
ST.th = c("ST.protein" = 0.15, "ST.window" = 0.25)
cat("Thresholds for whole protein or sliding window frequencies of Ser/Thr")
Thresholds for whole protein or sliding window frequencies of Ser/Thr
print(ST.th)
ST.protein ST.window
0.15 0.25
filter2 %>%
mutate(`S/T protein` = ST.prot > ST.th["ST.protein"],
`S/T max window` = ST.window.max > ST.th["ST.window"]) %>%
crosstab(row.vars = c("S/T protein", "S/T max window"), col.vars = "group", addmargins = FALSE)
group Hil CWP long CWP short PMP Others
S/T protein S/T max window
FALSE FALSE 0 0 44 6 14054
TRUE 0 0 0 0 463
TRUE FALSE 0 4 21 4 3835
TRUE 25 59 145 16 3096
Based on the above, we will apply a third filter of S/T frequency in the whole protein > 15%, which is roughly the lower cutoff for the group of CWPs in the Caro et al. 1997 paper.
# ref: https://stackoverflow.com/questions/34096162/dplyr-mutate-replace-several-columns-on-a-subset-of-rows
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}
filter3 <- filter2 %>%
mutate(`S/T protein` = ST.prot > 0.15) %>%
mutate_cond(str_sub(group,1,3) == "CWP" & !`S/T protein`, group = "Others") %>%
select(-Len, -`S/T protein`)
cat("Genes removed due to the S/T cutoff")
Genes removed due to the S/T cutoff
filter2 %>% filter(group %in% c("CWP long", "CWP short"), ST.prot <= 0.15)
We can see that all the proteins that fall out are within the “CWP short” category, which are less likely to function as adhesins to begin with. Thus this cutoff could help reduce the number of proteins we have to manually look at without losing many potential hits.
After this step we have a total of 229 proteins left.
TANGO hits
p <- filter3 %>%
ggplot(aes(x = n.tango, fill = group)) + geom_histogram(binwidth = 1) +
facet_wrap(~group, nrow = 5, scales = "free_y") +
scale_y_continuous(n.breaks = 3) +
scale_fill_viridis_d() +
xlab("# of strong TANGO hits") + theme_cowplot() + panel_border()
p
ggsave("output/figure/20210721-tango-hits-distribution-compare.png", plot = p, width = 6, height = 4.5)

As we already showed before, the number and spacing of strong TANGO hits is a divisive feature that separate the Hil family proteins in C. auris into two groups – those that are longer than 1000 a.a. have a lot of them and have them regularly spaced, while the four shorter members have few. We see a similar picture when we look at the candidate adhesin proteins in our analysis up to this point: the cell-wall protein long group showed above average number of TANGO hits while the short ones have very few. The PMPs, despite most of them being shorter than 500 amino acids, actually have relatively high number of TANGO hits.
Tandem repeats
We didn’t want to perform the tandem repeat analysis on >20,000 proteins. Now that we have narrowed the list down to a much more manageable one, the code below will perform XSTREAM analysis on the group of candidate proteins identified above.
- Export the protein IDs
cat(filter3 %>% filter(str_sub(group, 1, 3) == "CWP") %>% pull(name), file = "data/20210718-cwp-proteins-id.txt", sep = "\n")
- Extract the fasta, run the following code at the command line
/Users/bhe2/sw/miniconda3/bin/python script/extract_fasta_gz.py data/proteome-fasta/Caurisfasta.faa.gz data/20210718-cwp-proteins-id.txt data/20210718-cwp-proteins.faa
- Execute the xstream program
cd script
sh xstream.sh ../data/20210718-cwp-proteins.faa cwp-proteins ../output/tandem-repeats
cd ..
- Import the xstream results
repeats <- read_tsv("output/tandem-repeats/XSTREAM_cwp-proteins_i0.7_g3_m5_L15_chart.tsv", col_types = cols())
Cluster by similarity
Now that we settled on the “CWP long” and “CWP short” lists, we would like to visualize their domain architecture. Since we started with four proteomes, with the goal of capturing a “species-wide” set of candidates, we have presumably four alleles of each protein (may be fewer) in the list. To make it easier to examine the domain features, let’s cluster the proteins and assemble a representative list.
Submit the fasta to CD-hit webserver.
Reformat the CD-hit output
/Users/bhe2/sw/miniconda3/bin/python script/format_cd-hit_output.py output/cd-hit/20210727-CD-hit-0.85/1627438838.fas.1.clstr.sorted output/cd-hit/formatted-cd-hit-0.85.tsv
/Users/bhe2/sw/miniconda3/bin/python script/format_cd-hit_output.py output/cd-hit/20210727-CD-hit-0.6/1627439988.fas.1.clstr.sorted output/cd-hit/formatted-cd-hit-0.6.tsv
cdhit.lo <- read_tsv("output/cd-hit/formatted-cd-hit-0.6.tsv", col_types = "ccc") %>%
group_by(cluster) %>%
mutate(leader = id[which(identity == "*")], # assign the representative protein ID to new variable "leader"
identity = ifelse(identity == "*", 100, as.numeric(str_sub(identity, 1, -2)))) %>% # extract percent identity
ungroup()
cdhit.hi <- read_tsv("output/cd-hit/formatted-cd-hit-0.85.tsv", col_types = "ccc") %>%
group_by(cluster) %>%
mutate(leader = id[which(identity == "*")], # assign the representative protein ID to new variable "leader"
identity = ifelse(identity == "*", 100, as.numeric(str_sub(identity, 1, -2)))) %>% # extract percent identity
ungroup()
cdhit <- full_join(cdhit.hi, cdhit.lo, by = "id", suffix = c(".hi", ".lo")) %>%
mutate(agree = leader.hi == leader.lo) %>%
select(id, everything()) %>% arrange(cluster.lo)
rm(list = c("cdhit.lo", "cdhit.hi"))
Disagreement between the two clustering results:
cdhit %>% group_by(cluster.lo) %>% filter(length(unique(cluster.hi)) > 1, leader.hi == id) %>% ungroup()
To determine if the extra clusters identified with the 85% threshold is real, I used blastp (ncbi server) to align each of the additional leader sequence to the shared leader sequence identified with the 60% threshold. When necessary, I also checked the protein’s chromosomal location to determine if they indeed come from different loci.
Table. Determine the orthology of the cluster leaders using the two thresholds
| PIS51179.1 |
PIS54714.1 |
83% |
70% |
91 |
82 |
chr1 |
chr7 |
different |
both in B8441 |
| PIS48171.1 |
PIS48170.1 |
100% |
67% |
176 |
166 |
chr6:40562-41092 |
chr6:39307-39807 |
different |
tandem duplicates |
| PIS48171.1 |
PIS48172.1 |
100% |
66% |
176 |
173 |
chr6:40562-41092 |
chr6:41619-42140 |
different |
tandem duplicates |
| PIS54802.1 |
QEL63474.1 |
100% |
93% |
861 |
834 |
chr7:767294-769879 |
chr7:760044-760968,761039-762618 |
orthologs |
B8441 and B11245 |
| PIS56912.1 |
QEO20379.1 |
NA |
NA |
765 |
747 |
chr3,scaffold5:864995-867292 |
chr1:2549461-2551704 |
likely orthologs |
B8441 and B11220 |
| XP_028890844.1 |
QEO21242.1 |
100% |
85% |
790 |
755 |
chr1:3100225-3102597 |
chr2:1361822-1364089 |
likely orthologs |
B11221 and B11220 1 |
| QEL58520.1 |
PIS51941.1 |
97% |
88% |
457 |
440 |
chr1 |
chr1,scaffold1 |
orthologs |
B11245 and B8441 |
| PIS58185.1 |
QEO22553.1 |
100% |
94% |
2608 |
970 |
chr2 |
chr3 |
orthologs |
B8441 and B11220 1 |
1 According to Muñoz et al. 2021, there is a large scale translocation between chromosome 1 and 2 in the clade II strains and the chromosome 1 location for XP_028890844 is within the translocated region and matches the coordinates of QEO21242 in the translocated region in its chromosome 2.
3. Based on the results above, we will manually update the cdhit table
In addition to the changes mentioned in the table above, I also manually merged QEO60526 into the PIS58185 cluster (see README.md in this folder for the rationale); I merged QEO21241 from B11220 to the PIS51941 cluster. My goal here is to manually check the single gene clusters and determine if they actually belong to an existing cluster. All three remaining single gene clusters cannot be merged to any existing cluster, because their orthologs (by BLASTP) are either missing the signal peptide and/or GPI-anchor.
Lastly, CD-hit by default chooses the longest protein in a cluster as the representative. I manually changed the representative (aka leader) sequence to the B8441 protein if one exists. This way the final list will be mostly represented by sequences from a single strain (with some exceptions).
- Then load the updated table and merge the information to produce the final candidate table.Only the leaders (longest representative in each cluster) are shown below.
cdhit1 <- read_tsv("output/cd-hit/20210727-cdhit-cluster-manually-edited.tsv", col_types = cols()) %>%
left_join(select(seqInfo, id = name, length), by = "id") %>%
select(-ends_with(c(".hi",".lo"))) %>%
group_by(leader) %>%
mutate(leader.len = max(length), n = n()) %>%
ungroup()
cdhit1 <- cdhit1 %>%
filter(is.leader) %>%
select(leader, leader.len) %>%
arrange(desc(leader.len)) %>%
mutate(cluster = seq_along(leader)) %>%
right_join(cdhit1, by = c("leader", "leader.len")) %>%
select(cluster, n, leader, leader.len, id, length, is.leader)
- We will merge the cluster information with other sequence features
# calculate the per sequence repeat content and add to the candidates list
tmp <- repeats %>%
mutate(len = end - start + 1) %>%
group_by(identifier) %>%
summarize(perc = round(sum(len/seqLength),3)) %>%
rename(name = identifier, repeats = perc)
candidates <- filter3 %>%
filter(str_sub(group,1,3) == "CWP") %>%
rename(ST.win = ST.window.max) %>%
left_join(tmp, by = "name") %>%
replace_na(list(repeats = 0)) %>%
left_join(select(cdhit1, name = id, cluster, cluster.size = n, is.leader), by = "name") %>%
select(-Hil_family) %>%
arrange(cluster)
- Count the number of leaders in each group
candidates %>% filter(is.leader) %>% count(group)
- Show their sequence features for the CWP long and CWP short groups separately
cwp.long <- candidates %>% filter(is.leader, group == "CWP long") %>% select(-group, -is.leader, -pfam_domains, pfam_domains)
cwp.short <- candidates %>% filter(is.leader, group == "CWP short") %>% select(-group, -is.leader, -pfam_domains, pfam_domains)
cat("CWP long")
CWP long
cwp.long
cat("CWP short")
CWP short
cwp.short
Discussion
In the first table above, cluster 1 along with XP_028893057, which is not in the candidates list because it misses the C-terminal GPI-anchor sequence (this is a partial mRNA missing the 3’ portion), are the only ones containing the “GVVIVT” sequence (see README.md in this folder). They all have the GLEYA domain in the N-terminus, and had very similar central domain repeat units as the Hil1/2 protein.
Rows 2, 3 and 4 all have the Candida_ALS_N domain, and they are the ALS family members.
Examining the N-terminal Pfam domains among the rest candidate revealed Flo11, Asp, PLA2_B, CFEM, Glyco_hydro_72 and X8. Many of these domains have connections to adhesion.
Write the candidates info to files
candidates %>%
filter(is.leader) %>%
select(-is.leader, -pfam_domains, pfam_domains) %>%
relocate(group, .after = name) %>%
write_tsv(file = "output/Rmd-out/20210729-candidate-leaders-seqinfo.tsv")
Schematic plot for leaders
The goal is to create a domain schematic plot for all the leader proteins. The desired plot will include the following features: SignalP, GPI-anchor, any pfam domain (except for the Hyr1, Flocculin_t3 and Candida_ALS repeat domains), tandem repeats as identified by xstream, and the plot should be ordered by protein length from large to small.
Update 2021-07-20 [HB] I decide to remove the clusters whose leader protein length is less than 195 a.a. This helps to shorten the list, making the plot more readable.
# get the ids to plot
id.to.plot <- candidates %>% filter(is.leader, length >= 195) %>% pull(name)
# we will be building the plot in layers: 1) whole protein (grey); 2) tandem repeats; 3) pfam domains; 4) signalp and pred.gpi; 5) if possible, tango hits
# the relevant data can be assembled separately
df.whole <- candidates %>%
filter(name %in% id.to.plot) %>%
mutate(start = 1, end = length, feature = "whole protein",
name = ordered(name, levels = rev(id.to.plot))) %>%
select(name, feature, start, end, group)
df.repeats <- repeats %>%
mutate(feature = paste("TR", type, "-")) %>%
select(name = identifier, feature, start, end) %>%
filter(name %in% id.to.plot) %>%
left_join(select(df.whole, name, group), by = "name")
df.pfam <- hmmscan_pfam %>%
filter(!outcompeted, # remove lower-scored overlapping domains
!hmm_name %in% c("Hyr1", "Candida_ALS")) %>% # remove two repeat domains , "Flocculin_t3"
mutate(name = str_sub(seq_id, 2), # remove > from seq.id
feature = paste0(hmm_name, " (", gsub("\\.[0-9]", "", hmm_acc), ")")) %>% # feature name
filter(name %in% id.to.plot) %>%
left_join(select(df.whole, name, group), by = "name") %>%
select(name, feature, start = envelope_start, end = envelope_end, group)
df.signalp <- signalp5 %>%
filter(id %in% id.to.plot) %>%
mutate(feature = "SignalP") %>%
select(name = id, start, end) %>%
left_join(select(df.whole, name, group), by = "name")
df.gpi <- pred.gpi %>%
filter(name %in% id.to.plot, is.gpi) %>%
mutate(feature = "GPI-anchor") %>%
left_join(select(df.whole, name, group, end), by = "name") %>%
select(name, feature, start = cleavePos, end, group)
df.tango <- tango.res.df1 %>%
filter(id %in% id.to.plot) %>%
mutate(feature = "tango") %>%
select(name = id, feature, start, end, median, seq) %>%
left_join(select(df.whole, name, group), by = "name")
Build the plot
# protein backbone
p0 <- ggplot(df.whole, aes(x = name, y = start, xend = name, yend = end)) +
geom_hline(yintercept = 500, color = "grey80", size = 0.2) + # add a visual divider to distinguish between long and short
geom_segment(color = "grey80", size = 2)
# tandem repeats
p1 <- geom_segment(data = df.repeats, aes(size = "Tandem Repeats"), color = "orange4", alpha = 0.5)
# pfam
require(RColorBrewer)
feature.col <- c(brewer.pal(12, "Paired")[c(9,2,4,3,5,7,11,8)], brewer.pal(8, "Dark2")[c(2,8,3,4,6)],"brown3", "navyblue", "skyblue")
#(length(unique(df.pfam$feature))) # define the colors
p2 <- geom_segment(data = df.pfam, aes(color = feature), size = 2)
# signalp and gpi
p3 <- geom_segment(data = df.signalp, aes(size = "SignalP"), color = "#e31a1c")
p4 <- geom_segment(data = df.gpi, aes(size = "GPI-anchor"), color = "#6a3d9a")
# tango
p5 <- geom_segment(data = df.tango, aes(y = start, yend = end + 1, alpha = median),
position = position_nudge(x = -0.42, y = 0), color = "darkred", size = .8)
# plot
p0 + p1 + p2 + p3 + p4 + p5 + coord_flip() +
#facet_wrap(~group, scales = "free_y") +
scale_color_manual(values = c(feature.col)) +
scale_size_manual(
"Other features", values=rep(2,3),
guide=guide_legend(override.aes = list(colour=c("#6a3d9a", "#e31a1c", alpha("orange4",.5))))
) +
guides(alpha = "none") +
scale_y_continuous(breaks = seq(0,2500,500), expand = expansion(mult = 0.02)) +
theme_classic(base_size = 12) + background_grid(major = "x") +
theme(axis.text.y = element_text(size = 5), axis.title.y = element_blank(),
axis.line.y = element_blank(),# axis.ticks.y = element_blank(),
#axis.line.x = element_blank(),# axis.ticks.x = element_blank(),
legend.position = c(0.65, 0.45), legend.box = "horizontal",
panel.background = element_rect(fill = alpha("lightblue",0.1))) +
labs(y = "Position in sequence", x = "Sequences", color = "Pfam domains")

ggsave("output/figure/20210720-cwp-candidates-domain-schematic.png", width = 7.2, height = 5.4)
---
title: "_C. auris_ adhesin prediction based on Hil family properties"
author: "Rachel Smoak, modified by Bin He"
date: "1/21/2021 Updated `r format(Sys.time(), '%d %B, %Y')`"
output: 
  html_notebook:
    theme: default
    toc: true
    toc_float: true
    code_folding: hide
    fig_asp: 1
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(RSQLite)
library(seqinr)
library(tidyverse)
library(cowplot)
library(stringr)
#working_directory = "C:/Users/Rachel/OneDrive - University of Iowa/Fall 2019/Bioinformatics/Project work/Paper/Adhesin_Analysis"
```

# Prepare data

## Get proteomes from ftp

*C. auris* proteome files were retrieved from NCBI ftp sites and stored in the `data/proteome-fasta` folder. The concatenated fasta file is stored in the same folder.

```{r get_proteomes, include=FALSE, eval=FALSE}
Caurisfasta.table <- NULL

# Get data. seqinr package is new and seems powerful but has odd data forms so I have to extract then remake data frames
file.list <- list.files(path = "data/proteome-fasta/", pattern = "*.faa.gz")
strain.list <- str_extract(file.list, "B[0-9]+")
file.list <- file.path("data/proteome-fasta", file.list)
names(file.list) <- strain.list


# working_directory = "./sandbox"
extract.fasta.fn <- function(file, working_directory = "sandbox"){
  #proteome_file <- paste(working_directory, "/","proteome_file.txt.gz", sep = "")
  #download.file(url = raw_url, destfile = proteome_file, quiet = TRUE)
  fasta <- read.fasta(file, seqtype = "AA", as.string = TRUE)
  t1 <- tibble(id = getName(fasta), annotation = unlist(getAnnot(fasta)), seq = unlist(getSequence(fasta, as.string = TRUE)))
  return(t1)
}

Caurisfasta.table <- map_dfr(file.list, ~extract.fasta.fn(.), .id = "strain") %>% 
  mutate(length = nchar(seq))
write_tsv(select(Caurisfasta.table, strain, name = id, length), file = "data/Cauris-four-proteomes-seqLen.tsv")
```

## Write a fasta file

I wrote a combined fasta file using the command fwrite(x = Caurisfasta.table[, c(1,3)], file = "Cauris.txt", sep = "\t", col.names = FALSE)

Using Notepad++ I changed /n to /n\> then /t to /n to put in conventional fasta format. Saved as Caurisfasta.txt. There are 27278 proteins.

Note: I wrote a new fasta file to remove B11243 from the file. There are now 21772 proteins.

> **Update 2021-07-04 HB** skipped this step as the concatenated file can be created by `cat *.faa.gz > concatenated_cand_auris.faa.gz` if needed

```{r initiate_seqInfo}
# we want to include the information about whether a protein is in our homologs list
# for that we will important the list of protein ids that were analyzed
Hil <- read_tsv("data/Hil-family-Cauris-homologs-id.txt", col_names = c("name", "strain"), col_types = "cc")

# initiate the seqInfo table
seqInfo <- read_tsv("data/Cauris-four-proteomes-seqLen.tsv", col_types = cols()) %>% 
  mutate(isHIL = name %in% Hil$name)

# number of Hil homologs by strain
seqInfo %>% 
  group_by(strain) %>% 
  summarize(Hil = sum(isHIL))
```

A total of `r sum(seqInfo$isHIL)` proteins in the four strains belong to the Hil family. B11221 misses Hil4 and B11245 misses Hil8. In both cases the reason is because of incomplete genome assembly rather than gene loss. B11220, the representative of clade II, indeed misses 5/8 homologs.

## SignalP for signal peptides

Use SignalP 5.0 server to predict signal peptides. Maximum number of proteins is 5000, so I split the fasta file into 6 smaller files and submitted, providing my email each time. I used the old file with B11243 to run this. (HB: seems like the current result file contains B11245 instead). I resplit the new fasta into 5 files and reran. Downloaded .gff files as SignalP\_(file \#).gff3 and used command `grep -v -h "##" SignalP* | sort > SignalP.txt`

**Note [HB]**: SignalP output in GFF3 format only includes the input proteins that are predicted to have a signal peptide. There is a total of 1183 entried in the result file, from a total of >20,000 proteins submitted.

Modified file names, but otherwise Bin's code:

```{r signalP, fig.width=5, fig.height=5}
# Signal peptide
gff.names <- c("id", "source", "name", "start", "end", "prob", "na1", "na2", "na3")
signalp5 <- read_tsv("output/SignalP.txt", comment = "#", col_names = gff.names, col_types = "ccciidccc")

# remove the column if it already exists
if("signalp" %in% names(seqInfo))
  seqInfo <- select(seqInfo, -signalp)
seqInfo <- left_join(seqInfo, select(signalp5, name = id, prob), by = c("name" = "name")) %>% 
  mutate(signalp = !is.na(prob)) %>% select(-prob)
```

> **New problem** B11243 and B11245 are identical. Because B11245 is assembled to the chromosome level, I will retain it and remove B11243 from the results.

## PredGPI {#predgpi}

Following Bin's method, I used PredGPI to predict GPI anchors. For PredGPI, the maximum number of sequences is 500. To split the file I used the command split Caurisfasta.txt -l 1000 in my Ubuntu instance, which made 55 new files. The output includes name \| FPrate \| most likely position and the sequence for each. GPI anchor is "highly probable" when FPrate \< 0.001, "probable" when 0.001 \< FPrate \< 0.005, "lowly probable" when 0.005 \< FPrate \< 0.01. Retain only header lines with the command grep -h "\>" GPIPE\* \> PredGPIResults.txt.

This was done using the file that contained B11243. To remove duplicate results, I used the command grep "\>" PredGPIResults.txt \| sort -u \> PredGPIResults2.txt. Then I renamed PredGPIResults2 PredGPIResults. It has 21772 proteins.

**Update 2021-07-27 [HB]**

It turns out that GPI-anchored proteins can be divided into two categories: those that are retained in the plasma membrane and those that are subsequently released and then covalently linked to the cell wall components (β-1,6-glucan). The latter, named cell wall proteins (CWP) account for the majority of the GPI-anchored proteins while the former, named plasma membrane proteins (PMP) are the minority. Based on Caro _et al._ 1997 (PMID: 9434352), the PMPs usually have a dibasic motif prior to (within 12 residues from) the cleavage site, or ω-site. Frieman and Cormack 2004 (PMID: 15470092) further showed that an elevated Ser/Thr frequency in the body of the protein can override the dibasic motif signal. In particular, they empirically showed that a group of known PMPs with dibasic motifs all have Ser/Thr frequencies lower than 30%, while the CWPs don't have the dibasic motifs and 28/41 (68%) have a Ser/Thr frequency > 30%.

To remove the PMPs from consideration (from all I can glean from the literature, adhesins are CWPs and not PMPs), I determined the presence of a dibasic motif ("KK/RR/KR/RK") in the 12 residues immediately prior to the ω-site, using a custom Python script (realized later that I could have used the `pred.gpi` tibble and `seqinr` package to do the same thing right here). Below I'll read that in and merge it into the `pred.gpi` table.
```{r gpi}
tmp <- read_delim("output/PredGPIResults.txt", delim = "|", col_names = c("name","fp","omega"), col_types = cols())
tmp1 <- read_tsv("output/PredGPI_dibasic.tsv", col_types = cols())
pred.gpi <- tmp %>% 
  mutate(name = str_sub(name,2,-2), # remove > and the trailing space
         fp = as.numeric(str_sub(fp, 9, -2)), # extract the numeric part
         is.gpi = fp <= 0.01,    # based on the cutoff of the PredGPI server (prob < 99% -> not GPI-anchored)
         omega = str_sub(omega, 8),
         cleaveRes = str_sub(omega, 1, 1),
         cleavePos = as.integer(str_sub(omega, 3))
         ) %>% 
  left_join(tmp1, by = "name")

# remove the column if it already exists
if("pred.gpi" %in% names(seqInfo))
  seqInfo <- select(seqInfo, -pred.gpi, -dibasic)
seqInfo <- left_join(seqInfo, select(pred.gpi, name, pred.gpi = is.gpi, dibasic), by = c("name"="name")) %>% 
  relocate(pred.gpi, dibasic, .after = signalp)
```

## hmmscan for pfam domains

Using hmmscan to search the pfam database,the maximum allowed number of sequences is 500. I uploaded the same 55 files from PredGPI to hmmscan, providing my email address each time. HmmerWeb version 2.41.1. I resplit the new fasta into 44 files and reran. Copied and pasted results email into hmmer_results_emails.txt - checked to make sure that there were 44 and that each job id matched. Used the command grep "\>" hmmer_results_emails.txt \| sort \> hmmer_results.txt. In Notepad++, copied and pasted header row from emails file.

This chunk will add a column that lists all non-overlapping domains found in each protein. In cases of overlap, the highest-scoring domain wins (using the outcompeted column in the data).

**Update 2021-07-04 HB** See [HMMSCAN help page](https://hmmer-web-docs.readthedocs.io/en/latest/result.html) for clarification on the 'out-competed" column

> In Pfam related entries are grouped into Clans, and as such can often match the same, or similar, regions on the query sequence. An additional column in the results table contains the clan accession for the family, if it belongs to a clan. Pfam employs a specific post processing on families from the same clan where the best match (determined by lowest E-value), is taken and the rest are out-competed.

```{r hmmscan}
hmm.names <- c("seq_id", "alignment_start", "alignment_end", "envelope_start", "envelope_end", "hmm_acc", "hmm_name", "hmm_start", "hmm_end", "hmm_length", "bit_score", "individual_e_value", "conditional_e_value", "database_significant", "outcompeted", "clan")
hmmscan_pfam <- read_table2(file = "output/hmmer_results.txt", col_names = hmm.names, col_types = "ciiiicciiidddiic", skip = 1)

df <- hmmscan_pfam %>% 
  filter(!outcompeted,  # remove lower-scored overlapping domains
         !hmm_name %in% c("Hyr1", "Candida_ALS")) %>% # remove two repeat domains
  mutate(seq_id = str_sub(seq_id, 2)) %>% # remove >
  group_by(seq_id) %>% 
  # collapse into list of pfam domains for each protein
  summarise(pfam_domains = paste0(hmm_name, " (", envelope_start, "-", envelope_end,  ")", collapse = ", "))

# remove the column if it already exists
if("pfam_domains" %in% names(seqInfo))
  seqInfo <- select(seqInfo, -pfam_domains)
seqInfo <- left_join(seqInfo, df, by = c("name"="seq_id"))
```

## TANGO

Created a .bat file to run tango on the condensed fasta. Used the command `fwrite(x = Caurisfasta.table[, c(1,3)], file = "Cauris.txt", sep = "\t", col.names = FALSE)`. Command file is tango_in.bat. Ran it and compressed all results using the command gzip \*.txt.

When I went to analyze I found that multiple files contained trash (columns with value -1.IO and values that didn't match if the single line from the .bat file was rerun). Upon looking in the TANGO documentation, it says that files can't be larger than 1000 sequences. Splitting it still induced errors, so I had Bin rerun all of the TANGO calculations and upload his error-free results to the github.

> Note: Downloading and processing all of the TANGO output files takes a VERY long time. If you don't need to regenerate the results, just import the file `tango_results_df.txt` as `tango.res.df1`. Everything after extracting the results will work off of that.

Using Bin's tango extraction functions

```{r extract_tango_info, eval=FALSE}
require(dplyr)
extract_tango <- function(tango_output, agg_threshold = 5, required_in_serial = 5) {
    require(tidyverse)
    tmp <- read_tsv(file = tango_output, col_types = "icddddd") %>% 
        # a boolean vector for residues above threshold
        mutate(pass = Aggregation > agg_threshold)
    pass.rle <- rle(tmp$pass) # this creates a run length encoding that will be useful for identifying the sub-sequences in a run longer than certain length
    # --- Explanation ---
    # this rle object is at the core of this function
    # an example of the rle looks like
    #   lengths: int[1:10] 5 19 20 8 1 5 19 6 181 18
    #   values: logi[1:10] F T  F  T F T F  T F   T
    #   note that by definition the values will always be T/F interdigited
    # our goal is to identify the sub-sequences that is defined as a stretch of 
    # n consecutive positions with a score greater than the cutoff and record the
    # sub-sequence, its length, start and end position, 90% quantile of the score
    # --- End of explanation ---
    # 1. assigns a unique id for each run of events
    tmp$group <- rep(1:length(pass.rle$lengths), times = pass.rle$lengths)
    # # 2. extract the subsequences
    agg.seq <- tmp %>%
        dplyr::filter(pass) %>% # drop residues not predicted to have aggregation potential
        group_by(group) %>% # cluster by the runs
        summarise(seq = paste0(aa, collapse = ""),
                  start = min(res), end = max(res), length = n(),
                  median = median(Aggregation),
                  q90 = quantile(Aggregation, probs = 0.9),
                  ivt = sum(aa %in% c("I","V","T")) / length(aa),
                  .groups = "drop") %>%
        mutate(interval = start - dplyr::lag(end) - 1) %>%
        dplyr::filter(length >= required_in_serial) %>%
        select(-group)
    return(agg.seq)
}

```

Apply to *C. auris* proteomes

```{r tango_extract, eval=FALSE, warning=FALSE}
tango.output.files <- list.files(path = "output/tango-output", pattern = ".txt|.txt.gz", full.names = TRUE)
# the read_csv() function used in the custom function can automatically decompress gzipped files
tango.res <- lapply(tango.output.files, extract_tango)
names(tango.res) <- gsub(".txt|.txt.gz", "", basename(tango.output.files))
# to add species information
# seqInfo <- read_tsv("raw-output/XP_028889033_homologs.tsv", comment = "#", col_types = c("ccci"))
tango.res.df <- bind_rows(tango.res, .id = "id")
# tango.res.df <- bind_rows(tango.res, .id = "id") %>% 
#   left_join(select(seqInfo, -length), by = c("id" = "id"))
# # save the tango output
# write_tsv(tango.res.df, "tango_summary_table.tsv.gz")
# # mutate(species = str_split(id, "_(?!.*_)", simplify = TRUE)[,2]) 
# # extract the species names
# # credit: https://stackoverflow.com/questions/20454768/how-to-split-a-string-from-right-to-left-like-pythons-rsplit
# # the split pattern is equivalent to the rsplit() function in python
```

Top 20 TANGO hits ranked by number of occurrences in all proteins, restricted to the strong hits with a median score \>= 30%.

```{r unique_motifs}
# tango.res.df1 <- tango.res.df %>% left_join(seqInfo %>% select(name, strain), by = c("id" = "name"))
tango.res.df1 <- read_tsv("output/tango_results_df.txt.gz", col_types = cols())

# find unique motifs and count the number of proteins and strains represented
motif.summary <- tango.res.df1 %>% 
  left_join(select(seqInfo, id = name, isHIL)) %>% 
  group_by(seq) %>% 
  summarize(n = n(), n.HIL = sum(isHIL),
            n.prot = n_distinct(id), 
            n.strains = n_distinct(strain),
            medScore = round(mean(median),1),
            IVT = round(mean(ivt),2),
            avg.intv = round(mean(interval, na.rm = T),1), 
            mad.intv = round(mad(interval, na.rm = T),1),
            .groups = "drop") %>% 
  arrange(desc(n))
motif.summary %>% 
  filter(round(medScore,0) >= 30) %>% 
  head(20)
```

***Discussion***

-   Somewhat to our surprise, even in the entire *C. auris* proteome, the "GVVIVTT" and its variants are still the dominant TANGO motif! Are they all in the Hil family homologs?

### Identify variants of "GVVIVTT"

Since this is the dominant motif in the *C. auris* proteome, we want to determine what percent of the TANGO sequences in the *C. auris* proteome belong to this class?

To design the regular expression, we looked at both the TANGO sequences that "look similar" to "GVVIVTT". In particular, Jan manually went through the TANGO hits in *C. lusitaneae, C. auris, C. haemulonis, C. pseudohaemulonis, C. duohaemulonis, S. stipitis, S. cerevisiae*, and *C. glabrata* and summarized the following pattern: `[VI][VIVM[VI][VI]T`

Below we will restrict the patterns to those with a median TANGO probability above 30.

```{r, results='asis'}
seqs <- motif.summary$seq
# core group, must be G-[VI]x4-TT
pat0 <- "G[VI]{4}TT"
# version 1, requires G-[ALVI][VI][LVI][VI]-(either end of string or at least 1 suitable residue) 
#                  OR (G not required)-[VI]x4-(two suitable residues)
# pat1 <- "G[ALVI][VI][LVI][VI]([LVIFAWYTM]{1,2}|$)|[VI]{4}[LVIFAWYTM]{2}"
# version 2, based on Jan's manual inspection of TANGO hits in the following species:
# C. lusitaneae, C. auris, C. haemulonis, C. pseudohaemulonis, C. duohaemulonis, S. stipitis, S. cerevisiae, and C. glabrata
pat.jan <- "[VI][VIVM[VI][VI]T"
cat(paste("Identify the group of TANGO sequences most similar to 'GVVIVTT' using the pattern", pat0, sep = " "))
match0 <- grep(pat0, seqs)
motif.summary[match0,] %>% filter(round(medScore,0) >= 30) %>% arrange(desc(n)) %>% select(seq, n, n.HIL, medScore, IVT)
cat(paste("Identify TANGO sequences with slightly more variation from 'GVVIVTT', using `", pat.jan, "`", sep = ""))
match.jan <- grep(pat.jan, seqs)
# exclude those already identified with pattern 0
motif.summary[setdiff(match.jan, match0),] %>% filter(round(medScore,0) >= 30) %>% arrange(desc(n)) %>% select(seq, n, n.HIL, medScore, IVT)

# export the list of motifs for manual inspection
motif.summary[union(match.jan, match0),] %>% 
  filter(round(medScore,0) >= 30) %>% 
  arrange(desc(n)) %>% 
  write_csv("output/TANGO-GVVIVTT-like-sequences.csv")
```

***Discussion***

-   We can see that the vast majority of GVVIVTT-like sequences are within the Hil family proteins

-   Which non-Hil proteins have GVVIVTT sequences?

    `r motif.per.seq %>% filter(grepl(pat0,seqs), !isHIL)`

    Note that XP_028891378.1 is a partial protein entry that I have determined to be the C-termianl fragment of Hil4 in B11221. Looking closely at the sequence and the chromosomal locations of QEL63124 and its neighboring QEL63125, I found that the latter, at ~500 a.a. is the N-terminal domain of Hil1 in B11245 while QEL63124 is the remaining part of that protein. So in conclusion, **ALL "GVVIVTT" like sequences, based on Bin's regex, are within the Hil family.

-   In the second table that lists GVVIVTT-like sequences that are captured by Jan's more inclusive regex and not my strict one, there is a notable outlier, namely GVVIVT, which is both so close to GVVIVTT and yet **only** present in a group of nonHil sequences, including PIS58185 in B8441 and others.

    `r tango.res.df1 %>% filter(seq == "GVVIVT") %>% select(id, strain) %>% left_join(seqInfo, by = c("id" = "name", "strain")) %>% unique()`

    I analyzed this group of sequences further -- see `README.md` in this folder.

-   This protein is very interesting: it no longer has the PF11765 domain but otherwise shows a high degree of sequence similarity with the latter.

    ![align](output/PIS58185/PIS58185-XP_028889033-alignment.png)

#### Number and spacing of TANGO hits in each protein sequence

We first summarize the motif sequences within each protein and then use that to determine the distribution of TANGO hit \# and spacing in the entire proteome, with or without the Hil family.

```{r}
# first count the number of tango hits with median score >= 30 hits for each protein
# motif.per.seq <- tango.res.df1 %>% 
#   group_by(id) %>% 
#   summarize(n.all = sum(round(median, 0) >= 30))

# next we will filter the tango dataset in order to recalculate the intervals
# this will result in 14 sequences to be dropped since they have 0 hits meeting
# the criteria. we will add them back by right_join() with the tibble above
motif.per.seq <- tango.res.df1 %>% 
  # limit to sequences with median score >= 30
  filter(round(median,0) >= 30) %>% 
  group_by(id) %>% 
  # recalculate the interval since we are now limiting the hits to >= 30
  mutate(interval = start - lag(end) - 1) %>% 
  # summarize the results at a sequence level
  summarize(n.all = n(),
            n.type = length(unique(seq)),
            medScore = round(mean(median),1),
            IVT = round(mean(ivt),2),
            avg.intv = round(mean(interval, na.rm = T),1), 
            IQR.intv = round(IQR(interval, na.rm = T)/1.349 ,1),
            # median absolute deviation is a robust measure of the scale parameter
            # https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/mad
            mad.intv = round(mad(interval, na.rm = T),1),
            seqs = paste(unique(seq), collapse = ","),
            .groups = "drop") %>% 
  right_join(select(seqInfo, strain, id = name, length, isHIL), by = "id") %>% 
  mutate(n.all = ifelse(is.na(n.all), 0, n.all), n.type = ifelse(is.na(n.type), 0, n.type)) %>% 
  arrange(desc(n.all), desc(mad.intv))
#motif.per.seq
```

Distribution of the number of TANGO hits per kilo amino acids protein across the proteome.

```{r}
ggplot(motif.per.seq, aes(x = strain, y = n.all/length*1000)) + 
  geom_boxplot(fill = "gray70", alpha = 0.7, outlier.alpha = 0.2) + ylab("# of TANGO hits per 1000 amino acids") + 
  geom_point(data = filter(motif.per.seq, isHIL), aes(color = length), position = position_jitter(0.05), size = 1.5) +
  scale_color_viridis_c() +
  theme_cowplot()
```

We can see that the 8 Hil orthologs fall into two groups. Hil1-4, which are longer than 1000 amino acids, have relatively high number of TANGO hits relative to their length. In fact, they are all above 89% percentile of the data.

Compare the distribution of TANGO hits \# and score in the Hil family vs the proteome

```{r fig.width=5, fig.height=4.5}
ggplot(motif.per.seq, aes(x = length, y = n.all/length*1000)) + 
  geom_bin_2d(binwidth = c(200, 2)) + scale_fill_viridis_c() + 
  geom_point(data = filter(motif.per.seq, isHIL), color = "red") +
  xlab("Protein length") + ylab("# of TANGO hits per 1000 amino acids") + 
  facet_wrap(~strain, nrow = 2) + theme_grey(base_size = 14)
```

***Discussion***

-   this is another way of plotting the same data, again showing that Hil1-4 are exceptional in the proteome in that they harbor a very high amount of TANGO hits per 1000 amino acid residues relative to proteins of similar length to them.

Next we add the TANGO information to the seqInfo dataset.

```{r}
if("n.tango" %in% names(seqInfo)){
  seqInfo <- seqInfo %>% 
    select(-n.tango, -reg.spaced)
}
seqInfo <- seqInfo %>% 
  left_join(select(motif.per.seq, name = id, n.tango = n.all, mad.intv), by = "name") %>% 
  mutate(mad.intv = ifelse(n.tango > 2, mad.intv, NA),
         reg.spaced = ifelse(n.tango > 3 & mad.intv < 5, TRUE, FALSE)) %>% 
  select(-mad.intv)
```

#### Summarize tango results for table

Three metrics for the by protein tango results can be number of pat0 (GVVIVTT) sequences, pat1 (GVVIVTT-like) sequences, and total number of aggregation sequences with score \> 30

**Update 2021-07-15 [HB]** The code below is no longer included because the above analysese have shown that GVVIVTT and its variants are highly specific to the Hil family, which appears to be an outlier in terms of TANGO hits per protein in the entire proteome. In predicting additional adhesins, we will only look at the total number of TANGO hits.

```r
# core group, must be G-[VI]x4-TT
pat0 <- "G[VI]{4}TT"
# version 1, requires G-[ALVI][VI][LVI][VI]-(either end of string or at least 1 suitable residue) 
#                  OR (G not required)-[VI]x4-(two suitable residues)
pat1 <- "G[ALVI][VI][LVI][VI]([LVIFAWYTM]{1,2}|$)|[VI]{4}[LVIFAWYTM]{2}"
pat2 <- "[VI][VIVM][VI][VI]T"
match0 <- grep(pat0, motif.per.seq$seq)
match1 <- grep(pat1, motif.per.seq$seq)
match2 <- grep(pat2, motif.per.seq$seq)
tango.summ <- seqInfo[,"name"] %>%
  # add pat0 results
  left_join(motif.per.seq[match0,] %>% group_by(id) %>% summarise(n.gvvivtt.seqs = sum(n)), by = c("name" = "id")) %>% 
  # add pat1 results
  left_join(motif.per.seq[setdiff(match1, match0),] %>% group_by(id) %>% summarise(n.gvvivtt.like.seqs = sum(n)), by = c("name" = "id")) %>% 
  # add pat2 results - pat2 from Fassler manual curation of TANGO positive sequences from FungalRV positive proteins in C. lusitaneae, C. auris, C. haemulonis, C. pseudohaemulonis, C. duohaemulonis, S. stipitis, S. cerevisiae, and C. glabrata into GVVIVTT-like and non GVVIVTT groups
  left_join(motif.per.seq[match2,] %>% group_by(id) %>% summarise(n.vvvvt.seqs = sum(n)), by = c("name" = "id")) %>% 
  # add motif.per.seq results
  left_join(motif.per.seq %>% group_by(id) %>% summarise(n.high.score.agg.seq = sum(n)), by = c("name" = "id"))
# join to seqInfo table
# remove the column if it already exists
if("n.gvvivtt.seqs" %in% names(seqInfo))
  seqInfo <- select(seqInfo, -n.gvvivtt.seqs, -n.gvvivtt.like.seqs, -n.vvvvt.seqs, -n.high.score.agg.seq)
seqInfo <- left_join(seqInfo, tango.summ)
seqInfo <- seqInfo %>% mutate(n.gvvivtt.seqs = replace_na(n.gvvivtt.seqs,0), n.gvvivtt.like.seqs = replace_na(n.gvvivtt.like.seqs,0), n.vvvvt.seqs = replace_na(n.vvvvt.seqs,0), n.high.score.agg.seq = replace_na(n.high.score.agg.seq,0))
```

## S/T frequency

Using Bin's method, I'm running EMBOSS freak for S/T, S, T with a stepping value of 10 and an averaging window of 100. The results were saved as freak_aminoacid(s)\_100_10.txt. Using Bin's format_freaq_output.py script, each was converted to a readable .out file and all files were compressed for storage.

**Update 2021-07-16 [HB]** we will use two types of statistics to summarize the serine/threonine frequencies here. 1) frequency of the residues over the entire protein; 2) maximum frequency of serine or threnonine in a 100 amino acid sliding window (which will ignore any proteins \< 100 residues). in the second measure, we reason that a 100 amino acid window is a reasonable size to capture a genuine "hike" in Serine or Threonine frequency. For one thing, it is large enough to avoid capturing just a small cluster of S/T, and for another thing the Serine rich cluster we observe in Hil1 is \~200 amino acids long and with a consistently elevated Serine content, suggesting that a 100 amino acid window is not too large to lead to diluted signals by flanking residues.

```{r S_T_freq}
# load data
ST.protein <- read_tsv("output/ST-freq/Cauris-ST-freq-per-protein.txt.gz", col_types = cols())
ST.freq <- read_tsv("output/ST-freq/freak_st_100_10_freak.out.gz", col_types = "cid")
S.freq <- read_tsv("output/ST-freq/freak_s_100_10_freak.out.gz", col_types = "cid")
T.freq <- read_tsv("output/ST-freq/freak_t_100_10_freak.out.gz", col_types = "cid")
ST.window <- bind_rows("ST" = ST.freq, "S" = S.freq, "T" = T.freq, .id = "residue") %>% 
  group_by(id, residue) %>% 
  summarize(max = max(freq))
rm(list = c("ST.freq", "S.freq", "T.freq"))
```

```{r fig.cap="Ser/Thr"}
ST.window %>% 
  left_join(select(seqInfo, id = name, isHIL), by = "id") %>% 
  mutate(residue = factor(residue, levels = c("ST", "S", "T"), labels = c("Ser/Thr", "Ser", "Thr")),
         `Hil family` = ifelse(isHIL, "Yes", "No")) %>% 
  ggplot(aes(x = max)) + geom_histogram(binwidth = 0.02) + 
  xlab("Max frequency of Ser/Thr in 100 amino acid window") + scale_y_continuous(position = "right") +
  facet_grid(`Hil family` ~ residue, scales = "free_y", labeller = "label_both", switch = "y") +
  theme_cowplot() + panel_border()
```

***Discussion***

From the above we can see that the not all Hil family proteins have well-above-average Serine or Threonine content alone. But when both residues are counted together, they consistently rank above the majority of the proteome.

```{r}
ST.protein %>% 
  left_join(select(seqInfo, ID = name, isHIL), by = "ID") %>% 
  mutate(`Ser/Thr` = (Ser+Thr)/length, Ser = Ser/length, Thr = Thr/length) %>% 
  pivot_longer(cols = c(`Ser/Thr`, Ser, Thr), names_to = "residue", values_to = "value") %>% 
  mutate(residue = factor(residue, levels = c("Ser/Thr", "Ser", "Thr")),
         `Hil family` = ifelse(isHIL, "Yes", "No")) %>% 
  ggplot(aes(x = value)) + geom_histogram(binwidth = 0.02) + 
  xlab("Frequency of Ser/Thr in the whole protein") + scale_y_continuous(position = "right") +
  facet_grid(`Hil family` ~ residue, scales = "free_y", labeller = "label_both", switch = "y") +
  theme_cowplot() + panel_border()
```

***Discussion***

Similar as above, we see that the combined Ser/Thr frequency appear to better distinguish the Hil family from the rest of the proteins. For this reason let's add the max S/T frequency in 100 amino acid windows and the overall S/T frequency per protein to `seqInfo`

```{r}
#remove columns if present
if("ST.prot" %in% names(seqInfo))
  seqInfo <- select(seqInfo, -ST.prot, -ST.window.max)

#join maximums back into seqInfo
tmp <- ST.protein %>% 
  mutate(ST.prot = round((Ser+Thr)/length, 3)) %>% 
  select(name = ID, ST.prot) %>% 
  left_join(ST.window %>% filter(residue == "ST") %>% select(name = id, ST.window.max = max), by = "name") %>% 
  # for proteins shorter than 100 a.a., the sliding window estimate would be NA. replace it with whole protein estimate in ST.protein
  mutate(ST.window.max = round(coalesce(ST.window.max, ST.prot),3))

seqInfo <- seqInfo %>% left_join(tmp, by = "name")
# rm(list = c("ST.protein", "ST.window"))
```

Write the `seqInfo` to a file to be included as supplementary data
```{r}
seqInfo %>% 
  select(-pfam_domains, pfam_domains) %>% 
  write_tsv(file = "output/Rmd-out/seqInfo_table_all.tsv")
```

# Prediction

Here we try to use the sequence properties collected above to identify additional proteins outside of the Hil family that may function as adhesins. The idea here is to apply filters in a step-wise manner and watch if all Hil proteins are still retained and how many non-Hil proteins remain in the dataset.

## Cell wall proteins

How many proteins are likely to be GPI-anchored cell wall proteins?
```{r}
# Load function
# reference: http://rstudio-pubs-static.s3.amazonaws.com/6975_c4943349b6174f448104a5513fed59a9.html
source("http://pcwww.liv.ac.uk/~william/R/crosstab.r")

crosstab(seqInfo, row.vars = "signalp", col.vars = "pred.gpi", type = c("f", "t"), dec.places = 1, addmargins = FALSE)
```

We see that only a very small fraction (1.5%) of the proteome pass both the SignalP and PredGPI tests. By comparison, all eight *C. auris* Hil proteins pass both tests, and the vast majority of Hil homologs pass both tests as well. So far this is the most stringent (aka effective) filter to reduce the number of candidate proteins.

**Update 2021-07-27 [HB]** Following the discussion on GPI-CWP vs GPI-PMP (see notes in the [PredGPI](#predgpi) section above), we will additionally test whether proteins with Signal Peptide and GPI-anchor also have low (<30%) Ser/Thr frequency and the presence of a dibasic motif

```{r}
filter1 <- seqInfo %>% 
  mutate(`Hil_family` = ifelse(isHIL, "Yes", "No"), 
         `GPI` = ifelse(signalp & pred.gpi, ifelse(ST.prot < 0.3 & !is.na(dibasic), "PMP", "CWP"), 'No'),
         `GPI` = factor(GPI, levels = c("No", "CWP", "PMP"))) %>% 
  select(-signalp, -pred.gpi, -isHIL)

crosstab(filter1, row.vars = "GPI", col.vars = "Hil_family", addmargins = FALSE)
```

The two proteins that dropped out `r filter1 %>% filter(GPI == "No", Hil_family == "Yes")` are both in the clade IV strain B11245 and appear to be missing the C-terminal sequence, hence failed the PredGPI test.

And the PMPs are shown below `r filter1 %>% filter(GPI == "PMP")`

## Protein length
Below we show the distribution of protein length in the proteome.
```{r}
p <- ggplot(data = seqInfo) + 
  geom_histogram(mapping = aes(x=length), binwidth = 50)+
  geom_vline(aes(xintercept = 500), color="blue", size=1) +
  annotate("text", x = 3000, y = 1500, label = paste0("# of proteins >= 500 residues: ", sum(seqInfo$length >= 500)), size = 5) +
  xlab("Protein length (amino acids)") + ylab("# of proteins") + theme_cowplot()
print(p)
```

We classify the CWP hits into two groups based on their length. Those with more than 500 a.a. are labeled as "CWP long" and those smaller than 500 a.a. as "CWP short".

```{r}
filter2 <- filter1 %>% 
  mutate(Len = ifelse(length >= 500, ">=500", "<500"),
         # group1 is the old definition
         # group1 = ifelse(Hil_family == "Yes",             # group 1: Hil family
         #                "Hil", ifelse(GPI != "No",        # group 2&3: cell wall long / short
         #                              ifelse(length >=500, "CWP long", "CWP short"), 
         #                              "Others")),         # group 4: Others
         # group1 = factor(group1, levels = c("Hil", "CWP long", "CWP short", "Others")),
         # group is the new definition
         GPI = as.character(GPI), # necessary for the case_when() statements
         group = case_when(
           Hil_family == "Yes" ~ "Hil",
           GPI == "CWP" & length >= 500 ~ "CWP long",
           GPI == "CWP" & length < 500  ~ "CWP short",
           GPI == "PMP" ~ "PMP",
           GPI == "No" ~ "Others",
         ),
         group = factor(group, levels = c("Hil", "CWP long", "CWP short", "PMP", "Others")))

crosstab(filter2, row.vars = "group")
```

## Ser/Thr frequency

```{r}
seqInfo %>% 
  pivot_longer(cols = starts_with("ST."), names_to = "type", names_prefix = "ST.", values_to = "value") %>% 
  mutate(type = factor(type, levels = c("prot", "window.max"), labels = c("whole protein", "max in 100 a.a. windows"))) %>% 
  ggplot(aes(x = value, color = isHIL)) + stat_ecdf() + facet_wrap(~type) + 
  scale_x_continuous(limits = c(0,0.8)) + scale_color_manual("Hil family", values = c("black", "gold3")) +
  xlab("Ser+Thr frequency") + ylab("Cumulative probability") + 
  theme_cowplot() + background_grid()
```

```{r}
p <- filter2 %>% 
  pivot_longer(cols = starts_with("ST."), names_to = "type", names_prefix = "ST.", values_to = "value") %>% 
  mutate(type = factor(type, levels = c("prot", "window.max"), labels = c("whole protein", "max in 100 a.a. window"))) %>% 
  ggplot(aes(x = type, y = value, fill = group)) + geom_boxplot(outlier.size = 0.5, outlier.shape = 1) +
  scale_fill_viridis_d() + xlab("") + ylab("Ser/Thr frequency") + theme_cowplot()
p
ggsave(file = "output/figure/20210721-ST-boxplot-compare-cwp-to-rest.png", plot = p + theme(legend.position = "bottom"), width = 5, height = 4.5)
```

Seems like the two measures behave very similarly. Let's see how they compare by applying the above cutoffs.

```{r}
ST.th = c("ST.protein" = 0.15, "ST.window" = 0.25)
cat("Thresholds for whole protein or sliding window frequencies of Ser/Thr")
print(ST.th)
filter2 %>% 
  mutate(`S/T protein` = ST.prot > ST.th["ST.protein"],
         `S/T max window` = ST.window.max > ST.th["ST.window"]) %>% 
  crosstab(row.vars = c("S/T protein", "S/T max window"), col.vars = "group", addmargins = FALSE)
```

Based on the above, we will apply a third filter of S/T frequency in the whole protein \> 15%, which is roughly the lower cutoff for the group of CWPs in the Caro _et al._ 1997 paper.

```{r}
# ref: https://stackoverflow.com/questions/34096162/dplyr-mutate-replace-several-columns-on-a-subset-of-rows
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

filter3 <- filter2 %>%
  mutate(`S/T protein` = ST.prot > 0.15) %>% 
  mutate_cond(str_sub(group,1,3) == "CWP" & !`S/T protein`, group = "Others") %>% 
  select(-Len, -`S/T protein`)

cat("Genes removed due to the S/T cutoff")
filter2 %>% filter(group %in% c("CWP long", "CWP short"), ST.prot <= 0.15)
```
We can see that all the proteins that fall out are within the "CWP short" category, which are less likely to function as adhesins to begin with. Thus this cutoff could help reduce the number of proteins we have to manually look at without losing many potential hits.

After this step we have a total of `r with(filter3, sum(str_sub(group, 1, 3) == "CWP"))` proteins left.

## TANGO hits

```{r}
p <- filter3 %>% 
  ggplot(aes(x = n.tango, fill = group)) + geom_histogram(binwidth = 1) +
  facet_wrap(~group, nrow = 5, scales = "free_y") +
  scale_y_continuous(n.breaks = 3) +
  scale_fill_viridis_d() +
  xlab("# of strong TANGO hits") + theme_cowplot() + panel_border()
p
ggsave("output/figure/20210721-tango-hits-distribution-compare.png", plot = p, width = 6, height = 4.5)
```

As we already showed before, the number and spacing of strong TANGO hits is a divisive feature that separate the Hil family proteins in *C. auris* into two groups -- those that are longer than 1000 a.a. have a lot of them and have them regularly spaced, while the four shorter members have few. We see a similar picture when we look at the candidate adhesin proteins in our analysis up to this point: the cell-wall protein long group showed above average number of TANGO hits while the short ones have very few. The PMPs, despite most of them being shorter than 500 amino acids, actually have relatively high number of TANGO hits.

## Tandem repeats
We didn't want to perform the tandem repeat analysis on >20,000 proteins. Now that we have narrowed the list down to a much more manageable one, the code below will perform XSTREAM analysis on the group of candidate proteins identified above.

1. Export the protein IDs

```{r}
cat(filter3 %>% filter(str_sub(group, 1, 3) == "CWP") %>% pull(name), file = "data/20210718-cwp-proteins-id.txt", sep = "\n")
```

2. Extract the fasta, run the following code at the command line

```{bash}
/Users/bhe2/sw/miniconda3/bin/python script/extract_fasta_gz.py data/proteome-fasta/Caurisfasta.faa.gz data/20210718-cwp-proteins-id.txt data/20210718-cwp-proteins.faa
```

3. Execute the xstream program
```{bash}
cd script
sh xstream.sh ../data/20210718-cwp-proteins.faa cwp-proteins ../output/tandem-repeats
cd ..
```

4. Import the xstream results
```{r}
repeats <- read_tsv("output/tandem-repeats/XSTREAM_cwp-proteins_i0.7_g3_m5_L15_chart.tsv", col_types = cols())
```

## Cluster by similarity

Now that we settled on the "CWP long" and "CWP short" lists, we would like to visualize their domain architecture. Since we started with four proteomes, with the goal of capturing a "species-wide" set of candidates, we have presumably four alleles of each protein (may be fewer) in the list. To make it easier to examine the domain features, let's cluster the proteins and assemble a representative list.

1.  Submit the fasta to [CD-hit](https://github.com/weizhongli/cdhit) [webserver](http://weizhong-lab.ucsd.edu/cdhit-web-server/cgi-bin/).

2.  Reformat the CD-hit output

``` bash
/Users/bhe2/sw/miniconda3/bin/python script/format_cd-hit_output.py output/cd-hit/20210727-CD-hit-0.85/1627438838.fas.1.clstr.sorted output/cd-hit/formatted-cd-hit-0.85.tsv
/Users/bhe2/sw/miniconda3/bin/python script/format_cd-hit_output.py output/cd-hit/20210727-CD-hit-0.6/1627439988.fas.1.clstr.sorted output/cd-hit/formatted-cd-hit-0.6.tsv
```

```{r}
cdhit.lo <- read_tsv("output/cd-hit/formatted-cd-hit-0.6.tsv", col_types = "ccc") %>% 
  group_by(cluster) %>% 
  mutate(leader = id[which(identity == "*")], # assign the representative protein ID to new variable "leader"
         identity = ifelse(identity == "*", 100, as.numeric(str_sub(identity, 1, -2)))) %>%  # extract percent identity
  ungroup()

cdhit.hi <- read_tsv("output/cd-hit/formatted-cd-hit-0.85.tsv", col_types = "ccc") %>% 
  group_by(cluster) %>% 
  mutate(leader = id[which(identity == "*")], # assign the representative protein ID to new variable "leader"
         identity = ifelse(identity == "*", 100, as.numeric(str_sub(identity, 1, -2)))) %>%  # extract percent identity
  ungroup()

cdhit <- full_join(cdhit.hi, cdhit.lo, by = "id", suffix = c(".hi", ".lo")) %>% 
  mutate(agree = leader.hi == leader.lo) %>% 
  select(id, everything()) %>% arrange(cluster.lo)
rm(list = c("cdhit.lo", "cdhit.hi"))
```

Disagreement between the two clustering results:

```{r}
cdhit %>% group_by(cluster.lo) %>% filter(length(unique(cluster.hi)) > 1, leader.hi == id) %>% ungroup()
```

To determine if the extra clusters identified with the 85% threshold is real, I used blastp (ncbi server) to align each of the additional leader sequence to the shared leader sequence identified with the 60% threshold. When necessary, I also checked the protein's chromosomal location to determine if they indeed come from different loci.

| leader.lo      | leader.hi  | %qcov | %ident | len.lo | len.hi | chromosome.lo                | chromosome.hi                    | conclusion       | comments                       |
|:---------------|:-----------|:------|:-------|:-------|:-------|:-----------------------------|:---------------------------------|:-----------------|:-------------------------------|
| PIS51179.1     | PIS54714.1 | 83%   | 70%    | 91     | 82     | chr1                         | chr7                             | different        | both in B8441                  |
| PIS48171.1     | PIS48170.1 | 100%  | 67%    | 176    | 166    | chr6:40562-41092             | chr6:39307-39807                 | different        | tandem duplicates              |
| PIS48171.1     | PIS48172.1 | 100%  | 66%    | 176    | 173    | chr6:40562-41092             | chr6:41619-42140                 | different        | tandem duplicates              |
| PIS54802.1     | QEL63474.1 | 100%  | 93%    | 861    | 834    | chr7:767294-769879           | chr7:760044-760968,761039-762618 | orthologs        | B8441 and B11245               |
| PIS56912.1     | QEO20379.1 | NA    | NA     | 765    | 747    | chr3,scaffold5:864995-867292 | chr1:2549461-2551704             | likely orthologs | B8441 and B11220               |
| XP_028890844.1 | QEO21242.1 | 100%  | 85%    | 790    | 755    | chr1:3100225-3102597         | chr2:1361822-1364089             | likely orthologs | B11221 and B11220 <sup>1</sup> |
| QEL58520.1     | PIS51941.1 | 97%   | 88%    | 457    | 440    | chr1                         | chr1,scaffold1                   | orthologs        | B11245 and B8441               |
| PIS58185.1     | QEO22553.1 | 100%  | 94%    | 2608   | 970    | chr2                         | chr3                             | orthologs        | B8441 and B11220 <sup>1</sup>  |

: **Table.** Determine the orthology of the cluster leaders using the two thresholds

<sup>1</sup> According to Muñoz _et al._ 2021, there is a large scale translocation between chromosome 1 and 2 in the clade II strains and the chromosome 1 location for XP_028890844 is within the translocated region and matches the coordinates of QEO21242 in the translocated region in its chromosome 2.
![translocation between chr1 and 2 in B11220](output/cd-hit/img/20210728-translocation-affects-synteny-in-clustering.png)
3. Based on the results above, we will manually update the `cdhit` table

```{r}
write_tsv(cdhit, file = "output/cd-hit/20210727-cdhit-cluster-for-manual-editing.tsv")
```
In addition to the changes mentioned in the table above, I also manually merged QEO60526 into the PIS58185 cluster (see README.md in this folder for the rationale); I merged QEO21241 from B11220 to the PIS51941 cluster. My goal here is to manually check the single gene clusters and determine if they actually belong to an existing cluster. All three remaining single gene clusters cannot be merged to any existing cluster, because their orthologs (by BLASTP) are either missing the signal peptide and/or GPI-anchor.

Lastly, CD-hit by default chooses the longest protein in a cluster as the representative. I manually changed the representative (aka leader) sequence to the B8441 protein if one exists. This way the final list will be mostly represented by sequences from a single strain (with some exceptions).

4. Then load the updated table and merge the information to produce the final candidate table.Only the leaders (longest representative in each cluster) are shown below.
```{r}
cdhit1 <- read_tsv("output/cd-hit/20210727-cdhit-cluster-manually-edited.tsv", col_types = cols()) %>% 
  left_join(select(seqInfo, id = name, length), by = "id") %>% 
  select(-ends_with(c(".hi",".lo"))) %>% 
  group_by(leader) %>% 
  mutate(leader.len = max(length), n = n()) %>% 
  ungroup()

cdhit1 <- cdhit1 %>% 
  filter(is.leader) %>% 
  select(leader, leader.len) %>% 
  arrange(desc(leader.len)) %>% 
  mutate(cluster = seq_along(leader)) %>% 
  right_join(cdhit1, by = c("leader", "leader.len")) %>%
  select(cluster, n, leader, leader.len, id, length, is.leader)
```

5. We will merge the cluster information with other sequence features
```{r}
# calculate the per sequence repeat content and add to the candidates list
tmp <- repeats %>% 
  mutate(len = end - start + 1) %>% 
  group_by(identifier) %>% 
  summarize(perc = round(sum(len/seqLength),3)) %>% 
  rename(name = identifier, repeats = perc)

candidates <- filter3 %>% 
  filter(str_sub(group,1,3) == "CWP") %>% 
  rename(ST.win = ST.window.max) %>% 
  left_join(tmp, by = "name") %>% 
  replace_na(list(repeats = 0)) %>% 
  left_join(select(cdhit1, name = id, cluster, cluster.size = n, is.leader), by = "name") %>% 
  select(-Hil_family) %>% 
  arrange(cluster)
```

6. Count the number of leaders in each group
```{r}
candidates %>% filter(is.leader) %>% count(group)
```

7. Show their sequence features for the CWP long and CWP short groups separately

```{r}
cwp.long <- candidates %>% filter(is.leader, group == "CWP long") %>% select(-group, -is.leader, -pfam_domains, pfam_domains)
cwp.short <- candidates %>% filter(is.leader, group == "CWP short") %>% select(-group, -is.leader, -pfam_domains, pfam_domains)
cat("CWP long")
cwp.long
cat("CWP short")
cwp.short
```
    
**_Discussion_**

1. In the first table above, cluster 1 along with XP_028893057, which is not in the candidates list because it misses the C-terminal GPI-anchor sequence (this is a partial mRNA missing the 3' portion), are the only ones containing the "GVVIVT" sequence (see README.md in this folder). They all have the GLEYA domain in the N-terminus, and had very similar central domain repeat units as the Hil1/2 protein.
1. Rows 2, 3 and 4 all have the Candida_ALS_N domain, and they are the ALS family members.
1. Examining the N-terminal Pfam domains among the rest candidate revealed Flo11, Asp, PLA2_B, CFEM, Glyco_hydro_72 and X8. Many of these domains have connections to adhesion.

8. Write the candidates info to files
```{r}
candidates %>% 
  filter(is.leader) %>% 
  select(-is.leader, -pfam_domains, pfam_domains) %>% 
  relocate(group, .after = name) %>% 
  write_tsv(file = "output/Rmd-out/20210729-candidate-leaders-seqinfo.tsv")
```


## Schematic plot for leaders
The goal is to create a domain schematic plot for all the leader proteins. The desired plot will include the following features: SignalP, GPI-anchor, any pfam domain (except for the Hyr1, Flocculin_t3 and Candida_ALS repeat domains), tandem repeats as identified by xstream, and the plot should be ordered by protein length from large to small.

**Update 2021-07-20 [HB]** I decide to remove the clusters whose leader protein length is less than 195 a.a. This helps to shorten the list, making the plot more readable.

```{r}
# get the ids to plot
id.to.plot <- candidates %>% filter(is.leader, length >= 195) %>% pull(name)

# we will be building the plot in layers: 1) whole protein (grey); 2) tandem repeats; 3) pfam domains; 4) signalp and pred.gpi; 5) if possible, tango hits
# the relevant data can be assembled separately

df.whole <- candidates %>% 
  filter(name %in% id.to.plot) %>% 
  mutate(start = 1, end = length, feature = "whole protein",
         name = ordered(name, levels = rev(id.to.plot))) %>% 
  select(name, feature, start, end, group)

df.repeats <- repeats %>% 
  mutate(feature = paste("TR", type, "-")) %>% 
  select(name = identifier, feature, start, end) %>% 
  filter(name %in% id.to.plot) %>% 
  left_join(select(df.whole, name, group), by = "name")

df.pfam <- hmmscan_pfam %>% 
  filter(!outcompeted,  # remove lower-scored overlapping domains
         !hmm_name %in% c("Hyr1", "Candida_ALS")) %>% # remove two repeat domains , "Flocculin_t3"
  mutate(name = str_sub(seq_id, 2), # remove > from seq.id
         feature = paste0(hmm_name, " (", gsub("\\.[0-9]", "", hmm_acc), ")")) %>% # feature name
  filter(name %in% id.to.plot) %>% 
  left_join(select(df.whole, name, group), by = "name") %>% 
  select(name, feature, start = envelope_start, end = envelope_end, group)

df.signalp <- signalp5 %>% 
  filter(id %in% id.to.plot) %>% 
  mutate(feature = "SignalP") %>% 
  select(name = id, start, end) %>% 
  left_join(select(df.whole, name, group), by = "name")

df.gpi <- pred.gpi %>% 
  filter(name %in% id.to.plot, is.gpi) %>% 
  mutate(feature = "GPI-anchor") %>% 
  left_join(select(df.whole, name, group, end), by = "name") %>% 
  select(name, feature, start = cleavePos, end, group)

df.tango <- tango.res.df1 %>% 
  filter(id %in% id.to.plot) %>% 
  mutate(feature = "tango") %>% 
  select(name = id, feature, start, end, median, seq) %>% 
  left_join(select(df.whole, name, group), by = "name")
```

Build the plot
```{r, fig.height=5.4, fig.width=7.2}
# protein backbone
p0 <- ggplot(df.whole, aes(x = name, y = start, xend = name, yend = end)) + 
  geom_hline(yintercept = 500, color = "grey80", size = 0.2) + # add a visual divider to distinguish between long and short
  geom_segment(color = "grey80", size = 2)
# tandem repeats
p1 <- geom_segment(data = df.repeats, aes(size = "Tandem Repeats"), color = "orange4", alpha = 0.5)

# pfam
require(RColorBrewer)
feature.col <- c(brewer.pal(12, "Paired")[c(9,2,4,3,5,7,11,8)], brewer.pal(8, "Dark2")[c(2,8,3,4,6)],"brown3", "navyblue", "skyblue")
#(length(unique(df.pfam$feature))) # define the colors
p2 <- geom_segment(data = df.pfam, aes(color = feature), size = 2)

# signalp and gpi
p3 <- geom_segment(data = df.signalp, aes(size = "SignalP"), color = "#e31a1c")
p4 <- geom_segment(data = df.gpi, aes(size = "GPI-anchor"), color = "#6a3d9a")

# tango
p5 <- geom_segment(data = df.tango, aes(y = start,  yend = end + 1, alpha = median),
                   position = position_nudge(x = -0.42, y = 0), color = "darkred", size = .8)

# plot
p0 + p1 + p2 + p3 + p4 + p5 + coord_flip() + 
  #facet_wrap(~group, scales = "free_y") +
  scale_color_manual(values = c(feature.col)) + 
  scale_size_manual(
    "Other features", values=rep(2,3),
    guide=guide_legend(override.aes = list(colour=c("#6a3d9a", "#e31a1c", alpha("orange4",.5))))
  ) +
  guides(alpha = "none") +
  scale_y_continuous(breaks = seq(0,2500,500), expand = expansion(mult = 0.02)) +
  theme_classic(base_size = 12) + background_grid(major = "x") +
  theme(axis.text.y = element_text(size = 5), axis.title.y = element_blank(),
        axis.line.y = element_blank(),# axis.ticks.y = element_blank(),
        #axis.line.x = element_blank(),# axis.ticks.x = element_blank(),
        legend.position = c(0.65, 0.45), legend.box = "horizontal",
        panel.background = element_rect(fill = alpha("lightblue",0.1))) +
  labs(y = "Position in sequence", x = "Sequences", color = "Pfam domains")
ggsave("output/figure/20210720-cwp-candidates-domain-schematic.png", width = 7.2, height = 5.4)
```