Analysis
1. Overlap between the three queries?
What’s the extent of overlap between the blastp hits using the three
different queries?
Hits with <50% query coverage are removed.
refseq %>% filter(qcovs < 50)
Venn diagram showing the extent of overlap and set specific hits:
set.ovlp <- refseq %>%
filter(qcovs >= 50) %>%
group_by(sid) %>%
summarize(
Caur = any(query == "Caur_Hil1"),
Calb = any(query == "Calb_Hyr1"),
Cgla = any(query == "Cgla_Hyr1")
)
ggvenn(set.ovlp)

Which genes are found by the C. albicans and/or C.
glabrata Hyr1 query blast but not by the C. auris Hil1
query blast?
Calb.not.Caur.hits <- set.ovlp %>% filter(!Caur) %>% pull(sid)
refseq %>% filter(sid %in% Calb.not.Caur.hits) %>%
group_by(sid) %>% filter(evalue == min(evalue)) %>% ungroup() %>%
select(query, qcovs, sid, slen, pident, evalue, species) %>%
mutate(species = paste(str_sub(species, 1, 1), word(species, 2, 2), sep = ". ")) %>%
arrange(species)
- The majority of the combined set is already in the subset identified
using C. auris Hil1 as the query.
- Of the ones that are specific to the C. albicans Hyr1
query, I found the three C. parasilosis hits are already
included in the 104 homologs via the FungiDB blast. Whatever the reason
that exluded them from the ncbi blastp search using C. auris
Hil1 domain as the query, they are already in our dataset.
- Same is true for XP_452176.1 from K. lactis, this time
through the GRYC search.
Among the species included previously, how many of the “new”
sequences were not part of the 104 homologs list?
| XP_002999585 |
CAGL0L00227g |
C. glabrata |
Yes |
GRYC |
| XP_002615218 |
CLUG_05233 |
C. lusitaniae |
Yes |
FungiDB |
| XP_002620045 |
CLUG_01204 |
C. lusitaniae |
No^ |
NA |
| XP_002614121 |
CLUG_05607 |
C. lusitaniae |
No |
NA |
| XP_036665262 |
CPAR2_806390 |
C. parapsilosis |
Yes |
FungiDB |
| XP_036662815 |
CPAR2_600430 |
C. parapsilosis |
Yes |
FungiDB |
| XP_036665263 |
CPAR2_806400 |
C. parapsilosis |
No |
NA |
| XP_452176 |
KLLA0_B14498g |
K. lactis |
Yes |
GRYC |
| XP_001527307 |
LELG_02136 |
L. elongisporus |
No |
NA |
| XP_001385683 |
PICST_3402 |
S. stipitis |
No^ |
NA |
| XP_001383953 |
PICST_31095 |
S. stipitis |
Yes* |
NCBI |
*this hit was originally included when the N560 aa was used as query.
In later analysis where N360 aa (of C. auris Hil1) was used as
query, this hit dropped off the list due to E-value being above the 1e-5
cutoff. So it is a legit, but borderline hit.
^shorter than 500 a.a.
Five hits are new, among which two are shorter than 500 a.a. and
would have been excluded from the list (although they were not
identified at all with the C. auris query).
2. Species to include
The new criteria is whether the species is among the 332 included in
the large-scale phylogeny of Shen et al. 2018. I downloaded the
tree from https://github.com/JLSteenwyk/treehouse/blob/master/Data/Saccharomycotina_fig2_Shen_etal_2018.tre
and manually converted it into a list of species names.
shen2018 <- scan("../../data/yeast-phylogeny/Saccharomycotina_fig2_Shen_etal_2018.txt", what = "character", sep = "\n")
Any species in the refseq hit list not in the Shen 2018
phylogeny?
setdiff(unique(refseq$species), shen2018)
[1] "Candida haemuloni" "Candida duobushaemulonis" "Candida pseudohaemulonii"
[4] "Suhomyces tanzawaensis" "Pseudomonas syringae;Pseudomonas" "Diutina rugosa"
[7] "Yamadazyma tenuis" "Schizosaccharomyces cryophilus" "Kazachstania barnettii"
[10] "Artibeus jamaicensis"
| Candida haemuloni |
45357 |
No |
NA |
| Candida duobushaemulonis |
1231522 |
No |
NA |
| Candida pseudohaemulonii |
418784 |
No |
NA |
| Suhomyces tanzawaensis |
984487 |
Yes |
Candida tanzawaensis |
| Pseudomonas syringae:Pseudomonas |
317;59510 |
No |
NA |
| Diutina rugosa |
5481 |
No |
NA |
| Yamadazyma tenuis |
590646 |
Yes |
Candida tenuis |
| Schizosaccharomyces cryophilus |
653667 |
No |
NA |
| Kazachstania barnettii |
61262 |
No |
NA |
| Artibeus jamaicensis |
9417 |
No |
NA |
Manually figure out species synonyms and record both names for the
species to use. update: downloaded the species name
convertion table from Shen et al 2018 supplementary material.
spsNameConvert <- read_tsv("../../data/yeast-phylogeny/343taxa_speicies-name_clade-name_color-code.txt", col_types = cols()) %>%
mutate(species = word(`Species name`, 1, 2))
spsNameSyn <- tibble(
ncbi.name = c(intersect(refseq$species, shen2018), "Suhomyces tanzawaensis", "Yamadazyma tenuis", "Lachancea kluyveri"),
shen.name = c(intersect(refseq$species, shen2018), "Candida tanzawaensis", "Candida tenuis", "Lachancea kluyveri")
)
3. Check protein completeness
Our goal is to create a table documenting the number of putative
homologs per species. We used to apply a length cutoff of 500 aa.
However, the specific value of the cutoff is not based on any
comprehensive studies. This combined with the uncertainty in some
putative homolog’s length due to assembly issues led us to reconsider
this filter. The solution we came up with is to include all blast hits
based on the PF11765 domain alone in the family expansion analysis (CAFE
and phylogenetic reconstruction). Later we will selectively include some
of them for the protein feature analysis. Since the latter analysis is
mainly to demonstrate the fast evolution of the central domain features,
it doesn’t rely on the completeness of the Hil family or the inclusion
of all species for which we have records for.
In curating the blast hits, we found that some (protein) blast hits
are annotated as missing the carboxy terminus. This would prevent us
from properly assessing the length of the protein. Note that
“incomplete” means different things for the protein vs mRNA record. For
the latter, “incomplete” could mean missing 5’ and/or 3’UTR while the
CDS may be complete. A large number of the mRNA records corresponding to
the protein hits in the list are “partial”. That’s not reason for
exclusion.
For C. tropicalis, I found a new, 2019
assembly that uses both PacBio and Illumina. The assembly did use
the reference genome assembly as a guide.
To identify all partial mRNA, we will use the rentrez
package to query the ncbi database.
# define a function for retrieving the mRNA information from the nucleotide databased, based on protein ID
retrieve_mrna_info <- function(ids){
nuc.lnk <- entrez_link(dbfrom = "protein", db = "nucleotide", id = ids, by_id = TRUE)
nucID <- sapply(nuc.lnk, function(x){
lnk = x$links$protein_nuccore_mrna
return(ifelse(is.null(lnk), NA, lnk))
})
nucSum <- entrez_summary(db = "nuccore", id = na.omit(nucID))
names(nucSum) <- ids[!is.na(nucID)]
return(nucSum)
}
retrieve_protein_info <- function(ids){
pSum <- entrez_summary(db = "protein", id = ids)
names(pSum) <- ids
return(pSum)
}
# all refseq protein IDs to get the mRNA information for
pid <- unique(refseq$sid)
# posting too many queries can hit the rate-limit. split into multiples of 50s
iter.pid <- split(x = pid, f = ceiling(seq_along(pid)/50))
pSum <- sapply(iter.pid, retrieve_protein_info)
nucSum <- sapply(iter.pid, retrieve_mrna_info)
# extract the mRNA info
mrnaInfo <- lapply(nucSum, function(x) {
extract_from_esummary(x, elements = c("caption", "completeness"), simplify = FALSE) %>%
bind_rows(.id = "sid") %>%
rename(tID = caption, tComplete = completeness)
}) %>% bind_rows()
# extract the protein completeness info
pInfo <- lapply(pSum, function(x) {
extract_from_esummary(x, elements = c("completeness"), simplify = FALSE) %>%
bind_rows(.id = "sid") %>%
rename(pComplete = completeness)
}) %>% bind_rows()
Now remove duplicates (resulting from the different queries) and
filter by species (those that overlap the Shen 2018 list plus the MDR
clade)
sps.include <- c(spsNameSyn$ncbi.name, "Candida haemuloni", "Candida duobushaemulonis", "Candida pseudohaemulonii")
dedup.ref <- refseq %>%
filter(species %in% sps.include, qcovs >= 50) %>%
group_by(sid) %>%
filter(evalue == min(evalue)) %>%
left_join(mrnaInfo, by = c("sid" = "sid")) %>%
left_join(pInfo, by = c("sid" = "sid")) %>%
select(query, sid, slen, pComplete, tID, tComplete, species, strain, everything())
Summarize the deduplicated list - for each species, record the total
number of hits, those that are labeled as incomplete in the protein
record, those that are above 500 aa and the number of hits below 500 aa
that are labeled as incomplete
sum.ref <- dedup.ref %>%
group_by(species, strain) %>%
summarize(
total = n(),
incompl = sum(pComplete != ""),
gt500 = sum(slen >= 500),
shortIncompl = sum(pComplete != "" & slen < 500)
)
`summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
sum.ref %>% arrange(desc(shortIncompl))
- S. stipitis and M. bicuspidata had a large number
of incomplete proteins, likely a reflection of the genome assembly
status. For M. bicuspidata, a newer Illumina + Oxford Nanopore
assembly
is available for this species but is only assembled to the contig level,
with no gene annotation. I tried blasting the refseq hits from the
species against the new assembly but the match was poor for most of the
hits - instead of getting a long, contiguous HSP, I got mostly broken
pieces with lots of mismatches. I suspect this is due to a combination
of tandem gene arrays causing difficulty for assembly around the Hil
homologs loci and the sequencing depth and assembly itself. Based on
these observations, I plan to include the species in the CAFE analysis
and potentially in the phylogenetic reconstruction as well, but exclude
it from the homolog protein trait analyses.
- For S. stipitis, I identified a newer
assembly for a different strain (NRRL Y-7124) and performed local
blast against the annotated protein sequences. I identified a total of
13 hits, of which seven are above the 50% query coverage threshold. This
is quite different from the 18 hits identified in the refseq assembly.
When I blasted the refseq hits against the new assembly, some of them
had poor matches. Without further investigation, it is not clear what is
the cause of the discrepancy. I plan to use the new assembly hits for
downstream analyses given that it is more recently sequenced.
- For the same reasons as above, I plan to include Y. tenuis,
S. tanzawaensis and C. jadinii in the CAFE analysis
but exclude them from the protein feature analyses. Given that they have
relatively few hits, I could consider manually searching for the full
length ORF, but that’s optional.
- For D. fabryi, the Y-1000 plus project has sequenced
another strain from this species. I blasted against that assembly and
identified two matches. But the two hits appear too similar to each
other, leaving me wonder if they are truly distinct loci, or are they
partially duplicate contigs.
Hits based on the new S. stipitis assembly
s.stipitis <- read_tsv("../../data/long-read-assemblies/s_stipitis_NRRL_Y7124/S-stipitis-NRRL-Y7124-Hil-homologs-blast.txt",
comment = "#", col_names = header[1:10], col_types = cols()) %>%
mutate(species = "Scheffersomyces stipitis", strain = "NRRL Y-7124", staxid = "913138",
pident = num(pident, digits = 1)) %>%
relocate(species:strain, .after = pident) %>%
group_by(sid) %>%
filter(evalue == min(evalue))
4. Candida glabrata hits
The Cormack lab released new assemblies for a couple of C.
glabrata strains in Xu 2020 and Xu 2021 (PMIDs: 32068314,
33713372). These assemblies were based on long-red sequencing
technologies supplemented with short reads for polishing the sequences.
They also paid special attention to the subtelomeres, using various
means to verify that their newly assembled subtelomeric regions are
correct. Performing blastp searches against these assemblies (I tested
two of them: ATTC2001=CBS138 and BG2) recovered 13 instead of 3 hits (in
the new refseq hit list). I verified that all of them are credible.
Actually, Xu et al. 2020 supplementary table specifically
listed these 13 as predicted adhesins belonging to cluster V, which also
included Awp2 studied by Reithofer et al 2021. Including them in the
homologs list would properly reflect the family size in the species, but
will also highlight the issue that incompletely assembled subtelomeres
may have led to many false negatives in the gene family inventory. I
think it is worth including it and commenting on the caveats.
c.glabrata <- read_tsv("../../data/long-read-assemblies/c_glabrata_CBS138/C-glabrata-CBS138-Hil-homologs-blast.txt",
comment = "#", col_names = header[1:10], col_types = cols()) %>%
mutate(species = "Candida glabrata", strain = "CBS 138", staxid = "5478",
pident = num(pident, digits = 1)) %>%
relocate(species:strain, .after = pident) %>%
group_by(sid) %>%
filter(evalue == min(evalue))# %>%
#rename(pid = sid)
c.glabrata
Note that I also checked three other species for which I can find a
recent assembly that used long-read technlogoies. None of them led to so
many more hits as the C. glabrata assembly did. In fact, only
the C. nivariensis assembly yielded one more hit than the
older, short-read based assembly. There may be two possibilities: 1) not
all new assemblies are created equal. there are some special properties
of the C. glabrata assembly that made its subtelomeric regions
especially well assembled; 2) there is something special about C.
glabrata in that the Hil family expanded more significantly than
even the closely related Nakaseomyces species. As of now we cannot
distinguish between these two.
5. FungiDB hits
FungiDB doesn’t include any species not already in the refseq
database. The hits that were added to the refseq hits from the earlier
analysis are
prev.fdb.add <- tribble(
~gid, ~pid, ~species,
"CLUG_05233", "XP_002615218.1", "Clavispora lusitaniae",
"CPAR2_600430", "XP_036662815.1", "Candida parapsilosis",
"CPAR2_806390", "XP_036665262.1", "Candida parapsilosis",
"CPAR2_806420", "XP_036665265.1", "Candida parapsilosis",
"B9J08_004098", "PIS52481.1", "Candida auris"
)
prev.fdb.add %>% filter(!pid %in% dedup.ref$sid)
The first four are not missing, just a difference in the ID format.
The missing C. auris sequence is known to us and will be added
regardless. Based on the above, we will no longer consult the FungiDB to
supplement the refseq hits.
c.auris <- read_tsv("../../output/XP_028889033-Cauris-five-strains-blast.txt",
comment = "#", col_types = cols()) %>%
filter(qseqid == "XP_028889033.1_PF11765", sseqid == "PIS52481.1") %>%
rename(sid = sseqid) %>%
mutate(
pid = sid,
query = "Caur_Hil1",
# the blast result from the above file was done using the first 325 amino acids as query
# I manually did two sequence blast with the PF11765 from Hil1 as the query and entered
# the query and subject start and end positions below
q.start = 12, q.end = 326, s.start = 11, s.end = 325,
qcovs = (q.end - q.start + 1)/325*100,
slen = 1247,
species = "Candida auris",
staxid = "498019"
) %>%
select(query, sid, slen, species, qcovs, pident, q.start, q.end, s.start, s.end, evalue, staxid, pid)
I now consider the refseq list complete, we will gather the final
list of IDs and retrieve their sequences.
Retrieve protein info (no mRNA information is available)
id.add <- c(c.glabrata$sid, s.stipitis$sid, c.auris$sid)
pSum1 <- retrieve_protein_info(id.add)
# extract the protein completeness info
pInfo1 <- extract_from_esummary(pSum1, elements = c("completeness"), simplify = FALSE) %>%
bind_rows(.id = "sid") %>%
rename(pComplete = completeness)
Replace C. glabrata and S. stipitis in refseq hit
list. Add the C. auris Hil4 from B8441.
dedup.ref.cg <- dedup.ref %>%
filter(!species %in% c("Candida glabrata", "Scheffersomyces stipitis")) %>%
bind_rows(filter(s.stipitis, qcovs >= 50)) %>%
bind_rows(filter(c.glabrata, qcovs >= 50)) %>%
bind_rows(filter(c.auris, qcovs >= 50)) %>%
rows_update(pInfo1, by = "sid", unmatched = "ignore") %>%
mutate(pid = gsub(".*\\|(.*)\\|.*", "\\1", sid))
Retrieve sequences
iter.pid1 <- split(x = dedup.ref.cg$pid, f = ceiling(seq_along(dedup.ref.cg$pid)/50))
ref.seq <- lapply(iter.pid1, function(x){
entrez_fetch(db = "protein", id = x, rettype = "fasta")
})
# merge all the sequences and write to file
ref.seq %>% paste0(collapse = "") %>%
write(file = "../../output/20220503-expanded-blast-refseq-homologs.faa")
6. GRYC hits
GRYC did include a few more species, especially in the Nakaseomyces
and Lachancea genera. While the same assembly data appear to exist in
the ncbi database, they are certainly not included in the refseq
database. Moreover, even when the assemblies are present in the ncbi
database, there are only the genomic sequences, not translated CDS,
which likely make them unavailable to blastp searches. Therefore I
downloaded the five non-glabrata Nakaseomyces genomes plus one Lachancea
species, L. kluyveri and performed local tblastn
searches. Below I will process the results and decide which ones to
include in the final list.
gryc.sps <- tribble(
~sps.abr, ~species, ~strain, ~staxid,
"CANI", "Candida nivariensis", "CBS 9983", "1279115",
"NADE", "Nakaseomyces delphensis", "CBS 2170", "1279113",
"NABA", "Nakaseomyces bacillisporus", "CBS 7720", "1279116",
"CACA", "Candida castellii", "CBS 4332", "1279086",
"CABR", "Candida bracarensis", "CBS 10154", "1279114",
"SAKL", "Lachancea kluyveri", "CBS 3082", "4934"
)
gryc <- read_tsv("../../data/Nakaseomyces_GRYC/Nakaseomyces-Hil-homologs-blast.txt",
comment = "#", col_names = header[1:10], col_types = cols()) %>%
mutate(sps.abr = gsub("BN[0-9]*_", "", sid) %>% str_sub(1,4), pident = num(pident, digits = 1)) %>%
left_join(gryc.sps, by = "sps.abr") %>%
relocate(species:strain, .after = pident) %>%
select(-sps.abr)
dedup.gryc <- gryc %>%
group_by(sid) %>%
filter(evalue == min(evalue)) %>%
mutate(pid = gsub("BN[0-9]*_", "", sid) %>% gsub("1_1$", "", .)) %>%
filter(qcovs >= 50) %>%
ungroup()
Extract the protein sequences from the combined fasta
cat(dedup.gryc$sid, file = "20220506-gryc-deduplicated-id-for-extracting-fasta.txt", sep = "\n")
Run the following at the command line assuming the current working
directory is 02-blast
python3 script/extract_fasta_gz.py \
data/Nakaseomyces_GRYC/20220503-gryc-nakaseomyces-protein.faa.gz \
analysis/expanded-blast/20220506-gryc-deduplicated-id-for-extracting-fasta.txt \
output/20220506-expanded-blast-gryc-homologs.faa
Merge the GRYC sequences with the refseq sequences - run the
following at the command line
cat 20220503-expanded-blast-refseq-homologs.faa \
20220506-expanded-blast-gryc-homologs.faa \
> 20220506-expanded-blast-combined-homologs.faa
Check to make sure that no duplicates exist.
bioawk -c fastx '{print $name}' 20220506-expanded-blast-combined-homologs.faa | sort | uniq -d
If there is no output, then there is no duplicates. The command will
output one line per sequence that occcurs more than once in the
file.
7. Merge all the sequences
Export the combined homologs list
dedup.merge <- bind_rows(
RefSeq = dedup.ref.cg,
GRYC = dedup.gryc,
.id = "source"
) %>%
mutate(name = paste(pid, gsub(" ", "_", species), sep = "_")) %>%
select(-q.start, -q.end, -s.start, -s.end) %>%
ungroup()
write_tsv(dedup.merge, file = "../../output/20220506-expanded-blast-combined-homologs.tsv")
9. Identify the PF11765 domain
Use HMMER to identify the location of the PF11765 domain. The hmm
model was downloaded from the pfam database on 2022-05-04. Run the
following command in 02-blast directory.
hmmscan --noali --domE 1e-3 \
-o 20220504-hmmscan.log \
--domtblout output/20220506-expanded-blast-combined-PF11765-hmmscan.txt \
data/HMM-profile/Hyphal_reg_CWP.hmm \
output/20220506-expanded-blast-combined-homologs.faa
Read the result
hmm.header <- c("target", "tacc", "tlen", "pid", "pacc", "slen", "full.evalue", "full.score", "full.bias",
"num", "of", "c.evalue", "i.evalue", "domain.score", "domain.bias",
"hmm.from", "hmm.to", "ali.from", "ali.to", "env.from", "env.to", "acc")
hmm <- read_table("../../output/20220506-expanded-blast-combined-PF11765-hmmscan.txt",
comment = "#", col_names = hmm.header, col_types = cols()) %>%
mutate(sid = pid, pid = gsub("BN[0-9]*_", "", pid) %>% gsub("1_1$", "", .)) %>%
select(sid, pid, num, of, i.evalue:env.to)
dedup.merge %>% filter(!pid %in% hmm$pid)
Curious why hmmscan didn’t identify the PF11765 domain
in the two sequences, I submitted the same sequence file to Batch CD-search,
which uses the RPS-BLAST to search a collection of protein sequences for
matches to a set of profile matrices. It is the reverse of the
PSI-BLAST.
cdsearch <- read_tsv("../../output/20220509-expanded-blast-combined-pfam-cd-search-output.txt", skip = 7,
col_types = cols()) %>%
filter(grepl("Hyphal", `Short name`)) %>%
mutate(sid = str_match(Query, ">([a-zA-Z0-9_.]*) ")[,2],
pid = gsub("BN[0-9]*_", "", sid) %>% gsub("1_1$", "", .)) %>%
select(sid, pid, everything(), -Query)
cdsearch %>% filter(pid %in% setdiff(dedup.merge$pid, hmm$pid)) %>% select(-sid)
Both sequences did have a match, although one of them has a
“superfamily” level match, suggesting that no hits exceed the domain
level E-value, but multiple hits met the “non-specific” criteria. Not
sure why hmmscan didn’t pick up the domain in the C. glabrata
sequence though
Merge the CD-Search and HMMSCAM results into one. I’m using the
envelope coordinates from hmmscan rather than the alignment
coordinates. According to PFam
documentation, the envelope coordinates reflect the region where the
match has been probabilistically determined to lie. It will be
slightly wider than the alignment.
hmm.cd <- cdsearch %>% filter(grepl("Hyphal_reg_CWP", `Short name`)) %>%
select(sid, pid, cd.type = `Hit type`, cd.incom = Incomplete, cd.from = From, cd.to = To) %>%
full_join(select(hmm, sid, pid, ali.from, ali.to, env.from, env.to)) %>%
mutate(cd.len = cd.to - cd.from + 1,
ali.len = ali.to - ali.from + 1,
env.len = env.to - env.from + 1)
Joining, by = c("sid", "pid")
hmm.cd %>% group_by(pid) %>% filter(n() > 1) %>% select(-sid, -ends_with("to"))
- Above are the sequences with more than one entries from either
search. Note that with the parameters used for CD-Search, it appears
that the algorithm only reports the strongest match (specific). This can
probably be changed by switching to the full report mode. But it’s not
important as I plan to use the HMM envelope coordinates whenever
available.
- The second match in XP_003671892.1 is very short. Remove it.
- The second match in NADE0s03e00110g largely overlaps with the first
one. Remove.
domain.final <- hmm.cd %>%
# remove the short hit and duplicates due to the join operation, but save the rows where hmmscan didn't find a match
filter(is.na(ali.from) | (abs(ali.from - cd.from) < 1000 & ali.len > 100)) %>%
filter(!(pid == "NADE0s03e00110g" & ali.from == 39)) # remove the overlapping entry
domain.final %>%
pivot_longer(c(cd.len, env.len), values_to = "length", names_to = "method") %>%
mutate(method = factor(method, levels = c("cd.len", "env.len"), labels = c("CD-Search hits", "HMMSCAN hits"))) %>%
ggplot(aes(x = length)) +
geom_histogram(aes(fill = cd.incom != "-"), bins = 30) +
geom_vline(xintercept = 328, linetype = 2) +
scale_fill_brewer("Incomplete", type = "qual") +
facet_wrap(~method, nrow = 2, ) + theme_cowplot()
Warning: Removed 2 rows containing non-finite values (stat_bin).

Some are incomplete matches.
Write the coordinates into a BED formatted file for extracting the
sequences.
domain.final %>%
mutate(
# if hmm hits are not available, use cd-search hit coordinates, with 3 bp expansion on both sides
# note that BED format is 0-based for the start position
from = ifelse(is.na(env.from), cd.from - 3 - 1, env.from - 1),
to = ifelse(is.na(env.to), cd.to + 3, env.to)) %>%
select(seq = sid, from, to) %>%
# the following statement is to remove the second PF11765 domain from one protein for the purpose of gene tree
group_by(seq) %>%
filter(from == min(from)) %>%
write_tsv(file = "../../output/20220512-expanded-blast-combined-homologs-PF11765.BED", col_names = FALSE)
Also write a version with the long name and not removing the second
domain from the one sequence, for the purpose of plotting and
calculating sequence length - PF11765
domain.final %>%
mutate(
from = ifelse(is.na(env.from), cd.from - 3 - 1, env.from - 1),
to = ifelse(is.na(env.to), cd.to + 3, env.to)) %>%
select(seq = pid, from, to) %>% # select pid for merging with dedup.merge
left_join(select(dedup.merge, seq = pid, name), by = "seq") %>%
ungroup %>% select(name, from, to) %>%
write_tsv(file = "../../output/20220623-expanded-blast-combined-homologs-PF11765-longname.BED", col_names = FALSE)
To extract the sequence matches to the PF11765 profile, execute the
following shell command
seqtk subseq 20220506-expanded-blast-combined-homologs.faa \
20220512-expanded-blast-combined-homologs-PF11765.BED > \
20220512-expanded-blast-combined-homologs-PF11765.faa
10. Rename the sequences for gene tree reconstruction
To facilitate gene tree reconstruction, rename the sequence names in
the form of ID_species_name, and get rid of the description
renamer <- dedup.merge %>%
# a bit complicated: in the sequence file, sequence names from Refseq sequences were modified
# while sequences from GRYC were not
mutate(id = ifelse(source == "RefSeq", pid, sid)) %>%
select(id, name) %>%
deframe()
myFastaRename <- function(file, out){
input <- read.fasta(file, seqtype = "AA", as.string = TRUE)
if(any(grepl(":", names(input)))){
# deal with the seqtk output
names(input) <- gsub(":.*$", "", names(input))
}
if(!all(names(input) %in% names(renamer))){
warning("The following sequences were not found in the rename list:")
print(setdiff(names(input), names(renamer)))
}
write.fasta(input, names = renamer[names(input)], file.out = out, nbchar = 5000)
}
myFastaRename(file = "../../output/20220506-expanded-blast-combined-homologs.faa",
out = "../../output/20220506-expanded-blast-combined-homologs-longname.faa")
myFastaRename(file = "../../output/20220512-expanded-blast-combined-homologs-PF11765.faa",
out = "../../output/20220512-expanded-blast-combined-homologs-PF11765-longname.faa")
11. Obtain subtrees for the species included
This is for the latter gene tree reconcilation and plotting.
Output the species used as a single column text file:
# after adding the GRYC species, we need to expand the sps.include list
# add S. cerevisiae for species tree illustrations
sps.final <- c(unique(dedup.merge$species), "Saccharomyces cerevisiae")
# convert the species names to the Shen 2018 name
sps.final.shen <- spsNameConvert %>%
mutate(ncbi.name = species,
shen2018 = word(gsub("_", " ", old_speceis_names), -2, -1)) %>%
select(ncbi.name, shen2018, `Major clade`, Family, Genus) %>%
filter(ncbi.name %in% sps.final)
cat("The following species are not present in the Shen 2018 list and need to be added later.")
The following species are not present in the Shen 2018 list and need to be added later.
setdiff(sps.final, spsNameConvert$species)
[1] "Candida haemuloni" "Candida duobushaemulonis" "Candida pseudohaemulonii"
sps.final.shen$shen2018 %>% na.omit() %>% gsub(" ", "_", .) %>%
write(file = paste0("data/", gsub("-", "", Sys.Date()), "-expanded-blast-species-list.txt"))
Get the species tree using TreeHouse
# first install shiny and phytools package
shiny::runGitHub("treehouse", "JLSteenwyk")
I then manually added the missing species by editing the Newick file
(without branch length version). The final result is shown below: 
Also export the gene-to-species mapping for generax
gene2sps.map <- dedup.merge %>%
select(name, species) %>%
# modify the two species names that differ between the gene names and species tree
mutate(treeName = gsub(" ", "_", species))#treeName = case_when(
#species == "Yamadazyma tenuis" ~ "Candida tenuis",
#species == "Suhomyces tanzawaensis" ~ "Candida tanzawaensis",
#TRUE ~ species),
#treeName = gsub(" ", "_", treeName))
gene2sps.map %>%
select(name, treeName) %>%
write_tsv("data/20220516-expanded-blast-genes-to-species-map.tsv", col_names = FALSE)
Append the species list with additional information
sps.patho <- read_tsv("data/20220518-expanded-blast-species-pathogen-info.txt", col_types = cols())
Warning: One or more parsing issues, see `problems()` for details
sps.summary <- dedup.merge %>%
group_by(species, staxid) %>%
summarize(
total = n(),
short = sum(slen < 500),
shortIncompl = ifelse(
any(is.na(pComplete)), NA,
sum(pComplete %in% c("no-ends", "no-left", "no-right") & slen < 500)
),
.groups = "drop"
) %>%
add_row(species = "Saccharomyces cerevisiae",
staxid = "559292",
total = 0, short = 0, shortIncompl = NA) %>%
left_join(sps.patho, by = "species") %>%
arrange(desc(total))
sps.summary %>% select(-staxid)
Query the taxonomy database for lineage information for each
species
taxon <- entrez_fetch(db = "taxonomy", id = sps.summary$staxid, rettype = "xml") %>%
XML::xmlToList()
names(taxon) <- sps.summary$species
Parse the taxonomy information
sps.lin <- lapply(taxon, function(x) {
bind_rows(x$LineageEx) %>% tail(-10) %>% filter(Rank != "species")
}) %>%
bind_rows(.id = "species")
write_tsv(sps.lin, "data/20220518-expanded-blast-species-taxonomy.tsv")
Write species info to file
spsInfo <- sps.lin %>%
filter(Rank == "family") %>%
select(species, family = ScientificName) %>%
mutate(treeName = gsub(" ", "_", species)) %>%
#treeName = case_when(
#species == "Yamadazyma tenuis" ~ "Candida tenuis",
#species == "Suhomyces tanzawaensis" ~ "Candida tanzawaensis",
#TRUE ~ species),
#treeName = gsub(" ", "_", treeName)) %>%
right_join(sps.summary, by = "species")
write_tsv(spsInfo, file = "data/20220518-expanded-blast-species-info.tsv")
12. Genome assembly info
Use species taxid to fetch the links to the taxonomy database. Note
that some of the taxids are for specific strains while others are at the
species level. As a result, there may be different number of assemblies
being fetched.
staxid <- sps.summary %>% select(species, staxid) %>% deframe()
staxid["Scheffersomyces stipitis"] <- "4924" # use the species general ID
assembly.lnk <- entrez_link(dbfrom = "taxonomy", db = "assembly", id = staxid, by_id = TRUE)
assembly.lnk <- lapply(assembly.lnk, function(x) x$links$taxonomy_assembly)
names(assembly.lnk) <- sps.summary$species
assembly.lnk["Lachancea kluyveri"] <- "9527871"
Get the assembly summary info from ncbi
assembly.info <- lapply(assembly.lnk, function(i){
if(!is_empty(i))
entrez_summary(db = "assembly", id = i)
})
Extract the information into a tibble
fields <- c("assemblyaccession", "lastmajorreleaseaccession", "taxid", "speciestaxid",
"speciesname", "assemblytype", "assemblystatus", "coverage", "submissiondate",
"lastupdatedate", "refseq_category", "contign50", "scaffoldn50")
assembly.tb <- lapply(assembly.info, function(x) {
tb <- extract_from_esummary(x, elements = fields, simplify = FALSE)
bind_rows(tb)
}) %>% bind_rows() %>%
mutate(species = gsub("\\[|\\]", "", speciesname)) %>%
arrange(desc(refseq_category), species) %>%
select(species, assemblystatus:scaffoldn50, everything())
#select(species = speciesname, everything())
Select the refseq or one alternative assembly
assembly.tbl <- assembly.tb %>%
mutate(assemblystatus = factor(assemblystatus, levels = c("Complete Genome", "Chromosome", "Scaffold", "Contig"))) %>%
group_by(species) %>%
filter(refseq_category != "na" | scaffoldn50 == max(scaffoldn50) | assemblystatus %in% c("Complete Genome", "Chromosome")) %>%
group_by(species, assemblystatus) %>%
filter(refseq_category != "na" | scaffoldn50 == max(scaffoldn50)) %>%
ungroup()
write_tsv(assembly.tbl, file = "../../output/20220525-expanded-blast-species-assembly-info.tsv")
13. Chromosomal locations
The goal is to fetch the chromosomal locations for the BLAST hits. We
are only going to use the assemblies that are at a chromosomal or
complete genome level. Note that C. auris refseq genome is
complete even though the record says it is scaffold.
# Get the list of species whose genome assembly are at a chromosomal or complete genome level
list.genome.compl <- assembly.tbl %>%
filter(assemblystatus %in% c("Complete Genome", "Chromosome"), refseq_category != "na") %>%
pull(species) %>% unique()
chr.sps <- c(list.genome.compl, "Scheffersomyces stipitis", "Candida auris")
chr.sid <- dedup.merge %>% filter(species %in% chr.sps, grepl("^ref", sid)) %>% pull(sid)
# Get the protein's gene ID (not refseq ID)
tmp.gene.lnk <- entrez_link(dbfrom = "protein", id = chr.sid, db = "gene", by_id = TRUE)
# extract gene ID
tmp.gene.id <- sapply(tmp.gene.lnk, function(x) x$links$protein_gene)
# fetch summary info from the gene database
tmp.gene <- entrez_summary(db = "gene", id = tmp.gene.id)
# extract gene name
tmp.gene.name <- extract_from_esummary(tmp.gene, elements = "name")
# extract systematic ID (stored in the otheralias field)
tmp.gene.alias <- extract_from_esummary(tmp.gene, elements = "otheraliases")
# extract chromosomal locations, only available for fully assembled genomes
tmp.gene.loc <- bind_rows(extract_from_esummary(tmp.gene, elements = "genomicinfo"), .id = "gene_uid")
# pool the unique chromosome sequence IDs to find out their length
tmp.chr.id <- unique(tmp.gene.loc$chraccver)
# obtain the chromosome length information from the nuccore database
tmp.chr <- entrez_summary("nuccore", id = tmp.chr.id)
tmp.chr.len <- extract_from_esummary(tmp.chr, elements = "slen");
names(tmp.chr.len) = tmp.chr.id
# assemble the chromosomal location dataset
chr.loc <- tibble(sid = chr.sid, gene_uid = names(tmp.gene.name),
gene_name = tmp.gene.name, gene_alias = tmp.gene.alias) %>%
left_join(tmp.gene.loc, by = "gene_uid") %>%
mutate(chrL = tmp.chr.len[chraccver], relLoc = num(chrstart/chrL, digits = 2)) %>%
left_join(select(dedup.merge, sid, pid, species, strain), by = "sid") %>%
relocate(pid:strain, .after = sid) %>% relocate(sid, .after = last_col()) %>%
select(-exoncount) %>%
arrange(species, chrloc)
write_tsv(chr.loc, file = "../../output/20220609-expanded-blast-chromosomal-locations.tsv")
---
title: "C037 Expanding NCBI blastp analysis"
author: Bin He
date: "2022-04-08 updated `r Sys.Date()`"
output: 
  html_notebook:
    toc: true
    toc_float: true
    code_folding: hide
---

```{r}
suppressPackageStartupMessages(require(tidyverse))
suppressPackageStartupMessages(require(cowplot))
suppressPackageStartupMessages(require(ggvenn))
suppressPackageStartupMessages(require(seqinr))
# set NCBI Entrez API key
suppressPackageStartupMessages(require(rentrez))
set_entrez_key("f4f01c220f3a0a162bada5b0bd5d4b131908")
```

## Background
This analysis is in response to reviewer 1's questions:

1. The submitted version used _C. auris_ Hil1's PF11765 domain as the query (bait). What if other homologs were used? Would the number and identity of the blast results change?
1. Why remove species from the blast output?

To address the first problem, I did blastp with two additional queries: the PF11765 domains in _C. albicans_ Hyr1 and _C. glabrata_ Hyr1.

As for the second question, my response and solutions are to include any species that were among the Y1000 tree in Shen et al. 2018. The more complete set can be used for the family expansion analysis. I plan to keep the homologs analysis restricted to the 104 homologs, as adding more species are not expected to change the main conclusions.

## Data
I performed blastp analysis using the three queries against either the refseq_prot, the new clustered nr databases or in FungiDB. For the former two (performed through web interface on NCBI website), I downloaded the results using the Request ID (RID) with the `blast_formatter` program and then used the same tool to reformat the output to contain the desired columns. For the FungiDB result, I tried downloading the results in ASN.1 format, but wasn't able to use `blast_formatter` to reformat it. Therefore I downloaded the table as shown on the default result page.
```{r}
# header for ncbi output
header <- c("query", "qcovs", "sid", "slen", "pident", "q.start", "q.end", "s.start", "s.end", "evalue", "staxid", "sscinames")
# refseq results
refseq <- read_tsv("data/20220406-expanded-PF11765-blastp-refseq.txt", comment = "#", col_types = cols(), col_names = header) %>% 
  mutate(species = word(sscinames, 1, 2) %>% gsub("\\[|\\]", "", x = .), strain = word(sscinames, 3, -1),
         pident = num(pident, digits = 1)) %>% 
  select(-sscinames) %>% 
  relocate(species:strain, .after = pident)
# nr results
nr <- read_tsv("data/20220406-expanded-PF11765-blastp-clusternr.txt", comment = "#", col_types = cols(), col_names = header) %>% 
  mutate(species = word(sscinames, 1, 2) %>% gsub("\\[|\\]", "", x = .), strain = word(sscinames, 3, -1),
         pident = num(pident, digits = 1)) %>% 
  select(-sscinames) %>% 
  relocate(species:strain, .after = pident)
# fungidb results
fdb <- read_csv("data/20220427-expanded-blast-fungidb.csv") %>% 
  rename(sid = Accession, evalue = `E-Value`, score = Score, alen = `Align Length`, qcovs = `Query Coverage`) %>% 
  mutate(
    species = word(Organism, 1, 2) %>% gsub("\\[|\\]", "", x = .), strain = word(Organism, 3, -1),
    query = word(Query, 1, 1),
    pident = num(Identity*100, digits = 1),
    qcovs = num(qcovs*100, digits = 0),
    evalue = signif(evalue, digits = 1),
    score = signif(score, digits = 3),
    .keep = "unused"
  ) %>% 
  # make spelling consistent with ncbi
  mutate(species = gsub("Candida haemulonis", "Candida haemuloni", species)) %>%
  select(query, qcovs, sid, pident, species, strain, score, evalue, `Rank Per Query`, `Rank Per Subject`)
```

## Analysis
### 1. Overlap between the three queries?
What's the extent of overlap between the blastp hits using the three different queries?

Hits with <50% query coverage are removed.
```{r}
refseq %>% filter(qcovs < 50)
```
Venn diagram showing the extent of overlap and set specific hits:
```{r}
set.ovlp <- refseq %>% 
  filter(qcovs >= 50) %>% 
  group_by(sid) %>% 
  summarize(
    Caur = any(query == "Caur_Hil1"),
    Calb = any(query == "Calb_Hyr1"),
    Cgla = any(query == "Cgla_Hyr1")
  )
ggvenn(set.ovlp)
```

Which genes are found by the _C. albicans_ and/or _C. glabrata_ Hyr1 query blast but not by the _C. auris_ Hil1 query blast?
```{r}
Calb.not.Caur.hits <- set.ovlp %>% filter(!Caur) %>% pull(sid)
refseq %>% filter(sid %in% Calb.not.Caur.hits) %>% 
  group_by(sid) %>% filter(evalue == min(evalue)) %>% ungroup() %>% 
  select(query, qcovs, sid, slen, pident, evalue, species) %>% 
  mutate(species = paste(str_sub(species, 1, 1), word(species, 2, 2), sep = ". ")) %>% 
  arrange(species)
```
> - The majority of the combined set is already in the subset identified using _C. auris_ Hil1 as the query.
> - Of the ones that are specific to the _C. albicans_ Hyr1 query, I found the three _C. parasilosis_ hits are already included in the 104 homologs via the FungiDB blast. Whatever the reason that exluded them from the ncbi blastp search using _C. auris_ Hil1 domain as the query, they are already in our dataset.
> - Same is true for XP_452176.1 from _K. lactis_, this time through the GRYC search.

Among the species included previously, how many of the "new" sequences were not part of the 104 homologs list?

| sid | gid | species | in_early_verion | source |
|:--|:--|:--|:--|:--|
| XP_002999585 | CAGL0L00227g | C. glabrata | Yes | GRYC |
| XP_002615218 | CLUG_05233 | C. lusitaniae | Yes | FungiDB |
| XP_002620045 | CLUG_01204 | C. lusitaniae | No^ | NA |
| XP_002614121 | CLUG_05607 | C. lusitaniae | No | NA |
| XP_036665262 | CPAR2_806390 | C. parapsilosis | Yes | FungiDB |
| XP_036662815 | CPAR2_600430 | C. parapsilosis | Yes | FungiDB |
| XP_036665263 | CPAR2_806400 | C. parapsilosis | No | NA |
| XP_452176 | KLLA0_B14498g | K. lactis | Yes | GRYC |
| XP_001527307 | LELG_02136 | L. elongisporus | No | NA |
| XP_001385683 | PICST_3402 | S. stipitis | No^ | NA |
| XP_001383953 | PICST_31095 | S. stipitis | Yes* | NCBI |

*this hit was originally included when the N560 aa was used as query. In later analysis where N360 aa (of _C. auris_ Hil1) was used as query, this hit dropped off the list due to E-value being above the 1e-5 cutoff. So it is a legit, but borderline hit.

^shorter than 500 a.a.

> Five hits are new, among which two are shorter than 500 a.a. and would have been excluded from the list (although they were not identified at all with the _C. auris_ query).


### 2. Species to include
The new criteria is whether the species is among the 332 included in the large-scale phylogeny of Shen _et al._ 2018. I downloaded the tree from https://github.com/JLSteenwyk/treehouse/blob/master/Data/Saccharomycotina_fig2_Shen_etal_2018.tre and manually converted it into a list of species names.
```{r}
shen2018 <- scan("../../data/yeast-phylogeny/Saccharomycotina_fig2_Shen_etal_2018.txt", what = "character", sep = "\n")
```

Any species in the refseq hit list not in the Shen 2018 phylogeny?
```{r}
setdiff(unique(refseq$species), shen2018)
```

| refseq_name | taxid | in_Shen_2018 | shen2018_name |
|:--|:--|:--|:--|
| Candida haemuloni | 45357 | No | NA |
| Candida duobushaemulonis | 1231522 | No | NA |
| Candida pseudohaemulonii | 418784 | No | NA |
| Suhomyces tanzawaensis | 984487 | Yes | Candida tanzawaensis |
| Pseudomonas syringae:Pseudomonas | 317;59510 | No | NA |
| Diutina rugosa | 5481 | No | NA |
| Yamadazyma tenuis | 590646 | Yes | Candida tenuis |
| Schizosaccharomyces cryophilus | 653667 | No | NA |
| Kazachstania barnettii | 61262 | No | NA |
| Artibeus jamaicensis | 9417 | No | NA |

Manually figure out species synonyms and record both names for the species to use. **update**: downloaded the species name convertion table from Shen et al 2018 supplementary material.
```{r}
spsNameConvert <- read_tsv("../../data/yeast-phylogeny/343taxa_speicies-name_clade-name_color-code.txt", col_types = cols()) %>% 
  mutate(species = word(`Species name`, 1, 2))
spsNameSyn <- tibble(
  ncbi.name = c(intersect(refseq$species, shen2018), "Suhomyces tanzawaensis", "Yamadazyma tenuis", "Lachancea kluyveri"),
  shen.name = c(intersect(refseq$species, shen2018), "Candida tanzawaensis", "Candida tenuis", "Lachancea kluyveri")
)
```

### 3. Check protein completeness

Our goal is to create a table documenting the number of putative homologs per species. We used to apply a length cutoff of 500 aa. However, the specific value of the cutoff is not based on any comprehensive studies. This combined with the uncertainty in some putative homolog's length due to assembly issues led us to reconsider this filter. The solution we came up with is to include all blast hits based on the PF11765 domain alone in the family expansion analysis (CAFE and phylogenetic reconstruction). Later we will selectively include some of them for the protein feature analysis. Since the latter analysis is mainly to demonstrate the fast evolution of the central domain features, it doesn't rely on the completeness of the Hil family or the inclusion of all species for which we have records for.

In curating the blast hits, we found that some (protein) blast hits are annotated as missing the carboxy terminus. This would prevent us from properly assessing the length of the protein. Note that "incomplete" means different things for the protein vs mRNA record. For the latter, "incomplete" could mean missing 5' and/or 3'UTR while the CDS may be complete. A large number of the mRNA records corresponding to the protein hits in the list are "partial". That's not reason for exclusion.

For _C. tropicalis_, I found a new, [2019 assembly](https://www.ncbi.nlm.nih.gov/assembly/GCA_013177555.1) that uses both PacBio and Illumina. The assembly did use the reference genome assembly as a guide.

To identify all partial mRNA, we will use the `rentrez` package to query the ncbi database.
```{r}
# define a function for retrieving the mRNA information from the nucleotide databased, based on protein ID
retrieve_mrna_info <- function(ids){
  nuc.lnk <- entrez_link(dbfrom = "protein", db = "nucleotide", id = ids, by_id = TRUE)
  nucID <- sapply(nuc.lnk, function(x){
    lnk = x$links$protein_nuccore_mrna
    return(ifelse(is.null(lnk), NA, lnk))
  })
  nucSum <- entrez_summary(db = "nuccore", id = na.omit(nucID))
  names(nucSum) <- ids[!is.na(nucID)]
  return(nucSum)
}

retrieve_protein_info <- function(ids){
  pSum <- entrez_summary(db = "protein", id = ids)
  names(pSum) <- ids
  return(pSum)
}
```

```{r}
# all refseq protein IDs to get the mRNA information for
pid <- unique(refseq$sid)
# posting too many queries can hit the rate-limit. split into multiples of 50s
iter.pid <- split(x = pid, f = ceiling(seq_along(pid)/50))
pSum <- sapply(iter.pid, retrieve_protein_info)
nucSum <- sapply(iter.pid, retrieve_mrna_info)
```

```{r}
# extract the mRNA info
mrnaInfo <- lapply(nucSum, function(x) {
  extract_from_esummary(x, elements = c("caption", "completeness"), simplify = FALSE) %>% 
    bind_rows(.id = "sid") %>% 
    rename(tID = caption, tComplete = completeness)
}) %>% bind_rows()
# extract the protein completeness info
pInfo <- lapply(pSum, function(x) {
  extract_from_esummary(x, elements = c("completeness"), simplify = FALSE) %>% 
    bind_rows(.id = "sid") %>% 
    rename(pComplete = completeness)
}) %>% bind_rows()
```

Now remove duplicates (resulting from the different queries) and filter by species (those that overlap the Shen 2018 list plus the MDR clade)
```{r}
sps.include <- c(spsNameSyn$ncbi.name, "Candida haemuloni", "Candida duobushaemulonis", "Candida pseudohaemulonii")
dedup.ref <- refseq %>% 
  filter(species %in% sps.include, qcovs >= 50) %>% 
  group_by(sid) %>% 
  filter(evalue == min(evalue)) %>% 
  left_join(mrnaInfo, by = c("sid" = "sid")) %>% 
  left_join(pInfo, by = c("sid" = "sid")) %>% 
  select(query, sid, slen, pComplete, tID, tComplete, species, strain, everything())
```


Summarize the deduplicated list - for each species, record the total number of hits, those that are labeled as incomplete in the protein record, those that are above 500 aa and the number of hits below 500 aa that are labeled as incomplete
```{r}
sum.ref <- dedup.ref %>% 
  group_by(species, strain) %>% 
  summarize(
    total = n(),
    incompl = sum(pComplete != ""),
    gt500 = sum(slen >= 500),
    shortIncompl = sum(pComplete != "" & slen < 500)
  )
sum.ref %>% arrange(desc(shortIncompl))
```
> - _S. stipitis_ and _M. bicuspidata_ had a large number of incomplete proteins, likely a reflection of the genome assembly status. For _M. bicuspidata_, a newer Illumina + Oxford Nanopore [assembly](https://www.ncbi.nlm.nih.gov/assembly/GCA_022575955.1/) is available for this species but is only assembled to the contig level, with no gene annotation. I tried blasting the refseq hits from the species against the new assembly but the match was poor for most of the hits - instead of getting a long, contiguous HSP, I got mostly broken pieces with lots of mismatches. I suspect this is due to a combination of tandem gene arrays causing difficulty for assembly around the Hil homologs loci and the sequencing depth and assembly itself. Based on these observations, I plan to include the species in the CAFE analysis and potentially in the phylogenetic reconstruction as well, but exclude it from the homolog protein trait analyses.
> - For _S. stipitis_, I identified a [newer assembly](https://www.ncbi.nlm.nih.gov/assembly/GCA_016859295.1/) for a different strain (NRRL Y-7124) and performed local blast against the annotated protein sequences. I identified a total of 13 hits, of which seven are above the 50% query coverage threshold. This is quite different from the 18 hits identified in the refseq assembly. When I blasted the refseq hits against the new assembly, some of them had poor matches. Without further investigation, it is not clear what is the cause of the discrepancy. I plan to use the new assembly hits for downstream analyses given that it is more recently sequenced.
> - For the same reasons as above, I plan to include _Y. tenuis_, _S. tanzawaensis_ and _C. jadinii_ in the CAFE analysis but exclude them from the protein feature analyses. Given that they have relatively few hits, I could consider manually searching for the full length ORF, but that's optional.
> - For _D. fabryi_, the Y-1000 plus project has sequenced another strain from this species. I blasted against that assembly and identified two matches. But the two hits appear too similar to each other, leaving me wonder if they are truly distinct loci, or are they partially duplicate contigs.

Hits based on the new _S. stipitis_ assembly
```{r}
s.stipitis <- read_tsv("../../data/long-read-assemblies/s_stipitis_NRRL_Y7124/S-stipitis-NRRL-Y7124-Hil-homologs-blast.txt",
                       comment = "#", col_names = header[1:10], col_types = cols()) %>% 
  mutate(species = "Scheffersomyces stipitis", strain = "NRRL Y-7124", staxid = "913138",
         pident = num(pident, digits = 1)) %>% 
  relocate(species:strain, .after = pident) %>% 
  group_by(sid) %>% 
  filter(evalue == min(evalue))
```

### 4. _Candida glabrata_ hits
The Cormack lab released new assemblies for a couple of _C. glabrata_ strains in Xu 2020 and Xu 2021 (PMIDs: 32068314, 33713372). These assemblies were based on long-red sequencing technologies supplemented with short reads for polishing the sequences. They also paid special attention to the subtelomeres, using various means to verify that their newly assembled subtelomeric regions are correct. Performing blastp searches against these assemblies (I tested two of them: ATTC2001=CBS138 and BG2) recovered 13 instead of 3 hits (in the new refseq hit list). I verified that all of them are credible. Actually, Xu _et al._ 2020 supplementary table specifically listed these 13 as predicted adhesins belonging to cluster V, which also included Awp2 studied by Reithofer et al 2021. Including them in the homologs list would properly reflect the family size in the species, but will also highlight the issue that incompletely assembled subtelomeres may have led to many false negatives in the gene family inventory. I think it is worth including it and commenting on the caveats.

```{r}
c.glabrata <- read_tsv("../../data/long-read-assemblies/c_glabrata_CBS138/C-glabrata-CBS138-Hil-homologs-blast.txt",
                       comment = "#", col_names = header[1:10], col_types = cols()) %>% 
  mutate(species = "Candida glabrata", strain = "CBS 138", staxid = "5478",
         pident = num(pident, digits = 1)) %>% 
  relocate(species:strain, .after = pident) %>% 
  group_by(sid) %>% 
  filter(evalue == min(evalue))# %>% 
  #rename(pid = sid)

c.glabrata
```

Note that I also checked three other species for which I can find a recent assembly that used long-read technlogoies. None of them led to so many more hits as the _C. glabrata_ assembly did. In fact, only the _C. nivariensis_ assembly yielded one more hit than the older, short-read based assembly. There may be two possibilities: 1) not all new assemblies are created equal. there are some special properties of the _C. glabrata_ assembly that made its subtelomeric regions especially well assembled; 2) there is something special about _C. glabrata_ in that the Hil family expanded more significantly than even the closely related Nakaseomyces species. As of now we cannot distinguish between these two.

### 5. FungiDB hits
FungiDB doesn't include any species not already in the refseq database. The hits that were added to the refseq hits from the earlier analysis are

```{r}
prev.fdb.add <- tribble(
  ~gid, ~pid, ~species,
  "CLUG_05233", "XP_002615218.1", "Clavispora lusitaniae",
  "CPAR2_600430", "XP_036662815.1", "Candida parapsilosis",
  "CPAR2_806390", "XP_036665262.1", "Candida parapsilosis",
  "CPAR2_806420", "XP_036665265.1", "Candida parapsilosis",
  "B9J08_004098", "PIS52481.1", "Candida auris"
)
prev.fdb.add %>% filter(!pid %in% dedup.ref$sid)
```
> The first four are not missing, just a difference in the ID format.
> The missing _C. auris_ sequence is known to us and will be added regardless.
> Based on the above, we will no longer consult the FungiDB to supplement the refseq hits.

```{r}
c.auris <- read_tsv("../../output/XP_028889033-Cauris-five-strains-blast.txt",
                    comment = "#", col_types = cols()) %>% 
  filter(qseqid == "XP_028889033.1_PF11765", sseqid == "PIS52481.1") %>% 
  rename(sid = sseqid) %>% 
  mutate(
    pid = sid,
    query = "Caur_Hil1", 
    # the blast result from the above file was done using the first 325 amino acids as query
    # I manually did two sequence blast with the PF11765 from Hil1 as the query and entered 
    # the query and subject start and end positions below
    q.start = 12, q.end = 326, s.start = 11, s.end = 325,
    qcovs = (q.end - q.start + 1)/325*100,
    slen = 1247,
    species = "Candida auris",
    staxid = "498019"
  ) %>% 
  select(query, sid, slen, species, qcovs, pident, q.start, q.end, s.start, s.end, evalue, staxid, pid)
```

I now consider the refseq list complete, we will gather the final list of IDs and retrieve their sequences.

Retrieve protein info (no mRNA information is available)
```{r}
id.add <- c(c.glabrata$sid, s.stipitis$sid, c.auris$sid)
pSum1 <- retrieve_protein_info(id.add)
# extract the protein completeness info
pInfo1 <- extract_from_esummary(pSum1, elements = c("completeness"), simplify = FALSE) %>% 
    bind_rows(.id = "sid") %>% 
    rename(pComplete = completeness)
```

Replace _C. glabrata_ and _S. stipitis_ in refseq hit list. Add the _C. auris_ Hil4 from B8441.
```{r}
dedup.ref.cg <- dedup.ref %>% 
  filter(!species %in% c("Candida glabrata", "Scheffersomyces stipitis")) %>% 
  bind_rows(filter(s.stipitis, qcovs >= 50)) %>% 
  bind_rows(filter(c.glabrata, qcovs >= 50)) %>% 
  bind_rows(filter(c.auris, qcovs >= 50)) %>% 
  rows_update(pInfo1, by = "sid", unmatched = "ignore") %>% 
  mutate(pid = gsub(".*\\|(.*)\\|.*", "\\1", sid))
```


Retrieve sequences
```{r}
iter.pid1 <- split(x = dedup.ref.cg$pid, f = ceiling(seq_along(dedup.ref.cg$pid)/50))
ref.seq <- lapply(iter.pid1, function(x){
  entrez_fetch(db = "protein", id = x, rettype = "fasta")
})
```

```{r}
# merge all the sequences and write to file
ref.seq %>% paste0(collapse = "") %>% 
  write(file = "../../output/20220503-expanded-blast-refseq-homologs.faa")
```

### 6. GRYC hits
GRYC did include a few more species, especially in the Nakaseomyces and Lachancea genera. While the same assembly data appear to exist in the ncbi database, they are certainly not included in the refseq database. Moreover, even when the assemblies are present in the ncbi database, there are only the genomic sequences, not translated CDS, which likely make them unavailable to blastp searches. Therefore I downloaded the five non-glabrata Nakaseomyces genomes plus one Lachancea species, _L. kluyveri_ and performed local `tblastn` searches. Below I will process the results and decide which ones to include in the final list.
```{r}
gryc.sps <- tribble(
  ~sps.abr, ~species, ~strain, ~staxid,
  "CANI", "Candida nivariensis", "CBS 9983", "1279115",
  "NADE", "Nakaseomyces delphensis", "CBS 2170", "1279113",
  "NABA", "Nakaseomyces bacillisporus", "CBS 7720", "1279116",
  "CACA", "Candida castellii", "CBS 4332", "1279086",
  "CABR", "Candida bracarensis", "CBS 10154", "1279114",
  "SAKL", "Lachancea kluyveri", "CBS 3082", "4934"
)

gryc <- read_tsv("../../data/Nakaseomyces_GRYC/Nakaseomyces-Hil-homologs-blast.txt",
                       comment = "#", col_names = header[1:10], col_types = cols()) %>% 
  mutate(sps.abr = gsub("BN[0-9]*_", "", sid) %>% str_sub(1,4), pident = num(pident, digits = 1)) %>% 
  left_join(gryc.sps, by = "sps.abr") %>% 
  relocate(species:strain, .after = pident) %>% 
  select(-sps.abr)

dedup.gryc <- gryc %>% 
  group_by(sid) %>% 
  filter(evalue == min(evalue)) %>% 
  mutate(pid = gsub("BN[0-9]*_", "", sid) %>% gsub("1_1$", "", .)) %>% 
  filter(qcovs >= 50) %>% 
  ungroup()
```

Extract the protein sequences from the combined fasta
```{r}
cat(dedup.gryc$sid, file = "20220506-gryc-deduplicated-id-for-extracting-fasta.txt", sep = "\n")
```

Run the following at the command line assuming the current working directory is `02-blast`
```bash
python3 script/extract_fasta_gz.py \
  data/Nakaseomyces_GRYC/20220503-gryc-nakaseomyces-protein.faa.gz \
  analysis/expanded-blast/20220506-gryc-deduplicated-id-for-extracting-fasta.txt \
  output/20220506-expanded-blast-gryc-homologs.faa
```

Merge the GRYC sequences with the refseq sequences - run the following at the command line
```bash
cat 20220503-expanded-blast-refseq-homologs.faa \
  20220506-expanded-blast-gryc-homologs.faa \
  > 20220506-expanded-blast-combined-homologs.faa
```

Check to make sure that no duplicates exist.

```
bioawk -c fastx '{print $name}' 20220506-expanded-blast-combined-homologs.faa | sort | uniq -d
```

If there is no output, then there is no duplicates. The command will output one line per sequence that occcurs more than once in the file.

### 7. Merge all the sequences
Export the combined homologs list
```{r}
dedup.merge <- bind_rows(
  RefSeq = dedup.ref.cg,
  GRYC = dedup.gryc,
  .id = "source"
  ) %>% 
  mutate(name = paste(pid, gsub(" ", "_", species), sep = "_")) %>% 
  select(-q.start, -q.end, -s.start, -s.end) %>% 
  ungroup()
write_tsv(dedup.merge, file = "../../output/20220506-expanded-blast-combined-homologs.tsv")
```

### 8. Compare with the former 104 list
Hits in the previous refseq blast list but not in the current one:
```{r}
p104 <- read_tsv("../../output/XP_028889033_homologs_combine_blast_stat.tsv", col_types = cols())
ggvenn(list(expanded_blast = dedup.merge$pid, previous_104 = p104$id))
p104 %>% filter(!id %in% gsub("BN[0-9]*_", "", dedup.merge$pid)) %>% arrange(species)
```
> - The four _S. stipitis_ sequences have been replaced by the new assembly
> - The _C. auris_, _C. glabrata_, _C. parasilosis_ and _Clavispora lusitaniae_ sequences are in fact all present in the new list, just under different IDs

### 9. Identify the PF11765 domain
Use HMMER to identify the location of the PF11765 domain. The hmm model was downloaded from the pfam database on 2022-05-04. Run the following command in `02-blast` directory.
```
hmmscan --noali --domE 1e-3 \
  -o 20220504-hmmscan.log \
  --domtblout output/20220506-expanded-blast-combined-PF11765-hmmscan.txt \
  data/HMM-profile/Hyphal_reg_CWP.hmm \
  output/20220506-expanded-blast-combined-homologs.faa
```

Read the result
```{r warning=FALSE}
hmm.header <- c("target", "tacc", "tlen", "pid", "pacc", "slen", "full.evalue", "full.score", "full.bias",
                "num", "of", "c.evalue", "i.evalue", "domain.score", "domain.bias",
                "hmm.from", "hmm.to", "ali.from", "ali.to", "env.from", "env.to", "acc")
hmm <- read_table("../../output/20220506-expanded-blast-combined-PF11765-hmmscan.txt", 
                  comment = "#", col_names = hmm.header, col_types = cols()) %>% 
  mutate(sid = pid, pid = gsub("BN[0-9]*_", "", pid) %>% gsub("1_1$", "", .)) %>% 
  select(sid, pid, num, of, i.evalue:env.to)
dedup.merge %>% filter(!pid %in% hmm$pid)
```

Curious why `hmmscan` didn't identify the PF11765 domain in the two sequences, I submitted the same sequence file to [`Batch CD-search`](https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi), which uses the RPS-BLAST to search a collection of protein sequences for matches to a set of profile matrices. It is the reverse of the PSI-BLAST.
```{r}
cdsearch <- read_tsv("../../output/20220509-expanded-blast-combined-pfam-cd-search-output.txt", skip = 7,
                     col_types = cols()) %>% 
  filter(grepl("Hyphal", `Short name`)) %>%
  mutate(sid = str_match(Query, ">([a-zA-Z0-9_.]*) ")[,2],
         pid = gsub("BN[0-9]*_", "", sid) %>% gsub("1_1$", "", .)) %>% 
  select(sid, pid, everything(), -Query)
cdsearch %>% filter(pid %in% setdiff(dedup.merge$pid, hmm$pid)) %>% select(-sid)
```
> Both sequences did have a match, although one of them has a "superfamily" level match, suggesting that no hits exceed the domain level E-value, but multiple hits met the "non-specific" criteria. Not sure why hmmscan didn't pick up the domain in the _C. glabrata_ sequence though

Merge the CD-Search and HMMSCAM results into one. I'm using the envelope coordinates from `hmmscan` rather than the alignment coordinates. According to [PFam documentation](https://pfam-docs.readthedocs.io/en/latest/glossary.html#alignment-coordinates), the envelope coordinates reflect the region where the match has been *probabilistically* determined to lie. It will be slightly wider than the alignment.
```{r}
hmm.cd <- cdsearch %>% filter(grepl("Hyphal_reg_CWP", `Short name`)) %>% 
  select(sid, pid, cd.type = `Hit type`, cd.incom = Incomplete, cd.from = From, cd.to = To) %>% 
  full_join(select(hmm, sid, pid, ali.from, ali.to, env.from, env.to)) %>% 
  mutate(cd.len = cd.to - cd.from + 1,
         ali.len = ali.to - ali.from + 1,
         env.len = env.to - env.from + 1)
hmm.cd %>% group_by(pid) %>% filter(n() > 1) %>% select(-sid, -ends_with("to"))
```

> - Above are the sequences with more than one entries from either search. Note that with the parameters used for CD-Search, it appears that the algorithm only reports the strongest match (specific). This can probably be changed by switching to the full report mode. But it's not important as I plan to use the HMM envelope coordinates whenever available.
> - The second match in XP_003671892.1 is very short. Remove it.
> - The second match in NADE0s03e00110g largely overlaps with the first one. Remove.

```{r}
domain.final <- hmm.cd %>% 
  # remove the short hit and duplicates due to the join operation, but save the rows where hmmscan didn't find a match
  filter(is.na(ali.from) | (abs(ali.from - cd.from) < 1000 & ali.len > 100)) %>% 
  filter(!(pid == "NADE0s03e00110g" & ali.from == 39)) # remove the overlapping entry
domain.final %>% 
  pivot_longer(c(cd.len, env.len), values_to = "length", names_to = "method") %>%
  mutate(method = factor(method, levels = c("cd.len", "env.len"), labels = c("CD-Search hits", "HMMSCAN hits"))) %>% 
  ggplot(aes(x = length)) + 
  geom_histogram(aes(fill = cd.incom != "-"), bins = 30) + 
  geom_vline(xintercept = 328, linetype = 2) +
  scale_fill_brewer("Incomplete", type = "qual") +
  facet_wrap(~method, nrow = 2, ) + theme_cowplot()
```
> Some are incomplete matches.

`r domain.final %>% filter(cd.incom != "-")`

Write the coordinates into a BED formatted file for extracting the sequences.
```{r}
domain.final %>% 
  mutate(
    # if hmm hits are not available, use cd-search hit coordinates, with 3 bp expansion on both sides
    # note that BED format is 0-based for the start position
    from = ifelse(is.na(env.from), cd.from - 3 - 1, env.from - 1), 
    to = ifelse(is.na(env.to), cd.to + 3, env.to)) %>% 
  select(seq = sid, from, to) %>% 
  # the following statement is to remove the second PF11765 domain from one protein for the purpose of gene tree
  group_by(seq) %>% 
  filter(from == min(from)) %>% 
  write_tsv(file = "../../output/20220512-expanded-blast-combined-homologs-PF11765.BED", col_names = FALSE)
```

Also write a version with the long name and not removing the second domain from the one sequence, for the purpose of plotting and calculating sequence length - PF11765
```{r}
domain.final %>% 
  mutate(
    from = ifelse(is.na(env.from), cd.from - 3 - 1, env.from - 1), 
    to = ifelse(is.na(env.to), cd.to + 3, env.to)) %>% 
  select(seq = pid, from, to) %>% # select pid for merging with dedup.merge
  left_join(select(dedup.merge, seq = pid, name), by = "seq") %>% 
  ungroup %>% select(name, from, to) %>% 
  write_tsv(file = "../../output/20220623-expanded-blast-combined-homologs-PF11765-longname.BED", col_names = FALSE)
```

To extract the sequence matches to the PF11765 profile, execute the following shell command
```
seqtk subseq 20220506-expanded-blast-combined-homologs.faa \
             20220512-expanded-blast-combined-homologs-PF11765.BED  > \
             20220512-expanded-blast-combined-homologs-PF11765.faa
```

### 10. Rename the sequences for gene tree reconstruction
To facilitate gene tree reconstruction, rename the sequence names in the form of ID_species_name, and get rid of the description

```{r}
renamer <- dedup.merge %>% 
  # a bit complicated: in the sequence file, sequence names from Refseq sequences were modified 
  # while sequences from GRYC were not
  mutate(id = ifelse(source == "RefSeq", pid, sid)) %>% 
  select(id, name) %>% 
  deframe()
myFastaRename <- function(file, out){
  input <- read.fasta(file, seqtype = "AA", as.string = TRUE)
  if(any(grepl(":", names(input)))){
    # deal with the seqtk output
    names(input) <- gsub(":.*$", "", names(input))
  }
  if(!all(names(input) %in% names(renamer))){
    warning("The following sequences were not found in the rename list:")
    print(setdiff(names(input), names(renamer)))
  }
  write.fasta(input, names = renamer[names(input)], file.out = out, nbchar = 5000)
}
```
```{r}
myFastaRename(file = "../../output/20220506-expanded-blast-combined-homologs.faa", 
              out = "../../output/20220506-expanded-blast-combined-homologs-longname.faa")
myFastaRename(file = "../../output/20220512-expanded-blast-combined-homologs-PF11765.faa", 
              out = "../../output/20220512-expanded-blast-combined-homologs-PF11765-longname.faa")
```

### 11. Obtain subtrees for the species included
This is for the latter gene tree reconcilation and plotting. 

Output the species used as a single column text file:
```{r}
# after adding the GRYC species, we need to expand the sps.include list
# add S. cerevisiae for species tree illustrations
sps.final <- c(unique(dedup.merge$species), "Saccharomyces cerevisiae")
# convert the species names to the Shen 2018 name
sps.final.shen <- spsNameConvert %>% 
  mutate(ncbi.name = species,
         shen2018 = word(gsub("_", " ", old_speceis_names), -2, -1)) %>% 
  select(ncbi.name, shen2018, `Major clade`, Family, Genus) %>% 
  filter(ncbi.name %in% sps.final)
cat("The following species are not present in the Shen 2018 list and need to be added later.")
setdiff(sps.final, spsNameConvert$species)
```

```{r}
sps.final.shen$shen2018 %>% na.omit() %>% gsub(" ", "_", .) %>%
  write(file = paste0("data/", gsub("-", "", Sys.Date()), "-expanded-blast-species-list.txt"))
```

Get the species tree using [TreeHouse](https://github.com/JLSteenwyk/treehouse)
```{r}
# first install shiny and phytools package
shiny::runGitHub("treehouse", "JLSteenwyk")
```

I then manually added the missing species by editing the Newick file (without branch length version). The final result is shown below:
![species tree](data/20220515-expanded-blast-species-tree.png)

Also export the gene-to-species mapping for `generax`
```{r}
gene2sps.map <- dedup.merge %>% 
  select(name, species) %>% 
  # modify the two species names that differ between the gene names and species tree
  mutate(treeName = gsub(" ", "_", species))#treeName = case_when(
    #species == "Yamadazyma tenuis" ~ "Candida tenuis",
    #species == "Suhomyces tanzawaensis" ~ "Candida tanzawaensis",
    #TRUE ~ species),
    #treeName = gsub(" ", "_", treeName))
    
gene2sps.map %>% 
  select(name, treeName) %>% 
  write_tsv("data/20220516-expanded-blast-genes-to-species-map.tsv", col_names = FALSE)
```

Append the species list with additional information
```{r}
sps.patho <- read_tsv("data/20220518-expanded-blast-species-pathogen-info.txt", col_types = cols())
sps.summary <- dedup.merge %>% 
  group_by(species, staxid) %>% 
  summarize(
    total = n(),
    short = sum(slen < 500),
    shortIncompl = ifelse(
      any(is.na(pComplete)), NA,
      sum(pComplete %in% c("no-ends", "no-left", "no-right") & slen < 500)
    ),
    .groups = "drop"
  ) %>% 
  add_row(species = "Saccharomyces cerevisiae",
          staxid = "559292",
          total = 0, short = 0, shortIncompl = NA) %>% 
  left_join(sps.patho, by = "species") %>% 
  arrange(desc(total))
sps.summary %>% select(-staxid)
```

Query the taxonomy database for lineage information for each species
```{r}
taxon <- entrez_fetch(db = "taxonomy", id = sps.summary$staxid, rettype = "xml") %>% 
  XML::xmlToList()
names(taxon) <- sps.summary$species
```

Parse the taxonomy information
```{r}
sps.lin <- lapply(taxon, function(x) {
  bind_rows(x$LineageEx) %>% tail(-10) %>% filter(Rank != "species")
  }) %>% 
  bind_rows(.id = "species")
write_tsv(sps.lin, "data/20220518-expanded-blast-species-taxonomy.tsv")
```


Write species info to file
```{r}
spsInfo <- sps.lin %>% 
  filter(Rank == "family") %>% 
  select(species, family = ScientificName) %>% 
  mutate(treeName = gsub(" ", "_", species)) %>% 
    #treeName = case_when(
    #species == "Yamadazyma tenuis" ~ "Candida tenuis",
    #species == "Suhomyces tanzawaensis" ~ "Candida tanzawaensis",
    #TRUE ~ species),
    #treeName = gsub(" ", "_", treeName)) %>% 
  right_join(sps.summary, by = "species")

write_tsv(spsInfo, file = "data/20220518-expanded-blast-species-info.tsv")
```

### 12. Genome assembly info
Use species taxid to fetch the links to the taxonomy database. Note that some of the taxids are for specific strains while others are at the species level. As a result, there may be different number of assemblies being fetched.
```{r genome_assembly}
staxid <- sps.summary %>% select(species, staxid) %>% deframe()
staxid["Scheffersomyces stipitis"] <- "4924" # use the species general ID
assembly.lnk <- entrez_link(dbfrom = "taxonomy", db = "assembly", id = staxid, by_id = TRUE)
assembly.lnk <- lapply(assembly.lnk, function(x) x$links$taxonomy_assembly)
names(assembly.lnk) <- sps.summary$species
assembly.lnk["Lachancea kluyveri"] <- "9527871"
```
Get the assembly summary info from ncbi
```{r}
assembly.info <- lapply(assembly.lnk, function(i){
  if(!is_empty(i))
    entrez_summary(db = "assembly", id = i)
})
```
Extract the information into a tibble
```{r}
fields <- c("assemblyaccession", "lastmajorreleaseaccession", "taxid", "speciestaxid",
            "speciesname", "assemblytype", "assemblystatus", "coverage", "submissiondate",
            "lastupdatedate", "refseq_category", "contign50", "scaffoldn50")
assembly.tb <- lapply(assembly.info, function(x) {
  tb <- extract_from_esummary(x, elements = fields, simplify = FALSE)
  bind_rows(tb)
}) %>% bind_rows() %>%
  mutate(species = gsub("\\[|\\]", "", speciesname)) %>% 
  arrange(desc(refseq_category), species) %>% 
  select(species, assemblystatus:scaffoldn50, everything())
  #select(species = speciesname, everything())
```
Select the refseq or one alternative assembly
```{r}
assembly.tbl <- assembly.tb %>% 
  mutate(assemblystatus = factor(assemblystatus, levels = c("Complete Genome", "Chromosome", "Scaffold", "Contig"))) %>% 
  group_by(species) %>% 
  filter(refseq_category != "na" | scaffoldn50 == max(scaffoldn50) | assemblystatus %in% c("Complete Genome", "Chromosome")) %>% 
  group_by(species, assemblystatus) %>% 
  filter(refseq_category != "na" | scaffoldn50 == max(scaffoldn50)) %>% 
  ungroup()
write_tsv(assembly.tbl, file = "../../output/20220525-expanded-blast-species-assembly-info.tsv")
```

### 13. Chromosomal locations
The goal is to fetch the chromosomal locations for the BLAST hits. We are only going to use the assemblies that are at a chromosomal or complete genome level. Note that _C. auris_ refseq genome is complete even though the record says it is scaffold.

```{r chr_loc1}
# Get the list of species whose genome assembly are at a chromosomal or complete genome level
list.genome.compl <- assembly.tbl %>% 
  filter(assemblystatus %in% c("Complete Genome", "Chromosome"), refseq_category != "na") %>% 
  pull(species) %>% unique()
chr.sps <- c(list.genome.compl, "Scheffersomyces stipitis", "Candida auris")
chr.sid <- dedup.merge %>% filter(species %in% chr.sps, grepl("^ref", sid)) %>% pull(sid) 
# Get the protein's gene ID (not refseq ID)
tmp.gene.lnk <- entrez_link(dbfrom = "protein", id = chr.sid, db = "gene", by_id = TRUE)
# extract gene ID
tmp.gene.id <- sapply(tmp.gene.lnk, function(x) x$links$protein_gene)
# fetch summary info from the gene database
tmp.gene <- entrez_summary(db = "gene", id = tmp.gene.id)
# extract gene name
tmp.gene.name <- extract_from_esummary(tmp.gene, elements = "name")
# extract systematic ID (stored in the otheralias field)
tmp.gene.alias <- extract_from_esummary(tmp.gene, elements = "otheraliases")
# extract chromosomal locations, only available for fully assembled genomes
tmp.gene.loc <- bind_rows(extract_from_esummary(tmp.gene, elements = "genomicinfo"), .id = "gene_uid")
# pool the unique chromosome sequence IDs to find out their length
tmp.chr.id <- unique(tmp.gene.loc$chraccver)
# obtain the chromosome length information from the nuccore database
tmp.chr <- entrez_summary("nuccore", id = tmp.chr.id)
tmp.chr.len <- extract_from_esummary(tmp.chr, elements = "slen"); 
names(tmp.chr.len) = tmp.chr.id
# assemble the chromosomal location dataset
chr.loc <- tibble(sid = chr.sid, gene_uid = names(tmp.gene.name), 
                  gene_name = tmp.gene.name, gene_alias = tmp.gene.alias) %>% 
  left_join(tmp.gene.loc, by = "gene_uid") %>% 
  mutate(chrL = tmp.chr.len[chraccver], relLoc = num(chrstart/chrL, digits = 2)) %>% 
  left_join(select(dedup.merge, sid, pid, species, strain), by = "sid") %>% 
  relocate(pid:strain, .after = sid) %>% relocate(sid, .after = last_col()) %>% 
  select(-exoncount) %>% 
  arrange(species, chrloc)
```

```{r}
write_tsv(chr.loc, file = "../../output/20220609-expanded-blast-chromosomal-locations.tsv")
```
