Expression of GAL1 (06051) and GAl10/UDP-glucose 4-epimerase (06050)
CPM and rank for TEF1 (CNAG_06125) in YPD or DMEM CO2 37 90 min
highly expressed crypto genes – 75th to 95th percentile
MSA of the promoters (100bp upstream) of genes shown above
MSA of 10 bp upstream of all introns
follow ups
- Checking intron extraction
  - promoters
including some exon parts and downstream
What?
- And the PWM

Expression of GAL1 (06051) and GAl10/UDP-glucose 4-epimerase (06050)

#> Joining, by = "fastqFileName"

CPM and rank for TEF1 (CNAG_06125) in YPD or DMEM CO2 37 90 min

gene	medium	median_log2cpm	quantile
CKF44_06125	DMEM	15.21845	0.9984211
CKF44_06125	YPD	14.71931	0.9982776

highly expressed crypto genes – 75th to 95th percentile

I tried many intervals, and widening to top 75 netted the most genes whose expression was consistently high across ALL conditions. narrowing the interval decreased the number of genes in that range across ALL conditions down to 5 or less (though most of the genes were highly expressed in, say, 142 / 146 of the unique concat conditions)

gene	treatment_tally	median_log2cpm	mean_quantile	quantile_sd
CKF44_00006	146	10.30493	0.8751453	0.0255795
CKF44_00402	146	10.38472	0.8817586	0.0281594
CKF44_00509	146	10.56398	0.8965574	0.0311383
CKF44_00602	146	10.51135	0.8952115	0.0270342
CKF44_01018	146	10.48432	0.8922041	0.0290333
CKF44_01083	146	10.40630	0.8787208	0.0349441
CKF44_01106	146	10.33366	0.8701196	0.0322008
CKF44_01278	146	10.26794	0.8730660	0.0352172
CKF44_01475	146	10.48109	0.8862681	0.0337783
CKF44_01568	146	10.39724	0.8840611	0.0278882
CKF44_01655	146	10.20040	0.8614053	0.0286747
CKF44_01722	146	10.85759	0.9204695	0.0219071
CKF44_02134	146	9.88108	0.8274114	0.0310610
CKF44_02177	146	10.54762	0.8901652	0.0317665
CKF44_02486	146	10.23253	0.8722982	0.0277540
CKF44_02671	146	10.78905	0.9169755	0.0142576
CKF44_02948	146	10.50731	0.8968837	0.0286800
CKF44_03439	146	10.61384	0.9030351	0.0209172
CKF44_03706	146	10.56189	0.8924391	0.0317257
CKF44_03931	146	10.67367	0.9024806	0.0280053

MSA of the promoters (100bp upstream) of genes shown above

#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

NOTE: there if a pwm/pfm that goes along with this image of course

MSA of 10 bp upstream of all introns

#>  [1] 0.1919815 0.1923186 0.1895406 0.1908105 0.1931301 0.1843017 0.1826488
#>  [8] 0.1702303 0.1452700 0.1110761

A	0.0385005	0.0512811	0.0524815	0.0330705	0.0508494	0.0853854	0.0896944	0.1284762	0.1574426	-0.0029644
C	0.0291867	0.0252927	0.0361979	0.0385244	0.0404272	0.0219504	0.0640099	0.0455409	-0.0081061	-0.0846052
G	0.0569385	0.0609691	0.0192342	0.0387392	0.0401427	0.0025708	-0.0135594	0.0024607	-0.0450855	0.1868622
T	0.0673558	0.0547757	0.0816270	0.0804764	0.0617109	0.0743951	0.0425039	-0.0062476	0.0410189	0.0117834

follow ups

Checking intron extraction

MSA of 10 bp upstream of promoters. Can you please double check that you don’t have an off-by-one error that’s causing you to include the first base of the intron. The reason I suspect this is that essentially all introns begin with GT and end with AG. You have 100% G in position 10, which should be the last base before the intron. I did not think the base before the intron was so constrained, although I could be wrong about that.

I am using a core bioconductor data structure for both the annotations and the genome. There are core functions which link the two. So, little of my own code.

Here is an examination of one intron

# this is created in a previous chunk, included here for clarity
# the genome below is a bioconductor package that I made to store the 
# kn99 genome, downloaded from GenBank -- it is imminently going
# up on bioconductor, actually.
# 
# library(BSgenome.CneoformansVarGrubiiKN99.NCBI.ASM221672v1)
# kn99_genome = BSgenome.CneoformansVarGrubiiKN99.NCBI.ASM221672v1
# chrominfo = kn99_genome@seqinfo
# 
# chrominfo = merge(chrominfo,
#       Seqinfo(seqnames = "G418", seqlengths = 2565,
#               isCircular = FALSE, genome = "ASM221672v1"),
#       Seqinfo(seqnames = "NAT", seqlengths = 1650,
#               isCircular = FALSE, genome = "ASM221672v1"))
# 
# kn99_gff = GenomicFeatures::makeTxDbFromGFF(KN99_GFF_PATH,
#                                             chrominfo = chrominfo)
# 
# introns = intronsByTranscript(kn99_gff)
# up_10_from_introns = promoters(introns, upstream = 10, downstream = 0)
# 
# up_10_from_introns_seq = getSeq(kn99_genome, unlist(up_10_from_introns))

selected_intron = introns[[4000]]

selected_gene = genes(kn99_gff)[
                        unique(findOverlaps(selected_intron, 
                                            genes(kn99_gff))@to)]$gene_id

selected_exons = exonsBy(kn99_gff, by = "gene")[
  names(exonsBy(kn99_gff, by = "gene")) == selected_gene]

# zoom in on exon 1 (furthest right)

## the exon and intron are 1 bp apart

#> [1] TRUE

I should have just included this: this is the definition of how the promoter function works:

promoters

promoters generates promoter ranges for each range in x relative to the transcription start site (TSS), where TSS is start(x). The promoter range is expanded around the TSS according to the upstream and downstream arguments. upstream represents the number of nucleotides in the 5’ direction and downstream the number in the 3’ direction. The full range is defined as, (start(x) - upstream) to (start(x) + downstream - 1). For documentation for using promoters on a GRanges object see ?promoters,GenomicRanges-method in the GenomicRanges package.

including some exon parts and downstream

Could you also provide the same information for the first few bases of the exon downstream of introns? Actually, why don’t you concatenate the upstream and downstream into a single PWM. Once we trim it down to the most important positions, we can use it to scan for good positions to insert introns in a designed gene.

I’m not quite sure I am following this – so let me explain what I did. I have taken the 5 bp upstream of the start of a given intron, and the 5 bp downstream of a given intron (which extends into the exon), and concatenated those sequences. here is a visualization of one

#>  [1]  0.3189775  0.3188008  0.3174729  0.3148040  0.3111479 -0.4023806
#>  [7] -0.2898850  0.2531190  0.2850127  0.1214468

What?

Not sure what is going on at positions 5 and 6. This is what just the downstream seqLogo looks like

#> [1] -0.126421241 -0.003477547  0.589958111  0.624813934  0.446056790

And the PWM

knitr::kable(pwm)

A	-0.0312519	-0.1509870	0.1885408	0.1933886	0.1110321
C	-0.1509870	0.0923087	0.1105515	0.1507696	0.0551158
G	0.2068047	-0.1509870	0.1781620	0.1363068	0.2050780
T	-0.1509870	0.2061878	0.1127038	0.1443489	0.0748308

Chemostat RnaSeq

Chase Mateusiak, chasem@wustl.edu

2022-02-08