#> Joining, by = "fastqFileName"
| gene | medium | median_log2cpm | quantile |
|---|---|---|---|
| CKF44_06125 | DMEM | 15.21845 | 0.9984211 |
| CKF44_06125 | YPD | 14.71931 | 0.9982776 |
I tried many intervals, and widening to top 75 netted the most genes whose expression was consistently high across ALL conditions. narrowing the interval decreased the number of genes in that range across ALL conditions down to 5 or less (though most of the genes were highly expressed in, say, 142 / 146 of the unique concat conditions)
| gene | treatment_tally | median_log2cpm | mean_quantile | quantile_sd |
|---|---|---|---|---|
| CKF44_00006 | 146 | 10.30493 | 0.8751453 | 0.0255795 |
| CKF44_00402 | 146 | 10.38472 | 0.8817586 | 0.0281594 |
| CKF44_00509 | 146 | 10.56398 | 0.8965574 | 0.0311383 |
| CKF44_00602 | 146 | 10.51135 | 0.8952115 | 0.0270342 |
| CKF44_01018 | 146 | 10.48432 | 0.8922041 | 0.0290333 |
| CKF44_01083 | 146 | 10.40630 | 0.8787208 | 0.0349441 |
| CKF44_01106 | 146 | 10.33366 | 0.8701196 | 0.0322008 |
| CKF44_01278 | 146 | 10.26794 | 0.8730660 | 0.0352172 |
| CKF44_01475 | 146 | 10.48109 | 0.8862681 | 0.0337783 |
| CKF44_01568 | 146 | 10.39724 | 0.8840611 | 0.0278882 |
| CKF44_01655 | 146 | 10.20040 | 0.8614053 | 0.0286747 |
| CKF44_01722 | 146 | 10.85759 | 0.9204695 | 0.0219071 |
| CKF44_02134 | 146 | 9.88108 | 0.8274114 | 0.0310610 |
| CKF44_02177 | 146 | 10.54762 | 0.8901652 | 0.0317665 |
| CKF44_02486 | 146 | 10.23253 | 0.8722982 | 0.0277540 |
| CKF44_02671 | 146 | 10.78905 | 0.9169755 | 0.0142576 |
| CKF44_02948 | 146 | 10.50731 | 0.8968837 | 0.0286800 |
| CKF44_03439 | 146 | 10.61384 | 0.9030351 | 0.0209172 |
| CKF44_03706 | 146 | 10.56189 | 0.8924391 | 0.0317257 |
| CKF44_03931 | 146 | 10.67367 | 0.9024806 | 0.0280053 |
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
NOTE: there if a pwm/pfm that goes along with this image of course
#> [1] 0.1919815 0.1923186 0.1895406 0.1908105 0.1931301 0.1843017 0.1826488
#> [8] 0.1702303 0.1452700 0.1110761
| A | 0.0385005 | 0.0512811 | 0.0524815 | 0.0330705 | 0.0508494 | 0.0853854 | 0.0896944 | 0.1284762 | 0.1574426 | -0.0029644 |
| C | 0.0291867 | 0.0252927 | 0.0361979 | 0.0385244 | 0.0404272 | 0.0219504 | 0.0640099 | 0.0455409 | -0.0081061 | -0.0846052 |
| G | 0.0569385 | 0.0609691 | 0.0192342 | 0.0387392 | 0.0401427 | 0.0025708 | -0.0135594 | 0.0024607 | -0.0450855 | 0.1868622 |
| T | 0.0673558 | 0.0547757 | 0.0816270 | 0.0804764 | 0.0617109 | 0.0743951 | 0.0425039 | -0.0062476 | 0.0410189 | 0.0117834 |
I am using a core bioconductor data structure for both the annotations and the genome. There are core functions which link the two. So, little of my own code.
Here is an examination of one intron
# this is created in a previous chunk, included here for clarity
# the genome below is a bioconductor package that I made to store the
# kn99 genome, downloaded from GenBank -- it is imminently going
# up on bioconductor, actually.
#
# library(BSgenome.CneoformansVarGrubiiKN99.NCBI.ASM221672v1)
# kn99_genome = BSgenome.CneoformansVarGrubiiKN99.NCBI.ASM221672v1
# chrominfo = kn99_genome@seqinfo
#
# chrominfo = merge(chrominfo,
# Seqinfo(seqnames = "G418", seqlengths = 2565,
# isCircular = FALSE, genome = "ASM221672v1"),
# Seqinfo(seqnames = "NAT", seqlengths = 1650,
# isCircular = FALSE, genome = "ASM221672v1"))
#
# kn99_gff = GenomicFeatures::makeTxDbFromGFF(KN99_GFF_PATH,
# chrominfo = chrominfo)
#
# introns = intronsByTranscript(kn99_gff)
# up_10_from_introns = promoters(introns, upstream = 10, downstream = 0)
#
# up_10_from_introns_seq = getSeq(kn99_genome, unlist(up_10_from_introns))
selected_intron = introns[[4000]]
selected_gene = genes(kn99_gff)[
unique(findOverlaps(selected_intron,
genes(kn99_gff))@to)]$gene_id
selected_exons = exonsBy(kn99_gff, by = "gene")[
names(exonsBy(kn99_gff, by = "gene")) == selected_gene]
# zoom in on exon 1 (furthest right)
## the exon and intron are 1 bp apart
#> [1] TRUE
I should have just included this: this is the definition of how the promoter function works:
promoters generates promoter ranges for each range in x relative to the transcription start site (TSS), where TSS is start(x). The promoter range is expanded around the TSS according to the upstream and downstream arguments. upstream represents the number of nucleotides in the 5’ direction and downstream the number in the 3’ direction. The full range is defined as, (start(x) - upstream) to (start(x) + downstream - 1). For documentation for using promoters on a GRanges object see ?promoters,GenomicRanges-method in the GenomicRanges package.
I’m not quite sure I am following this – so let me explain what I did. I have taken the 5 bp upstream of the start of a given intron, and the 5 bp downstream of a given intron (which extends into the exon), and concatenated those sequences. here is a visualization of one
#> [1] 0.3189775 0.3188008 0.3174729 0.3148040 0.3111479 -0.4023806
#> [7] -0.2898850 0.2531190 0.2850127 0.1214468
Not sure what is going on at positions 5 and 6. This is what just the downstream seqLogo looks like
#> [1] -0.126421241 -0.003477547 0.589958111 0.624813934 0.446056790
knitr::kable(pwm)
| A | -0.0312519 | -0.1509870 | 0.1885408 | 0.1933886 | 0.1110321 |
| C | -0.1509870 | 0.0923087 | 0.1105515 | 0.1507696 | 0.0551158 |
| G | 0.2068047 | -0.1509870 | 0.1781620 | 0.1363068 | 0.2050780 |
| T | -0.1509870 | 0.2061878 | 0.1127038 | 0.1443489 | 0.0748308 |