Author: Pablo Garcia-Nieto

Latest updated: May 24th, 2021

Goals:

  1. To assess how many mouse genes in the cellxgene data corpus can be mapped to ENSEBML ids.
  2. To explore and implement solutions for those genes that can’t be mapped.
  3. To make a recommendation about how to move forward (i.e. do we want to upgrade genes in current data corpus? If so, how?)

Summary of current data corpus in cellxgene

Genes symbols in current cellxgene datasets were looked up in the GTF ENSEBML files and were assigned one of the following states:

  1. CONVERTIBLE - there’s exactly one existing GRCm39 ENSEMBL id for the symbol.
  2. AMBIGUOUS - there are multiple GRCm39 ENSEMBL ids for the symbol.
  3. UNKNOWN (GRCm38: CONVERTIBLE) - there are not GRCm39 ENSEMBL ids for the symbol, but exactly one id was found in GRCm38.
  4. UNKNOWN (GRCm38: AMBIGUOUS) - there are not GRCm39 ENSEMBL ids for the symbol, but more that one id were found in GRCm38.
  5. UNKNOWN (GRCm38: UNKNOWN) - there are not ENSEMBL ids found in neither GRCm39 nor GRCm38
Table: Summary of gene states in 7 datasets.
Gene state Count
AMBIGUOUS 194 (0.11%)
CONVERTIBLE 154541 (88%)
UNKNOWN (GRCm38: AMBIGUOUS) 5 (0.0028%)
UNKNOWN (GRCm38: CONVERTIBLE) 1058 (0.6%)
UNKNOWN (GRCm38: UNKNOWN) 19751 (11%)

Figure: gene states across datasets.

The combination of yellow and light green are genes that can be safely converted.

Recommendations: how to move forward

TODO

Technical information

File versions

  • GRCm39: ENSEBML genes and transcripts from release-103 (Mus_musculus.GRCm39.104.chr.gtf.gz).
  • GRCm38: ENSEBML genes and transcripts from release-87 (Mus_musculus.GRCm38.102.chr.gtf.gz).

Process to check the state of gene names

Building reference sets

  • Parse GTFs to: get gene and transcript tables with three columns each: ID, name and version.

Checking gene state

For each gene:

  1. Assign one of the following (do checks in order):
    1. If gene is an ENSEBML gene or transcript ID, assign “VALID”.
    2. If gene is an ENSEBML name with multiple ENSEMBL IDs assign “AMBIGUOUS”
    3. If gene is an ENSEBML name with exactly one ENSEMBL IDs assign “CONVERTIBLE”
    4. If gene is an ENSEBML suffix-stripped name with exactly one ENSEMBL IDs assign “CONVERTIBLE”. A suffix-stripped name is one without the \.\d.* prefix.
  2. If no assignment and gene has a \.\d.* prefix, remove it and repeat 2.
  3. If no assignment and gene has a -.* prefix, remove it and repeat 2.
  4. If no assignment so far then assign “UNKNOWN”

The process was done first with GRCm39 and then repeated it with GRCm38.

Deep explorations

This section contains open-ended research that aims to understand gene symbols that are unknown.

Gene symbols not found in ENSEMBL

This section is intended to look for patterns in gene symbols that were not found in any ENSEBML version (those labeled as “UNKNOWN (GRCm38: UNKNOWN)” above).

Some useful of stats about them:

  1. There are 14975 unique genes.
  2. Histogram of the number of times one of these genes is found across datasets:
    Number of datasests Count
    1 12520
    2 1394
    3 379
    4 285
    5 240
    6 133
    7 24
  3. Two-letter prefix counts (only top-50 most frequent are shown):
Prefix Count Examples
Gm 6079 Gm15698, Gm34858, Gm5538, Gm17365
LO 5502 LOC105244415, LOC105244438, LOC105246620, LOC100417215
c1 340 c13_tRNA-Lys-AAA, c1_tRNA-Leu-TTA, c13_tRNA-Met-i, c19_tRNA-Ser-AGY
49 214 4930455C21Rik, 4930480K23Rik, 4930583H14Rik, 4930539E08Rik
17 131 1700091E21Rik, 1700016G14Rik, 1700011J10Rik, 1700101G07Rik
mt 97 mt_AK165865, mt_AK165190, mt_AF071428, mt_AK157367
RP 92 RP24-308M10.2, RP24-286L23.3, RP24-177G14.2, RP24-484I1.3
BC 81 BC002163, BC068157, BC037034, BC049730
Zf 61 Zfp383, Zfp292, Zfp276, Zfp71-rs1
Mi 58 Mir5124, Mir219-2, Mira, Mir873
Vm 56 Vmn2r-ps71, Vmn2r-ps138, Vmn2r123, Vmn1r-ps44
23 52 2310014L17Rik, 2310001H18Rik, 2310010M20Rik, 2310031A07Rik
AC 52 AC168977.1, AC099934.3, AC164099.2, AC158605.2
ZN 50 ZNF507, ZNF76, ZNF513, ZNF235
11 46 1110018J18Rik, 1110021L09Rik, 1190005F20Rik, 1100001G20Rik
cx 37 cx_tRNA-Thr-ACY, cx_tRNA-Val-GTA, cx_tRNA-Leu-TTG, cx_tRNA-Pro-CCA
Ol 36 Olfr240-ps1, Olfr182-ps1, Olfr766, Olfr940-ps1
c3 35 c3_tRNA-Val-GTG, c3_tRNA-Gly-GGA, c3_tRNA-Pro-CCA, c3_tRNA-Gln-CAA
c7 35 c7_tRNA-Leu-CTG, c7_tRNA-Pro-CCY, c7_tRNA-Gly-GGG, c7_tRNA-Tyr-TAC
c6 34 c6_tRNA-Phe-TTY, c6_tRNA-Trp-TGG, c6_tRNA-Thr-ACA, c6_tRNA-Ala-GCY_
18 32 1810019J16Rik, 1810043H04Rik, 1810035I16Rik, 1810032O08Rik
c2 32 c2_tRNA-Gly-GGA, c2_tRNA-Ser-AGY, c2_tRNA-Ile-ATA, c2_tRNA-Gln-CAA
c5 31 c5_tRNA-Gln-CAG, c5_tRNA-Gly-GGY, c5_tRNA-Ala-GCY_, c5_tRNA-Asn-AAC
24 28 2400001E08Rik, 2410066E13Rik, 2410012M07Rik, 2410016O06Rik
26 28 2610034B18Rik, 2610101N10Rik, 2610203C20Rik, 2610039C10Rik
c9 28 c9_tRNA-Ile-ATA, c9_tRNA-Met, c9_tRNA-Val-GTA, c9_LSU-rRNA_Hsa
28 27 2810432D09Rik, 2810422J05Rik, 2810407C02Rik, 2810403A07Rik
c4 27 c4_tRNA-Tyr-TAT, c4_tRNA-Gly-GGG, c4_tRNA-Ala-GCA, c4_tRNA-Ala-GCY_
c8 26 c8_tRNA-Ile-ATT, c8_tRNA-Lys-AAA, c8_tRNA-Ile-ATA, c8_tRNA-Glu-GAG_
20 24 2010002M12Rik, 2010005H15Rik, 2010109I03Rik, 2010005H15Rik
D1 24 D14Ertd668e, D17H6S56E-3, D14Ertd668e, D17Ertd648e
AI 23 AI854517, AI314976, AI464131, AI481877
MI 23 MIR365A, MIR486-1, MIR151A, MIR544A
Fa 22 Fam21, Fam108b, Fam48a, Fam108a
06 20 0610007P22Rik, 0610037L13Rik, 0610038L08Rik, 0610011F06Rik
22 20 2210416O15Rik, 2210009G21Rik, 2210415F13Rik, 2210415F13Rik
Hi 18 Hist2h2aa2, Hist2h2aa2, Hist1h1a, Hist1h1b
FA 17 FAM160A2, FAM110D, FAM110D, FAM160B1
31 16 3110062M04Rik, 3110007F17Rik, 3110057O12Rik, 3110002H16Rik
57 16 5730457N03Rik, 5730494M16Rik, 5730457N03Rik, 5730528L13Rik
AT 16 ATP5F1E, ATP5PD, ATP5F1A, ATP5PD
H1 16 H1f0, H1f0, H1fnt, H1-7
At 15 Atp5h, Atpif1, Atp5h, Atpif1
TR 15 TRIM43, TRD, TRIM75P, TRBV22-1
15 14 1500015O10Rik, 1500016L03Rik, 1500015O10Rik, 1500032L24Rik
27 14 2700094K13Rik, 2700078E11Rik, 2700089E24Rik, 2700060E02Rik
63 14 6330408A02Rik, 6330578E17Rik, 6330439K17Rik, 6330416G13Rik
C3 14 C330019G07Rik, C330046G13Rik, C330005M16Rik, C330027C09Rik
Sp 14 Spry3, Speer4f, Spnb1, Spnb3
Mt 13 Mtap7d1, Mtap7d3, Mtag2, Mtap4
  1. There are 156 genes that have a “.” suffix, and these are the top 20 prefixes of those:
Prefix Count
RP 91
AC 51
CT 6
Ep 3
CA 2
AL 1
CR 1
WI 1

Appendix

Expanded gene category stats per dataset

Table: Counts and percentages of state across datasets
dataset AMBIGUOUS CONVERTIBLE UNKNOWN (GRCm38: CONVERTIBLE) UNKNOWN (GRCm38: UNKNOWN) UNKNOWN (GRCm38: AMBIGUOUS)
A Single-Cell Transcriptional … 12 (0.08) 14598 (98) 43 (0.29) 266 (1.8) NA
A transcriptomic atlas of the … 24 (0.098) 23026 (94) 110 (0.45) 1249 (5.1) NA
Adult mouse cortical cell taxo … 82 (0.12) 54030 (77) 526 (0.75) 15131 (22) 5 (0.0072)
All — A single-cell transcript … 26 (0.13) 18184 (90) 79 (0.39) 1849 (9.2) NA
An integrated transcriptomic a … 30 (0.12) 23348 (97) 146 (0.6) 616 (2.6) NA
Glutamatergic neurons — An Atl … 20 (0.091) 21105 (96) 154 (0.7) 636 (2.9) NA
Molecular, spatial and project … NA 250 (98) NA 4 (1.6) NA