Technical information

ENSEMBL GTF versions:

Summary of current data corpus in cellxgene

Genes symbols in current cellxgene datasets were looked up in the GTF ENSEBML files and were assigned one of the following states:

  1. CONVERTIBLE - there’s exactly one existing GRCh38 ENSEMBL id for the symbol.
  2. AMBIGUOUS - there are multiple GRCh38 ENSEMBL ids for the symbol.
  3. UNKNOWN (GRCh37: CONVERTIBLE) - there are not GRCh38 ENSEMBL ids for the symbol, but exactly one id was found in GRCh37.
  4. UNKNOWN (GRCh37: AMBIGUOUS) - there are not GRCh38 ENSEMBL ids for the symbol, but more that one id were found in GRCh37.
  5. UNKNOWN (GRCh37: UNKNOWN) - there are not ENSEMBL ids found in neither GRCh38 nor GRCh37

Numbers in parentheses are percentages.

Table: Summary of gene states in 12 datasets.
Gene state Count
AMBIGUOUS 15264 (4.5)
CONVERTIBLE 276961 (81)
UNKNOWN (GRCh37: AMBIGUOUS) 1483 (0.44)
UNKNOWN (GRCh37: CONVERTIBLE) 34991 (10)
UNKNOWN (GRCh37: UNKNOWN) 11651 (3.4)

Figure: gene states across datasets.

The combination of yellow and light green are genes that can be safely converted.

Exploration of gene symbols not found in ENSEMBL

This section is intended to look for patterns in gene symbols that were not found in any ENSEBML version (those labeled as “UNKNOWN (GRCh37: UNKNOWN)” above).

Some useful of stats about them:

  1. There are 8302 unique genes.
  2. Histogram of the number of times one of these genes is found across datasets:
    Number of datasests Count
    1 7332
    2 50
    3 233
    4 202
    5 212
    6 259
    7 14
  3. Two-letter prefix counts (only top-50 most frequent are shown):
Prefix Count Examples
RP 2270 RP11-567F11.2, RP11-113D19.9, RP11-775C24.5, RP11-203H19.2
LO 1252 LOC646903, LOC219347, LOC100128750, LOC93622
AC 900 AC096558.2, AC122710.3, AC068675.1, AC126281.3
Y_ 785 Y_RNA-226, Y_RNA-745, Y_RNA-659, Y_RNA-195
AL 470 AL358975.1, AL136221.3, AL133330.2, AL606763.1
SN 427 SNORA26-6, SNORA51-3, SNORA62-5, SNORA64-4
CT 338 CTD-2619J13.27, CTD-3234P18.6, CTD-3149D2.2, CTD-2145A24.5
Me 223 Metazoa_SRP-195, Metazoa_SRP-74, Metazoa_SRP-109, Metazoa_SRP-70
CH 125 CH507-152C13.5, CH17-140K24.7, CH17-140K24.2, CH17-360D5.2
AP 88 AP003386.1, AP003397.1, AP002759.1, AP001646.3
LI 60 LINC01759, LINC01336, LIMS3.1, LINC01578
U3 55 U3-16, U3-49, U3-8, U3-39
FL 50 FLJ23867, FLJ16171, FLJ33534, FLJ32063
sn 48 snoU13-10, snoMBII-202-1, snoU13-25, snoMe28S-Am2634-1
LL 36 LLNLF-173C4.1, LLNLR-304G9.1, LLNLF-18A12.1, LLNLR-285B5.1
SC 36 SCARNA16-2, SCARNA9L, SCARNA17-4, SCARNA20-3
uc 36 uc_338-3, uc_338-30, uc_338-24, uc_338-29
OR 30 OR4N3P-1, OR4C16-1, OR5H6-1, OR5L1-1
FA 28 FAM95B1-1, FAS-AS1, FABP5P13-1, FAM24B-CUZD1
MI 25 MIR3180-1-1, MIR3180-3-1, MIAT_exon5_3, MIR4697HG
XX 25 XXbac-B33L19.12, XXbac-BPG258E24.10, XXbac-BPG154L12.5, XXbac-BPG13B8.11
HO 23 HOTTIP_2, HOTAIRM1_3, HOTAIRM1_2, HOXA11-AS1_4
LA 21 LA16c-360A4.1, LA16c-360H6.1, LA16c-390H2.1, LA16c-380F5.1
SE 21 SEPT14P24, SEPT14P2, SEPT7-AS1, SEPT14P8
U6 21 U62317.5, U6-10, U6-15, U62317.5
U8 21 U8-19, U8-11, U8-21, U8-3
ZN 20 ZNF668.1, ZNF26.1, ZNRD1-AS1_1, ZNFX1-AS1_3
DL 19 DLEU1-AS1, DLEU1-AS1, DLEU2_1-1, DLX2-AS1
BX 18 BX649632.1, BX649632.1, BX005132.2, BX284668.6
U2 18 U2-18, U2-9, U2-11, U2-10
5S 17 5S_rRNA-4, 5S_rRNA-2, 5S_rRNA-13, 5S_rRNA-14
Cl 17 Clostridiales-1-11, Clostridiales-1-13, Clostridiales-1-3, Clostridiales-1-4
RN 17 RNASEK-C17ORF49, RNA5SP506, RNA5-8S5-7, RN45S
U1 16 U1-15, U1-10, U1.1, U1-8
KB 15 KB-176G8.1, KB-1572G7.5, KB-1396H2.2, KB-1125A3.12
AB 14 ABCF2.1, ABC7-42391500H16.4, AB015752.1, ABBA01037349.1
U7 14 U7-10, U7-14, U7-3, U7-12
pR 13 pRNA-8, pRNA-4, pRNA, pRNA-11
AF 12 AF011889.5, AF065393.2, AF127936.2, AF111168.2
bP 12 bP-2171C21.4, bP-2171C21.3, bP-2168N6.3, bP-21264C1.1
PR 12 PRICKLE2-AS1-1, PRCAT47, PRAMEF28, PRO0611
SP 12 SPRY4-IT1_1, SPRY4-IT1_2, SPG20-AS1, SPDYE14P
ST 12 ST7-OT3_3, ST7-OT4_4, ST7-AS2_1, ST7-OT4_2
C1 11 C10orf32-AS3MT, C1QTNF9B-AS1-1, C10ORF71-AS1, C1orf140
DK 11 DKFZP434K028, DKFZp686K1684, DKFZp686D0853, DKFZP434I0714
GS 11 GS1-345D13.1, GS1-214D18.3, GS1-54N10.1, GS1-114I9.3
MG 11 MGC45800, MGC2889, MGC27345, MGC16025
SM 11 SMAD5-AS1_2-1, SMAD5-AS1_1, SMIM25, SMIM37
CX 10 CXORF49, CXORF51B, CXORF51B, CXORF38
RM 10 RMST_8, RMST_7, RMRP-1, RMST_3
  1. There are 4500 genes that have a “.” suffix, and these are the top 20 prefixes of those:
Prefix Count
RP 2258
AC 887
AL 469
CT 336
CH 124
AP 88
LL 34
XX 25
LA 21
KB 15
BX 14
AB 13
AF 12
bP 12
GS 11
ZN 10
Z8 9
Z9 9
CM 7
AU 6

Removing “.” suffix

Does removing those suffixes makes “UNKNOWN” genes “COVNERTIBLE”?

After removing those suffixes from the 4500 genes (see above) and After removing any suffixes in the ENSEMBL names these are the new gene states:

Gene state Counts
AMBIGUOUS 339
CONVERTIBLE 426
UNKNOWN (GRCh37:AMBIGUOUS) 991
UNKNOWN (GRCh37:CONVERTIBLE) 748
UNKNOWN (GRCh37:UNKNOWN) 1996

Notes on some prefixes

These notes are adapted from a slack conversation with Brian Raymnor and Ambrose Carr.

RP

  • Potential solution: remove suffix.
  • Example: RP11-248B24.5
  • Link

LINC

  • Potetianl solution: upgrade symbol.
  • These seem to be old symbols that can be upgradable.
  • Link

ZN

  • Potential solution: remove suffix.
  • Example: ZNF709
  • Link

LO

  • Potential solutiion: unknown.
  • They seem to be genomic DNA annotations that are not part of the main genome assembly
  • Maybe it’s genes that originate from those
  • They are in the “gene” database of NCBI
  • Link
  • Link2

CT

  • Potential solution: remove suffix.
  • Example: CTD-3212A4.2
  • Link

Metazoa

  • Potential solution: unknown.
  • Example: Metazoa_SRP-187.
  • The prefix seems to be for Metazoan signal recognition particle RNA. (Link).
  • Is unclear the meaning of the suffx.

uc

  • Potential solution: reformatting.
  • Example: uc_338-27 which could be uc.338
  • Link
  • Link 2

CH

  • Potential solution: unknown.
  • Example: CH17-53B9.4
  • Some examples were found in and angiogenesis database, interestingly they have links to ENSEMBL ids
  • Link

AP

  • Potential solution: ignore.
  • In EBI they have warning of having being replaced with newer versions
  • Link

Appendix

Table: Counts and percentages of state across datasets
dataset AMBIGUOUS CONVERTIBLE UNKNOWN (GRCh37: AMBIGUOUS) UNKNOWN (GRCh37: CONVERTIBLE) UNKNOWN (GRCh37: UNKNOWN)
A single-cell atlas of the hea … 5 (4.7) 5 (92) 5 (0.12) 5 (1.7) 5 (1.9)
A Single-Cell Transcriptional … 4 (5.2) 4 (94) 4 (0.11) 4 (1) NA
Direct Exposure to SARS-CoV-2 … 5 (4.9) 5 (92) 5 (0.003) 5 (0.52) 5 (2.7)
Evolution of cellular diversit … 5 (5) 5 (95) 5 (0.0068) 5 (0.074) 5 (0.1)
Fibroblasts — Cells of the adu … 5 (4.9) 5 (92) 5 (0.003) 5 (0.52) 5 (2.7)
Infiltrating Neoplastic Cells … 5 (5.7) 5 (88) 5 (0.047) 5 (0.36) 5 (6.3)
Multiomics single-cell analysi … 5 (4.3) 5 (92) 5 (0.12) 5 (1.7) 5 (2.3)
Single cell transcriptional an … 5 (4.1) 5 (92) 5 (0.12) 5 (1.6) 5 (2.1)
Single-cell RNA-Seq Investigat … 5 (4.5) 5 (77) 5 (0.74) 5 (18) 5 (0.083)
Single-cell transcriptomics of … 5 (3.6) 5 (57) 5 (1.2) 5 (29) 5 (9.6)
Spatiotemporal analysis of hum … 5 (4.9) 5 (92) 5 (0.003) 5 (0.52) 5 (2.7)
Time-resolved Systems Immunolo … 5 (3.7) 5 (59) 5 (1.4) 5 (36) 5 (0.35)