Author: Pablo Garcia-Nieto

Latest updated: May 11th, 2021

Goals:

  1. To assess how many genes in the cellxgene data corpus can be mapped to ENSEBML ids.
  2. To explore and implement solutions for those genes that can’t be mapped.
  3. To make a recommendation about how to move forward (i.e. do we want to upgrade genes in current data corpus? If so, how?)

Summary of current data corpus in cellxgene

Genes symbols in current cellxgene datasets were looked up in the GTF ENSEBML files and were assigned one of the following states:

  1. CONVERTIBLE - there’s exactly one existing GRCh38 ENSEMBL id for the symbol.
  2. AMBIGUOUS - there are multiple GRCh38 ENSEMBL ids for the symbol.
  3. UNKNOWN (GRCh37: CONVERTIBLE) - there are not GRCh38 ENSEMBL ids for the symbol, but exactly one id was found in GRCh37.
  4. UNKNOWN (GRCh37: AMBIGUOUS) - there are not GRCh38 ENSEMBL ids for the symbol, but more that one id were found in GRCh37.
  5. UNKNOWN (GRCh37: UNKNOWN) - there are not ENSEMBL ids found in neither GRCh38 nor GRCh37
Table: Summary of gene states in 12 datasets.
Gene state Count
AMBIGUOUS 1317 (0.39%)
CONVERTIBLE 302179 (89%)
UNKNOWN (GRCh37: AMBIGUOUS) 109 (0.032%)
UNKNOWN (GRCh37: CONVERTIBLE) 28835 (8.5%)
UNKNOWN (GRCh37: UNKNOWN) 7910 (2.3%)

Figure: gene states across datasets.

The combination of yellow and light green are genes that can be safely converted.

Recommendations: how to move forward

High-level conclusions:

Based on these conclusions, there are two exisiting recommendations

Recomendation 1: programatic solution

  1. To programmatically add ENSEBML ids to current data corpus:
    1. Convert genes when possible to GRCh38 or GRCh37.
    2. Update GRCh37 to GRCh38 IDs (There has to be an exploration to asses whether this is need it).
    3. Drop genes that are not convertible.
  2. [Optional] To give orignal authors a heads-up and potentially wait for their approval.
  3. [Optional] In the meantime, try to solve “unmappable genes”, those with LO and RP prefixes may have a solution but further explorations are necessary to assess the feasibility of this.

Pros:

  • Easy to implement.
  • Quick turn around

Cons:

  • Dropping ~3% of genes.
  • There’s likely some false-positive mappings between the current names and ENSEBML ids.

Recomendation 2: re-curation

  1. To review each individual dataset and:
    1. If original data has ENSEBML ids, repeat curation process using those IDs.
    2. [Optional] Else, ask authors for ENSEMBL ids.
  2. If for a each dataset no ENSEBML ids are found from previous step, then implement recomendation 1.

Pros:

  • More accurate mapping for those datasets that had ENSEMBL ids in the original data.

Cons:

  • Very time-consuming, as there are more than 120 dataset in our corpus of data

Technical information

File versions

  • GRCh38: ENSEBML genes and transcripts from release-103 (Homo_sapiens.GRCh38.103.chr.gtf.gz). *
  • GRCh37: ENSEBML genes and transcripts from release-87 (Homo_sapiens.GRCh37.87.chr.gtf.gz). *
  • When possible gene symbol aliases are transformed to current approved names (hgnc_complete_set.tsv from 10/05/2021)

* I decided to use these GTF vesions that contain only the main assembly (excluding unassembled contigs) because it reduces the number of ambiguous gene symbols (see below).

Process to check the state of gene names

Building reference sets

  1. Parse GTFs to: get gene and transcript tables with three columns each: ID, name and version.
  2. Parse HGCN symbol table, to get mappings between aliases and latest approve symbol.

Checking gene state

For each gene:

  1. Try to upgrade gene (if it’s an alias in HGNC update it, do nothing if it’s not an alias)
  2. Assign one of the following (do checks in order):
    1. If gene is an ENSEBML gene or transcript ID, assign “VALID”.
    2. If gene is an ENSEBML name with multiple ENSEMBL IDs assign “AMBIGUOUS”
    3. If gene is an ENSEBML name with exactly one ENSEMBL IDs assign “CONVERTIBLE”
    4. If gene is a previously withdrawn HGNC symbol ENSEBML, assign “WITHDRAWN”.
    5. If gene is an ENSEBML suffix-stripped name with exactly one ENSEMBL IDs assign “CONVERTIBLE”. A suffix-stripped name is one without the \.\d.* prefix.
  3. If no assignment and gene has a \.\d.* prefix, remove it and repeat 2.
  4. If no assignment and gene has a -.* prefix, remove it and repeat 2.
  5. If no assignment so far then assign “UNKNOWN”

The process was done first with GRCh38 and then repeated it with GRCh37.

Deep explorations

This section contains open-ended research that aims to understand gene symbols that are ambiguous or unknown.

Many ambiguous genes were fixed (moved to convertible), after switching from the “full assembly” gtf (contains unassembled )

Ambiguous gene symbols

These are gene symbols that map to more than ENSEMBL ids. Some useful of stats about them:

  1. There are 1099 unique genes.
  2. Histogram of the number of times one of these genes is found across datasets:
    Number of datasests Count
    1 1001
    2 60
    3 8
    4 3
    6 1
    7 2
    8 2
    9 3
    10 8
    11 6
    12 5
  3. Top 15 repeated ambiguous genes
    Gene Count
    CYB561D2 12
    HERC3 12
    MATR3 12
    PDE11A 12
    RGS5 12
    ACTL10 11
    DNAJC9-AS1 11
    RMRP 11
    SPATA13 11
    TBCE 11
    TMSB15B 11
    CCDC39 10
    ELFN2 10
    GOLGA8M 10
    KBTBD11-OT1 10

Gene symbols not found in ENSEMBL

This section is intended to look for patterns in gene symbols that were not found in any ENSEBML version (those labeled as “UNKNOWN (GRCh37: UNKNOWN)” above).

Some useful of stats about them:

  1. There are 4964 unique genes.
  2. Histogram of the number of times one of these genes is found across datasets:
    Number of datasests Count
    1 4231
    2 45
    3 140
    4 138
    5 145
    6 187
    7 18
    8 5
    9 12
    10 5
    11 10
    12 28
  3. Two-letter prefix counts (only top-50 most frequent are shown):
Prefix Count Examples
RP 1449 RP11-521C22.2, RP11-79P5.10, RP11-157J13.1, RP11-49G2.3
LO 1251 LOC338817, LOC641746, LOC100128496, LOC729987
AC 680 AC007389.5, AC008750.8, AC074327.1, AC008448.1
AL 383 AL513023.1, AL031777.3, AL442647.1, AL592183.1
CT 229 CT978678.1, CTD-2201E9.3, CTC-260E6.12, CTC-788C1.2
CH 123 CH17-125A10.2, CH17-385C13.2, CH507-254M2.2, CH17-174L20.1
AP 51 AP003392.6, AP003386.1, AP003774.3, AP003097.2
FL 44 FLJ39080, FLJ37201, FLJ22447, FLJ33581
uc 36 uc_338-4, uc_338-27, uc_338-18, uc_338-9
SN 35 SNAR-C5, SNAR-C4, SNORD103C, SNAR-F
LL 31 LLNLF-18A12.1, LLNLR-276E7.1, LL0XNC01-116E7.5, LLNLR-285B5.1
LI 30 LINC01969, LINC00535, LINC00535, LINC02083
FA 28 FAM239C, FAM160B2, FAM160B2, FAM166AP4
C1 20 C1orf229, C12orf65, C11orf95, C16orf71
MI 19 MIAT_exon5_2, MIR3179-3-1, MIR3180-4-1, MIR1273E
XX 18 XXbac-BPG246D15.9, XXyac-YR21CG7.1, XXcos-LUCA16.1, XX-FYM637E10_5.1
Cl 17 Clostridiales-1-13, Clostridiales-1-1, Clostridiales-1-15, Clostridiales-1-7
DL 15 DLEU2_3, DLEU2_6, DLEU2_5, DLEU1_1
LA 15 LA16c-335H7.2, LA16c-360H6.1, LA16c-407A10.3, LA16c-380H5.6
KB 14 KB-1572G7.5, KB-1183D5.13, KB-1125A3.12, KB-1552D7.2
CC 13 CCDC151, CCDC114, CCDC114, CCDC114
HO 13 HOTAIRM1_3, HOTAIRM1_5, HOTAIR_2, HOTTIP_3
pR 13 pRNA-9, pRNA-3, pRNA-5, pRNA-4
RN 13 RNA5-8S5-5, RN45S, RNA5-8S5-4, RNMTL1P2
C2 12 C2orf91, C22orf24, C2orf27AP3, C21ORF62-AS1
bP 11 bP-21264C1.2, bP-2171C21.4, bP-2171C21.5, bP-2189O9.3
MG 11 MGC16142, MGC72080, MGC39584, MGC12916
KI 10 KIAA1841, KIAA1841, KIR2DS2, KIR3DS1
BX 9 BX649632.1, BX072566.2, BX072566.1, BX649632.1
C8 9 C8orf59P1, C8ORF37-AS1, C8ORF37-AS1, C8orf31
CX 9 CXORF51B, CXORF51A, CXORF51B, CXORF66
DK 9 DKFZp451B082, DKFZp566F0947, DKFZP434K028, DKFZp686K1684
hs 9 hsa-mir-3675, hsa-mir-8069-1, hsa-mir-3687-1-1, hsa-mir-6724-1-1
RM 9 RMST_10, RMST_5, RMST_8, RMST_9
CU 8 CU463998.3, CU634019.6, CU013544.1, CU634019.6
ME 8 MESTIT1_1, MEF2BNBP1, MEG3_2, MEG8_2
AB 7 ABP1, ABBA01037346.1, ABC12-47964100C23.1, ABBA01037345.1
CM 7 CMB9-22P13.2, CMB9-94B1.2, CMB9-22P13.1, CMB9-14B22.1
GS 7 GS1-25M2.1, GS1-214D18.3, GS1-204I12.3, GS1-259H13.13
H1 7 H19_3-1, H19_3-2, H19_1, H19_2
Si 7 Six3os1_5, Six3os1_1, Six3os1_7, Six3os1_2
TC 7 TCL6_2, TCTEX1D4, TCTEX1D1, TCTEX1D4
Z9 7 Z98744.1, Z97987.1, Z98750.1, Z97987.1
AF 6 AF241734.1, AF065393.1, AF065393.2, AF065393.3
AU 6 AUXG01000518.1, AUXG01000515.2, AUXG01000517.1, AUXG01000516.1
C9 6 C9orf62, C9ORF135-DT, C9orf147, C9orf147
FO 6 FO393415.3, FO538757.3, FO393407.1, FO393415.3
FT 6 FTX_3, FTX_1, FTX_4, FTH1P18
PA 6 PART1_2, PANO1, PAR4, PART1_3
Z8 6 Z84484.1, Z84484.1, Z82246.1, Z82205.1
  1. There are 3087 genes that have a “.” suffix, and these are the top 20 prefixes of those:
Prefix Count
RP 1445
AC 676
AL 383
CT 229
CH 122
AP 51
LL 31
XX 18
LA 15
KB 14
bP 11
BX 9
CU 8
CM 7
GS 7
Z9 7
AB 6
AF 6
AU 6
FO 6

Removing “.” suffix

This solution has been already incorporated and therefore the results should be ignored and are not an accurate representation.

Does removing those suffixes makes “UNKNOWN” genes “COVNERTIBLE”?

After removing those suffixes from the 3087 genes (see above) and after removing any suffixes in the ENSEMBL names these are the new gene states:

Gene state Counts
AMBIGUOUS 312
CONVERTIBLE 31
UNKNOWN (GRCh37:AMBIGUOUS) 897
UNKNOWN (GRCh37:CONVERTIBLE) 19
UNKNOWN (GRCh37:UNKNOWN) 1828

Appendix

Expanded gene category stats per dataset

Table: Counts and percentages of state across datasets
dataset AMBIGUOUS CONVERTIBLE UNKNOWN (GRCh37: CONVERTIBLE) UNKNOWN (GRCh37: UNKNOWN) UNKNOWN (GRCh37: AMBIGUOUS)
A single-cell atlas of the hea … 29 (0.13) 22426 (98) 111 (0.49) 317 (1.4) NA
A Single-Cell Transcriptional … 16 (0.11) 14988 (99) 19 (0.13) 50 (0.33) NA
Direct Exposure to SARS-CoV-2 … 28 (0.084) 32610 (97) 186 (0.56) 682 (2) NA
Evolution of cellular diversit … 11 (0.074) 14701 (100) 9 (0.061) 47 (0.32) NA
Fibroblasts — Cells of the adu … 28 (0.084) 32610 (97) 186 (0.56) 679 (2) NA
Infiltrating Neoplastic Cells … 72 (0.31) 21641 (93) 73 (0.31) 1517 (6.5) 5 (0.021)
Multiomics single-cell analysi … 34 (0.13) 25764 (98) 148 (0.56) 441 (1.7) NA
Single cell transcriptional an … 29 (0.13) 21854 (98) 103 (0.46) 371 (1.7) NA
Single-cell RNA-Seq Investigat … 29 (0.13) 18094 (84) 3401 (16) 75 (0.35) NA
Single-cell transcriptomics of … 981 (1.6) 41473 (68) 15232 (25) 2937 (4.8) 102 (0.17)
Spatiotemporal analysis of hum … 28 (0.084) 32610 (97) 186 (0.56) 679 (2) NA
Time-resolved Systems Immunolo … 32 (0.098) 23408 (72) 9181 (28) 115 (0.35) 2 (0.0061)

Notes on some prefixes

These notes are adapted from a slack conversation with Brian Raymnor and Ambrose Carr.

RP

  • Potential solution: unkwown.
  • These seem to be annotated "fragments of the genome
  • Example: RP11-248B24.5
  • Link

LINC

  • Potetianl solution: upgrade symbol.
  • These seem to be old symbols that can be upgradable.
  • Link

ZN

  • Potential solution: remove suffix.
  • Example: ZNF709
  • Link

LO

  • Potential solutiion: unknown.
  • They seem to be genomic DNA annotations that are not part of the main genome assembly
  • Maybe it’s genes that originate from those
  • They are in the “gene” database of NCBI
  • Link
  • Link2

CT

  • Potential solution: remove suffix.
  • Example: CTD-3212A4.2
  • Link

Metazoa

  • Potential solution: unknown.
  • Example: Metazoa_SRP-187.
  • The prefix seems to be for Metazoan signal recognition particle RNA. (Link).
  • Is unclear the meaning of the suffx.

uc

  • Potential solution: reformatting.
  • Example: uc_338-27 which could be uc.338
  • Link
  • Link 2

CH

  • Potential solution: unknown.
  • Example: CH17-53B9.4
  • Some examples were found in and angiogenesis database, interestingly they have links to ENSEMBL ids
  • Link

AP

  • Potential solution: ignore.
  • In EBI they have warning of having being replaced with newer versions
  • Link