ENSEBML status of genes in data corpus of cellxgene

Summary of current data corpus in cellxgene
Recommendations: how to move forward
- Recomendation 1: programatic solution
- Recomendation 2: re-curation
Technical information
- File versions
- Process to check the state of gene names
Deep explorations
Appendix
- Expanded gene category stats per dataset
- Notes on some prefixes

Author: Pablo Garcia-Nieto

Latest updated: May 11th, 2021

Goals:

To assess how many genes in the cellxgene data corpus can be mapped to ENSEBML ids.
To explore and implement solutions for those genes that can’t be mapped.
To make a recommendation about how to move forward (i.e. do we want to upgrade genes in current data corpus? If so, how?)

Summary of current data corpus in cellxgene

Genes symbols in current cellxgene datasets were looked up in the GTF ENSEBML files and were assigned one of the following states:

CONVERTIBLE - there’s exactly one existing GRCh38 ENSEMBL id for the symbol.
AMBIGUOUS - there are multiple GRCh38 ENSEMBL ids for the symbol.
UNKNOWN (GRCh37: CONVERTIBLE) - there are not GRCh38 ENSEMBL ids for the symbol, but exactly one id was found in GRCh37.
UNKNOWN (GRCh37: AMBIGUOUS) - there are not GRCh38 ENSEMBL ids for the symbol, but more that one id were found in GRCh37.
UNKNOWN (GRCh37: UNKNOWN) - there are not ENSEMBL ids found in neither GRCh38 nor GRCh37

Table: Summary of gene states in 12 datasets.

Gene state	Count
AMBIGUOUS	1317 (0.39%)
CONVERTIBLE	302179 (89%)
UNKNOWN (GRCh37: AMBIGUOUS)	109 (0.032%)
UNKNOWN (GRCh37: CONVERTIBLE)	28835 (8.5%)
UNKNOWN (GRCh37: UNKNOWN)	7910 (2.3%)

Figure: gene states across datasets.

The combination of yellow and light green are genes that can be safely converted.

Recommendations: how to move forward

High-level conclusions:

About 97.5% of genes can be safely converted to an ENSEBML id.
About 2.3% are not found in the mapping table. These are dominate by genes with the prefixes: RP, LO, AC, AL, and CT (eg. RP11-349F21.5, LOC392232, AC114400.1, CTB-52I2.7).
Implementation of a conversion algorithm would be simple and could be added to the cellxgene-schema, or as a peripherial script.

Based on these conclusions, there are two exisiting recommendations

Recomendation 1: programatic solution

To programmatically add ENSEBML ids to current data corpus:
1. Convert genes when possible to GRCh38 or GRCh37.
2. Update GRCh37 to GRCh38 IDs (There has to be an exploration to asses whether this is need it).
3. Drop genes that are not convertible.
[Optional] To give orignal authors a heads-up and potentially wait for their approval.
[Optional] In the meantime, try to solve “unmappable genes”, those with LO and RP prefixes may have a solution but further explorations are necessary to assess the feasibility of this.

Pros:

Easy to implement.
Quick turn around

Cons:

Dropping ~3% of genes.
There’s likely some false-positive mappings between the current names and ENSEBML ids.

Recomendation 2: re-curation

To review each individual dataset and:
1. If original data has ENSEBML ids, repeat curation process using those IDs.
2. [Optional] Else, ask authors for ENSEMBL ids.
If for a each dataset no ENSEBML ids are found from previous step, then implement recomendation 1.

Pros:

More accurate mapping for those datasets that had ENSEMBL ids in the original data.

Cons:

Very time-consuming, as there are more than 120 dataset in our corpus of data

Technical information

File versions

GRCh38: ENSEBML genes and transcripts from release-103 (Homo_sapiens.GRCh38.103.chr.gtf.gz). *
GRCh37: ENSEBML genes and transcripts from release-87 (Homo_sapiens.GRCh37.87.chr.gtf.gz). *
When possible gene symbol aliases are transformed to current approved names (hgnc_complete_set.tsv from 10/05/2021)

* I decided to use these GTF vesions that contain only the main assembly (excluding unassembled contigs) because it reduces the number of ambiguous gene symbols (see below).

Process to check the state of gene names

Building reference sets

Parse GTFs to: get gene and transcript tables with three columns each: ID, name and version.
Parse HGCN symbol table, to get mappings between aliases and latest approve symbol.

Checking gene state

For each gene:

Try to upgrade gene (if it’s an alias in HGNC update it, do nothing if it’s not an alias)
Assign one of the following (do checks in order):
1. If gene is an ENSEBML gene or transcript ID, assign “VALID”.
2. If gene is an ENSEBML name with multiple ENSEMBL IDs assign “AMBIGUOUS”
3. If gene is an ENSEBML name with exactly one ENSEMBL IDs assign “CONVERTIBLE”
4. If gene is a previously withdrawn HGNC symbol ENSEBML, assign “WITHDRAWN”.
5. If gene is an ENSEBML suffix-stripped name with exactly one ENSEMBL IDs assign “CONVERTIBLE”. A suffix-stripped name is one without the \.\d.* prefix.
If no assignment and gene has a \.\d.* prefix, remove it and repeat 2.
If no assignment and gene has a -.* prefix, remove it and repeat 2.
If no assignment so far then assign “UNKNOWN”

The process was done first with GRCh38 and then repeated it with GRCh37.

Deep explorations

This section contains open-ended research that aims to understand gene symbols that are ambiguous or unknown.

Many ambiguous genes were fixed (moved to convertible), after switching from the “full assembly” gtf (contains unassembled )

Ambiguous gene symbols

These are gene symbols that map to more than ENSEMBL ids. Some useful of stats about them:

There are 1099 unique genes.
Histogram of the number of times one of these genes is found across datasets:

Number of datasests Count

1 1001

2 60

3 8

4 3

6 1

7 2

8 2

9 3

10 8

11 6

12 5

Number of datasests	Count
1	1001
2	60
3	8
4	3
6	1
7	2
8	2
9	3
10	8
11	6
12	5

Top 15 repeated ambiguous genes

Gene	Count
CYB561D2	12
HERC3	12
MATR3	12
PDE11A	12
RGS5	12
ACTL10	11
DNAJC9-AS1	11
RMRP	11
SPATA13	11
TBCE	11
TMSB15B	11
CCDC39	10
ELFN2	10
GOLGA8M	10
KBTBD11-OT1	10

Gene symbols not found in ENSEMBL

This section is intended to look for patterns in gene symbols that were not found in any ENSEBML version (those labeled as “UNKNOWN (GRCh37: UNKNOWN)” above).

Some useful of stats about them:

There are 4964 unique genes.
Histogram of the number of times one of these genes is found across datasets:

Number of datasests Count

1 4231

2 45

3 140

4 138

5 145

6 187

7 18

8 5

9 12

10 5

11 10

12 28
Two-letter prefix counts (only top-50 most frequent are shown):

Number of datasests	Count
1	4231
2	45
3	140
4	138
5	145
6	187
7	18
8	5
9	12
10	5
11	10
12	28

Prefix	Count	Examples
RP	1449	RP11-521C22.2, RP11-79P5.10, RP11-157J13.1, RP11-49G2.3
LO	1251	LOC338817, LOC641746, LOC100128496, LOC729987
AC	680	AC007389.5, AC008750.8, AC074327.1, AC008448.1
AL	383	AL513023.1, AL031777.3, AL442647.1, AL592183.1
CT	229	CT978678.1, CTD-2201E9.3, CTC-260E6.12, CTC-788C1.2
CH	123	CH17-125A10.2, CH17-385C13.2, CH507-254M2.2, CH17-174L20.1
AP	51	AP003392.6, AP003386.1, AP003774.3, AP003097.2
FL	44	FLJ39080, FLJ37201, FLJ22447, FLJ33581
uc	36	uc_338-4, uc_338-27, uc_338-18, uc_338-9
SN	35	SNAR-C5, SNAR-C4, SNORD103C, SNAR-F
LL	31	LLNLF-18A12.1, LLNLR-276E7.1, LL0XNC01-116E7.5, LLNLR-285B5.1
LI	30	LINC01969, LINC00535, LINC00535, LINC02083
FA	28	FAM239C, FAM160B2, FAM160B2, FAM166AP4
C1	20	C1orf229, C12orf65, C11orf95, C16orf71
MI	19	MIAT_exon5_2, MIR3179-3-1, MIR3180-4-1, MIR1273E
XX	18	XXbac-BPG246D15.9, XXyac-YR21CG7.1, XXcos-LUCA16.1, XX-FYM637E10_5.1
Cl	17	Clostridiales-1-13, Clostridiales-1-1, Clostridiales-1-15, Clostridiales-1-7
DL	15	DLEU2_3, DLEU2_6, DLEU2_5, DLEU1_1
LA	15	LA16c-335H7.2, LA16c-360H6.1, LA16c-407A10.3, LA16c-380H5.6
KB	14	KB-1572G7.5, KB-1183D5.13, KB-1125A3.12, KB-1552D7.2
CC	13	CCDC151, CCDC114, CCDC114, CCDC114
HO	13	HOTAIRM1_3, HOTAIRM1_5, HOTAIR_2, HOTTIP_3
pR	13	pRNA-9, pRNA-3, pRNA-5, pRNA-4
RN	13	RNA5-8S5-5, RN45S, RNA5-8S5-4, RNMTL1P2
C2	12	C2orf91, C22orf24, C2orf27AP3, C21ORF62-AS1
bP	11	bP-21264C1.2, bP-2171C21.4, bP-2171C21.5, bP-2189O9.3
MG	11	MGC16142, MGC72080, MGC39584, MGC12916
KI	10	KIAA1841, KIAA1841, KIR2DS2, KIR3DS1
BX	9	BX649632.1, BX072566.2, BX072566.1, BX649632.1
C8	9	C8orf59P1, C8ORF37-AS1, C8ORF37-AS1, C8orf31
CX	9	CXORF51B, CXORF51A, CXORF51B, CXORF66
DK	9	DKFZp451B082, DKFZp566F0947, DKFZP434K028, DKFZp686K1684
hs	9	hsa-mir-3675, hsa-mir-8069-1, hsa-mir-3687-1-1, hsa-mir-6724-1-1
RM	9	RMST_10, RMST_5, RMST_8, RMST_9
CU	8	CU463998.3, CU634019.6, CU013544.1, CU634019.6
ME	8	MESTIT1_1, MEF2BNBP1, MEG3_2, MEG8_2
AB	7	ABP1, ABBA01037346.1, ABC12-47964100C23.1, ABBA01037345.1
CM	7	CMB9-22P13.2, CMB9-94B1.2, CMB9-22P13.1, CMB9-14B22.1
GS	7	GS1-25M2.1, GS1-214D18.3, GS1-204I12.3, GS1-259H13.13
H1	7	H19_3-1, H19_3-2, H19_1, H19_2
Si	7	Six3os1_5, Six3os1_1, Six3os1_7, Six3os1_2
TC	7	TCL6_2, TCTEX1D4, TCTEX1D1, TCTEX1D4
Z9	7	Z98744.1, Z97987.1, Z98750.1, Z97987.1
AF	6	AF241734.1, AF065393.1, AF065393.2, AF065393.3
AU	6	AUXG01000518.1, AUXG01000515.2, AUXG01000517.1, AUXG01000516.1
C9	6	C9orf62, C9ORF135-DT, C9orf147, C9orf147
FO	6	FO393415.3, FO538757.3, FO393407.1, FO393415.3
FT	6	FTX_3, FTX_1, FTX_4, FTH1P18
PA	6	PART1_2, PANO1, PAR4, PART1_3
Z8	6	Z84484.1, Z84484.1, Z82246.1, Z82205.1

There are 3087 genes that have a “.” suffix, and these are the top 20 prefixes of those:

Prefix	Count
RP	1445
AC	676
AL	383
CT	229
CH	122
AP	51
LL	31
XX	18
LA	15
KB	14
bP	11
BX	9
CU	8
CM	7
GS	7
Z9	7
AB	6
AF	6
AU	6
FO	6

Removing “.” suffix

This solution has been already incorporated and therefore the results should be ignored and are not an accurate representation.

Does removing those suffixes makes “UNKNOWN” genes “COVNERTIBLE”?

After removing those suffixes from the 3087 genes (see above) and after removing any suffixes in the ENSEMBL names these are the new gene states:

Gene state	Counts
AMBIGUOUS	312
CONVERTIBLE	31
UNKNOWN (GRCh37:AMBIGUOUS)	897
UNKNOWN (GRCh37:CONVERTIBLE)	19
UNKNOWN (GRCh37:UNKNOWN)	1828

Appendix

Expanded gene category stats per dataset

Table: Counts and percentages of state across datasets

dataset	AMBIGUOUS	CONVERTIBLE	UNKNOWN (GRCh37: CONVERTIBLE)	UNKNOWN (GRCh37: UNKNOWN)	UNKNOWN (GRCh37: AMBIGUOUS)
A single-cell atlas of the hea …	29 (0.13)	22426 (98)	111 (0.49)	317 (1.4)	NA
A Single-Cell Transcriptional …	16 (0.11)	14988 (99)	19 (0.13)	50 (0.33)	NA
Direct Exposure to SARS-CoV-2 …	28 (0.084)	32610 (97)	186 (0.56)	682 (2)	NA
Evolution of cellular diversit …	11 (0.074)	14701 (100)	9 (0.061)	47 (0.32)	NA
Fibroblasts — Cells of the adu …	28 (0.084)	32610 (97)	186 (0.56)	679 (2)	NA
Infiltrating Neoplastic Cells …	72 (0.31)	21641 (93)	73 (0.31)	1517 (6.5)	5 (0.021)
Multiomics single-cell analysi …	34 (0.13)	25764 (98)	148 (0.56)	441 (1.7)	NA
Single cell transcriptional an …	29 (0.13)	21854 (98)	103 (0.46)	371 (1.7)	NA
Single-cell RNA-Seq Investigat …	29 (0.13)	18094 (84)	3401 (16)	75 (0.35)	NA
Single-cell transcriptomics of …	981 (1.6)	41473 (68)	15232 (25)	2937 (4.8)	102 (0.17)
Spatiotemporal analysis of hum …	28 (0.084)	32610 (97)	186 (0.56)	679 (2)	NA
Time-resolved Systems Immunolo …	32 (0.098)	23408 (72)	9181 (28)	115 (0.35)	2 (0.0061)

Notes on some prefixes

These notes are adapted from a slack conversation with Brian Raymnor and Ambrose Carr.

Potential solution: unkwown.
These seem to be annotated "fragments of the genome
Example: RP11-248B24.5
Link

LINC

Potetianl solution: upgrade symbol.
These seem to be old symbols that can be upgradable.
Link

Potential solution: remove suffix.
Example: ZNF709
Link

Potential solutiion: unknown.
They seem to be genomic DNA annotations that are not part of the main genome assembly
Maybe it’s genes that originate from those
They are in the “gene” database of NCBI
Link
Link2

Potential solution: remove suffix.
Example: CTD-3212A4.2
Link

Metazoa

Potential solution: unknown.
Example: Metazoa_SRP-187.
The prefix seems to be for Metazoan signal recognition particle RNA. (Link).
Is unclear the meaning of the suffx.

Potential solution: reformatting.
Example: uc_338-27 which could be uc.338
Link
Link 2

Potential solution: unknown.
Example: CH17-53B9.4
Some examples were found in and angiogenesis database, interestingly they have links to ENSEMBL ids
Link

Potential solution: ignore.
In EBI they have warning of having being replaced with newer versions
Link