ENSEMBL GTF versions:
Genes symbols in current cellxgene datasets were looked up in the GTF ENSEBML files and were assigned one of the following states:
Numbers in parentheses are percentages.
Table: Summary of gene states in 12 datasets.| Gene state | Count |
|---|---|
| AMBIGUOUS | 15264 (4.5) |
| CONVERTIBLE | 276961 (81) |
| UNKNOWN (GRCh37: AMBIGUOUS) | 1483 (0.44) |
| UNKNOWN (GRCh37: CONVERTIBLE) | 34991 (10) |
| UNKNOWN (GRCh37: UNKNOWN) | 11651 (3.4) |
Figure: gene states across datasets.
The combination of yellow and light green are genes that can be safely converted.
This section is intended to look for patterns in gene symbols that were not found in any ENSEBML version (those labeled as “UNKNOWN (GRCh37: UNKNOWN)” above).
Some useful of stats about them:
| Number of datasests | Count |
|---|---|
| 1 | 7332 |
| 2 | 50 |
| 3 | 233 |
| 4 | 202 |
| 5 | 212 |
| 6 | 259 |
| 7 | 14 |
| Prefix | Count | Examples |
|---|---|---|
| RP | 2270 | RP11-567F11.2, RP11-113D19.9, RP11-775C24.5, RP11-203H19.2 |
| LO | 1252 | LOC646903, LOC219347, LOC100128750, LOC93622 |
| AC | 900 | AC096558.2, AC122710.3, AC068675.1, AC126281.3 |
| Y_ | 785 | Y_RNA-226, Y_RNA-745, Y_RNA-659, Y_RNA-195 |
| AL | 470 | AL358975.1, AL136221.3, AL133330.2, AL606763.1 |
| SN | 427 | SNORA26-6, SNORA51-3, SNORA62-5, SNORA64-4 |
| CT | 338 | CTD-2619J13.27, CTD-3234P18.6, CTD-3149D2.2, CTD-2145A24.5 |
| Me | 223 | Metazoa_SRP-195, Metazoa_SRP-74, Metazoa_SRP-109, Metazoa_SRP-70 |
| CH | 125 | CH507-152C13.5, CH17-140K24.7, CH17-140K24.2, CH17-360D5.2 |
| AP | 88 | AP003386.1, AP003397.1, AP002759.1, AP001646.3 |
| LI | 60 | LINC01759, LINC01336, LIMS3.1, LINC01578 |
| U3 | 55 | U3-16, U3-49, U3-8, U3-39 |
| FL | 50 | FLJ23867, FLJ16171, FLJ33534, FLJ32063 |
| sn | 48 | snoU13-10, snoMBII-202-1, snoU13-25, snoMe28S-Am2634-1 |
| LL | 36 | LLNLF-173C4.1, LLNLR-304G9.1, LLNLF-18A12.1, LLNLR-285B5.1 |
| SC | 36 | SCARNA16-2, SCARNA9L, SCARNA17-4, SCARNA20-3 |
| uc | 36 | uc_338-3, uc_338-30, uc_338-24, uc_338-29 |
| OR | 30 | OR4N3P-1, OR4C16-1, OR5H6-1, OR5L1-1 |
| FA | 28 | FAM95B1-1, FAS-AS1, FABP5P13-1, FAM24B-CUZD1 |
| MI | 25 | MIR3180-1-1, MIR3180-3-1, MIAT_exon5_3, MIR4697HG |
| XX | 25 | XXbac-B33L19.12, XXbac-BPG258E24.10, XXbac-BPG154L12.5, XXbac-BPG13B8.11 |
| HO | 23 | HOTTIP_2, HOTAIRM1_3, HOTAIRM1_2, HOXA11-AS1_4 |
| LA | 21 | LA16c-360A4.1, LA16c-360H6.1, LA16c-390H2.1, LA16c-380F5.1 |
| SE | 21 | SEPT14P24, SEPT14P2, SEPT7-AS1, SEPT14P8 |
| U6 | 21 | U62317.5, U6-10, U6-15, U62317.5 |
| U8 | 21 | U8-19, U8-11, U8-21, U8-3 |
| ZN | 20 | ZNF668.1, ZNF26.1, ZNRD1-AS1_1, ZNFX1-AS1_3 |
| DL | 19 | DLEU1-AS1, DLEU1-AS1, DLEU2_1-1, DLX2-AS1 |
| BX | 18 | BX649632.1, BX649632.1, BX005132.2, BX284668.6 |
| U2 | 18 | U2-18, U2-9, U2-11, U2-10 |
| 5S | 17 | 5S_rRNA-4, 5S_rRNA-2, 5S_rRNA-13, 5S_rRNA-14 |
| Cl | 17 | Clostridiales-1-11, Clostridiales-1-13, Clostridiales-1-3, Clostridiales-1-4 |
| RN | 17 | RNASEK-C17ORF49, RNA5SP506, RNA5-8S5-7, RN45S |
| U1 | 16 | U1-15, U1-10, U1.1, U1-8 |
| KB | 15 | KB-176G8.1, KB-1572G7.5, KB-1396H2.2, KB-1125A3.12 |
| AB | 14 | ABCF2.1, ABC7-42391500H16.4, AB015752.1, ABBA01037349.1 |
| U7 | 14 | U7-10, U7-14, U7-3, U7-12 |
| pR | 13 | pRNA-8, pRNA-4, pRNA, pRNA-11 |
| AF | 12 | AF011889.5, AF065393.2, AF127936.2, AF111168.2 |
| bP | 12 | bP-2171C21.4, bP-2171C21.3, bP-2168N6.3, bP-21264C1.1 |
| PR | 12 | PRICKLE2-AS1-1, PRCAT47, PRAMEF28, PRO0611 |
| SP | 12 | SPRY4-IT1_1, SPRY4-IT1_2, SPG20-AS1, SPDYE14P |
| ST | 12 | ST7-OT3_3, ST7-OT4_4, ST7-AS2_1, ST7-OT4_2 |
| C1 | 11 | C10orf32-AS3MT, C1QTNF9B-AS1-1, C10ORF71-AS1, C1orf140 |
| DK | 11 | DKFZP434K028, DKFZp686K1684, DKFZp686D0853, DKFZP434I0714 |
| GS | 11 | GS1-345D13.1, GS1-214D18.3, GS1-54N10.1, GS1-114I9.3 |
| MG | 11 | MGC45800, MGC2889, MGC27345, MGC16025 |
| SM | 11 | SMAD5-AS1_2-1, SMAD5-AS1_1, SMIM25, SMIM37 |
| CX | 10 | CXORF49, CXORF51B, CXORF51B, CXORF38 |
| RM | 10 | RMST_8, RMST_7, RMRP-1, RMST_3 |
| Prefix | Count |
|---|---|
| RP | 2258 |
| AC | 887 |
| AL | 469 |
| CT | 336 |
| CH | 124 |
| AP | 88 |
| LL | 34 |
| XX | 25 |
| LA | 21 |
| KB | 15 |
| BX | 14 |
| AB | 13 |
| AF | 12 |
| bP | 12 |
| GS | 11 |
| ZN | 10 |
| Z8 | 9 |
| Z9 | 9 |
| CM | 7 |
| AU | 6 |
Does removing those suffixes makes “UNKNOWN” genes “COVNERTIBLE”?
After removing those suffixes from the 4500 genes (see above) and After removing any suffixes in the ENSEMBL names these are the new gene states:
| Gene state | Counts |
|---|---|
| AMBIGUOUS | 339 |
| CONVERTIBLE | 426 |
| UNKNOWN (GRCh37:AMBIGUOUS) | 991 |
| UNKNOWN (GRCh37:CONVERTIBLE) | 748 |
| UNKNOWN (GRCh37:UNKNOWN) | 1996 |
These notes are adapted from a slack conversation with Brian Raymnor and Ambrose Carr.
RP
LINC
ZN
LO
CT
Metazoa
uc
CH
AP
| dataset | AMBIGUOUS | CONVERTIBLE | UNKNOWN (GRCh37: AMBIGUOUS) | UNKNOWN (GRCh37: CONVERTIBLE) | UNKNOWN (GRCh37: UNKNOWN) |
|---|---|---|---|---|---|
| A single-cell atlas of the hea … | 5 (4.7) | 5 (92) | 5 (0.12) | 5 (1.7) | 5 (1.9) |
| A Single-Cell Transcriptional … | 4 (5.2) | 4 (94) | 4 (0.11) | 4 (1) | NA |
| Direct Exposure to SARS-CoV-2 … | 5 (4.9) | 5 (92) | 5 (0.003) | 5 (0.52) | 5 (2.7) |
| Evolution of cellular diversit … | 5 (5) | 5 (95) | 5 (0.0068) | 5 (0.074) | 5 (0.1) |
| Fibroblasts — Cells of the adu … | 5 (4.9) | 5 (92) | 5 (0.003) | 5 (0.52) | 5 (2.7) |
| Infiltrating Neoplastic Cells … | 5 (5.7) | 5 (88) | 5 (0.047) | 5 (0.36) | 5 (6.3) |
| Multiomics single-cell analysi … | 5 (4.3) | 5 (92) | 5 (0.12) | 5 (1.7) | 5 (2.3) |
| Single cell transcriptional an … | 5 (4.1) | 5 (92) | 5 (0.12) | 5 (1.6) | 5 (2.1) |
| Single-cell RNA-Seq Investigat … | 5 (4.5) | 5 (77) | 5 (0.74) | 5 (18) | 5 (0.083) |
| Single-cell transcriptomics of … | 5 (3.6) | 5 (57) | 5 (1.2) | 5 (29) | 5 (9.6) |
| Spatiotemporal analysis of hum … | 5 (4.9) | 5 (92) | 5 (0.003) | 5 (0.52) | 5 (2.7) |
| Time-resolved Systems Immunolo … | 5 (3.7) | 5 (59) | 5 (1.4) | 5 (36) | 5 (0.35) |