Author: Pablo Garcia-Nieto
Latest updated: May 11th, 2021
Goals:
Genes symbols in current cellxgene datasets were looked up in the GTF ENSEBML files and were assigned one of the following states:
| Gene state | Count |
|---|---|
| AMBIGUOUS | 1317 (0.39%) |
| CONVERTIBLE | 302179 (89%) |
| UNKNOWN (GRCh37: AMBIGUOUS) | 109 (0.032%) |
| UNKNOWN (GRCh37: CONVERTIBLE) | 28835 (8.5%) |
| UNKNOWN (GRCh37: UNKNOWN) | 7910 (2.3%) |
Figure: gene states across datasets.
The combination of yellow and light green are genes that can be safely converted.
High-level conclusions:
cellxgene-schema, or as a peripherial script.Based on these conclusions, there are two exisiting recommendations
Pros:
Cons:
Pros:
Cons:
* I decided to use these GTF vesions that contain only the main assembly (excluding unassembled contigs) because it reduces the number of ambiguous gene symbols (see below).
For each gene:
\.\d.* prefix.\.\d.* prefix, remove it and repeat 2.-.* prefix, remove it and repeat 2.The process was done first with GRCh38 and then repeated it with GRCh37.
This section contains open-ended research that aims to understand gene symbols that are ambiguous or unknown.
Many ambiguous genes were fixed (moved to convertible), after switching from the “full assembly” gtf (contains unassembled )
These are gene symbols that map to more than ENSEMBL ids. Some useful of stats about them:
| Number of datasests | Count |
|---|---|
| 1 | 1001 |
| 2 | 60 |
| 3 | 8 |
| 4 | 3 |
| 6 | 1 |
| 7 | 2 |
| 8 | 2 |
| 9 | 3 |
| 10 | 8 |
| 11 | 6 |
| 12 | 5 |
| Gene | Count |
|---|---|
| CYB561D2 | 12 |
| HERC3 | 12 |
| MATR3 | 12 |
| PDE11A | 12 |
| RGS5 | 12 |
| ACTL10 | 11 |
| DNAJC9-AS1 | 11 |
| RMRP | 11 |
| SPATA13 | 11 |
| TBCE | 11 |
| TMSB15B | 11 |
| CCDC39 | 10 |
| ELFN2 | 10 |
| GOLGA8M | 10 |
| KBTBD11-OT1 | 10 |
This section is intended to look for patterns in gene symbols that were not found in any ENSEBML version (those labeled as “UNKNOWN (GRCh37: UNKNOWN)” above).
Some useful of stats about them:
| Number of datasests | Count |
|---|---|
| 1 | 4231 |
| 2 | 45 |
| 3 | 140 |
| 4 | 138 |
| 5 | 145 |
| 6 | 187 |
| 7 | 18 |
| 8 | 5 |
| 9 | 12 |
| 10 | 5 |
| 11 | 10 |
| 12 | 28 |
| Prefix | Count | Examples |
|---|---|---|
| RP | 1449 | RP11-521C22.2, RP11-79P5.10, RP11-157J13.1, RP11-49G2.3 |
| LO | 1251 | LOC338817, LOC641746, LOC100128496, LOC729987 |
| AC | 680 | AC007389.5, AC008750.8, AC074327.1, AC008448.1 |
| AL | 383 | AL513023.1, AL031777.3, AL442647.1, AL592183.1 |
| CT | 229 | CT978678.1, CTD-2201E9.3, CTC-260E6.12, CTC-788C1.2 |
| CH | 123 | CH17-125A10.2, CH17-385C13.2, CH507-254M2.2, CH17-174L20.1 |
| AP | 51 | AP003392.6, AP003386.1, AP003774.3, AP003097.2 |
| FL | 44 | FLJ39080, FLJ37201, FLJ22447, FLJ33581 |
| uc | 36 | uc_338-4, uc_338-27, uc_338-18, uc_338-9 |
| SN | 35 | SNAR-C5, SNAR-C4, SNORD103C, SNAR-F |
| LL | 31 | LLNLF-18A12.1, LLNLR-276E7.1, LL0XNC01-116E7.5, LLNLR-285B5.1 |
| LI | 30 | LINC01969, LINC00535, LINC00535, LINC02083 |
| FA | 28 | FAM239C, FAM160B2, FAM160B2, FAM166AP4 |
| C1 | 20 | C1orf229, C12orf65, C11orf95, C16orf71 |
| MI | 19 | MIAT_exon5_2, MIR3179-3-1, MIR3180-4-1, MIR1273E |
| XX | 18 | XXbac-BPG246D15.9, XXyac-YR21CG7.1, XXcos-LUCA16.1, XX-FYM637E10_5.1 |
| Cl | 17 | Clostridiales-1-13, Clostridiales-1-1, Clostridiales-1-15, Clostridiales-1-7 |
| DL | 15 | DLEU2_3, DLEU2_6, DLEU2_5, DLEU1_1 |
| LA | 15 | LA16c-335H7.2, LA16c-360H6.1, LA16c-407A10.3, LA16c-380H5.6 |
| KB | 14 | KB-1572G7.5, KB-1183D5.13, KB-1125A3.12, KB-1552D7.2 |
| CC | 13 | CCDC151, CCDC114, CCDC114, CCDC114 |
| HO | 13 | HOTAIRM1_3, HOTAIRM1_5, HOTAIR_2, HOTTIP_3 |
| pR | 13 | pRNA-9, pRNA-3, pRNA-5, pRNA-4 |
| RN | 13 | RNA5-8S5-5, RN45S, RNA5-8S5-4, RNMTL1P2 |
| C2 | 12 | C2orf91, C22orf24, C2orf27AP3, C21ORF62-AS1 |
| bP | 11 | bP-21264C1.2, bP-2171C21.4, bP-2171C21.5, bP-2189O9.3 |
| MG | 11 | MGC16142, MGC72080, MGC39584, MGC12916 |
| KI | 10 | KIAA1841, KIAA1841, KIR2DS2, KIR3DS1 |
| BX | 9 | BX649632.1, BX072566.2, BX072566.1, BX649632.1 |
| C8 | 9 | C8orf59P1, C8ORF37-AS1, C8ORF37-AS1, C8orf31 |
| CX | 9 | CXORF51B, CXORF51A, CXORF51B, CXORF66 |
| DK | 9 | DKFZp451B082, DKFZp566F0947, DKFZP434K028, DKFZp686K1684 |
| hs | 9 | hsa-mir-3675, hsa-mir-8069-1, hsa-mir-3687-1-1, hsa-mir-6724-1-1 |
| RM | 9 | RMST_10, RMST_5, RMST_8, RMST_9 |
| CU | 8 | CU463998.3, CU634019.6, CU013544.1, CU634019.6 |
| ME | 8 | MESTIT1_1, MEF2BNBP1, MEG3_2, MEG8_2 |
| AB | 7 | ABP1, ABBA01037346.1, ABC12-47964100C23.1, ABBA01037345.1 |
| CM | 7 | CMB9-22P13.2, CMB9-94B1.2, CMB9-22P13.1, CMB9-14B22.1 |
| GS | 7 | GS1-25M2.1, GS1-214D18.3, GS1-204I12.3, GS1-259H13.13 |
| H1 | 7 | H19_3-1, H19_3-2, H19_1, H19_2 |
| Si | 7 | Six3os1_5, Six3os1_1, Six3os1_7, Six3os1_2 |
| TC | 7 | TCL6_2, TCTEX1D4, TCTEX1D1, TCTEX1D4 |
| Z9 | 7 | Z98744.1, Z97987.1, Z98750.1, Z97987.1 |
| AF | 6 | AF241734.1, AF065393.1, AF065393.2, AF065393.3 |
| AU | 6 | AUXG01000518.1, AUXG01000515.2, AUXG01000517.1, AUXG01000516.1 |
| C9 | 6 | C9orf62, C9ORF135-DT, C9orf147, C9orf147 |
| FO | 6 | FO393415.3, FO538757.3, FO393407.1, FO393415.3 |
| FT | 6 | FTX_3, FTX_1, FTX_4, FTH1P18 |
| PA | 6 | PART1_2, PANO1, PAR4, PART1_3 |
| Z8 | 6 | Z84484.1, Z84484.1, Z82246.1, Z82205.1 |
| Prefix | Count |
|---|---|
| RP | 1445 |
| AC | 676 |
| AL | 383 |
| CT | 229 |
| CH | 122 |
| AP | 51 |
| LL | 31 |
| XX | 18 |
| LA | 15 |
| KB | 14 |
| bP | 11 |
| BX | 9 |
| CU | 8 |
| CM | 7 |
| GS | 7 |
| Z9 | 7 |
| AB | 6 |
| AF | 6 |
| AU | 6 |
| FO | 6 |
This solution has been already incorporated and therefore the results should be ignored and are not an accurate representation.
Does removing those suffixes makes “UNKNOWN” genes “COVNERTIBLE”?
After removing those suffixes from the 3087 genes (see above) and after removing any suffixes in the ENSEMBL names these are the new gene states:
| Gene state | Counts |
|---|---|
| AMBIGUOUS | 312 |
| CONVERTIBLE | 31 |
| UNKNOWN (GRCh37:AMBIGUOUS) | 897 |
| UNKNOWN (GRCh37:CONVERTIBLE) | 19 |
| UNKNOWN (GRCh37:UNKNOWN) | 1828 |
| dataset | AMBIGUOUS | CONVERTIBLE | UNKNOWN (GRCh37: CONVERTIBLE) | UNKNOWN (GRCh37: UNKNOWN) | UNKNOWN (GRCh37: AMBIGUOUS) |
|---|---|---|---|---|---|
| A single-cell atlas of the hea … | 29 (0.13) | 22426 (98) | 111 (0.49) | 317 (1.4) | NA |
| A Single-Cell Transcriptional … | 16 (0.11) | 14988 (99) | 19 (0.13) | 50 (0.33) | NA |
| Direct Exposure to SARS-CoV-2 … | 28 (0.084) | 32610 (97) | 186 (0.56) | 682 (2) | NA |
| Evolution of cellular diversit … | 11 (0.074) | 14701 (100) | 9 (0.061) | 47 (0.32) | NA |
| Fibroblasts — Cells of the adu … | 28 (0.084) | 32610 (97) | 186 (0.56) | 679 (2) | NA |
| Infiltrating Neoplastic Cells … | 72 (0.31) | 21641 (93) | 73 (0.31) | 1517 (6.5) | 5 (0.021) |
| Multiomics single-cell analysi … | 34 (0.13) | 25764 (98) | 148 (0.56) | 441 (1.7) | NA |
| Single cell transcriptional an … | 29 (0.13) | 21854 (98) | 103 (0.46) | 371 (1.7) | NA |
| Single-cell RNA-Seq Investigat … | 29 (0.13) | 18094 (84) | 3401 (16) | 75 (0.35) | NA |
| Single-cell transcriptomics of … | 981 (1.6) | 41473 (68) | 15232 (25) | 2937 (4.8) | 102 (0.17) |
| Spatiotemporal analysis of hum … | 28 (0.084) | 32610 (97) | 186 (0.56) | 679 (2) | NA |
| Time-resolved Systems Immunolo … | 32 (0.098) | 23408 (72) | 9181 (28) | 115 (0.35) | 2 (0.0061) |
These notes are adapted from a slack conversation with Brian Raymnor and Ambrose Carr.
RP
LINC
ZN
LO
CT
Metazoa
uc
CH
AP