Author: Pablo Garcia-Nieto
Latest updated: May 24th, 2021
Goals:
Genes symbols in current cellxgene datasets were looked up in the GTF ENSEBML files and were assigned one of the following states:
| Gene state | Count |
|---|---|
| AMBIGUOUS | 194 (0.11%) |
| CONVERTIBLE | 154541 (88%) |
| UNKNOWN (GRCm38: AMBIGUOUS) | 5 (0.0028%) |
| UNKNOWN (GRCm38: CONVERTIBLE) | 1058 (0.6%) |
| UNKNOWN (GRCm38: UNKNOWN) | 19751 (11%) |
Figure: gene states across datasets.
The combination of yellow and light green are genes that can be safely converted.
TODO
For each gene:
\.\d.* prefix.\.\d.* prefix, remove it and repeat 2.-.* prefix, remove it and repeat 2.The process was done first with GRCm39 and then repeated it with GRCm38.
This section contains open-ended research that aims to understand gene symbols that are unknown.
This section is intended to look for patterns in gene symbols that were not found in any ENSEBML version (those labeled as “UNKNOWN (GRCm38: UNKNOWN)” above).
Some useful of stats about them:
| Number of datasests | Count |
|---|---|
| 1 | 12520 |
| 2 | 1394 |
| 3 | 379 |
| 4 | 285 |
| 5 | 240 |
| 6 | 133 |
| 7 | 24 |
| Prefix | Count | Examples |
|---|---|---|
| Gm | 6079 | Gm15698, Gm34858, Gm5538, Gm17365 |
| LO | 5502 | LOC105244415, LOC105244438, LOC105246620, LOC100417215 |
| c1 | 340 | c13_tRNA-Lys-AAA, c1_tRNA-Leu-TTA, c13_tRNA-Met-i, c19_tRNA-Ser-AGY |
| 49 | 214 | 4930455C21Rik, 4930480K23Rik, 4930583H14Rik, 4930539E08Rik |
| 17 | 131 | 1700091E21Rik, 1700016G14Rik, 1700011J10Rik, 1700101G07Rik |
| mt | 97 | mt_AK165865, mt_AK165190, mt_AF071428, mt_AK157367 |
| RP | 92 | RP24-308M10.2, RP24-286L23.3, RP24-177G14.2, RP24-484I1.3 |
| BC | 81 | BC002163, BC068157, BC037034, BC049730 |
| Zf | 61 | Zfp383, Zfp292, Zfp276, Zfp71-rs1 |
| Mi | 58 | Mir5124, Mir219-2, Mira, Mir873 |
| Vm | 56 | Vmn2r-ps71, Vmn2r-ps138, Vmn2r123, Vmn1r-ps44 |
| 23 | 52 | 2310014L17Rik, 2310001H18Rik, 2310010M20Rik, 2310031A07Rik |
| AC | 52 | AC168977.1, AC099934.3, AC164099.2, AC158605.2 |
| ZN | 50 | ZNF507, ZNF76, ZNF513, ZNF235 |
| 11 | 46 | 1110018J18Rik, 1110021L09Rik, 1190005F20Rik, 1100001G20Rik |
| cx | 37 | cx_tRNA-Thr-ACY, cx_tRNA-Val-GTA, cx_tRNA-Leu-TTG, cx_tRNA-Pro-CCA |
| Ol | 36 | Olfr240-ps1, Olfr182-ps1, Olfr766, Olfr940-ps1 |
| c3 | 35 | c3_tRNA-Val-GTG, c3_tRNA-Gly-GGA, c3_tRNA-Pro-CCA, c3_tRNA-Gln-CAA |
| c7 | 35 | c7_tRNA-Leu-CTG, c7_tRNA-Pro-CCY, c7_tRNA-Gly-GGG, c7_tRNA-Tyr-TAC |
| c6 | 34 | c6_tRNA-Phe-TTY, c6_tRNA-Trp-TGG, c6_tRNA-Thr-ACA, c6_tRNA-Ala-GCY_ |
| 18 | 32 | 1810019J16Rik, 1810043H04Rik, 1810035I16Rik, 1810032O08Rik |
| c2 | 32 | c2_tRNA-Gly-GGA, c2_tRNA-Ser-AGY, c2_tRNA-Ile-ATA, c2_tRNA-Gln-CAA |
| c5 | 31 | c5_tRNA-Gln-CAG, c5_tRNA-Gly-GGY, c5_tRNA-Ala-GCY_, c5_tRNA-Asn-AAC |
| 24 | 28 | 2400001E08Rik, 2410066E13Rik, 2410012M07Rik, 2410016O06Rik |
| 26 | 28 | 2610034B18Rik, 2610101N10Rik, 2610203C20Rik, 2610039C10Rik |
| c9 | 28 | c9_tRNA-Ile-ATA, c9_tRNA-Met, c9_tRNA-Val-GTA, c9_LSU-rRNA_Hsa |
| 28 | 27 | 2810432D09Rik, 2810422J05Rik, 2810407C02Rik, 2810403A07Rik |
| c4 | 27 | c4_tRNA-Tyr-TAT, c4_tRNA-Gly-GGG, c4_tRNA-Ala-GCA, c4_tRNA-Ala-GCY_ |
| c8 | 26 | c8_tRNA-Ile-ATT, c8_tRNA-Lys-AAA, c8_tRNA-Ile-ATA, c8_tRNA-Glu-GAG_ |
| 20 | 24 | 2010002M12Rik, 2010005H15Rik, 2010109I03Rik, 2010005H15Rik |
| D1 | 24 | D14Ertd668e, D17H6S56E-3, D14Ertd668e, D17Ertd648e |
| AI | 23 | AI854517, AI314976, AI464131, AI481877 |
| MI | 23 | MIR365A, MIR486-1, MIR151A, MIR544A |
| Fa | 22 | Fam21, Fam108b, Fam48a, Fam108a |
| 06 | 20 | 0610007P22Rik, 0610037L13Rik, 0610038L08Rik, 0610011F06Rik |
| 22 | 20 | 2210416O15Rik, 2210009G21Rik, 2210415F13Rik, 2210415F13Rik |
| Hi | 18 | Hist2h2aa2, Hist2h2aa2, Hist1h1a, Hist1h1b |
| FA | 17 | FAM160A2, FAM110D, FAM110D, FAM160B1 |
| 31 | 16 | 3110062M04Rik, 3110007F17Rik, 3110057O12Rik, 3110002H16Rik |
| 57 | 16 | 5730457N03Rik, 5730494M16Rik, 5730457N03Rik, 5730528L13Rik |
| AT | 16 | ATP5F1E, ATP5PD, ATP5F1A, ATP5PD |
| H1 | 16 | H1f0, H1f0, H1fnt, H1-7 |
| At | 15 | Atp5h, Atpif1, Atp5h, Atpif1 |
| TR | 15 | TRIM43, TRD, TRIM75P, TRBV22-1 |
| 15 | 14 | 1500015O10Rik, 1500016L03Rik, 1500015O10Rik, 1500032L24Rik |
| 27 | 14 | 2700094K13Rik, 2700078E11Rik, 2700089E24Rik, 2700060E02Rik |
| 63 | 14 | 6330408A02Rik, 6330578E17Rik, 6330439K17Rik, 6330416G13Rik |
| C3 | 14 | C330019G07Rik, C330046G13Rik, C330005M16Rik, C330027C09Rik |
| Sp | 14 | Spry3, Speer4f, Spnb1, Spnb3 |
| Mt | 13 | Mtap7d1, Mtap7d3, Mtag2, Mtap4 |
| Prefix | Count |
|---|---|
| RP | 91 |
| AC | 51 |
| CT | 6 |
| Ep | 3 |
| CA | 2 |
| AL | 1 |
| CR | 1 |
| WI | 1 |
| dataset | AMBIGUOUS | CONVERTIBLE | UNKNOWN (GRCm38: CONVERTIBLE) | UNKNOWN (GRCm38: UNKNOWN) | UNKNOWN (GRCm38: AMBIGUOUS) |
|---|---|---|---|---|---|
| A Single-Cell Transcriptional … | 12 (0.08) | 14598 (98) | 43 (0.29) | 266 (1.8) | NA |
| A transcriptomic atlas of the … | 24 (0.098) | 23026 (94) | 110 (0.45) | 1249 (5.1) | NA |
| Adult mouse cortical cell taxo … | 82 (0.12) | 54030 (77) | 526 (0.75) | 15131 (22) | 5 (0.0072) |
| All — A single-cell transcript … | 26 (0.13) | 18184 (90) | 79 (0.39) | 1849 (9.2) | NA |
| An integrated transcriptomic a … | 30 (0.12) | 23348 (97) | 146 (0.6) | 616 (2.6) | NA |
| Glutamatergic neurons — An Atl … | 20 (0.091) | 21105 (96) | 154 (0.7) | 636 (2.9) | NA |
| Molecular, spatial and project … | NA | 250 (98) | NA | 4 (1.6) | NA |