For HGNChelper paper, we used the results from inst/analyses/test_GPLs.R script
done on September 28th, 2020. The output is saved as inst/analyses/gpls_already_tested.csv.
Briefly, we downloaded 20,716 GEO platforms where 5,128 were from Homo Sapiens.
If more than 50% of values of a column is human gene symbol, we consider that column
as a ‘(unique) gene symbol’ column and summarized the performance of HGNChelper
on the values of that column in the gpls_already_tested.csv table. The two main
information in that table are the fraction of valid gene symbols before/after HGNChelper.
colname : the column name of fetched GPL data, where HGNChelper was appliedvalid_frac : the fraction of valid gene symbols BEFORE HGNChelperafter.HGNChelper.valid.frac : the fraction of valid gene symbols AFTER HGNChelperdat_dir <- "~/data/HGNChelper_raw_data_ver1"
x <- read.csv("gpls_already_tested_ver1.csv", as.is = TRUE)
x$valid.after.hgnchelper.frac <- as.numeric(x$valid.after.hgnchelper.frac)
x <- x[!is.na(x$valid.frac),]
x <- x[x$valid.frac > 0,]
outlier table contains the summary of platforms that have less than 90% validated
gene symbol after HGNChelper.
outlier <- x[x$valid.after.hgnchelper.frac < 0.9,]
outlier <- outlier[order(outlier$valid.after.hgnchelper.frac),]
head(outlier)
## platform colname frac.hgnc nrow valid.frac
## 85 GPL26963 ORF 0.5044283 60795 0.4926269
## 37 GPL25978 ID 0.5111388 43317 0.4765565
## 168 GPL24444 ID 0.5111855 43315 0.4765786
## 237 GPL25261 Gene ID 0.5295989 43315 0.4937454
## 1640 GPL16309 Gene Symbol 0.5147768 62976 0.4957780
## 288 GPL23465 ID 0.5168011 36337 0.4976745
## valid.after.hgnchelper.frac distribution submission_date
## 85 0.5025098 custom-commercial Jul 25 2019
## 37 0.5044902 custom-commercial Dec 22 2018
## 168 0.5051137 commercial Dec 27 2017
## 237 0.5233084 commercial Jul 02 2018
## 1640 0.5531582 custom-commercial Nov 20 2012
## 288 0.5560173 custom-commercial May 11 2017
Column name of GPL that wasn’t efficiently corrected by HGNChelper.
table(outlier$colname) %>% sort(., decreasing = TRUE)
##
## ORF
## 85
## GENE_SYMBOL
## 56
## SPOT_ID
## 28
## Symbol
## 21
## Gene Symbol
## 19
## Gene_Symbol
## 13
## GeneSymbol
## 13
## gene_symbol
## 12
## GeneName
## 10
## ID
## 10
## Blast Gene Symbol
## 7
## GENE SYMBOL
## 7
## name
## 7
## symbol
## 7
## CompositeSequence BioSequence [Gene Symbol]
## 6
## ORF_LIST
## 6
## SYMBOL
## 6
## Name
## 5
## Gene_symbol
## 4
## OligoSet_geneSymbol
## 3
## gene symbol
## 2
## GENENAME
## 2
## HUGOname
## 2
## Primary Annotation
## 2
## Primary Sequence Name
## 2
## Alternative gene name
## 1
## Closest_TSS_gene_name
## 1
## Common
## 1
## COMMON
## 1
## Ensembl_gene_symbol
## 1
## External name
## 1
## GENE
## 1
## Gene ID
## 1
## gene name
## 1
## Gene name
## 1
## Gene Name
## 1
## Gene symbol
## 1
## Gene Symbols
## 1
## GENE_SYBMOL
## 1
## GENE_SYM
## 1
## gene_symbols
## 1
## Gene.Symbol
## 1
## GeneSym
## 1
## GENESYMBOL
## 1
## GeneSymbols
## 1
## Human Gene Symbols
## 1
## Reporter_Name
## 1
## SeqName
## 1
## Symbol v12
## 1
## symbol_from_entrez_gene
## 1
## V3.0.3_gene_symbol
## 1
HGNChelper was applied on ORF column of GPL26963 and only ~50% of gene symbols
were valid both before and after applying HGCNhelper.
outlier[1,]
## platform colname frac.hgnc nrow valid.frac valid.after.hgnchelper.frac
## 85 GPL26963 ORF 0.5044283 60795 0.4926269 0.5025098
## distribution submission_date
## 85 custom-commercial Jul 25 2019
This is the actual information of GPL26963 fetched from GEO.
i <- 1 # GPL26963
gpl <- outlier[i, "platform"]
gpldat <- getGEO(filename = file.path(dat_dir, "platforms", paste0(gpl, ".soft")))
gpltable <- Table(gpldat)
head(gpltable)
## ID TRANSCRIPT_TYPE ACC ORF SOURCE BUILD CHROM STRAND
## 1 ASHGV40000072V5 lncRNA NR_027417 LOC644669 Refseq HG19 chr18 -
## 2 ASHGV40000162V5 lncRNA NR_037149 NME1-NME2 Refseq HG19 chr17 +
## 3 ASHGV40000243V5 lncRNA NR_130700 LOC728323 Refseq HG19 chr2 +
## 4 ASHGV40004835V5 lncRNA NR_130699 LOC728323 Refseq HG19 chr2 +
## 5 ASHGV40030421V5 lncRNA NR_130701 LOC728323 Refseq HG19 chr2 +
## 6 ASHGV40000318V5 lncRNA NR_046258 FAR2P2 Refseq HG19 chr2 -
## txStart txEnd GENE DESCRIPTION
## 1 15313554 15325918
## 2 49230896 49249105
## 3 243030783 243102476
## 4 243030783 243102476
## 5 243030783 243102476
## 6 131174325 131186119
## SEQUENCE SPOT_ID
## 1 ATAGTAATCCAGAAGGAACATCTGAAGGAACACTTGATGAGGCTGCACCCTTGGCAGAAA NA
## 2 TTCTGCATACAAGTTGGCAGGACCATGGCCAACCTGGAGCGCACCTTCATCGCCATCAAG NA
## 3 CTGAGCACTGATACAAAGAAAGACAAACATCACCAAACCAATGCAGACCAAACCAATGCA NA
## 4 CAGCTGAGCACTGATACAAAGAAAGACAAACATCATATCTCTCTTGATTCTTAGTAACAA NA
## 5 TTAAATAGCGAAGATGGAGAAATACTCAATAATGAAGAGCATGAATATGCATCCAAAAAA NA
## 6 AGCAAAATGTGATTCCAGGTCTTGGCAACCTCTGAAATTCCAACTCCATTTGCGAGAGCT NA
columnName <- outlier[i, "colname"]
checkGeneSymbols(head(gpltable[,columnName]))
## x Approved Suggested.Symbol
## 1 LOC644669 FALSE <NA>
## 2 NME1-NME2 TRUE NME1-NME2
## 3 LOC728323 FALSE <NA>
## 4 LOC728323 FALSE <NA>
## 5 LOC728323 FALSE <NA>
## 6 FAR2P2 TRUE FAR2P2
ID column contains both gene symbol and probe name.
i <- 2 # GPL25978
gpl <- outlier[i, "platform"]
gpldat <- getGEO(filename = file.path(dat_dir, "platforms", paste0(gpl, ".soft")))
gpltable <- Table(gpldat)
head(gpltable)
## ID ProbeName GB_ACC
## 1 CNP A_21_P0014015;A_23_P21838;A_33_P3250383
## 2 HMGCL A_33_P3388283;A_23_P145
## 3 HTR2C A_23_P433586;A_33_P3319176
## 4 MTRR A_21_P0014139;A_21_P0014138;A_33_P3304377;A_23_P252211
## 5 PHYKPL A_33_P3253628;A_23_P41888
## 6 AKAP17A A_24_P16856;A_23_P342668
## Genbank ACCID
## 1 ENST00000592861
## 2 ENST00000496907
## 3 ENST00000276198
## 4 ENST00000507837
## 5 ENST00000476487
## 6 ENST00000381261
## Description
## 1 ens|2',3'-cyclic nucleotide 3' phosphodiesterase [Source:HGNC Symbol;Acc:HGNC:2158] [ENST00000592861]
## 2 ens|3-hydroxymethyl-3-methylglutaryl-CoA lyase [Source:HGNC Symbol;Acc:HGNC:5005] [ENST00000496907]
## 3 ens|5-hydroxytryptamine (serotonin) receptor 2C, G protein-coupled [Source:HGNC Symbol;Acc:HGNC:5295] [ENST00000276198]
## 4 ens|5-methyltetrahydrofolate-homocysteine methyltransferase reductase [Source:HGNC Symbol;Acc:HGNC:7473] [ENST00000507837]
## 5 ens|5-phosphohydroxy-L-lysine phospho-lyase [Source:HGNC Symbol;Acc:HGNC:28249] [ENST00000476487]
## 6 ens|A kinase (PRKA) anchor protein 17A [Source:HGNC Symbol;Acc:HGNC:18783] [ENST00000381261]
## SEQUENCE
## 1 GTCTGGAAATCAAAATCACAGATACTCCACTCCTCTTTGTGTCTTCCAGCCCCATAATGT;TAACAGGGCCTTGCTAATCGGGTTGTCACTCAACAAAAGTGCTTTGGATTTAAGTTACTA;CCCTATAATGCTGGAGCGGCTACTAAAAAGGATAAAATGTATCACTTAAATGTTACCAAA
## 2 AAGAGTGACTTCCTCTCCTATGCCTTGCCAGGCAGGTAAAGGGGAATTCTGCACAGCTGA;GGACATGGAAATGAGAATAGGTTAAATGGTGCAGGTACCTCATAGCCAGCTCTACACAGA
## 3 AATGGCTGAAGAAACACAGCATGCATTTAGCATGAGTTCTGCACATACAGATGGTGTCCT;GAAACTTTGGAAGTTTTACTTGATTAAGGACTACAGAATTGGGCCCTTAGAATGTGAAAA
## 4 GATCCTTCAATCCAGACTGTGAAGAAAACCCTTGAGTTAAGACATTAGACATCTGTGCAA;GTGGATGGCAATACAAATCCTTCTTCATCTAATTTAATTTCAACATCTGGTGAGAGAAGA;CCTCTCTGAATATTCCTGGTTTACCCCCAGAATATTTACAGGTACATCTGCAGGAGTCTC;TTCCATGCAAAGGCTTCCTGAAATAGGGGAGACTGACTGAGTAGCTCATTCTTGTGACTT
## 5 CAAAAGGTACCAAAAAGTACAGTAAAATTAACACTTCCGTTACAGGAAATGTATGACGCA;CCTGCTCTGCCTAAGTGTACTCCAGAAGAAACTCATCTCATCCAAATACACGCTATTGAG
## 6 TCTGATTACAGTTGTGATAAGTCCCCGGGAAGGAGCATGACAAGAGGCTGAGACATGTGG;AAATAGCTGTAACGTTCGCGTTAGGAAAGATGGTGTTTATTCCAGTTTGCATTTTTATGG
## SPOT_ID
## 1 ENST00000592861
## 2 ENST00000496907
## 3 ENST00000276198
## 4 ENST00000507837
## 5 ENST00000476487
## 6 ENST00000381261
columnName <- outlier[i, "colname"]
columnName
## [1] "ID"
checkGeneSymbols(tail(gpltable[,columnName]))
## x Approved Suggested.Symbol
## 1 THC2767512 FALSE NA
## 2 THC2772172 FALSE NA
## 3 THC2773168 FALSE NA
## 4 THC2773489 FALSE NA
## 5 THC2774795 FALSE NA
## 6 THC2779931 FALSE NA
ID column contains both gene symbol and experiment-specific, re-annotated ID.
i <- 3 # GPL24444
gpl <- outlier[i, "platform"]
gpldat <- getGEO(filename = file.path(dat_dir, "platforms", paste0(gpl, ".soft")))
gpltable <- Table(gpldat)
head(gpltable)
## ID ProbeName Genbank ID
## 1 MARCH1 A_24_P188800;A_33_P3293362;A_33_P3247205 NM_017923;NM_022746;NM_022746
## 2 MARCH2 A_23_P33683;A_23_P200685;A_33_P3219434 NM_016496;NM_017898;NM_017898
## 3 MARCH3 A_24_P555473;A_23_P321511 NM_178450;NM_178450
## 4 MARCH4 A_23_P333228 NM_020814
## 5 MARCH5 A_24_P64393 NM_017824
## 6 MARCH6 A_23_P110492 NM_005885
## Description
## 1 ref|Homo sapiens membrane-associated ring finger (C3HC4) 1, E3 ubiquitin protein ligase (MARCH1), transcript variant 2, mRNA [NM_017923];ref|Homo sapiens mitochondrial amidoxime reducing component 1 (MARC1), mRNA [NM_022746];ref|Homo sapiens mitochondrial amidoxime reducing component 1 (MARC1), mRNA [NM_022746]
## 2 ref|Homo sapiens membrane-associated ring finger (C3HC4) 2, E3 ubiquitin protein ligase (MARCH2), transcript variant 1, mRNA [NM_016496];ref|Homo sapiens mitochondrial amidoxime reducing component 2 (MARC2), mRNA [NM_017898];ref|Homo sapiens mitochondrial amidoxime reducing component 2 (MARC2), mRNA [NM_017898]
## 3 ref|Homo sapiens membrane-associated ring finger (C3HC4) 3, E3 ubiquitin protein ligase (MARCH3), mRNA [NM_178450];ref|Homo sapiens membrane-associated ring finger (C3HC4) 3, E3 ubiquitin protein ligase (MARCH3), mRNA [NM_178450]
## 4 ref|Homo sapiens membrane-associated ring finger (C3HC4) 4, E3 ubiquitin protein ligase (MARCH4), mRNA [NM_020814]
## 5 ref|Homo sapiens membrane-associated ring finger (C3HC4) 5 (MARCH5), mRNA [NM_017824]
## 6 ref|Homo sapiens membrane-associated ring finger (C3HC4) 6, E3 ubiquitin protein ligase (MARCH6), transcript variant 1, mRNA [NM_005885]
## SEQUENCE
## 1 GTCACGCAAAAGATTTTCAGAAAATGTTCGGATATAATTAGCTCTGTTAAATACCCACAG;AGTTAAAGCAACCAACTTCAGGCCCAATATTGTAATTTCAGGATGCGATGTCTATGCAGA;CTGAAAACCTTTAAAGGGGGAAAAGGAAAGCATATGTCAGTTGTTTAAAACCCAATATCT
## 2 CAGCATTCTCCACTGGCAGCTGGACTCCTGAAGAAGGTGGCAGAGGAGACACCAGTATGA;TAACAACAGCAGCAACGATACATCAGCAAATCCTTATTATCCAGCCTTCAACTATCTTTA;AATGAAAATGGAGAATTTCAGGCCAAATATTGTGGTGACCGGCTGTGATGCTTTTGAGGA
## 3 CTTGCTGTGATTCTGTCCTAATCATTTTTCTTGAGAATGTCATGTAGAGATAAATGTGTG;TCGGACCAATCAGAGGGTGATTCTCCTCATTCCAAAGTCTGTCAATGTACCTTCTAACCA
## 4 TTTTTAAAACTCTCTGTTGTTTGTAATATTCTCTTAAAAGCTTGAAAATAAAACTTCTTT
## 5 AGCCCCGAATTGAACACTTTTAAACCTAAAGAGCCTTATTATTATTAGCTCGAGAAATAC
## 6 TAAAGAGCAATGTGTTCTGGCTGTTTTATACTTCAACAATTTTTTCCCTAAGTGGTAAGC
## GB LIST SPOT_ID
## 1 NM_017923, NM_022746 MARCH1
## 2 NM_016496, NM_017898, NM_017898 MARCH2
## 3 NM_178450, NM_178450 MARCH3
## 4 NM_020814 MARCH4
## 5 NM_017824 MARCH5
## 6 NM_005885 MARCH6
columnName <- outlier[i, "colname"]
columnName
## [1] "ID"
checkGeneSymbols(head(gpltable[,columnName]))
## x Approved Suggested.Symbol
## 1 MARCH1 FALSE MARCHF1
## 2 MARCH2 FALSE MARCHF2
## 3 MARCH3 FALSE MARCHF3
## 4 MARCH4 FALSE MARCHF4
## 5 MARCH5 FALSE MARCHF5
## 6 MARCH6 FALSE MARCHF6
hgnc.vec <- unique(as.character(gpltable[, columnName])) # assumed that there is only one 'symbol' column
hgnc.vec <- gsub("[ ].+", "", hgnc.vec) # get rid of anything after a space
HGNChelper.output <- checkGeneSymbols(iconv(hgnc.vec, "latin1", "ASCII", ""),
map = currentHumanMap) # convert to ascii
## Warning in checkGeneSymbols(iconv(hgnc.vec, "latin1", "ASCII", ""), map =
## currentHumanMap): Human gene symbols should be all upper-case except for the
## 'orf' in open reading frames. The case of some letters was corrected.
## Warning in checkGeneSymbols(iconv(hgnc.vec, "latin1", "ASCII", ""), map =
## currentHumanMap): x contains non-approved gene symbols
notFixed <- which(is.na(HGNChelper.output$Suggested.Symbol))
head(HGNChelper.output[notFixed,])
## x Approved Suggested.Symbol
## 27 A_19_P00325768 FALSE <NA>
## 28 A_19_P00800244 FALSE <NA>
## 29 A_19_P00802027 FALSE <NA>
## 30 A_19_P00803499 FALSE <NA>
## 31 A_19_P00803675 FALSE <NA>
## 32 A_19_P00804070 FALSE <NA>
i <- 4 # GPL25261
gpl <- outlier[i, "platform"]
gpldat <- getGEO(filename = file.path(dat_dir, "platforms", paste0(gpl, ".soft")))
gpltable <- Table(gpldat)
head(gpltable)
## ID Gene ID ProbeName
## 1 A_21_P0014386 A_21_P0014386
## 2 CPED1 CPED1 A_33_P3396872;A_24_P187799;A_24_P943781
## 3 BCOR BCOR A_33_P3267760;A_23_P405707;A_23_P159741
## 4 CHAC2 CHAC2 A_32_P194264
## 5 IFI30 IFI30 A_23_P153745
## 6 A_33_P3352837 A_33_P3352837
## SPOT_ID
## 1 A_21_P0014386
## 2 NM_001105533;NM_024913;NM_024913
## 3 ENST00000378463;ENST00000615339;NM_017745
## 4 NM_001008708
## 5 NM_006332
## 6 A_33_P3352837
## Description
## 1 Unknown
## 2 ref|Homo sapiens cadherin-like and PC-esterase domain containing 1 (CPED1), transcript variant 2, mRNA [NM_001105533];ref|Homo sapiens cadherin-like and PC-esterase domain containing 1 (CPED1), transcript variant 1, mRNA [NM_024913];ref|Homo sapiens cadherin-like and PC-esterase domain containing 1 (CPED1), transcript variant 1, mRNA [NM_024913]
## 3 ens|BCL6 corepressor [Source:HGNC Symbol;Acc:HGNC:20893] [ENST00000378463];ens|BCL6 corepressor [Source:HGNC Symbol;Acc:HGNC:20893] [ENST00000615339];ref|Homo sapiens BCL6 corepressor (BCOR), transcript variant 1, mRNA [NM_017745]
## 4 ref|Homo sapiens ChaC, cation transport regulator homolog 2 (E. coli) (CHAC2), mRNA [NM_001008708]
## 5 ref|Homo sapiens interferon, gamma-inducible protein 30 (IFI30), mRNA [NM_006332]
## 6 Unknown
columnName <- outlier[i, "colname"]
columnName
## [1] "Gene ID"
hgnc.vec <- unique(as.character(gpltable[, columnName])) # assumed that there is only one 'symbol' column
hgnc.vec <- gsub("[ ].+", "", hgnc.vec) # get rid of anything after a space
HGNChelper.output <- checkGeneSymbols(iconv(hgnc.vec, "latin1", "ASCII", ""),
map = currentHumanMap) # convert to ascii
notFixed <- which(is.na(HGNChelper.output$Suggested.Symbol))
head(HGNChelper.output[notFixed,])
## x Approved Suggested.Symbol
## 1 FALSE <NA>
## 6 ENST00000618272 FALSE <NA>
## 7 ENST00000436258 FALSE <NA>
## 10 lnc-ARMCX4-1 FALSE <NA>
## 15 LOC101927502 FALSE <NA>
## 18 lnc-SOX11-1 FALSE <NA>
i <- 5 # GPL16309
gpl <- outlier[i, "platform"]
gpldat <- getGEO(filename = file.path(dat_dir, "platforms", paste0(gpl, ".soft")))
gpltable <- Table(gpldat)
head(gpltable)
## ID NAME
## 1 1 GE_BrightCorner
## 2 2 DarkCorner
## 3 3 DarkCorner
## 4 4 A_23_P117082
## 5 5 A_33_P3246448
## 6 6 A_33_P3318220
## Genebank Accession GB_ACC
## 1
## 2
## 3
## 4 ref|NM_015987|ens|ENST00000014930|gb|AF117615|gb|BC016277 NM_015987
## 5 ref|NM_080671|ens|ENST00000281830|tc|THC2655788 NM_080671
## 6 ref|NM_178466|ens|ENST00000375454|ens|ENST00000471233|tc|THC2478474 NM_178466
## ENSEMBL_ID Refseq ID Gene Symbol
## 1 GE_BrightCorner GE_BrightCorner
## 2 DarkCorner DarkCorner
## 3 DarkCorner DarkCorner
## 4 NM_015987 HEBP1
## 5 NM_080671 KCNE4
## 6 NM_178466 BPIFA3
## Gene Description
## 1
## 2
## 3
## 4 Homo sapiens heme binding protein 1 (HEBP1), mRNA [NM_015987]
## 5 Homo sapiens potassium voltage-gated channel, Isk-related family, member 4 (KCNE4), mRNA [NM_080671]
## 6 Homo sapiens BPI fold containing family A, member 3 (BPIFA3), transcript variant 1, mRNA [NM_178466]
## Chromosome Map Location
## 1
## 2
## 3
## 4 hs|chr12:13127906-13127847
## 5 hs|chr2:223920197-223920256
## 6 hs|chr20:31812208-31812267
## SEQUENCE SPOT_ID
## 1 CONTROL
## 2 CONTROL
## 3 CONTROL
## 4 AAGGGGGAAAATGTGATTTGTGCCTGATCTTTCATCTGTGATTCTTATAAGAGCTTTGTC
## 5 GCAAGTCTCTCTGCACCTATTAAAAAGTGATGTATATACTTCCTTCTTATTCTGTTGAGT
## 6 CATTCCATAAGGAGTGGTTCTCGGCAAATATCTCACTTGAATTTGACCTTGAATTGAGAC
columnName <- outlier[i, "colname"]
columnName
## [1] "Gene Symbol"
hgnc.vec <- unique(as.character(gpltable[, columnName])) # assumed that there is only one 'symbol' column
hgnc.vec <- gsub("[ ].+", "", hgnc.vec) # get rid of anything after a space
HGNChelper.output <- checkGeneSymbols(iconv(hgnc.vec, "latin1", "ASCII", ""),
map = currentHumanMap) # convert to ascii
notFixed <- which(is.na(HGNChelper.output$Suggested.Symbol))
head(HGNChelper.output[notFixed,])
## x Approved Suggested.Symbol
## 1 GE_BrightCorner FALSE <NA>
## 2 DarkCorner FALSE <NA>
## 6 LOC100129869 FALSE <NA>
## 9 LOC100506844 FALSE <NA>
## 13 OCLM FALSE <NA>
## 20 ENST00000319813 FALSE <NA>
In the example of GPL24307, chicken gene OvoDA3.
In the example of GPL23453, mixed with mouse genes, commercial RNAi product name (e.g. V1RF1), etc.