| Chromosome | Transcripts_total | Transcripts_protein_coding |
|---|---|---|
| chr1 | 22690 | 15844 |
| chr2 | 18925 | 11976 |
| chr3 | 16017 | 10815 |
| chr4 | 10390 | 6562 |
| chr5 | 12038 | 7706 |
| chr6 | 12004 | 7446 |
| chr7 | 12264 | 7933 |
| chr8 | 10299 | 6323 |
| chr9 | 8801 | 5869 |
| chr10 | 9170 | 5999 |
| chr11 | 14941 | 11282 |
| chr12 | 13772 | 10011 |
| chr13 | 4726 | 2400 |
| chr14 | 9237 | 6213 |
| chr15 | 9134 | 5818 |
| chr16 | 11797 | 8590 |
| chr17 | 14739 | 11261 |
| chr18 | 4789 | 2797 |
| chr19 | 14453 | 11452 |
| chr20 | 6073 | 3804 |
| chr21 | 3302 | 1638 |
| chr22 | 5341 | 3670 |
| chrX | 8107 | 5675 |
| chrY | 1014 | 313 |
Introduction
This report gives an overview of the data contents within the reference data bundle that comes with the Personal Cancer Genome Reporter (PCGR), an interpretation tool for genomic aberrations, aiming to provide clinical decision support for precision cancer medicine.
This report overview of the reference data gives the users of PCGR an ability to understand what the tool is able to report upon, and what kind of knowledge resources it uses (and does not use) for interpretation.
Currently, the PCGR reference data bundle contains integrated datasets that informs on the following properties with respect to molecular cancer medicine:
- Basic human gene/transcript annotations - identifiers, official symbols, gene names etc.
- Human cancer gene annotations - known tumor suppressor and proto-oncogenes
- Cancer phenotypes - main sites/tissues of human cancers and associated subtypes
- Targeted anti-cancer agents - small molecule inhibitors/antibodies, their molecular targets, and tumor types they are indicated for (approved or early/late clinical development)
- Known somatic DNA mutations - found previously in tumor samples, relative frequencies across tumor types
- Mutational hotspots - sites of significantly frequent somatic mutations in tumor samples
- Known germline DNA variants - allelic frequencies across populations
- Insilico variant effect predictions - assessment of damaging/tolerated effects of single nucleotide variants by multiple algorithms
- Biomarkers - expression markers, fusions/translocations, and DNA aberrations that are associated with prognosis, diagnosis, or sensitivity/resistance to particular treatments
- Protein domains
- Mutational signatures - cancer type prevalence, associated aetiologies
- Tumor gene expression patterns - cell lines, primary tumor samples, both early-onset (pediatric tumors), and adult tumors
Note that this report is provided for a specific release of the data bundle, as outlined in the title banner.
Files, filesizes and MD5 checksums
- The contents of the assembly-specific databundle is organized into seven main file folders. In the table below, one can explore the file types, files sizes, and MD5 checksums of each file within the various folders.
Gene and transcript data
Data resources (with versions and licenses):
- GENCODE - Human gene transcripts - release 46 (Free/open access)
- UniprotKB - UniProt identifiers and accessions with cross-references to Ensembl - release 2024_03 (CC BY 4.0)
- Ensembl Biomart - API for retrieval of gene and transcript cross-references (MANE, RefSeq) - release 112 (EMBL-EBI terms of use)
- APPRIS - Prinicipal transcript isoform annotation - release 2024-06-08 (Free/open access)
Numbers - genome level
Genes
Transcripts
Protein-coding genes
Protein-coding transcripts
Numbers - chromosome level
Proto-oncogenes and tumor suppressor genes
Data resources (with versions and licenses):
- Cancer Gene Census - Collection of cancer-relevant genes (soma/germline), tumor suppressors/oncogene annotations - v100 (Free for non-commercial, academic use. Commercial use requires licensing)
- Network of Cancer Genes - Tumor suppressors/oncogenes - v7.1 (Free/open access)
- CancerMine - Predicted tumor suppressors/oncogenes/cancer drivers from text mining of literature - v50 (March 2023) (CC0 1.0)
Brief synopsis
- Genes are annotated as proto-oncogenes or tumor suppressor genes if they are i) found in either of two curated resources: Cancer Gene Census (CGC Tier 1/2) or Network of Cancer Genes, or ii) predicted with the corresponding annotation in the CancerMine text mining resource. For oncogene/tumor suppressor candidates predicted exclusively by CancerMine, we require these to have support from at least 20 distinct publications in the literature.
Tumor suppressor genes
360
Proto-oncogenes
372
Cancer predisposition genes
Data resources (with versions and licenses):
- Cancer Gene Census - Collection of cancer-relevant genes (soma/germline), tumor suppressors/oncogene annotations - v100 (Free for non-commercial, academic use. Commercial use requires licensing)
- ACMG - secondary findings - Genes recommended for reporting of incidental findings in clinical exome sequencing - v3.2 (Free/open access)
- Cancer predisposition genes - curated/other - Candidate cancer predisposition genes - contributed e.g. by CPSR users - release 20221128 (Free/open access)
- Huang et al., Cell, 2018 - Collection of cancer predisposition genes screened in TCGA’s pancancer study - release . (Free/open access)
- Genomics England PanelApp - Collection of > 40 dedicated gene panels for various inherited cancer conditions and syndromes - v1 (API) (Commercial use requires separate agreement)
Brief synopsis
- Cancer predisposition genes that can be used for variant analysis and classification in CPSR are listed here, specifically virtual panel zero, the complete collection of predisposition genes. These have been collected from the Cancer Gene Census, genes in panels that target hereditary cancer conditions (Genomics England PanelApp), TCGA’s pancancer germline study, and curated (user-contributed) genes.
Variant data
Data resources (with versions and licenses):
- TCGA - The Cancer Genome Atlas - release39_20231204 (Free/open access)
- dbNSFP - A comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs - v4.5 (Free for non-commercial, academic use)
- ClinVar - Public archive of reports of the relationships among human DNA variations and phenotypes - release 2024-06 (NCBI data usage policies)
- dbMTS - A comprehensive database of putative human microRNA target site (MTS) SNVs and their functional predictions - v1.0 (Free/open access)
- GWAS Catalog - The NHGRI-EBI Catalog of human genome-wide association studies - v20240520 (EMBL-EBI terms of use)
- gnomAD - Genome Aggregation Database - non-cancer subset - v2.1.1 (CC0 1.0)
Brief synopsis
- Variant datasets are used for interrogation of somatic variant frequency across tissues (TCGA), germline variant population frequencies (gnomAD) and clinical significance (ClinVar), or insilico assessment of variant functional effect (dbNSFP/dbMTS)
Variant numbers - total
ClinVar
dbNSFP
dbMTS
TCGA
GWAS catalogue
gnomAD - non-cancer subset (cancer genes)
Variant numbers - chromosome level
| Chromosome | nAll | nSNV | nDeletion | nInsertion | nMNV |
|---|---|---|---|---|---|
| chr1 | 228,743 | 219,383 | 6,844 | 2,502 | 14 |
| chr2 | 164,612 | 157,849 | 4,932 | 1,823 | 8 |
| chr3 | 126,413 | 121,005 | 4,032 | 1,375 | 1 |
| chr4 | 89,137 | 85,331 | 2,755 | 1,047 | 4 |
| chr5 | 112,300 | 107,641 | 3,394 | 1,264 | 1 |
| chr6 | 113,358 | 108,250 | 3,765 | 1,339 | 4 |
| chr7 | 110,456 | 106,056 | 3,273 | 1,123 | 4 |
| chr8 | 81,997 | 78,640 | 2,447 | 907 | 3 |
| chr9 | 79,539 | 76,139 | 2,545 | 854 | 1 |
| chr10 | 84,662 | 80,787 | 2,834 | 1,037 | 4 |
| chr11 | 133,211 | 127,748 | 4,094 | 1,366 | 3 |
| chr12 | 116,910 | 111,751 | 3,790 | 1,368 | 1 |
| chr13 | 40,496 | 38,620 | 1,366 | 509 | 1 |
| chr14 | 69,522 | 66,406 | 2,283 | 825 | 8 |
| chr15 | 67,742 | 64,938 | 2,083 | 717 | 4 |
| chr16 | 79,146 | 75,577 | 2,734 | 833 | 2 |
| chr17 | 111,923 | 106,384 | 4,166 | 1,369 | 4 |
| chr18 | 35,105 | 33,616 | 1,111 | 377 | 1 |
| chr19 | 143,465 | 137,999 | 4,184 | 1,279 | 3 |
| chr20 | 55,221 | 53,092 | 1,575 | 553 | 1 |
| chr21 | 21,575 | 20,645 | 693 | 236 | 1 |
| chr22 | 39,900 | 38,218 | 1,252 | 427 | 3 |
| chrX | 100,706 | 97,111 | 2,574 | 1,020 | 1 |
| chrY | 693 | 663 | 24 | 6 | 0 |
| Chromosome | nAll | nSNV | nDeletion | nInsertion | nMNV |
|---|---|---|---|---|---|
| chr1 | 255,181 | 237,808 | 11,068 | 5,558 | 747 |
| chr2 | 270,552 | 246,763 | 15,128 | 7,806 | 855 |
| chr3 | 155,659 | 143,793 | 7,470 | 3,945 | 451 |
| chr4 | 100,119 | 93,069 | 4,486 | 2,262 | 302 |
| chr5 | 144,877 | 132,557 | 7,796 | 4,035 | 489 |
| chr6 | 129,301 | 118,605 | 6,788 | 3,515 | 393 |
| chr7 | 139,990 | 129,402 | 6,657 | 3,479 | 452 |
| chr8 | 99,186 | 91,854 | 4,677 | 2,320 | 335 |
| chr9 | 128,555 | 118,561 | 6,088 | 3,512 | 394 |
| chr10 | 107,160 | 98,709 | 5,199 | 2,895 | 357 |
| chr11 | 171,694 | 157,817 | 8,917 | 4,412 | 548 |
| chr12 | 128,909 | 118,870 | 6,409 | 3,234 | 396 |
| chr13 | 60,526 | 52,284 | 5,424 | 2,593 | 225 |
| chr14 | 92,124 | 85,578 | 4,180 | 2,125 | 241 |
| chr15 | 102,834 | 94,225 | 5,441 | 2,827 | 341 |
| chr16 | 152,353 | 140,578 | 7,530 | 3,693 | 552 |
| chr17 | 179,217 | 160,391 | 12,131 | 5,979 | 716 |
| chr18 | 50,565 | 46,690 | 2,347 | 1,390 | 138 |
| chr19 | 155,892 | 146,233 | 5,874 | 3,271 | 514 |
| chr20 | 59,137 | 55,302 | 2,360 | 1,302 | 173 |
| chr21 | 33,962 | 31,216 | 1,690 | 949 | 107 |
| chr22 | 62,901 | 57,990 | 3,107 | 1,580 | 224 |
| chrX | 108,016 | 96,259 | 7,783 | 3,708 | 266 |
| chrY | 95 | 83 | 8 | 4 | 0 |
| Chromosome | nAll | nSNV | nDeletion | nInsertion | nMNV |
|---|---|---|---|---|---|
| chr1 | 9,645,829 | 9,645,829 | 0 | 0 | 0 |
| chr2 | 7,083,654 | 7,083,654 | 0 | 0 | 0 |
| chr3 | 5,540,153 | 5,540,153 | 0 | 0 | 0 |
| chr4 | 3,815,852 | 3,815,852 | 0 | 0 | 0 |
| chr5 | 4,425,867 | 4,425,867 | 0 | 0 | 0 |
| chr6 | 4,836,504 | 4,836,504 | 0 | 0 | 0 |
| chr7 | 4,489,261 | 4,489,261 | 0 | 0 | 0 |
| chr8 | 3,244,915 | 3,244,915 | 0 | 0 | 0 |
| chr9 | 3,838,098 | 3,838,098 | 0 | 0 | 0 |
| chr10 | 3,726,476 | 3,726,476 | 0 | 0 | 0 |
| chr11 | 5,586,754 | 5,586,754 | 0 | 0 | 0 |
| chr12 | 5,052,139 | 5,052,139 | 0 | 0 | 0 |
| chr13 | 1,740,027 | 1,740,027 | 0 | 0 | 0 |
| chr14 | 3,011,585 | 3,011,585 | 0 | 0 | 0 |
| chr15 | 3,351,238 | 3,351,238 | 0 | 0 | 0 |
| chr16 | 4,040,167 | 4,040,167 | 0 | 0 | 0 |
| chr17 | 5,501,629 | 5,501,629 | 0 | 0 | 0 |
| chr18 | 1,506,525 | 1,506,525 | 0 | 0 | 0 |
| chr19 | 6,221,200 | 6,221,200 | 0 | 0 | 0 |
| chr20 | 2,252,415 | 2,252,415 | 0 | 0 | 0 |
| chr21 | 957,673 | 957,673 | 0 | 0 | 0 |
| chr22 | 1,992,402 | 1,992,402 | 0 | 0 | 0 |
| chrX | 3,548,544 | 3,548,544 | 0 | 0 | 0 |
| chrY | 190,146 | 190,146 | 0 | 0 | 0 |
| Chromosome | nAll | nSNV | nDeletion | nInsertion | nMNV |
|---|---|---|---|---|---|
| chr1 | 720,033 | 720,033 | 0 | 0 | 0 |
| chr2 | 519,030 | 519,030 | 0 | 0 | 0 |
| chr3 | 430,103 | 430,103 | 0 | 0 | 0 |
| chr4 | 344,379 | 344,379 | 0 | 0 | 0 |
| chr5 | 393,020 | 393,020 | 0 | 0 | 0 |
| chr6 | 396,215 | 396,215 | 0 | 0 | 0 |
| chr7 | 340,217 | 340,217 | 0 | 0 | 0 |
| chr8 | 269,169 | 269,169 | 0 | 0 | 0 |
| chr9 | 266,564 | 266,564 | 0 | 0 | 0 |
| chr10 | 298,408 | 298,408 | 0 | 0 | 0 |
| chr11 | 325,250 | 325,250 | 0 | 0 | 0 |
| chr12 | 445,379 | 445,379 | 0 | 0 | 0 |
| chr13 | 169,486 | 169,486 | 0 | 0 | 0 |
| chr14 | 266,405 | 266,405 | 0 | 0 | 0 |
| chr15 | 233,405 | 233,405 | 0 | 0 | 0 |
| chr16 | 158,793 | 158,793 | 0 | 0 | 0 |
| chr17 | 208,062 | 208,062 | 0 | 0 | 0 |
| chr18 | 105,918 | 105,918 | 0 | 0 | 0 |
| chr19 | 222,410 | 222,410 | 0 | 0 | 0 |
| chr20 | 105,773 | 105,773 | 0 | 0 | 0 |
| chr21 | 48,398 | 48,398 | 0 | 0 | 0 |
| chr22 | 80,370 | 80,370 | 0 | 0 | 0 |
| chrX | 44,492 | 44,492 | 0 | 0 | 0 |
| chrY | 0 | 0 | 0 | 0 | 0 |
| Chromosome | nAll | nSNV | nDeletion | nInsertion | nMNV |
|---|---|---|---|---|---|
| chr1 | 384 | 384 | 0 | 0 | 0 |
| chr2 | 516 | 516 | 0 | 0 | 0 |
| chr3 | 361 | 361 | 0 | 0 | 0 |
| chr4 | 184 | 184 | 0 | 0 | 0 |
| chr5 | 361 | 361 | 0 | 0 | 0 |
| chr6 | 657 | 657 | 0 | 0 | 0 |
| chr7 | 208 | 208 | 0 | 0 | 0 |
| chr8 | 402 | 402 | 0 | 0 | 0 |
| chr9 | 296 | 296 | 0 | 0 | 0 |
| chr10 | 324 | 324 | 0 | 0 | 0 |
| chr11 | 332 | 332 | 0 | 0 | 0 |
| chr12 | 289 | 289 | 0 | 0 | 0 |
| chr13 | 126 | 126 | 0 | 0 | 0 |
| chr14 | 148 | 148 | 0 | 0 | 0 |
| chr15 | 193 | 193 | 0 | 0 | 0 |
| chr16 | 240 | 240 | 0 | 0 | 0 |
| chr17 | 211 | 211 | 0 | 0 | 0 |
| chr18 | 111 | 111 | 0 | 0 | 0 |
| chr19 | 187 | 187 | 0 | 0 | 0 |
| chr20 | 229 | 229 | 0 | 0 | 0 |
| chr21 | 68 | 68 | 0 | 0 | 0 |
| chr22 | 140 | 140 | 0 | 0 | 0 |
| chrX | 47 | 47 | 0 | 0 | 0 |
| chrY | 0 | 0 | 0 | 0 | 0 |
| Chromosome | nAll | nSNV | nDeletion | nInsertion | nMNV |
|---|---|---|---|---|---|
| chr1 | 89,336 | 84,150 | 3,507 | 1,679 | 0 |
| chr2 | 99,140 | 93,534 | 3,728 | 1,878 | 0 |
| chr3 | 60,021 | 56,308 | 2,525 | 1,188 | 0 |
| chr4 | 35,279 | 33,132 | 1,468 | 679 | 0 |
| chr5 | 43,882 | 41,049 | 1,972 | 861 | 0 |
| chr6 | 53,169 | 50,133 | 2,050 | 986 | 0 |
| chr7 | 45,296 | 42,721 | 1,731 | 844 | 0 |
| chr8 | 38,065 | 35,797 | 1,545 | 723 | 0 |
| chr9 | 43,388 | 40,897 | 1,600 | 891 | 0 |
| chr10 | 40,315 | 37,890 | 1,640 | 785 | 0 |
| chr11 | 72,689 | 68,637 | 2,715 | 1,337 | 0 |
| chr12 | 43,338 | 40,566 | 1,847 | 925 | 0 |
| chr13 | 21,387 | 19,860 | 1,047 | 480 | 0 |
| chr14 | 27,961 | 26,150 | 1,235 | 576 | 0 |
| chr15 | 35,133 | 32,863 | 1,514 | 756 | 0 |
| chr16 | 56,043 | 53,002 | 2,016 | 1,025 | 0 |
| chr17 | 67,694 | 63,340 | 3,000 | 1,354 | 0 |
| chr18 | 11,529 | 10,882 | 455 | 192 | 0 |
| chr19 | 70,563 | 65,921 | 3,004 | 1,638 | 0 |
| chr20 | 15,335 | 14,308 | 703 | 324 | 0 |
| chr21 | 3,918 | 3,663 | 161 | 94 | 0 |
| chr22 | 20,991 | 19,649 | 881 | 461 | 0 |
| chrX | 24,222 | 22,903 | 892 | 427 | 0 |
| chrY | 43 | 43 | 0 | 0 | 0 |
Drug data
Data resources (with versions and licenses):
- NCI Thesaurus - Vocabulary for clinical care, translational and basic cancer research - release 24.05d (CC BY 4.0)
- Open Targets Platform - Tool that supports systematic identification and prioritisation of potential therapeutic drug targets - release 2024.06 (CC0 1.0)
- PubChem - World’s largest collection of freely accessible chemical information - v2023 (Free/open access)
- DGIdb - Database of gene-drug interactions - v2022_02 (Free/open access)
Brief synopsis
A repository of targeted cancer drugs organized according to primary tumor sites have been established through the pharmOncoX R package, which combines drug data from Open Targets Platform, NCI Thesaurus and DGIdb with cancer type classifications from phenOncoX.
Compounds (and associated targets) are listed below for the various tumor types, where compounds highlighted in green are those approved or in late clinical development
Although we try to make drug listings as accurate as possible, drug misclassifications are likely to occur, and entries may be missing.
While the focus here is on molecularly targeted cancer compounds, note that the data bundle is also shipped with a more complete set of drugs not shown here (chemotherapy drugs, drugs primarily indicated for other diseases etc.).
Targeted compounds
934
Drug targets
484
Targeted agents per tumor type
Biomarker data
Data resources (with versions and licenses):
- CIViC - An open-source platform supporting crowdsourced and expert-moderated cancer variant curation. - release 20240621 (CC0 1.0)
- Cancer Genome Interpreter - The Cancer Biomarkers database - v20221017 (CC0 1.0)
- Mitelman Database - Database of Chromosome Aberrations and Gene Fusions in Cancer - v20240415 (CC BY 4.0)
Brief synopsis
The biomarker data in PCGR is structured largely according to the CIViC knowledge model, in which
- A particular genomic aberration (e.g. BRAF V600E) is associated with one or more clinical evidence items, which typically denotes a relationship between the variant and a therapeutic response (or prognosis/diagnosis) in a defined disease/cancer type context. The type of relationship for a given evidence item is known as the evidence type, and the strength of the evidence is assigned distinct evidence levels (evidence level)
Given the disparate formatting notations of the CIViC and CGI resources, an attempt to merge their contents into an integrated biomarker source is not yet complete. Hence, aggregated numbers presented in the value boxes below may not be fully accurate due to overlapping entries among the two resources.
Why are the numbers presented here different than the ones that can be seen on e.g. civicdb.org? Numbers presented here reflect accepted evidence, which has been subject to multiple post-processing and quality-control checks.
Biomarker genes - somatic
398
Biomarker variants - somatic
1499
Evidence items - somatic
3659
Biomarker genes - germline
64
Biomarker variants - germline
353
Evidence items - germline
830
Statistics - somatic biomarkers
Evidence items
Phenotype/disease data
Data resources (with versions and licenses):
- Experimental Factor Ontology - Systematic description of experimental variables available in EBI databases - v3.67.0 (EMBL-EBI terms of use)
- Disease Ontology - Consistent, reusable and sustainable descriptions of human disease terms - v2024-05-29 (CC0 1.0)
- MedGen - Organizes information related to human medical genetics - release 2024-06-18 (NCBI data usage policies)
- OncoTree - A cancer classification system for precision oncology - release 2021_11_02 (CC BY 4.0)
Brief synopsis:
- In order to cross-reference different knowledge resources that utilize different nomenclature for disease/cancer types, we have built an integrated resource (phenOncoX) that organizes cross-referenced phenotype terms across the major tumor types for different ontologies (OncoTree, Disease Ontology, EFO, ICD-10, MeSH).
- In PCGR, tumor types are organized according to 32 primary sites/tissues, and these are populated with specific and cross-referenced phenotype terms that typically denote distinct subtypes of a major cancer type.
Other
Mutational hotspots
Data resources (with versions and licenses):
- cancerhotspots.org - A resource for statistically significant mutations in cancer - v2 (ODbL v1.0)
Hotspot genes
240
Amino acid hotspot variants
3311
Splice site hotspot variants
118
Mutational signatures
Data resources (with versions and licenses):
- COSMIC - COSMIC mutational signatures collection - v3.4 (Free for non-commercial, academic use. Commercial use requires licensing)
Brief synopsis
- For each mutational signature in COSMIC (v3.4, SBS only), we have collected data on which tumor types the signatures have been observed in (signature attribution). This information is utilized to limit search space when signature re-fitting is performed for individual samples in PCGR.
Protein domains
- PFAM/InterPro - Functional analysis of protein sequences and prediction of families and protein domains - v (CC0 1.0)
Expression data
Data resources (with versions and licenses):
- DepMap - The Cancer Dependency Map - release 23Q4 (Free/open access)
- TCGA - The Cancer Genome Atlas - release39_20231204 (Free/open access)
- TreeHouse - The Treehouse Childhood Cancer Data Initiative - v11_2020 (Free/open access)
Brief synopsis:
- Reference data on on gene expression across tumor types is currently collected from DepMap (cancer cell lines), TCGA (adult, primary tumor samples), and the TreeHouse Childhood Cancer Data Initiative (primarily pediatric tumor samples)
- Data from the three reference sources (TCGA, DepMap, TreeHouse) have been harmonized with respect to expression measures (TPM), gene annotation, and sample metadata. An overview of all samples are listed below.
Note that data on many of the samples publicly listed for the TreeHouse dataset has been excluded here due to either i) incomplete sample metadata, or ii) sample metadata that indicates adult-onset (> 30 yrs at age of diagnosis) rather than early-onset cancer. Only one sample per case has been included.