You can check the file formats here.
library(PureCN)
reference.file <- system.file("extdata", "ex2_reference.fa",
package = "PureCN", mustWork = TRUE)
bed.file <- system.file("extdata", "ex2_intervals.bed",
package = "PureCN", mustWork = TRUE)
mappability.file <- system.file("extdata", "ex2_mappability.bigWig",
package = "PureCN", mustWork = TRUE)
intervals <- import(bed.file)
mappability <- import(mappability.file)
# Interval.R
interval.file <- preprocessIntervals(intervals, reference.file,
mappability=mappability, output.file = "ex2_gc_file.txt")
## WARN [2018-11-29 15:27:36] No reptiming scores provided.
## INFO [2018-11-29 15:27:36] Calculating GC-content...
bam.file <- system.file("extdata", "ex1.bam", package="PureCN",
mustWork = TRUE)
interval.file <- system.file("extdata", "ex1_intervals.txt",
package = "PureCN", mustWork = TRUE)
# Coverage.R
calculateBamCoverageByInterval(bam.file = bam.file,
interval.file = interval.file, output.file = "ex1_coverage.txt")
normal.coverage.file <- system.file("extdata", "example_normal.txt",
package="PureCN")
normal2.coverage.file <- system.file("extdata", "example_normal2.txt",
package="PureCN")
normal.coverage.files <- c(normal.coverage.file, normal2.coverage.file)
tumor.coverage.file <- system.file("extdata", "example_tumor.txt",
package="PureCN")
seg.file <- system.file("extdata", "example_seg.txt",
package = "PureCN")
vcf.file <- system.file("extdata", "example.vcf.gz", package="PureCN")
interval.file <- system.file("extdata", "example_intervals.txt",
package="PureCN")
# Coverage.R
correctCoverageBias(normal.coverage.file, interval.file,
output.file="example_normal_loess.txt", plot.bias=FALSE)
# NormalDB.R
normalDB <- createNormalDatabase(normal.coverage.files)
## INFO [2018-11-29 15:27:39] 576 on-target bins with low coverage in all samples.
## WARN [2018-11-29 15:27:39] You are likely not using the correct baits file!
## WARN [2018-11-29 15:27:39] Allosome coverage missing, cannot determine sex.
## WARN [2018-11-29 15:27:39] Allosome coverage missing, cannot determine sex.
## INFO [2018-11-29 15:27:39] Processing on-target regions...
## INFO [2018-11-29 15:27:39] Removing 930 intervals with low coverage in normalDB.
## INFO [2018-11-29 15:27:39] Removing 1 intervals with zero coverage in more than 3% of normalDB.
saveRDS(normalDB, file="normalDB.rds")
normalDB <- readRDS("normalDB.rds")
pool <- calculateTangentNormal(tumor.coverage.file, normalDB)
# NormalDB.R and PureCN.R
normal.panel.vcf.file <- system.file("extdata", "normalpanel.vcf.gz",
package="PureCN")
bias <- calculateMappingBiasVcf(normal.panel.vcf.file, genome = "h19")
## INFO [2018-11-29 15:27:41] Processing variants 1 to 5000...
saveRDS(bias, "mapping_bias.rds")
normal.panel.vcf.file <- "mapping_bias.rds"
# NOrmalDB.R
interval.weight.file <- "interval_weights.txt"
calculateIntervalWeights(normalDB$normal.coverage.files, interval.weight.file)
## INFO [2018-11-29 15:27:41] Loading coverage data...
## INFO [2018-11-29 15:27:42] Mean target coverages: 71X (tumor) 99X (normal).
## INFO [2018-11-29 15:27:42] Mean target coverages: 71X (tumor) 43X (normal).
# PureCN.R
ret <-runAbsoluteCN(normal.coverage.file=pool, # normal.coverage.file=normal.coverage.file,
tumor.coverage.file=tumor.coverage.file, vcf.file=vcf.file,
genome="hg19", sampleid="Sample1",
interval.file=interval.file, normalDB=normalDB,
# args.setMappingBiasVcf=list(normal.panel.vcf.file=normal.panel.vcf.file),
# args.filterVcf=list(snp.blacklist=snp.blacklist,
# stats.file=mutect.stats.file),
args.segmentation=list(interval.weight.file=interval.weight.file),
post.optimize=FALSE, plot.cnv=FALSE, verbose=FALSE)
## WARN [2018-11-29 15:27:43] Allosome coverage missing, cannot determine sex.
## WARN [2018-11-29 15:27:43] Allosome coverage missing, cannot determine sex.
file.rds <- "Sample1_PureCN.rds"
saveRDS(ret, file=file.rds)
The R data file (file.rds
) contains gene-level copy number calls, SNV status and LOH calls. The purity/ploidy combinations are sorted by likelihood and stored in ret$results
.
names(ret)
## [1] "candidates" "results" "input"
To understand allelic fractions of particular SNVs, we must know the (i) somatic status, the (ii) tumor purity, the (iii) local copy number, as well as the (iv) number of chromosomes harboring the mutations or SNPs. One of PureCN main functions is to find the most likely combination of these four values. We further assign posterior probabilities to all possible combinations or states. Availability of matched normals reduces the search space by already providing somatic status.
The predictSomatic function provides access to these probabilities. For predicted somatic mutations, this function also provides cellular fraction estimates, i.e. the fraction of tumor cells with mutation. Fractions significantly below 1 indicate sub-clonality
head(predictSomatic(ret), 3)
## chr start end ID REF ALT SOMATIC.M0
## 1 chr1 114515871 114515871 chr1114515871xxx G A 5.969924e-101
## 2 chr1 150044293 150044293 chr1150044293xxx T G 7.667563e-91
## 3 chr1 158449835 158449835 chr1158449835xxx A G 4.162785e-146
## SOMATIC.M1 SOMATIC.M2 SOMATIC.M3 SOMATIC.M4 SOMATIC.M5 SOMATIC.M6
## 1 4.511273e-38 4.183033e-07 2.668510e-262 0 0 0
## 2 1.887887e-38 4.276546e-10 2.149060e-264 0 0 0
## 3 1.939661e-61 1.303470e-14 3.004433e-266 0 0 0
## SOMATIC.M7 GERMLINE.M0 GERMLINE.M1 GERMLINE.M2 GERMLINE.M3
## 1 0 1.574106e-68 2.173694e-15 0.9999996 1.868708e-260
## 2 0 2.307980e-63 2.728936e-18 1.0000000 1.053343e-258
## 3 0 2.291025e-104 2.254699e-29 1.0000000 2.861128e-258
## GERMLINE.M4 GERMLINE.M5 GERMLINE.M6 GERMLINE.M7 GERMLINE.CONTHIGH
## 1 0 0 0 0 6.174341e-43
## 2 0 0 0 0 3.072618e-21
## 3 0 0 0 0 5.875696e-27
## GERMLINE.CONTLOW GERMLINE.HOMOZYGOUS ML.SOMATIC POSTERIOR.SOMATIC ML.M
## 1 1.263476e-286 0 FALSE 4.183033e-07 2
## 2 1.395456e-241 0 FALSE 4.276546e-10 2
## 3 0.000000e+00 0 FALSE 1.303470e-14 2
## ML.C ML.M.SEGMENT M.SEGMENT.POSTERIOR M.SEGMENT.FLAGGED ML.AR AR
## 1 2 0 1 FALSE 0.825 0.755183
## 2 2 0 1 FALSE 0.825 0.817078
## 3 2 0 1 FALSE 0.825 0.834266
## AR.ADJUSTED MAPPING.BIAS ML.LOH CN.SUBCLONAL CELLFRACTION FLAGGED
## 1 0.7736009 0.976192 TRUE FALSE NA FALSE
## 2 0.8370054 0.976192 TRUE FALSE NA FALSE
## 3 0.8546126 0.976192 TRUE FALSE NA FALSE
## log.ratio depth prior.somatic prior.contamination on.target seg.id
## 1 0.2842136 184 9.9e-05 0.01 1 1
## 2 -0.1686186 138 9.9e-05 0.01 1 1
## 3 0.4596841 217 9.9e-05 0.01 1 1
## gene.symbol
## 1 HIPK1
## 2 VPS45
## 3 <NA>
gene.calls <- callAlterations(ret)
head(gene.calls)
## chr start end C seg.mean seg.id number.targets
## EIF2A chr3 150270143 150301699 6 1.3351 5 13
## MIR548H2 chr3 151531951 151538241 6 1.3351 5 3
## GPNMB chr7 23286477 23313844 7 1.4723 17 11
## SH2D4B chr10 82298088 82403838 0 -1.5118 22 8
## SLC35G1 chr10 95653791 95661248 0 -1.3002 24 3
## CYP2C9 chr10 96698440 96748786 0 -1.3002 24 9
## gene.mean gene.min gene.max focal breakpoints type
## EIF2A 1.5159387 0.7544062 2.1392167 TRUE 0 AMPLIFICATION
## MIR548H2 0.8261594 0.4682628 1.2000253 TRUE 0 AMPLIFICATION
## GPNMB 1.4490939 0.9397072 1.8200775 TRUE 0 AMPLIFICATION
## SH2D4B -1.5141982 -1.9281007 -1.2607230 FALSE 0 DELETION
## SLC35G1 -1.5988526 -1.8066816 -1.4493561 FALSE 0 DELETION
## CYP2C9 -1.1901085 -2.1922992 -0.9875222 FALSE 0 DELETION
## num.snps.segment M M.flagged loh
## EIF2A 0 NA NA NA
## MIR548H2 0 NA NA NA
## GPNMB 0 NA NA NA
## SH2D4B 2 0 TRUE TRUE
## SLC35G1 0 NA NA NA
## CYP2C9 0 NA NA NA
The gene.calls
data.frame described above provides gene-level LOH information. To find the corresponding genomic regions in LOH, we can use the callLOH
function:
loh <- callLOH(ret)
head(loh)
## chr start end arm C M type
## 1 chr1 114515871 121535434 p 2 0 WHOLE ARM COPY-NEUTRAL LOH
## 2 chr1 124535434 248085104 q 2 0 WHOLE ARM COPY-NEUTRAL LOH
## 3 chr2 10262881 92326171 p 1 0 WHOLE ARM LOH
## 4 chr2 95326171 231775678 q 1 0 WHOLE ARM LOH
## 5 chr2 236403331 239039169 q 2 0 COPY-NEUTRAL LOH
## 6 chr3 11888043 90504854 p 2 1
This will generate a CSV file in which the correct purity and ploidy values can be manually entered.
createCurationFile(file.rds)
read.csv("Sample1_PureCN.csv")
## Sampleid Purity Ploidy Sex Contamination Flagged Failed Curated
## 1 Sample1 0.65 1.734572 ? 0 TRUE FALSE FALSE
## Comment
## 1 EXCESSIVE LOSSES;EXCESSIVE LOH
The predictSomatic
function can be used to efficiently remove private germline mutations. This in turn allows the calculation of mutation burden for un-matched tumor samples. A wrapper function for this specific task is included as callMutationBurden
:
callableBed <- import(system.file("extdata", "example_callable.bed.gz",
package = "PureCN"))
callMutationBurden(ret, callable=callableBed)
## somatic.ontarget somatic.all private.germline.ontarget
## 1 0 0 0
## private.germline.all callable.bases.ontarget callable.bases.flanking
## 1 0 1511056 2223991
## callable.bases.all somatic.rate.ontarget somatic.rate.ontarget.95.lower
## 1 3063762 0 0
## somatic.rate.ontarget.95.upper private.germline.rate.ontarget
## 1 2.441256 0
## private.germline.rate.ontarget.95.lower
## 1 0
## private.germline.rate.ontarget.95.upper
## 1 2.441256
This plot shows the purity and ploidy local optima, sorted by final likelihood score after fitting both copy number and allelic fractions.
plotAbs(ret, type = "overview")
The colors visualize the copy number fitting score from low (blue) to high (red). The numbers indicate the ranks of the local optima. Yellow fonts indicate that the corresponding solutions were flagged, which does not necessarily mean the solutions are wrong. The correct solution (number 1) of this toy example was flagged due to large amount of LOH.
plotAbs(ret, 1, type = "hist")
## NULL
This figure displays a histogram of tumor vs. normal copy number log-ratios for the maximum likelihood solution (number 1 in Overview). The height of a bar in this plot is proportional to the fraction of the genome falling into the particular log-ratio copy number range. The vertical dotted lines and numbers visualize the, for the given purity/ploidy combination, expected log-ratios for all integer copy numbers from 0 to 7.
This figure shows the allelic fractions of predicted germline SNPs. The goodness of fit (GoF) is provided on an arbitrary scale in which 100% corresponds to a perfect fit and 0% to the worst possible fit. The latter is defined as a fit in which allelic fractions on average differ by 0.2 from their expected fractions. Note that this does not take purity into account and low purity samples are expected to have a better fit. In the middle panel, the corresponding copy number log-ratios are shown. The lower panel displays the calculated integer copy numbers, corrected for purity and ploidy. We can zoom into particular chromosomes using chr
argument.
plotAbs(ret, 1, type="BAF")
Each dot is a (predicted) germline SNP. The first panel shows the allelic fractions as provided in the VCF file. The alternating grey and white background colors visualize odd and even chromosome numbers, re- spectively. The black lines visualize the expected (not the average!) allelic fractions in the segment. These are calculated using the estimated purity and the total and minor segment copy numbers. These are vi- sualized in black and grey, respectively, in the second and third panel. The second panel shows the copy number log-ratios, the third panel the integer copy numbers.