1 Example inputs/intermediates

You can check the file formats here.

library(PureCN)

1.1 Target information

reference.file <- system.file("extdata", "ex2_reference.fa",
    package = "PureCN", mustWork = TRUE)
bed.file <- system.file("extdata", "ex2_intervals.bed",
        package = "PureCN", mustWork = TRUE)
mappability.file <- system.file("extdata", "ex2_mappability.bigWig",
        package = "PureCN", mustWork = TRUE)
intervals <- import(bed.file)
mappability <- import(mappability.file)

# Interval.R
interval.file <- preprocessIntervals(intervals, reference.file,
    mappability=mappability, output.file = "ex2_gc_file.txt")

## WARN [2018-11-29 15:27:36] No reptiming scores provided.
## INFO [2018-11-29 15:27:36] Calculating GC-content...

1.2 Coverage data

bam.file <- system.file("extdata", "ex1.bam", package="PureCN",
    mustWork = TRUE)
interval.file <- system.file("extdata", "ex1_intervals.txt",
    package = "PureCN", mustWork = TRUE)

# Coverage.R
calculateBamCoverageByInterval(bam.file = bam.file,
 interval.file = interval.file, output.file = "ex1_coverage.txt")

1.3 Library-specific coverage bias

normal.coverage.file <- system.file("extdata", "example_normal.txt",
    package="PureCN")
normal2.coverage.file <- system.file("extdata", "example_normal2.txt",
    package="PureCN")
normal.coverage.files <- c(normal.coverage.file, normal2.coverage.file)
tumor.coverage.file <- system.file("extdata", "example_tumor.txt",
    package="PureCN")
seg.file <- system.file("extdata", "example_seg.txt",
    package = "PureCN")
vcf.file <- system.file("extdata", "example.vcf.gz", package="PureCN")
interval.file <- system.file("extdata", "example_intervals.txt",
    package="PureCN")

# Coverage.R
correctCoverageBias(normal.coverage.file, interval.file,
    output.file="example_normal_loess.txt", plot.bias=FALSE)

1.4 PON

# NormalDB.R
normalDB <- createNormalDatabase(normal.coverage.files)

## INFO [2018-11-29 15:27:39] 576 on-target bins with low coverage in all samples.
## WARN [2018-11-29 15:27:39] You are likely not using the correct baits file!
## WARN [2018-11-29 15:27:39] Allosome coverage missing, cannot determine sex.
## WARN [2018-11-29 15:27:39] Allosome coverage missing, cannot determine sex.
## INFO [2018-11-29 15:27:39] Processing on-target regions...
## INFO [2018-11-29 15:27:39] Removing 930 intervals with low coverage in normalDB.
## INFO [2018-11-29 15:27:39] Removing 1 intervals with zero coverage in more than 3% of normalDB.

saveRDS(normalDB, file="normalDB.rds")

normalDB <- readRDS("normalDB.rds")
pool <- calculateTangentNormal(tumor.coverage.file, normalDB)

1.5 VCF

# NormalDB.R and PureCN.R
normal.panel.vcf.file <- system.file("extdata", "normalpanel.vcf.gz",
                                     package="PureCN")
bias <- calculateMappingBiasVcf(normal.panel.vcf.file, genome = "h19")

## INFO [2018-11-29 15:27:41] Processing variants 1 to 5000...

saveRDS(bias, "mapping_bias.rds")
normal.panel.vcf.file <- "mapping_bias.rds"

1.6 Coverage data

# NOrmalDB.R
interval.weight.file <- "interval_weights.txt"
calculateIntervalWeights(normalDB$normal.coverage.files, interval.weight.file)

## INFO [2018-11-29 15:27:41] Loading coverage data...
## INFO [2018-11-29 15:27:42] Mean target coverages: 71X (tumor) 99X (normal).
## INFO [2018-11-29 15:27:42] Mean target coverages: 71X (tumor) 43X (normal).

1.7 Recommended run

# PureCN.R
ret <-runAbsoluteCN(normal.coverage.file=pool, # normal.coverage.file=normal.coverage.file,
    tumor.coverage.file=tumor.coverage.file, vcf.file=vcf.file,
    genome="hg19", sampleid="Sample1",
    interval.file=interval.file, normalDB=normalDB,
# args.setMappingBiasVcf=list(normal.panel.vcf.file=normal.panel.vcf.file),
# args.filterVcf=list(snp.blacklist=snp.blacklist,
# stats.file=mutect.stats.file),
    args.segmentation=list(interval.weight.file=interval.weight.file),
    post.optimize=FALSE, plot.cnv=FALSE, verbose=FALSE)

## WARN [2018-11-29 15:27:43] Allosome coverage missing, cannot determine sex.
## WARN [2018-11-29 15:27:43] Allosome coverage missing, cannot determine sex.

file.rds <- "Sample1_PureCN.rds"
saveRDS(ret, file=file.rds)

2 Output data structures

The R data file (file.rds) contains gene-level copy number calls, SNV status and LOH calls. The purity/ploidy combinations are sorted by likelihood and stored in ret$results.

names(ret)

## [1] "candidates" "results"    "input"

2.1 Prediction of somatic status and cellular fraction

To understand allelic fractions of particular SNVs, we must know the (i) somatic status, the (ii) tumor purity, the (iii) local copy number, as well as the (iv) number of chromosomes harboring the mutations or SNPs. One of PureCN main functions is to find the most likely combination of these four values. We further assign posterior probabilities to all possible combinations or states. Availability of matched normals reduces the search space by already providing somatic status.

The predictSomatic function provides access to these probabilities. For predicted somatic mutations, this function also provides cellular fraction estimates, i.e. the fraction of tumor cells with mutation. Fractions significantly below 1 indicate sub-clonality

head(predictSomatic(ret), 3)

##    chr     start       end               ID REF ALT    SOMATIC.M0
## 1 chr1 114515871 114515871 chr1114515871xxx   G   A 5.969924e-101
## 2 chr1 150044293 150044293 chr1150044293xxx   T   G  7.667563e-91
## 3 chr1 158449835 158449835 chr1158449835xxx   A   G 4.162785e-146
##     SOMATIC.M1   SOMATIC.M2    SOMATIC.M3 SOMATIC.M4 SOMATIC.M5 SOMATIC.M6
## 1 4.511273e-38 4.183033e-07 2.668510e-262          0          0          0
## 2 1.887887e-38 4.276546e-10 2.149060e-264          0          0          0
## 3 1.939661e-61 1.303470e-14 3.004433e-266          0          0          0
##   SOMATIC.M7   GERMLINE.M0  GERMLINE.M1 GERMLINE.M2   GERMLINE.M3
## 1          0  1.574106e-68 2.173694e-15   0.9999996 1.868708e-260
## 2          0  2.307980e-63 2.728936e-18   1.0000000 1.053343e-258
## 3          0 2.291025e-104 2.254699e-29   1.0000000 2.861128e-258
##   GERMLINE.M4 GERMLINE.M5 GERMLINE.M6 GERMLINE.M7 GERMLINE.CONTHIGH
## 1           0           0           0           0      6.174341e-43
## 2           0           0           0           0      3.072618e-21
## 3           0           0           0           0      5.875696e-27
##   GERMLINE.CONTLOW GERMLINE.HOMOZYGOUS ML.SOMATIC POSTERIOR.SOMATIC ML.M
## 1    1.263476e-286                   0      FALSE      4.183033e-07    2
## 2    1.395456e-241                   0      FALSE      4.276546e-10    2
## 3     0.000000e+00                   0      FALSE      1.303470e-14    2
##   ML.C ML.M.SEGMENT M.SEGMENT.POSTERIOR M.SEGMENT.FLAGGED ML.AR       AR
## 1    2            0                   1             FALSE 0.825 0.755183
## 2    2            0                   1             FALSE 0.825 0.817078
## 3    2            0                   1             FALSE 0.825 0.834266
##   AR.ADJUSTED MAPPING.BIAS ML.LOH CN.SUBCLONAL CELLFRACTION FLAGGED
## 1   0.7736009     0.976192   TRUE        FALSE           NA   FALSE
## 2   0.8370054     0.976192   TRUE        FALSE           NA   FALSE
## 3   0.8546126     0.976192   TRUE        FALSE           NA   FALSE
##    log.ratio depth prior.somatic prior.contamination on.target seg.id
## 1  0.2842136   184       9.9e-05                0.01         1      1
## 2 -0.1686186   138       9.9e-05                0.01         1      1
## 3  0.4596841   217       9.9e-05                0.01         1      1
##   gene.symbol
## 1       HIPK1
## 2       VPS45
## 3        <NA>

2.2 Amplifications and deletions

gene.calls <- callAlterations(ret)
head(gene.calls)

##            chr     start       end C seg.mean seg.id number.targets
## EIF2A     chr3 150270143 150301699 6   1.3351      5             13
## MIR548H2  chr3 151531951 151538241 6   1.3351      5              3
## GPNMB     chr7  23286477  23313844 7   1.4723     17             11
## SH2D4B   chr10  82298088  82403838 0  -1.5118     22              8
## SLC35G1  chr10  95653791  95661248 0  -1.3002     24              3
## CYP2C9   chr10  96698440  96748786 0  -1.3002     24              9
##           gene.mean   gene.min   gene.max focal breakpoints          type
## EIF2A     1.5159387  0.7544062  2.1392167  TRUE           0 AMPLIFICATION
## MIR548H2  0.8261594  0.4682628  1.2000253  TRUE           0 AMPLIFICATION
## GPNMB     1.4490939  0.9397072  1.8200775  TRUE           0 AMPLIFICATION
## SH2D4B   -1.5141982 -1.9281007 -1.2607230 FALSE           0      DELETION
## SLC35G1  -1.5988526 -1.8066816 -1.4493561 FALSE           0      DELETION
## CYP2C9   -1.1901085 -2.1922992 -0.9875222 FALSE           0      DELETION
##          num.snps.segment  M M.flagged  loh
## EIF2A                   0 NA        NA   NA
## MIR548H2                0 NA        NA   NA
## GPNMB                   0 NA        NA   NA
## SH2D4B                  2  0      TRUE TRUE
## SLC35G1                 0 NA        NA   NA
## CYP2C9                  0 NA        NA   NA

2.3 Find genomic regions in LOH

The gene.calls data.frame described above provides gene-level LOH information. To find the corresponding genomic regions in LOH, we can use the callLOH function:

loh <- callLOH(ret)
head(loh)

##    chr     start       end arm C M                       type
## 1 chr1 114515871 121535434   p 2 0 WHOLE ARM COPY-NEUTRAL LOH
## 2 chr1 124535434 248085104   q 2 0 WHOLE ARM COPY-NEUTRAL LOH
## 3 chr2  10262881  92326171   p 1 0              WHOLE ARM LOH
## 4 chr2  95326171 231775678   q 1 0              WHOLE ARM LOH
## 5 chr2 236403331 239039169   q 2 0           COPY-NEUTRAL LOH
## 6 chr3  11888043  90504854   p 2 1

2.4 Curation

This will generate a CSV file in which the correct purity and ploidy values can be manually entered.

createCurationFile(file.rds)
read.csv("Sample1_PureCN.csv")

##   Sampleid Purity   Ploidy Sex Contamination Flagged Failed Curated
## 1  Sample1   0.65 1.734572   ?             0    TRUE  FALSE   FALSE
##                          Comment
## 1 EXCESSIVE LOSSES;EXCESSIVE LOH

2.5 Mutation burden

The predictSomatic function can be used to efficiently remove private germline mutations. This in turn allows the calculation of mutation burden for un-matched tumor samples. A wrapper function for this specific task is included as callMutationBurden:

callableBed <- import(system.file("extdata", "example_callable.bed.gz",
    package = "PureCN"))

callMutationBurden(ret, callable=callableBed)

##   somatic.ontarget somatic.all private.germline.ontarget
## 1                0           0                         0
##   private.germline.all callable.bases.ontarget callable.bases.flanking
## 1                    0                 1511056                 2223991
##   callable.bases.all somatic.rate.ontarget somatic.rate.ontarget.95.lower
## 1            3063762                     0                              0
##   somatic.rate.ontarget.95.upper private.germline.rate.ontarget
## 1                       2.441256                              0
##   private.germline.rate.ontarget.95.lower
## 1                                       0
##   private.germline.rate.ontarget.95.upper
## 1                                2.441256

3 Plot

3.1 Overview

This plot shows the purity and ploidy local optima, sorted by final likelihood score after fitting both copy number and allelic fractions.

plotAbs(ret, type = "overview")

The colors visualize the copy number fitting score from low (blue) to high (red). The numbers indicate the ranks of the local optima. Yellow fonts indicate that the corresponding solutions were flagged, which does not necessarily mean the solutions are wrong. The correct solution (number 1) of this toy example was flagged due to large amount of LOH.

3.2 Log-ratio histogram

plotAbs(ret, 1, type = "hist")

## NULL

This figure displays a histogram of tumor vs. normal copy number log-ratios for the maximum likelihood solution (number 1 in Overview). The height of a bar in this plot is proportional to the fraction of the genome falling into the particular log-ratio copy number range. The vertical dotted lines and numbers visualize the, for the given purity/ploidy combination, expected log-ratios for all integer copy numbers from 0 to 7.

3.3 B-Allele frequency plot

This figure shows the allelic fractions of predicted germline SNPs. The goodness of fit (GoF) is provided on an arbitrary scale in which 100% corresponds to a perfect fit and 0% to the worst possible fit. The latter is defined as a fit in which allelic fractions on average differ by 0.2 from their expected fractions. Note that this does not take purity into account and low purity samples are expected to have a better fit. In the middle panel, the corresponding copy number log-ratios are shown. The lower panel displays the calculated integer copy numbers, corrected for purity and ploidy. We can zoom into particular chromosomes using chr argument.

plotAbs(ret, 1, type="BAF")

Each dot is a (predicted) germline SNP. The first panel shows the allelic fractions as provided in the VCF file. The alternating grey and white background colors visualize odd and even chromosome numbers, re- spectively. The black lines visualize the expected (not the average!) allelic fractions in the segment. These are calculated using the estimated purity and the total and minor segment copy numbers. These are vi- sualized in black and grey, respectively, in the second and third panel. The second panel shows the copy number log-ratios, the third panel the integer copy numbers.

An example of PureCN output

November 29, 2018

Contents