Analysis of beat AML release 081717

Division of Hematology at University of Colorado
The beatAML Website
Shanshan Pei Github

1. Set up

library("dplyr")
library("ggplot2")
library("RColorBrewer")
library("VennDiagram")
library("circlize")
bAMLcli <- read.csv("clinical-summary-20170703_143804.csv", header = TRUE) # clinical info
bAMLcli <- as_data_frame(bAMLcli)

bAMLdna <- read.csv("BeatAML_DR2_TumorOnly_vars_2017_06_09.csv", header = TRUE) # DNA results
bAMLdna <- as_data_frame(bAMLdna)

bAMLrna <- read.csv("BeatAML_DR2_RNASeq_rawcounts_2017_06_08.csv", header = TRUE) # RNA-seq results
bAMLrna <- as_data_frame(bAMLrna)

2. Analysis of clinical data

1098 unique patients.

542 male.
422 female.
41 with unknown sex.
93 are not labeled with any sex info.
258 with unknown vital status.
426 are alive.
415 are dead.
- 32 treatment death.
- 247 disease death.

422 patients with RNA-seq data.

231 male.
184 female.

576 patients with Exome-seq data.

286 male.
230 female.

1873 unique specimens, each with a unique labID.

1026 Bone Marrow Aspirate.
800 Peripheral Blood.
47 Leukapheresis.

Among the 1873 unique specimens:

734 with Exome-seq data (DNA mutation).
471 with RNA-seq data (gene expression).
423 with both RNA-seq and Exome-seq data, related to: 383 patients (213 male, 166 female).

## (polygon[GRID.polygon.127], polygon[GRID.polygon.128], polygon[GRID.polygon.129], polygon[GRID.polygon.130], text[GRID.text.131], text[GRID.text.132], text[GRID.text.133], text[GRID.text.134], text[GRID.text.135])

Among the 423 specimens with both RNA-seq and Exome-seq data:

222 are Bone Marrow Aspirate.
194 are Peripheral Blood.
7 are Leukapheresis.

Three interesting lists of specimens

222 Bone Marrow Aspirate specimens with both RNA-seq and Exome-seq data:

## # A tibble: 222 × 6
##    patientId    labId         specimenType percentBlastsBM
##        <int>   <fctr>               <fctr>          <fctr>
## 1        685 13-00118 Bone Marrow Aspirate              71
## 2        757 13-00537 Bone Marrow Aspirate              90
## 3        928 13-00226 Bone Marrow Aspirate              75
## 4        967 13-00160 Bone Marrow Aspirate           60-70
## 5       1011 13-00098 Bone Marrow Aspirate         76 | BM
## 6       1049 13-00149 Bone Marrow Aspirate              87
## 7       1054 13-00157 Bone Marrow Aspirate              80
## 8       1058 13-00163 Bone Marrow Aspirate              50
## 9       1072 13-00186 Bone Marrow Aspirate              60
## 10      1078 13-00195 Bone Marrow Aspirate              90
## # ... with 212 more rows, and 2 more variables:
## #   ageAtSpecimenAcquisition <int>, riskGroup <fctr>

194 Peripheral Blood specimens with both RNA-seq and Exome-seq data:

## # A tibble: 194 × 6
##    patientId    labId     specimenType percentBlastsPB
##        <int>   <fctr>           <fctr>          <fctr>
## 1       1060 13-00165 Peripheral Blood              71
## 2       1061 13-00166 Peripheral Blood              77
## 3       1111 13-00250 Peripheral Blood              16
## 4       1113 13-00253 Peripheral Blood               9
## 5       1118 13-00262 Peripheral Blood         94 | PB
## 6       1123 13-00270 Peripheral Blood               7
## 7       1126 13-00273 Peripheral Blood              12
## 8       1131 13-00281 Peripheral Blood              56
## 9       1157 13-00338 Peripheral Blood                
## 10      1163 13-00354 Peripheral Blood              71
## # ... with 184 more rows, and 2 more variables:
## #   ageAtSpecimenAcquisition <int>, riskGroup <fctr>

7 Leukapheresis specimens with both RNA-seq and Exome-seq data:

## # A tibble: 7 × 6
##   patientId    labId  specimenType
##       <int>   <fctr>        <fctr>
## 1      1190 13-00406 Leukapheresis
## 2      2235 15-00615 Leukapheresis
## 3      2314 15-00777 Leukapheresis
## 4      2495 16-00087 Leukapheresis
## 5      2511 16-00109 Leukapheresis
## 6      2538 16-01127 Leukapheresis
## 7      2690 16-00465 Leukapheresis
## # ... with 3 more variables: percentBlastsPB <fctr>,
## #   ageAtSpecimenAcquisition <int>, riskGroup <fctr>

3. Analysis of the exome-seq DNA mutation data

5289 genes present mutations only in tumor but not in matched normal tissue control.

The most mutated patient specimens:

## # A tibble: 344 × 2
##    Tumor_Sample_Barcode     n
##                  <fctr> <int>
## 1              20-00146  3273
## 2              20-00137  3106
## 3              16-00945  3046
## 4              20-00116  3019
## 5              20-00558  3017
## 6              14-00842  3008
## 7              16-00960  3007
## 8              16-00994  3004
## 9              20-00548  3004
## 10             20-00465  3003
## # ... with 334 more rows

The most mutated Chromosome:

## # A tibble: 25 × 2
##    Chromosome     n
##        <fctr> <int>
## 1           1 87666
## 2          19 58060
## 3           3 51640
## 4          11 47665
## 5          17 46092
## 6           2 46082
## 7           6 42385
## 8           7 41112
## 9          16 35915
## 10         12 32659
## # ... with 15 more rows

Now lets start to filter this data:

Below is a list of the 54 genes involved in the illumina TruSight Myeloid Sequencing Panel, which from now on we will call them “hot genes”:

##  [1] ABL1     ASXL1    ATRX     BCOR     BCORL1   BRAF     CALR    
##  [8] CBL      CBLB     CBLC     CDKN2A   CEBPA    CSF3R    CUX1    
## [15] DNMT3A   ETV6/TEL EZH2     FBXW7    FLT3     GATA1    GATA2   
## [22] GNAS     HRAS     IDH1     IDH2     IKZF1    JAK2     JAK3    
## [29] KDM6A    KIT      KRAS     MLL      MPL      MYD88    NOTCH1  
## [36] NPM1     NRAS     PDGFRA   PHF6     PTEN     PTPN11   RAD21   
## [43] RUNX1    SETBP1   SF3B1    SMC1A    SMC3     SRSF2    STAG2   
## [50] TET2     TP53     U2AF1    WT1      ZRSR2   
## 54 Levels: ABL1 ASXL1 ATRX BCOR BCORL1 BRAF CALR CBL CBLB CBLC ... ZRSR2

Within this list hot genes, 37 genes have detected mutations in the current release:

## $Hugo_Symbol
##  [1] CSF3R  NRAS   PTEN   SMC3   WT1    CBL    KRAS   PTPN11 FLT3   IDH2  
## [11] TP53   SRSF2  SETBP1 JAK3   CEBPA  DNMT3A SF3B1  IDH1   ASXL1  GNAS  
## [21] RUNX1  U2AF1  GATA2  KIT    TET2   NPM1   IKZF1  BRAF   EZH2   JAK2  
## [31] ZRSR2  BCOR   KDM6A  SMC1A  ATRX   STAG2  PHF6  
## 5289 Levels: A1CF A2M AADAC AAMDC AAMP AASDH AASDHPPT AATF AATK ... ZZEF1

Now let’s filter out only the hot genes and match the seq_ID to oringal lab_ID:

## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector

## # A tibble: 1,709 × 10
##    Tumor_Sample_Barcode Original_LabID PatientID Hugo_Symbol
##                   <chr>         <fctr>     <int>      <fctr>
## 1              09-00705       09-00705       163        FLT3
## 2              09-00705       09-00705       163       U2AF1
## 3              09-00705       09-00705       163        TET2
## 4              09-00705       09-00705       163        NPM1
## 5              10-00136       10-00136       174        FLT3
## 6              10-00136       10-00136       174      DNMT3A
## 7              10-00136       10-00136       174       U2AF1
## 8              10-00172       10-00172       175        FLT3
## 9              10-00172       10-00172       175       SRSF2
## 10             10-00172       10-00172       175      SETBP1
## # ... with 1,699 more rows, and 6 more variables: Transcript_ID <fctr>,
## #   HGVSp_Short <fctr>, Variant_Classification <fctr>,
## #   Reference_Allele <fctr>, Alt_Allele <fctr>,
## #   Tumor_VAF_VarScan.2.4.1 <dbl>

Specimens with most mutations in these hot genes

## # A tibble: 344 × 2
##    Original_LabID     n
##            <fctr> <int>
## 1        14-00602    12
## 2        16-00941    12
## 3        14-00803    10
## 4        15-00231    10
## 5        16-00151    10
## 6        14-00842     9
## 7        15-00057     9
## 8        16-00483     9
## 9        13-00202     8
## 10       14-00495     8
## # ... with 334 more rows

The most mutated hot genes:

## # A tibble: 37 × 3
##    Hugo_Symbol     n       freq
##         <fctr> <int>      <dbl>
## 1         FLT3   370 0.21650088
## 2        U2AF1   321 0.18782914
## 3        SMC1A   221 0.12931539
## 4         JAK3   176 0.10298420
## 5         NPM1    70 0.04095963
## 6       DNMT3A    67 0.03920421
## 7         PTEN    60 0.03510825
## 8         ATRX    57 0.03335284
## 9         NRAS    55 0.03218256
## 10         CBL    38 0.02223523
## # ... with 27 more rows

The most common type of mutations

## # A tibble: 9 × 2
##   Variant_Classification     n
##                   <fctr> <int>
## 1      Missense_Mutation   840
## 2                  5'UTR   489
## 3                  3'UTR   250
## 4        Frame_Shift_Ins    83
## 5      Nonsense_Mutation    30
## 6        Frame_Shift_Del     6
## 7            Splice_Site     5
## 8           In_Frame_Ins     4
## 9           In_Frame_Del     2

Now let’s focus on mutations in 3 genes: FLT3, DNMT3A and NPM1:

## # A tibble: 166 × 10
##    Tumor_Sample_Barcode Original_LabID PatientID Hugo_Symbol
##                   <chr>         <fctr>     <int>      <fctr>
## 1              10-00136       10-00136       174        FLT3
## 2              10-00136       10-00136       174      DNMT3A
## 3              10-00136       10-00136       174       U2AF1
## 4              10-00542       10-00542       174        FLT3
## 5              10-00542       10-00542       174        JAK3
## 6              10-00542       10-00542       174      DNMT3A
## 7              10-00542       10-00542       174       U2AF1
## 8              10-00542       10-00542       174       U2AF1
## 9              10-00542       10-00542       174        ATRX
## 10             11-00049       11-00049       174        FLT3
## # ... with 156 more rows, and 6 more variables: Transcript_ID <fctr>,
## #   HGVSp_Short <fctr>, Variant_Classification <fctr>,
## #   Reference_Allele <fctr>, Alt_Allele <fctr>,
## #   Tumor_VAF_VarScan.2.4.1 <dbl>

## # A tibble: 202 × 10
##    Tumor_Sample_Barcode Original_LabID PatientID Hugo_Symbol
##                   <chr>         <fctr>     <int>      <fctr>
## 1              09-00705       09-00705       163        FLT3
## 2              09-00705       09-00705       163       U2AF1
## 3              09-00705       09-00705       163        TET2
## 4              09-00705       09-00705       163        NPM1
## 5              10-00692       10-00692       201        KRAS
## 6              10-00692       10-00692       201        FLT3
## 7              10-00692       10-00692       201       U2AF1
## 8              10-00692       10-00692       201        NPM1
## 9              10-00692       10-00692       201       SMC1A
## 10             10-00792       10-00792       163        FLT3
## # ... with 192 more rows, and 6 more variables: Transcript_ID <fctr>,
## #   HGVSp_Short <fctr>, Variant_Classification <fctr>,
## #   Reference_Allele <fctr>, Alt_Allele <fctr>,
## #   Tumor_VAF_VarScan.2.4.1 <dbl>

31 specimens have mutations in FLT3 and DNMT3A but not NPM1.
37 specimens have mutations in FLT3 and NPM1 but not DNMT3A.
22 specimens have mutations in all 3 genes.

4. Analysis of the RNA-seq gene expression data

145 specimens are sequenced
56638 unique genes are profiled

Now let’s filter out only specimens that have known mutations in FLT3, DNMT3A and NPM1 Frist, lets extract seqIDs

## Joining, by = "Original_LabID"

## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector

## Joining, by = "Original_LabID"

## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector

##  [1] "X20.00078" "X20.00310" "X14.00602" "X16.00004" "X16.00007"
##  [6] "X20.00490" "X16.00124" "X20.00344" "X20.00327" "X20.00328"
## [11] "X20.00335" "X20.00458" "X20.00501" "X20.00502"

##  [1] "X13.00157" "X13.00262" "X14.00658" "X15.00653" "X15.00967"
##  [6] "X16.00087" "X16.00109" "X20.00420" "X20.00455" "X20.00456"
## [11] "X20.00507"

Second, Lets extract the corresponding RNA-seq data (cpm values)

## Warning in one_of(FLT3_DNMT3A_dual_seqID_final): Unknown variables:
## `X14.00602`, `X16.00004`, `X16.00007`, `X16.00124`

## Warning in one_of(FLT3_NPM1_dual_seqID_final): Unknown variables:
## `X13.00157`, `X13.00262`, `X14.00658`, `X15.00653`, `X15.00967`,
## `X16.00087`, `X16.00109`

## # A tibble: 63,677 × 11
##      Symbol X20.00078 X20.00310 X20.00490 X20.00344 X20.00327 X20.00328
##      <fctr>     <int>     <int>     <int>     <int>     <int>     <int>
## 1    TSPAN6         0         1        15        46        21        12
## 2      TNMD         0         0         0         0         0         0
## 3      DPM1      1343      1543       596      1266      1503      1456
## 4     SCYL3      1015       835       791       235       679       456
## 5  C1orf112       426       445       304       455       205       119
## 6       FGR     32049     33783      2395      4086      5305     18305
## 7       CFH        26      1477      3303       796        58        96
## 8     FUCA2      3286      3638      1603       759      1412      1048
## 9      GCLC      1829      2562      1226      1217      2041      1751
## 10     NFYA      2123      1777      1866      3230      1232      1490
## # ... with 63,667 more rows, and 4 more variables: X20.00335 <int>,
## #   X20.00458 <int>, X20.00501 <int>, X20.00502 <int>

## # A tibble: 63,677 × 5
##      Symbol X20.00420 X20.00455 X20.00456 X20.00507
##      <fctr>     <int>     <int>     <int>     <int>
## 1    TSPAN6        28         1        65        14
## 2      TNMD         0         0         0         0
## 3      DPM1      1134      1698       839      1370
## 4     SCYL3       736       824       841       905
## 5  C1orf112       639       596       283       627
## 6       FGR      1351      1290      2299       718
## 7       CFH      2386      4335      4412      4139
## 8     FUCA2      2052      2527      2012      1588
## 9      GCLC      1823      2282      1738      2311
## 10     NFYA      2353      2793      1778      3130
## # ... with 63,667 more rows

## # A tibble: 63,677 × 15
##      Symbol X20.00078 X20.00310 X20.00490 X20.00344 X20.00327 X20.00328
##      <fctr>     <int>     <int>     <int>     <int>     <int>     <int>
## 1    TSPAN6         0         1        15        46        21        12
## 2      TNMD         0         0         0         0         0         0
## 3      DPM1      1343      1543       596      1266      1503      1456
## 4     SCYL3      1015       835       791       235       679       456
## 5  C1orf112       426       445       304       455       205       119
## 6       FGR     32049     33783      2395      4086      5305     18305
## 7       CFH        26      1477      3303       796        58        96
## 8     FUCA2      3286      3638      1603       759      1412      1048
## 9      GCLC      1829      2562      1226      1217      2041      1751
## 10     NFYA      2123      1777      1866      3230      1232      1490
## # ... with 63,667 more rows, and 8 more variables: X20.00335 <int>,
## #   X20.00458 <int>, X20.00501 <int>, X20.00502 <int>, X20.00420 <int>,
## #   X20.00455 <int>, X20.00456 <int>, X20.00507 <int>

## Warning: package 'limma' was built under R version 3.3.3

## 
## FALSE  TRUE 
## 38586 18052

After removing low-expressing genes, out of the starting 56638 genes, 10922 are kept.

Now let’s check the result of first data pre-processing (removing low-expressing genes)

Now lets calcuate and update normalization factors

##  [1] 0.988 1.081 0.864 1.022 1.007 0.826 0.747 1.098 0.733 1.137 1.213 1.127 1.076 1.268

Now lets generate a MDS plot to initially anlayze the similarity of specimens

Now lets do a mean-difference plot of first and last specimen

Now lets do dispersion estimation

##    FLT3andDNMT3A FLT3andNPM1
## 1              1           0
## 2              1           0
## 3              1           0
## 4              1           0
## 5              1           0
## 6              1           0
## 7              1           0
## 8              1           0
## 9              1           0
## 10             1           0
## 11             0           1
## 12             0           1
## 13             0           1
## 14             0           1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group_name
## [1] "contr.treatment"

##          FLT3andDNMT3A FLT3andNPM1
## DPM1            -10.29      -10.49
## SCYL3           -10.83      -10.87
## C1orf112        -11.66      -11.33
## FGR              -7.48      -10.29
## FUCA2           -10.03       -9.97
## GCLC            -10.05       -9.98

Now lets to differential expression analysis

## Coefficient:  1*FLT3andDNMT3A -1*FLT3andNPM1 
##               logFC logCPM    F   PValue      FDR
## FTO           -3.80   6.45 86.4 8.04e-08 0.000878
## ARAP2          3.75   6.08 65.5 4.90e-07 0.002674
## CPA3          -5.33   7.48 53.8 1.76e-06 0.005667
## SERPINB9       3.79   5.95 52.3 2.08e-06 0.005667
## TWISTNB       -2.18   5.95 48.4 3.29e-06 0.006557
## SORD          -2.33   4.72 47.7 3.60e-06 0.006557
## TNFRSF14       3.94   7.18 45.0 5.26e-06 0.007776
## RNF144B        4.39   5.89 44.4 5.70e-06 0.007776
## RP11-706O15.1 -2.44   3.98 42.3 7.40e-06 0.008979
## ITPR2         -2.05   7.75 40.0 1.02e-05 0.011141

##    [,1] 
## -1   129
## 0  10559
## 1    234

## Coefficient:  1*FLT3andDNMT3A -1*FLT3andNPM1 
##          logFC unshrunk.logFC logCPM   PValue     FDR
## FTO      -3.80          -3.80   6.45 1.70e-07 0.00186
## ARAP2     3.75           3.75   6.08 8.27e-07 0.00451
## CPA3     -5.33          -5.33   7.48 2.36e-06 0.00857
## SERPINB9  3.79           3.79   5.95 3.32e-06 0.00906
## RNF144B   4.39           4.39   5.89 7.77e-06 0.01426
## TNFRSF14  3.94           3.94   7.18 7.84e-06 0.01426
## SORD     -2.33          -2.33   4.72 1.77e-05 0.02683
## TWISTNB  -2.18          -2.18   5.95 1.99e-05 0.02683
## HLA-DRA   5.01           5.01   9.85 2.21e-05 0.02683
## IL6ST     3.21           3.21   6.53 2.53e-05 0.02759

##    [,1] 
## -1    15
## 0  10867
## 1     40

Now lets do heatmap clustering

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess