Division of Hematology at University of Colorado
The beatAML Website
Shanshan Pei Github
library("dplyr")
library("ggplot2")
library("RColorBrewer")
library("VennDiagram")
library("circlize")
bAMLcli <- read.csv("clinical-summary-20170703_143804.csv", header = TRUE) # clinical info
bAMLcli <- as_data_frame(bAMLcli)
bAMLdna <- read.csv("BeatAML_DR2_TumorOnly_vars_2017_06_09.csv", header = TRUE) # DNA results
bAMLdna <- as_data_frame(bAMLdna)
bAMLrna <- read.csv("BeatAML_DR2_RNASeq_rawcounts_2017_06_08.csv", header = TRUE) # RNA-seq results
bAMLrna <- as_data_frame(bAMLrna)
1098 unique patients.
93 are not labeled with any sex info.
422 patients with RNA-seq data.
576 patients with Exome-seq data.
1873 unique specimens, each with a unique labID.
Among the 1873 unique specimens:
## (polygon[GRID.polygon.127], polygon[GRID.polygon.128], polygon[GRID.polygon.129], polygon[GRID.polygon.130], text[GRID.text.131], text[GRID.text.132], text[GRID.text.133], text[GRID.text.134], text[GRID.text.135])
Among the 423 specimens with both RNA-seq and Exome-seq data:
222 Bone Marrow Aspirate specimens with both RNA-seq and Exome-seq data:
## # A tibble: 222 × 6
## patientId labId specimenType percentBlastsBM
## <int> <fctr> <fctr> <fctr>
## 1 685 13-00118 Bone Marrow Aspirate 71
## 2 757 13-00537 Bone Marrow Aspirate 90
## 3 928 13-00226 Bone Marrow Aspirate 75
## 4 967 13-00160 Bone Marrow Aspirate 60-70
## 5 1011 13-00098 Bone Marrow Aspirate 76 | BM
## 6 1049 13-00149 Bone Marrow Aspirate 87
## 7 1054 13-00157 Bone Marrow Aspirate 80
## 8 1058 13-00163 Bone Marrow Aspirate 50
## 9 1072 13-00186 Bone Marrow Aspirate 60
## 10 1078 13-00195 Bone Marrow Aspirate 90
## # ... with 212 more rows, and 2 more variables:
## # ageAtSpecimenAcquisition <int>, riskGroup <fctr>
194 Peripheral Blood specimens with both RNA-seq and Exome-seq data:
## # A tibble: 194 × 6
## patientId labId specimenType percentBlastsPB
## <int> <fctr> <fctr> <fctr>
## 1 1060 13-00165 Peripheral Blood 71
## 2 1061 13-00166 Peripheral Blood 77
## 3 1111 13-00250 Peripheral Blood 16
## 4 1113 13-00253 Peripheral Blood 9
## 5 1118 13-00262 Peripheral Blood 94 | PB
## 6 1123 13-00270 Peripheral Blood 7
## 7 1126 13-00273 Peripheral Blood 12
## 8 1131 13-00281 Peripheral Blood 56
## 9 1157 13-00338 Peripheral Blood
## 10 1163 13-00354 Peripheral Blood 71
## # ... with 184 more rows, and 2 more variables:
## # ageAtSpecimenAcquisition <int>, riskGroup <fctr>
7 Leukapheresis specimens with both RNA-seq and Exome-seq data:
## # A tibble: 7 × 6
## patientId labId specimenType
## <int> <fctr> <fctr>
## 1 1190 13-00406 Leukapheresis
## 2 2235 15-00615 Leukapheresis
## 3 2314 15-00777 Leukapheresis
## 4 2495 16-00087 Leukapheresis
## 5 2511 16-00109 Leukapheresis
## 6 2538 16-01127 Leukapheresis
## 7 2690 16-00465 Leukapheresis
## # ... with 3 more variables: percentBlastsPB <fctr>,
## # ageAtSpecimenAcquisition <int>, riskGroup <fctr>
5289 genes present mutations only in tumor but not in matched normal tissue control.
The most mutated patient specimens:
## # A tibble: 344 × 2
## Tumor_Sample_Barcode n
## <fctr> <int>
## 1 20-00146 3273
## 2 20-00137 3106
## 3 16-00945 3046
## 4 20-00116 3019
## 5 20-00558 3017
## 6 14-00842 3008
## 7 16-00960 3007
## 8 16-00994 3004
## 9 20-00548 3004
## 10 20-00465 3003
## # ... with 334 more rows
The most mutated Chromosome:
## # A tibble: 25 × 2
## Chromosome n
## <fctr> <int>
## 1 1 87666
## 2 19 58060
## 3 3 51640
## 4 11 47665
## 5 17 46092
## 6 2 46082
## 7 6 42385
## 8 7 41112
## 9 16 35915
## 10 12 32659
## # ... with 15 more rows
Below is a list of the 54 genes involved in the illumina TruSight Myeloid Sequencing Panel, which from now on we will call them “hot genes”:
## [1] ABL1 ASXL1 ATRX BCOR BCORL1 BRAF CALR
## [8] CBL CBLB CBLC CDKN2A CEBPA CSF3R CUX1
## [15] DNMT3A ETV6/TEL EZH2 FBXW7 FLT3 GATA1 GATA2
## [22] GNAS HRAS IDH1 IDH2 IKZF1 JAK2 JAK3
## [29] KDM6A KIT KRAS MLL MPL MYD88 NOTCH1
## [36] NPM1 NRAS PDGFRA PHF6 PTEN PTPN11 RAD21
## [43] RUNX1 SETBP1 SF3B1 SMC1A SMC3 SRSF2 STAG2
## [50] TET2 TP53 U2AF1 WT1 ZRSR2
## 54 Levels: ABL1 ASXL1 ATRX BCOR BCORL1 BRAF CALR CBL CBLB CBLC ... ZRSR2
Within this list hot genes, 37 genes have detected mutations in the current release:
## $Hugo_Symbol
## [1] CSF3R NRAS PTEN SMC3 WT1 CBL KRAS PTPN11 FLT3 IDH2
## [11] TP53 SRSF2 SETBP1 JAK3 CEBPA DNMT3A SF3B1 IDH1 ASXL1 GNAS
## [21] RUNX1 U2AF1 GATA2 KIT TET2 NPM1 IKZF1 BRAF EZH2 JAK2
## [31] ZRSR2 BCOR KDM6A SMC1A ATRX STAG2 PHF6
## 5289 Levels: A1CF A2M AADAC AAMDC AAMP AASDH AASDHPPT AATF AATK ... ZZEF1
Now let’s filter out only the hot genes and match the seq_ID to oringal lab_ID:
## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
## # A tibble: 1,709 × 10
## Tumor_Sample_Barcode Original_LabID PatientID Hugo_Symbol
## <chr> <fctr> <int> <fctr>
## 1 09-00705 09-00705 163 FLT3
## 2 09-00705 09-00705 163 U2AF1
## 3 09-00705 09-00705 163 TET2
## 4 09-00705 09-00705 163 NPM1
## 5 10-00136 10-00136 174 FLT3
## 6 10-00136 10-00136 174 DNMT3A
## 7 10-00136 10-00136 174 U2AF1
## 8 10-00172 10-00172 175 FLT3
## 9 10-00172 10-00172 175 SRSF2
## 10 10-00172 10-00172 175 SETBP1
## # ... with 1,699 more rows, and 6 more variables: Transcript_ID <fctr>,
## # HGVSp_Short <fctr>, Variant_Classification <fctr>,
## # Reference_Allele <fctr>, Alt_Allele <fctr>,
## # Tumor_VAF_VarScan.2.4.1 <dbl>
Specimens with most mutations in these hot genes
## # A tibble: 344 × 2
## Original_LabID n
## <fctr> <int>
## 1 14-00602 12
## 2 16-00941 12
## 3 14-00803 10
## 4 15-00231 10
## 5 16-00151 10
## 6 14-00842 9
## 7 15-00057 9
## 8 16-00483 9
## 9 13-00202 8
## 10 14-00495 8
## # ... with 334 more rows
The most mutated hot genes:
## # A tibble: 37 × 3
## Hugo_Symbol n freq
## <fctr> <int> <dbl>
## 1 FLT3 370 0.21650088
## 2 U2AF1 321 0.18782914
## 3 SMC1A 221 0.12931539
## 4 JAK3 176 0.10298420
## 5 NPM1 70 0.04095963
## 6 DNMT3A 67 0.03920421
## 7 PTEN 60 0.03510825
## 8 ATRX 57 0.03335284
## 9 NRAS 55 0.03218256
## 10 CBL 38 0.02223523
## # ... with 27 more rows
The most common type of mutations
## # A tibble: 9 × 2
## Variant_Classification n
## <fctr> <int>
## 1 Missense_Mutation 840
## 2 5'UTR 489
## 3 3'UTR 250
## 4 Frame_Shift_Ins 83
## 5 Nonsense_Mutation 30
## 6 Frame_Shift_Del 6
## 7 Splice_Site 5
## 8 In_Frame_Ins 4
## 9 In_Frame_Del 2
Now let’s focus on mutations in 3 genes: FLT3, DNMT3A and NPM1:
## # A tibble: 166 × 10
## Tumor_Sample_Barcode Original_LabID PatientID Hugo_Symbol
## <chr> <fctr> <int> <fctr>
## 1 10-00136 10-00136 174 FLT3
## 2 10-00136 10-00136 174 DNMT3A
## 3 10-00136 10-00136 174 U2AF1
## 4 10-00542 10-00542 174 FLT3
## 5 10-00542 10-00542 174 JAK3
## 6 10-00542 10-00542 174 DNMT3A
## 7 10-00542 10-00542 174 U2AF1
## 8 10-00542 10-00542 174 U2AF1
## 9 10-00542 10-00542 174 ATRX
## 10 11-00049 11-00049 174 FLT3
## # ... with 156 more rows, and 6 more variables: Transcript_ID <fctr>,
## # HGVSp_Short <fctr>, Variant_Classification <fctr>,
## # Reference_Allele <fctr>, Alt_Allele <fctr>,
## # Tumor_VAF_VarScan.2.4.1 <dbl>
## # A tibble: 202 × 10
## Tumor_Sample_Barcode Original_LabID PatientID Hugo_Symbol
## <chr> <fctr> <int> <fctr>
## 1 09-00705 09-00705 163 FLT3
## 2 09-00705 09-00705 163 U2AF1
## 3 09-00705 09-00705 163 TET2
## 4 09-00705 09-00705 163 NPM1
## 5 10-00692 10-00692 201 KRAS
## 6 10-00692 10-00692 201 FLT3
## 7 10-00692 10-00692 201 U2AF1
## 8 10-00692 10-00692 201 NPM1
## 9 10-00692 10-00692 201 SMC1A
## 10 10-00792 10-00792 163 FLT3
## # ... with 192 more rows, and 6 more variables: Transcript_ID <fctr>,
## # HGVSp_Short <fctr>, Variant_Classification <fctr>,
## # Reference_Allele <fctr>, Alt_Allele <fctr>,
## # Tumor_VAF_VarScan.2.4.1 <dbl>
31 specimens have mutations in FLT3 and DNMT3A but not NPM1.
37 specimens have mutations in FLT3 and NPM1 but not DNMT3A.
22 specimens have mutations in all 3 genes.
145 specimens are sequenced
56638 unique genes are profiled
Now let’s filter out only specimens that have known mutations in FLT3, DNMT3A and NPM1 Frist, lets extract seqIDs
## Joining, by = "Original_LabID"
## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
## Joining, by = "Original_LabID"
## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
## [1] "X20.00078" "X20.00310" "X14.00602" "X16.00004" "X16.00007"
## [6] "X20.00490" "X16.00124" "X20.00344" "X20.00327" "X20.00328"
## [11] "X20.00335" "X20.00458" "X20.00501" "X20.00502"
## [1] "X13.00157" "X13.00262" "X14.00658" "X15.00653" "X15.00967"
## [6] "X16.00087" "X16.00109" "X20.00420" "X20.00455" "X20.00456"
## [11] "X20.00507"
Second, Lets extract the corresponding RNA-seq data (cpm values)
## Warning in one_of(FLT3_DNMT3A_dual_seqID_final): Unknown variables:
## `X14.00602`, `X16.00004`, `X16.00007`, `X16.00124`
## Warning in one_of(FLT3_NPM1_dual_seqID_final): Unknown variables:
## `X13.00157`, `X13.00262`, `X14.00658`, `X15.00653`, `X15.00967`,
## `X16.00087`, `X16.00109`
## # A tibble: 63,677 × 11
## Symbol X20.00078 X20.00310 X20.00490 X20.00344 X20.00327 X20.00328
## <fctr> <int> <int> <int> <int> <int> <int>
## 1 TSPAN6 0 1 15 46 21 12
## 2 TNMD 0 0 0 0 0 0
## 3 DPM1 1343 1543 596 1266 1503 1456
## 4 SCYL3 1015 835 791 235 679 456
## 5 C1orf112 426 445 304 455 205 119
## 6 FGR 32049 33783 2395 4086 5305 18305
## 7 CFH 26 1477 3303 796 58 96
## 8 FUCA2 3286 3638 1603 759 1412 1048
## 9 GCLC 1829 2562 1226 1217 2041 1751
## 10 NFYA 2123 1777 1866 3230 1232 1490
## # ... with 63,667 more rows, and 4 more variables: X20.00335 <int>,
## # X20.00458 <int>, X20.00501 <int>, X20.00502 <int>
## # A tibble: 63,677 × 5
## Symbol X20.00420 X20.00455 X20.00456 X20.00507
## <fctr> <int> <int> <int> <int>
## 1 TSPAN6 28 1 65 14
## 2 TNMD 0 0 0 0
## 3 DPM1 1134 1698 839 1370
## 4 SCYL3 736 824 841 905
## 5 C1orf112 639 596 283 627
## 6 FGR 1351 1290 2299 718
## 7 CFH 2386 4335 4412 4139
## 8 FUCA2 2052 2527 2012 1588
## 9 GCLC 1823 2282 1738 2311
## 10 NFYA 2353 2793 1778 3130
## # ... with 63,667 more rows
## # A tibble: 63,677 × 15
## Symbol X20.00078 X20.00310 X20.00490 X20.00344 X20.00327 X20.00328
## <fctr> <int> <int> <int> <int> <int> <int>
## 1 TSPAN6 0 1 15 46 21 12
## 2 TNMD 0 0 0 0 0 0
## 3 DPM1 1343 1543 596 1266 1503 1456
## 4 SCYL3 1015 835 791 235 679 456
## 5 C1orf112 426 445 304 455 205 119
## 6 FGR 32049 33783 2395 4086 5305 18305
## 7 CFH 26 1477 3303 796 58 96
## 8 FUCA2 3286 3638 1603 759 1412 1048
## 9 GCLC 1829 2562 1226 1217 2041 1751
## 10 NFYA 2123 1777 1866 3230 1232 1490
## # ... with 63,667 more rows, and 8 more variables: X20.00335 <int>,
## # X20.00458 <int>, X20.00501 <int>, X20.00502 <int>, X20.00420 <int>,
## # X20.00455 <int>, X20.00456 <int>, X20.00507 <int>
## Warning: package 'limma' was built under R version 3.3.3
##
## FALSE TRUE
## 38586 18052
After removing low-expressing genes, out of the starting 56638 genes, 10922 are kept.
Now let’s check the result of first data pre-processing (removing low-expressing genes)
Now lets calcuate and update normalization factors
## [1] 0.988 1.081 0.864 1.022 1.007 0.826 0.747 1.098 0.733 1.137 1.213 1.127 1.076 1.268
Now lets generate a MDS plot to initially anlayze the similarity of specimens
Now lets do a mean-difference plot of first and last specimen
Now lets do dispersion estimation
## FLT3andDNMT3A FLT3andNPM1
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
## 11 0 1
## 12 0 1
## 13 0 1
## 14 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group_name
## [1] "contr.treatment"
## FLT3andDNMT3A FLT3andNPM1
## DPM1 -10.29 -10.49
## SCYL3 -10.83 -10.87
## C1orf112 -11.66 -11.33
## FGR -7.48 -10.29
## FUCA2 -10.03 -9.97
## GCLC -10.05 -9.98
Now lets to differential expression analysis
## Coefficient: 1*FLT3andDNMT3A -1*FLT3andNPM1
## logFC logCPM F PValue FDR
## FTO -3.80 6.45 86.4 8.04e-08 0.000878
## ARAP2 3.75 6.08 65.5 4.90e-07 0.002674
## CPA3 -5.33 7.48 53.8 1.76e-06 0.005667
## SERPINB9 3.79 5.95 52.3 2.08e-06 0.005667
## TWISTNB -2.18 5.95 48.4 3.29e-06 0.006557
## SORD -2.33 4.72 47.7 3.60e-06 0.006557
## TNFRSF14 3.94 7.18 45.0 5.26e-06 0.007776
## RNF144B 4.39 5.89 44.4 5.70e-06 0.007776
## RP11-706O15.1 -2.44 3.98 42.3 7.40e-06 0.008979
## ITPR2 -2.05 7.75 40.0 1.02e-05 0.011141
## [,1]
## -1 129
## 0 10559
## 1 234
## Coefficient: 1*FLT3andDNMT3A -1*FLT3andNPM1
## logFC unshrunk.logFC logCPM PValue FDR
## FTO -3.80 -3.80 6.45 1.70e-07 0.00186
## ARAP2 3.75 3.75 6.08 8.27e-07 0.00451
## CPA3 -5.33 -5.33 7.48 2.36e-06 0.00857
## SERPINB9 3.79 3.79 5.95 3.32e-06 0.00906
## RNF144B 4.39 4.39 5.89 7.77e-06 0.01426
## TNFRSF14 3.94 3.94 7.18 7.84e-06 0.01426
## SORD -2.33 -2.33 4.72 1.77e-05 0.02683
## TWISTNB -2.18 -2.18 5.95 1.99e-05 0.02683
## HLA-DRA 5.01 5.01 9.85 2.21e-05 0.02683
## IL6ST 3.21 3.21 6.53 2.53e-05 0.02759
## [,1]
## -1 15
## 0 10867
## 1 40
Now lets do heatmap clustering
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess