Data Visualization: Nice Boxplots, Violon and Density plots with ggpubr package!

Dhafer Malouche

http://dhafermalouche.net

Why this presentation?

Outline

Installing R packages and data

TCGA data

TCGA data

The following R code installs the core RTCGA package as well as the clinical and mRNA gene expression data packages.

  1. Load the bioconductor installer.
> source("https://bioconductor.org/biocLite.R")
> biocLite("RTCGA")
  1. Install the main RTCGA package
> library(RTCGA)
> installTCGA("RTCGA.mRNA")
> library(RTCGA.mRNA) # Run this on R not in RStudio
## Welcome to the RTCGA (version: 1.4.0).

Genomic Data

> library(RTCGA)
> # Each cell is a data.frame 
> # and it represents the number of obs.
> infoTCGA() 

Genomic Data

##                   Cohort  BCR Clinical   CN LowP Methylation mRNA mRNASeq
## ACC-counts           ACC   92       92   90    0          80    0      79
## BLCA-counts         BLCA  412      412  410  112         412    0     408
## BRCA-counts         BRCA 1098     1097 1089   19        1097  526    1093
## CESC-counts         CESC  307      307  295   50         307    0     304
## CHOL-counts         CHOL   51       45   36    0          36    0      36
## COAD-counts         COAD  460      458  451   69         457  153     457
## COADREAD-counts COADREAD  631      629  616  104         622  222     623
## DLBC-counts         DLBC   58       48   48    0          48    0      48
## ESCA-counts         ESCA  185      185  184   51         185    0     184
## FPPP-counts         FPPP   38       38    0    0           0    0       0
## GBM-counts           GBM  613      595  577    0         420  540     160
## GBMLGG-counts     GBMLGG 1129     1110 1090   52         936  567     676
## HNSC-counts         HNSC  528      528  522  108         528    0     520
## KICH-counts         KICH  113      113   66    0          66    0      66
## KIPAN-counts       KIPAN  973      941  883    0         892   88     889
## KIRC-counts         KIRC  537      537  528    0         535   72     533
## KIRP-counts         KIRP  323      291  289    0         291   16     290
## LAML-counts         LAML  200      200  197    0         194    0     179
## LGG-counts           LGG  516      515  513   52         516   27     516
## LIHC-counts         LIHC  377      377  370    0         377    0     371
## LUAD-counts         LUAD  585      522  516  120         578   32     515
## LUSC-counts         LUSC  504      504  501    0         503  154     501
## MESO-counts         MESO   87       87   87    0          87    0      87
## OV-counts             OV  602      591  586    0         594  574     304
## PAAD-counts         PAAD  185      185  184    0         184    0     178
## PCPG-counts         PCPG  179      179  175    0         179    0     179
## PRAD-counts         PRAD  499      499  492  115         498    0     497
## READ-counts         READ  171      171  165   35         165   69     166
## SARC-counts         SARC  261      261  257    0         261    0     259
## SKCM-counts         SKCM  470      470  469  118         470    0     469
## STAD-counts         STAD  443      443  442  107         443    0     415
## STES-counts         STES  628      628  626  158         628    0     599
## TGCT-counts         TGCT  150      134  150    0         150    0     150
## THCA-counts         THCA  503      503  499   98         503    0     501
## THYM-counts         THYM  124      124  123    0         124    0     120
## UCEC-counts         UCEC  560      548  540  106         547   54     545
## UCS-counts           UCS   57       57   56    0          57    0      57
## UVM-counts           UVM   80       80   80   51          80    0      80
##                 miR miRSeq RPPA MAF rawMAF
## ACC-counts        0     80   46  90      0
## BLCA-counts       0    409  344 130    395
## BRCA-counts       0   1078  887 977      0
## CESC-counts       0    307  173 194      0
## CHOL-counts       0     36   30  35      0
## COAD-counts       0    406  360 154    367
## COADREAD-counts   0    549  491 223    489
## DLBC-counts       0     47   33  48      0
## ESCA-counts       0    184  126 185      0
## FPPP-counts       0     23    0   0      0
## GBM-counts      565      0  238 290    290
## GBMLGG-counts   565    512  668 576    806
## HNSC-counts       0    523  212 279    510
## KICH-counts       0     66   63  66     66
## KIPAN-counts      0    873  756 644    799
## KIRC-counts       0    516  478 417    451
## KIRP-counts       0    291  215 161    282
## LAML-counts       0    188    0 197      0
## LGG-counts        0    512  430 286    516
## LIHC-counts       0    372   63 198    373
## LUAD-counts       0    513  365 230    542
## LUSC-counts       0    478  328 178      0
## MESO-counts       0     87   63   0      0
## OV-counts       570    453  426 316    469
## PAAD-counts       0    178  123 150    184
## PCPG-counts       0    179   80 179      0
## PRAD-counts       0    494  352 332    498
## READ-counts       0    143  131  69    122
## SARC-counts       0    259  223 247      0
## SKCM-counts       0    448  353 343    366
## STAD-counts       0    436  357 289    395
## STES-counts       0    620  483 474    395
## TGCT-counts       0    150  118 149      0
## THCA-counts       0    502  222 402    496
## THYM-counts       0    124   90 123      0
## UCEC-counts       0    538  440 248      0
## UCS-counts        0     56   48  57      0
## UVM-counts        0     80   12  80      0

Merging data sets

We will extract the mRNA expression for five genes of interest

Merging data sets

> expr <- expressionsTCGA(BRCA.mRNA, OV.mRNA, LUSC.mRNA,
+                         extract.cols = c("GATA3",
+                                          "PTEN", "XBP1",
+                                          "ESR1", "MUC1"))

Merging data sets

> dim(expr)
## [1] 1305    7
> colnames(expr)
## [1] "bcr_patient_barcode" "dataset"             "GATA3"              
## [4] "PTEN"                "XBP1"                "ESR1"               
## [7] "MUC1"

Merging data sets

> expr
## # A tibble: 1,305 x 7
##    bcr_patient_barcode          datas…  GATA3   PTEN   XBP1   ESR1    MUC1
##    <chr>                        <chr>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1 TCGA-A1-A0SD-01A-11R-A115-07 BRCA.…  2.87   1.36   2.98   3.08   1.65  
##  2 TCGA-A1-A0SE-01A-11R-A084-07 BRCA.…  2.17   0.428  2.55   2.39   3.08  
##  3 TCGA-A1-A0SH-01A-11R-A084-07 BRCA.…  1.32   1.31   3.02   0.791  2.99  
##  4 TCGA-A1-A0SJ-01A-11R-A084-07 BRCA.…  1.84   0.810  3.13   2.50  -1.92  
##  5 TCGA-A1-A0SK-01A-12R-A084-07 BRCA.… -6.03   0.251 -1.45  -4.86  -1.17  
##  6 TCGA-A1-A0SM-01A-11R-A084-07 BRCA.…  1.80   1.31   4.04   2.80   3.53  
##  7 TCGA-A1-A0SO-01A-22R-A084-07 BRCA.… -4.88  -0.237 -0.725 -4.49  -1.46  
##  8 TCGA-A1-A0SP-01A-11R-A084-07 BRCA.… -3.14  -1.24  -1.19  -1.63  -0.987 
##  9 TCGA-A2-A04N-01A-11R-A115-07 BRCA.…  2.03   1.21   2.28   4.12   0.668 
## 10 TCGA-A2-A04P-01A-31R-A034-07 BRCA.… -0.293  0.288 -1.61   0.473  0.0115
## # ... with 1,295 more rows

Preparing the data

display this

Preparing the data

> nb_samples <- table(expr$dataset)
> nb_samples
## 
## BRCA.mRNA LUSC.mRNA   OV.mRNA 
##       590       154       561
> expr$dataset <- gsub(pattern = ".mRNA", 
+                      replacement = "",  expr$dataset)

Preparing the data

> expr$bcr_patient_barcode <- paste0(expr$dataset,
+                                    c(1:590, 1:561, 1:154))

Preparing the data

> expr
## # A tibble: 1,305 x 7
##    bcr_patient_barcode dataset  GATA3   PTEN   XBP1   ESR1    MUC1
##    <chr>               <chr>    <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
##  1 BRCA1               BRCA     2.87   1.36   2.98   3.08   1.65  
##  2 BRCA2               BRCA     2.17   0.428  2.55   2.39   3.08  
##  3 BRCA3               BRCA     1.32   1.31   3.02   0.791  2.99  
##  4 BRCA4               BRCA     1.84   0.810  3.13   2.50  -1.92  
##  5 BRCA5               BRCA    -6.03   0.251 -1.45  -4.86  -1.17  
##  6 BRCA6               BRCA     1.80   1.31   4.04   2.80   3.53  
##  7 BRCA7               BRCA    -4.88  -0.237 -0.725 -4.49  -1.46  
##  8 BRCA8               BRCA    -3.14  -1.24  -1.19  -1.63  -0.987 
##  9 BRCA9               BRCA     2.03   1.21   2.28   4.12   0.668 
## 10 BRCA10              BRCA    -0.293  0.288 -1.61   0.473  0.0115
## # ... with 1,295 more rows

Boxplots

display this

Installing ggpubr package

> if(!require(devtools)) install.packages("devtools")
> devtools::install_github("kassambara/ggpubr")

Load ggpubr:

> library(ggpubr)
## Loading required package: ggplot2
## Loading required package: magrittr

First boxplot

Create a box plot of a gene expression profile, colored by groups (here dataset/cancer type)

GATA3

> ggboxplot(expr, x = "dataset", y = "GATA3",
+           title = "GATA3", ylab = "Expression",
+           color = "dataset", palette = "jco")

GATA3

PTEN

> ggboxplot(expr, x = "dataset", y = "GATA3",
+           title = "PTEN", ylab = "Expression",
+           color = "dataset", palette = "jco")

PTEN

Create a list of boxplots

Create a list of boxplots

> p <- ggboxplot(expr, x = "dataset",
+                y = c("GATA3", "PTEN", "XBP1"),
+                title = c("GATA3", "PTEN", "XBP1"),
+                 ylab = "Expression",
+                color = "dataset", palette = "jco")

View GATA3

> p$GATA3

View GATA3

View PTEN

> p$PTEN

View PTEN

View XBP1

> p$XBP1

View XBP1

Adding p-values

> my_comparisons <- list(c("BRCA", "OV"), c("OV", "LUSC"))
> ggboxplot(expr, x = "dataset", y = "GATA3",
+           title = "GATA3", ylab = "Expression",
+           color = "dataset", palette = "jco")+
+   stat_compare_means(comparisons = my_comparisons)

Adding p-values

Other comparisons

> my_comparisons <- list(c("BRCA", "OV"), 
+                        c("OV", "LUSC"),
+                        c("BRCA","LUSC"))
> ggboxplot(expr, x = "dataset", y = "GATA3",
+           title = "GATA3", ylab = "Expression",
+           color = "dataset", palette = "jco")+
+   stat_compare_means(comparisons = my_comparisons)

Other comparisons

Comparing different groups

> comp_data=compare_means(c(GATA3, PTEN, XBP1) ~ dataset, data = expr)
> comp_data
## # A tibble: 9 x 8
##   .y.    group1 group2                p            p.adj p.fo… p.si… meth…
##   <fctr> <chr>  <chr>             <dbl>            <dbl> <chr> <chr> <chr>
## 1 GATA3  BRCA   OV            1.11e⁻¹⁷⁷        3.34e⁻¹⁷⁷ < 2e… ****  Wilc…
## 2 GATA3  BRCA   LUSC          6.68e⁻ ⁷³        1.34e⁻ ⁷² < 2e… ****  Wilc…
## 3 GATA3  OV     LUSC          2.97e⁻  ⁸        2.97e⁻  ⁸ 3.0e… ****  Wilc…
## 4 PTEN   BRCA   OV            6.79e⁻  ⁵        6.79e⁻  ⁵ 6.8e… ****  Wilc…
## 5 PTEN   BRCA   LUSC          1.04e⁻ ¹⁶        3.13e⁻ ¹⁶ < 2e… ****  Wilc…
## 6 PTEN   OV     LUSC          1.28e⁻  ⁷        2.56e⁻  ⁷ 1.3e… ****  Wilc…
## 7 XBP1   BRCA   OV            2.55e⁻¹²³        7.65e⁻¹²³ < 2e… ****  Wilc…
## 8 XBP1   BRCA   LUSC          1.95e⁻ ⁴²        3.90e⁻ ⁴² < 2e… ****  Wilc…
## 9 XBP1   OV     LUSC          4.24e⁻ ¹¹        4.24e⁻ ¹¹ 4.2e… ****  Wilc…

More options

Select items (here cancer types) to display.

> ggboxplot(expr, x = "dataset", y = "GATA3",
+           title = "GATA3", ylab = "Expression",
+           color = "dataset", palette = "jco",
+           select = c("BRCA", "OV"))

More options

More options

Remove items (here cancer types) to display.

> ggboxplot(expr, x = "dataset", y = "GATA3",
+           title = "GATA3", ylab = "Expression",
+           color = "dataset", palette = "jco",
+           remove = "BRCA")

More options

More options

Change the order of the dataset on x axis

> ggboxplot(expr, x = "dataset", y = "GATA3",
+           title = "GATA3", ylab = "Expression",
+           color = "dataset", palette = "jco",
+           order = c("LUSC", "OV", "BRCA"))

More options

More options

Create horizontal plots

> ggboxplot(expr, x = "dataset", y = "GATA3",
+           title = "GATA3", ylab = "Expression",
+           color = "dataset", palette = "jco",
+           rotate = TRUE)

More options

Multi-plots in the same figure

Combine the three gene expression plots into a multi-panel plot, use the argument combine = TRUE:

> ggboxplot(expr, x = "dataset",
+           y = c("GATA3", "PTEN", "XBP1"),
+           combine = TRUE,
+           ylab = "Expression",
+           color = "dataset", palette = "jco")

Multi-plots in the same figure

Multi-plots in the same figure

Merge the 3 plots using the argument merge = TRUE:

> ggboxplot(expr, x = "dataset",
+           y = c("GATA3", "PTEN", "XBP1"),
+           merge = TRUE,
+           ylab = "Expression", 
+           palette = "jco")

Multi-plots in the same figure

Multi-plots in the same figure

Merge the 3 plots using the argument merge = "asis":

> ggboxplot(expr, x = "dataset",
+           y = c("GATA3", "PTEN", "XBP1"),
+           merge = "asis",
+           ylab = "Expression", 
+           palette = "jco")

Multi-plots in the same figure

Multi-plots in the same figure

The y variables (i.e.: genes) become x tick labels and the x variable (i.e.: dataset) becomes the grouping variable. To do this, use the argument merge = “flip”:

Multi-plots in the same figure

> ggboxplot(expr, x = "dataset",
+           y = c("GATA3", "PTEN", "XBP1"),
+           merge = "flip",
+           ylab = "Expression", 
+           palette = "jco")

Multi-plots in the same figure

Adding decorations

Add jittered points on the boxplot.

> ggboxplot(expr, x = "dataset",
+           y = c("GATA3", "PTEN", "XBP1"),
+           combine = TRUE,
+           color = "dataset", palette = "jco",
+           ylab = "Expression", 
+           add = "jitter",                              # Add jittered points
+           add.params = list(size = 0.1, jitter = 0.2)  # Point size and the amount of jittering
+ )

Adding decorations

Adding decorations

Add and adjust a dotplot.

> ggboxplot(expr, x = "dataset",
+             y = c("GATA3", "PTEN", "XBP1"),
+             combine = TRUE,
+             color = "dataset", palette = "jco",
+             ylab = "Expression", 
+             add = "dotplot",                              # Add dotplot
+             add.params = list(binwidth = 0.1, dotsize = 0.3)
+   )

Adding decorations

Adding labels

Adding labels

Example,

Adding labels

> ggboxplot(expr, x = "dataset",
+             y = c("GATA3", "PTEN", "XBP1"),
+             combine = TRUE,
+             color = "dataset", palette = "jco",
+             ylab = "Expression", 
+             add = "jitter",                               # Add jittered points
+             add.params = list(size = 0.1, jitter = 0.2),  # Point size and the amount of jittering
+             label = "bcr_patient_barcode",                # column containing point labels
+             label.select = list(top.up = 2, top.down = 2),# Select some labels to display
+             font.label = list(size = 9, face = "italic"), # label font
+             repel = TRUE                                  # Avoid label text overplotting
+   )

Adding labels

Adding labels

> label.select.criteria <- list(criteria = "`y` > 3.9 & `x` %in% c('BRCA', 'OV')")
>   ggboxplot(expr, x = "dataset",
+             y = c("GATA3", "PTEN", "XBP1"),
+             combine = TRUE,
+             color = "dataset", palette = "jco",
+             ylab = "Expression", 
+             label = "bcr_patient_barcode",              # column containing point labels
+             label.select = label.select.criteria,       # Select some labels to display
+             font.label = list(size = 9, face = "italic"), # label font
+             repel = TRUE                                # Avoid label text overplotting
+   )

Adding labels

Violin plots

display this

Violin + Boxplot

>  ggviolin(expr, x = "dataset",
+            y = c("GATA3", "PTEN", "XBP1"),
+            combine = TRUE, 
+            color = "dataset", palette = "jco",
+            ylab = "Expression", 
+            add = "boxplot")

Violin + Boxplot

Computing descriptive statistics

> library(reshape) 
> dt=as.data.frame(expr[,-1])
> expr2=melt(dt)
## Using dataset as id variables
> head(expr2,2)
##   dataset variable   value
## 1    BRCA    GATA3 2.87050
## 2    BRCA    GATA3 2.16625

Computing descriptive statistics

> res<-desc_statby(expr2,measure.var = "value",grps=c("dataset","variable"))
> res
##    dataset variable length       min      max     median       mean
## 1     BRCA    GATA3    590 -6.025250 3.930000  1.6763750  1.1035732
## 2     BRCA     PTEN    590 -2.398357 2.802786  0.8123929  0.7486928
## 3     BRCA     XBP1    590 -2.438250 4.736750  2.1194167  1.7498139
## 4     BRCA     ESR1    590 -4.860667 6.110833  3.0645833  2.2982359
## 5     BRCA     MUC1    590 -2.793875 4.280500  2.2476875  1.8527130
## 6     LUSC    GATA3    154 -5.059500 1.196625 -3.9270625 -3.7584818
## 7     LUSC     PTEN    154 -1.935429 1.472357  0.3048571  0.2516035
## 8     LUSC     XBP1    154 -2.291083 2.194083 -0.4241250 -0.2916250
## 9     LUSC     ESR1    154 -5.195417 0.421750 -3.0606250 -2.9925693
## 10    LUSC     MUC1    154 -1.814250 3.240375  0.6474375  0.5622597
## 11      OV    GATA3    561 -5.963250 1.395875 -4.2500000 -4.2256808
## 12      OV     PTEN    561 -2.750357 2.727500  0.6530714  0.5723408
## 13      OV     XBP1    561 -3.187750 1.910583 -0.8346667 -0.8727874
## 14      OV     ESR1    561 -5.033167 4.334583  1.6151667  1.4866153
## 15      OV     MUC1    561 -3.323375 4.561875  2.4697500  2.2671916
##          iqr       mad        sd         se         ci     range
## 1  2.0183125 1.1917324 1.8002158 0.07411371 0.14555931  9.955250
## 2  0.7508929 0.5542806 0.6917991 0.02848092 0.05593652  5.201143
## 3  1.7493750 1.1950992 1.5070681 0.06204501 0.12185639  7.175000
## 4  3.1700833 1.9828540 2.6218080 0.10793813 0.21199046 10.971500
## 5  2.0145625 1.4114352 1.4999340 0.06175131 0.12127955  7.074375
## 6  1.0329063 0.7490836 0.9132301 0.07359018 0.14538404  6.256125
## 7  0.8792321 0.6602865 0.6550587 0.05278614 0.10428378  3.407786
## 8  0.9772500 0.7592765 0.8248771 0.06647049 0.13131846  4.485167
## 9  1.0367708 0.7768206 1.0459820 0.08428763 0.16651783  5.617167
## 10 1.8160313 1.2928272 1.1499654 0.09266685 0.18307173  5.054625
## 11 0.8773750 0.6484522 0.7335973 0.03097250 0.06083647  7.359125
## 12 0.9690714 0.7268976 0.7741196 0.03268335 0.06419694  5.477857
## 13 1.3240000 0.9892648 0.9220309 0.03892817 0.07646308  5.098333
## 14 1.2080000 0.8978378 1.2030989 0.05079487 0.09977176  9.367750
## 15 1.1295000 0.8280321 1.0288869 0.04343964 0.08532454  7.885250
##            cv       var
## 1   1.6312609 3.2407769
## 2   0.9240093 0.4785860
## 3   0.8612734 2.2712544
## 4   1.1407915 6.8738772
## 5   0.8095879 2.2498020
## 6  -0.2429784 0.8339891
## 7   2.6035355 0.4291020
## 8  -2.8285542 0.6804223
## 9  -0.3495264 1.0940784
## 10  2.0452565 1.3224203
## 11 -0.1736045 0.5381650
## 12  1.3525501 0.5992611
## 13 -1.0564209 0.8501409
## 14  0.8092873 1.4474469
## 15  0.4538156 1.0586083

Statistics on the violin plot

> ggviolin(expr, x = "dataset",
+            y = c("GATA3", "PTEN", "XBP1"),
+            combine = TRUE, 
+            color = "dataset", palette = "jco",
+            ylab = "Expression", 
+            add = "mean_se")

Statistics on the violin plot

Adding quantiles

> ggviolin(expr, x = "dataset",
+            y = c("GATA3", "PTEN", "XBP1"),
+            combine = TRUE, draw_quantiles = c(.025,.5,.75,.975),
+            color = "dataset", palette = "jco",
+            ylab = "Expression", 
+            add = "boxplot")

Adding quantiles

Density plots

To visualize the distribution as a density plot,

> ggdensity(expr,
+        x = c("GATA3", "PTEN",  "XBP1"),
+        y = "..density..",
+        combine = TRUE,                  # Combine the 3 plots
+        xlab = "Expression", 
+        add = "median",                  # Add median line. 
+        rug = TRUE                       # Add marginal rug
+ )

Density plots

Density plots

Change color and fill by dataset

> ggdensity(expr,
+        x = c("GATA3", "PTEN",  "XBP1"),
+        y = "..density..",
+        combine = TRUE,                  # Combine the 3 plots
+        xlab = "Expression", 
+        add = "median",                  # Add median line. 
+        rug = TRUE,                      # Add marginal rug
+        color = "dataset", 
+        fill = "dataset",
+        palette = "jco"
+ )

Density plots

Density plots

Merging all the plots in one figure

> ggdensity(expr,
+        x = c("GATA3", "PTEN",  "XBP1"),
+        y = "..count..",
+        color = ".x.", fill = ".x.",     # color and fill by x variables
+        merge = TRUE,                    # Merge the 3 plots
+        xlab = "Expression", 
+        add = "median",                  # Add median line. 
+        rug = TRUE ,                     # Add marginal rug
+        palette = "jco"                  # Change color palette
+ )

Density plots

Merging all the plots in one figure

Density plots

Facet by dataset

> ggdensity(expr,
+        x = c("GATA3", "PTEN",  "XBP1"),
+        y = "..count..",
+        color = ".x.", fill = ".x.", 
+        facet.by = "dataset",            # Split by "dataset" into multi-panel
+        merge = TRUE,                    # Merge the 3 plots
+        xlab = "Expression", 
+        add = "median",                  # Add median line. 
+        rug = TRUE ,                     # Add marginal rug
+        palette = "jco"                  # Change color palette
+ )

Density plots