#High Quality Graphics in R
There are (at least) two types of data visualization. The first enables a scientist to explore data and make discoveries about the complex processes at work. The other type of visualization provides informative, clear and visually attractive illustrations of her results that she can show to others and eventually include in a publication.
Both of these types of visualizations can be made with R. In fact, R offers multiple graphics systems. This is because R is extensible, and because progress in R graphics over the years has proceeded largely not by replacing the old functions, but by adding packages. Each of the different graphics systems has its advantages and limitations. In this chapter we will get to know two of them. First, we have a cursory look at the base R plotting functions1. Subsequently we will switch to ggplot2.
1 They live in the graphics package, which ships with every basic R installation.
Figure 3.1: The ZUSE Plotter Z64 (presented in 1961). Source: https://en.wikipedia.org/wiki/Plotter. Base R graphics came historically first: simple, procedural, conceptually motivated by drawing on a canvas. There are specialized functions for different types of plots. These are easy to call – but when you want to combine them to build up more complex plots, or exchange one for another, this quickly gets messy, or even impossible. The user plots (the word harks back to some of the first graphics devices – see Figure 3.1) directly onto a (conceptual) canvas. She explicitly needs to deal with decisions such as how much space to allocate to margins, axes labels, titles, legends, subpanels; once something is “plotted” it cannot be moved or erased.
There is a more high-level approach: in the grammar of graphics, graphics are built up from modular logical pieces, so that we can easily try different visualization types for our data in an intuitive and easily deciphered way, like we can switch in and out parts of a sentence in human language. There is no concept of a canvas or a plotter, rather, the user gives ggplot2 a high-level description of the plot she wants, in the form of an R object, and the rendering engine takes a wholistic view on the scene to lay out the graphics and render them on the output device.
The most basic function is plot. In the code below, the output of which is shown in Figure 3.2, it is used to plot data from an enzyme-linked immunosorbent assay (ELISA) assay. The assay was used to quantify the activity of the enzyme deoxyribonuclease (DNase), which degrades DNA. The data are assembled in the R object DNase, which conveniently comes with base R. The object DNase is a dataframe whose columns are Run, the assay run; conc, the protein concentration that was used; and density, the measured optical density.
head(DNase)
## Run conc density
## 1 1 0.04882812 0.017
## 2 1 0.04882812 0.018
## 3 1 0.19531250 0.121
## 4 1 0.19531250 0.124
## 5 1 0.39062500 0.206
## 6 1 0.39062500 0.215
plot(DNase$conc, DNase$density)
This basic plot can be customized, for example by changing the plot
symbol and axis labels using the parameters xlab, ylab and pch (plot
character), as shown in Figure 3.3. Information about the variables is
stored in the object DNase, and we can access it with the attr
function.
plot(DNase$conc, DNase$density,
ylab = attr(DNase, "labels")$y,
xlab = paste(attr(DNase, "labels")$x, attr(DNase, "units")$x),
pch = 3,
col = "blue")
Besides scatterplots, we can also use built-in functions to create
histograms and boxplots (Figure 3.4).
hist(DNase$density, breaks=25, main = "")
boxplot(density ~ Run, data = DNase)
Boxplots are convenient for showing multiple distributions next to each
other in a compact space. We will see more about plotting multiple
univariate distributions in Section 3.6.
The base R plotting functions are great for quick interactive exploration of data; but we run soon into their limitations if we want to create more sophisticated displays. We are going to use a visualization framework called the grammar of graphics, implemented in the package ggplot2, that enables step by step construction of high quality graphics in a logical and elegant manner. First let us introduce and load an example dataset.
##2 An example dataset
Figure 3.5: Single-section immunofluorescence image of the E3.5 mouse blastocyst stained for Serpinh1, a marker of primitive endoderm (blue), Gata6 (red) and Nanog (green). To properly testdrive the ggplot2 functionality, we are going to need a dataset that is big enough and has some complexity so that it can be sliced and viewed from many different angles. We’ll use a gene expression microarray dataset that reports the transcriptomes of around 100 individual cells from mouse embryos at different time points in early development. The mammalian embryo starts out as a single cell, the fertilized egg. Through synchronized waves of cell divisions, the egg multiplies into a clump of cells that at first show no discernible differences between them. At some point, though, cells choose different lineages. By further and further specification, the different cell types and tissues arise that are needed for a full organism. The aim of the experiment, explained by Ohnishi et al. (2014), was to investigate the gene expression changes associated with the first symmetry breaking event in the embryo. We’ll further explain the data as we go. More details can be found in the paper and in the documentation of the Bioconductor data package Hiiragi2013. We first load the data:
library("Hiiragi2013")
## Loading required package: affy
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: boot
## Loading required package: clue
## Loading required package: cluster
## Loading required package: genefilter
## Loading required package: geneplotter
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: annotate
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: XML
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:IRanges':
##
## space
## The following object is masked from 'package:S4Vectors':
##
## space
## The following object is masked from 'package:stats':
##
## lowess
## Loading required package: gtools
##
## Attaching package: 'gtools'
## The following objects are masked from 'package:boot':
##
## inv.logit, logit
## Loading required package: KEGGREST
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following object is masked from 'package:genefilter':
##
## area
## Loading required package: mouse4302.db
## Loading required package: org.Mm.eg.db
##
##
## Loading required package: RColorBrewer
## Loading required package: xtable
data("x")
dim(Biobase::exprs(x))
## [1] 45101 101
You can print out a more detailed summary of the ExpressionSet object x by just typing x at the R prompt. The 101 columns of the data matrix (accessed above through the exprs function from the Biobase package) correspond to the samples (each of these is a single cell), the 45101 rows correspond to the genes probed by the array, an Affymetrix mouse4302 array. The data were normalized using the RMA method (Irizarry et al. 2003). The raw data are also available in the package (in the data object a) and at EMBL-EBI’s ArrayExpress database under the accession code E-MTAB-1681.
Let’s have a look at what information is available about the samples2
head(pData(x), n = 2)
## File.name Embryonic.day Total.number.of.cells lineage genotype
## 1 E3.25 1_C32_IN E3.25 32 WT
## 2 E3.25 2_C32_IN E3.25 32 WT
## ScanDate sampleGroup sampleColour
## 1 E3.25 2011-03-16 E3.25 #CAB2D6
## 2 E3.25 2011-03-16 E3.25 #CAB2D6
The information provided is a mix of information about the cells (i.e., age, size and genotype of the embryo from which they were obtained) and technical information (scan date, raw data file name). By convention, time in the development of the mouse embryo is measured in days, and reported as, for instance, E3.5. Moreover, in the paper the authors divided the cells into 8 biological groups (sampleGroup), based on age, genotype and lineage, and they defined a color scheme to represent these groups (sampleColour3). Using the following code (see below for explanations), we define a small dataframe groups that contains summary information for each group: the number of cells and the preferred color.
library("dplyr")
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
groups = group_by(pData(x), sampleGroup) |>
summarise(n = n(), color = unique(sampleColour))
groups
## # A tibble: 8 × 3
## sampleGroup n color
## <chr> <int> <chr>
## 1 E3.25 36 #CAB2D6
## 2 E3.25 (FGF4-KO) 17 #FDBF6F
## 3 E3.5 (EPI) 11 #A6CEE3
## 4 E3.5 (FGF4-KO) 8 #FF7F00
## 5 E3.5 (PE) 11 #B2DF8A
## 6 E4.5 (EPI) 4 #1F78B4
## 7 E4.5 (FGF4-KO) 10 #E31A1C
## 8 E4.5 (PE) 4 #33A02C
The cells in the groups whose name contains FGF4-KO are from embryos in which the FGF4 gene, an important regulator of cell differentiation, was knocked out. Starting from E3.5, the wildtype cells (without the FGF4 knock-out) undergo the first symmetry breaking event and differentiate into different cell lineages, called pluripotent epiblast (EPI) and primitive endoderm (PE).
Since the code chunk above is the first instance that we encounter the pipe operator |> and the functions group_by and summarise from the dplyr package, let’s unpack the code. First, the pipe |>4. Generally, the pipe is useful for making nested function calls easier to read for humans. The following two lines of code are equivalent to R.
##3 ggplot2
ggplot2 is a package by Hadley Wickham (Wickham 2016) that implements the idea of grammar of graphics – a concept created by Leland Wilkinson in his eponymous book (Wilkinson 2005). We will explore some of its functionality in this chapter, and you will see many examples of how it can be used in the rest of this book. Comprehensive documentation for the package can be found on its website. The online documentation includes example use cases for each of the graphic types that are introduced in this chapter (and many more) and is an invaluable resource when creating figures.
Let’s start by loading the package and redoing the simple plot of Figure 3.2.
library("ggplot2")
ggplot(DNase, aes(x = conc, y = density)) + geom_point()
We just wrote our first “sentence” using the grammar of graphics. Let’s
deconstruct this sentence. First, we specified the dataframe that
contains the data, DNase. The aes (this stands for aesthetic) argument
states which variables we want to see mapped to the - and -axes,
respectively. Finally, we stated that we want the plot to use points (as
opposed to, say, lines or bars), by adding the result of calling the
function geom_point.
Now let’s turn to the mouse single cell data and plot the number of samples for each of the 8 groups using the ggplot function. The result is shown in Figure 3.7.
ggplot(groups, aes(x = sampleGroup, y = n)) +
geom_bar(stat = "identity")
With geom_bar we now told ggplot that we want each data item (each row
of groups) to be represented by a bar. Bars are one example of geometric
object (geom in the ggplot2 package’s parlance) that ggplot knows about.
We have already seen another such object in Figure 3.6: points,
indicated by the geom_point function. We will encounter many other geoms
later. We used the aes to indicate that we want the groups shown along
the -axis and the sizes along the -axis. Finally, we provided the
argument stat = “identity” (in other words, do nothing) to the geom_bar
function, since otherwise it would try to compute a histogram of the
data (the default value of stat is “count”). stat is short for
statistic, which is what we call any function of data. Besides the
identity and count statistic, there are others, such as smoothing,
averaging, binning, or other operations that reduce the data in some
way.
These concepts –data, geometrical objects, statistics– are some of the ingredients of the grammar of graphics, just as nouns, verbs and adverbs are ingredients of an English sentence.
The plot in Figure 3.7 is not bad, but there are several potential improvements. We can use color for the bars to help us quickly see which bar corresponds to which group. This is particularly useful if we use the same color scheme in several plots. To this end, let’s define a named vector groupColor that contains our desired colors for each possible value of sampleGroup.5
groupColor = setNames(groups$color, groups$sampleGroup)
Another thing that we need to fix in Figure 3.7 is the readability of the bar labels. Right now they are running into each other — a common problem when you have descriptive names.
ggplot(groups, aes(x = sampleGroup, y = n, fill = sampleGroup)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = groupColor, name = "Groups") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
This is now already a longer and more complex sentence. Let’s dissect
it. We added an argument, fill to the aes function that states that we
want the bars to be colored (filled) based on sampleGroup (which in this
case co-incidentally is also the value of the x argument, but that need
not always be so). Furthermore we added a call to the scale_fill_manual
function, which takes as its input a color map – i.e., the mapping from
the possible values of a variable to the associated colors – as a named
vector. We also gave this color map a title (note that in more complex
plots, there can be several different color maps involved). Had we
omitted the call to scale_fill_manual, ggplot2 would have used its
choice of default colors. We also added a call to theme stating that we
want the -axis labels rotated by 90 degrees and right-aligned (hjust;
the default would be to center).
##4.1 Data flow The function ggplot expects your data in a dataframe. If they are in a matrix, in separate vectors, or other types of objects, you will have to convert them. The packages dplyr and broom, among others, offer facilities to this end. We will discuss this more in Section 13.10, and you will see examples of such conversions throughout the book.
The result of a call to the ggplot is a ggplot object. Let’s recall a piece of code from above:
gg = ggplot(DNase, aes(x = conc, y = density)) + geom_point()
We have now assigned the output of ggplot to the object gg, instead of sending it directly to the console, where it was “printed” and produced Figure 3.6. The situation is completely analogous to what you’re used to from working with the R console: when you enter an expression such as 1+1 and hit “Enter”, the result is printed. When the expression is an assignment, such as s = 1+1, the side effect takes place (the name “s” is bound to an object in memory that represents the value of 1+1), but nothing is printed. Similarly, when an expression is evaluated as part of a script called with source, it is not printed. Thus, the above code also does not create any graphic output, since no print method is invoked. To print gg, type its name (in an interactive session) or call print on it:
gg
print(gg)
Refrensi :