Visualizing data in 1D

#Visualizing data in 1D

A common task in biological data analysis is the comparison between several samples of univariate measurements. In this section we’ll explore some possibilities for visualizing and comparing such samples. As an example, we’ll use the intensities of a set of four genes: Fgf4, Gata4, Gata6 and Sox26. On the microarray, they are represented by

selectedProbes = c( Fgf4 = "1420085_at", Gata4 = "1418863_at",
                   Gata6 = "1425463_at",  Sox2 = "1416967_at")

To extract data from this representation and convert them into a dataframe, we use the function melt from the reshape2 package7.

'library("reshape2")
genes = melt(Biobase::exprs(x)[selectedProbes, ],
             varnames = c("probe", "sample"))'

## [1] "library(\"reshape2\")\ngenes = melt(Biobase::exprs(x)[selectedProbes, ],\n             varnames = c(\"probe\", \"sample\"))"

For good measure, we also add a column that provides the gene symbol along with the probe identifiers.

'1genes$gene =
  names(selectedProbes)[match(genes$probe, selectedProbes)]
head(genes)'

## [1] "1genes$gene =\n  names(selectedProbes)[match(genes$probe, selectedProbes)]\nhead(genes)"

## 1. Barplots

A popular way to display data such as in our dataframe genes is through barplots

'ggplot(genes, aes(x = gene, y = value)) +
  stat_summary(fun = mean, geom = "bar")'

## [1] "ggplot(genes, aes(x = gene, y = value)) +\n  stat_summary(fun = mean, geom = \"bar\")"

Here, we see again the principle of layered graphics: we use two summary functions, mean and mean_cl_normal, and two associated geometric objects, bar and errorbar. The function mean_cl_normal is from the Hmisc package and computes the standard error (or confidence limits) of the mean; it’s a simple function, and we could also compute it ourselves using base R expressions if we wished to do so. We have also colored the bars to make the plot more visually pleasing.

2. Boxplots

Boxplots take up a similar amount of space as barplots, but are much more informative.

'p = ggplot(genes, aes( x = gene, y = value, fill = gene))
p + geom_boxplot()'

## [1] "p = ggplot(genes, aes( x = gene, y = value, fill = gene))\np + geom_boxplot()"

In Figure 3.17 we see that two of the genes (Gata4, Gata6) have relatively concentrated distributions, with only a few data points venturing out in the direction of higher values. For Fgf4, we see that the distribution is right-skewed: the median, indicated by the horizontal black bar within the box is closer to the lower (or left) side of the box. Conversely, for Sox2 the distribution is left-skewed.

3. Dot plots and beeswarm plots

If the number of data points is not too large, it is possible to show the data points directly, and it is good practice to do so, compared to the summaries we saw above. However, plotting the data directly will often lead to overlapping points, which can be visually unpleasant, or even obscure the data. We can try to lay out the points so that they are as near possible to their proper locations without overlap (Wilkinson 1999).

'p + geom_dotplot(binaxis = "y", binwidth = 1/6,
       stackdir = "center", stackratio = 0.75,
       aes(color = gene))
library("ggbeeswarm")
p + geom_beeswarm(aes(color = gene))'

## [1] "p + geom_dotplot(binaxis = \"y\", binwidth = 1/6,\n       stackdir = \"center\", stackratio = 0.75,\n       aes(color = gene))\nlibrary(\"ggbeeswarm\")\np + geom_beeswarm(aes(color = gene))"

Figure 3.18: (a) Dot plots, made using geom_dotplot from ggplot2. (b) Beeswarm plots, made using geom_beeswarm from ggbeeswarm. The plot in panel (a) of Figure 3.18. The -coordinates of the points are discretized into bins (above we chose a bin size of 1/6), and then they are stacked next to each other.

An alternative is provided by the package ggbeeswarm, which provides geom_beeswarm. The plot is shown in panel (b) of Figure 3.18. The layout algorithm aims to avoid overlaps between the points. If a point were to overlap an existing point, it is shifted along the -axis by a small amount sufficient to avoid overlap. Some tweaking of the layout parameters is usually needed for each new dataset to make a dot plot or a beeswarm plot look good.

4. Density plots

Yet another way to represent the same data is by density plots. Here, we try to estimate the underlying data-generating density by smoothing out the data points (Figure 3.19).

'ggplot(genes, aes( x = value, color = gene)) + geom_density()'

## [1] "ggplot(genes, aes( x = value, color = gene)) + geom_density()"

As you can see from Figures 3.17—3.19, boxplots work fairly well for unimodal data, but they can be misleading if the data distribution is multimodal, and they also do not always give a fair impression of long-tailed distributions. By showing the data points, or their densities, directly, Figures 3.18—3.19) give a better impression of such features. For instance, we can see that the data for Fgf4 and Sox2 are bimodal, while Gata4 and Gata6 have most of their values concentrated around a baseline, but a certain fraction of cells have elevated expression levels, across a wide range of values.

Density estimation has, however, a number of complications, in particular, the need for choosing a smoothing window. A window size that is small enough to capture peaks in the dense regions of the data may lead to instable (“wiggly”) estimates elsewhere. On the other hand, if the window is made bigger, pronounced features of the density, such as sharp peaks, may be smoothed out. Moreover, the density lines do not convey the information on how much data was used to estimate them, and plots like Figure 3.19 can be especially problematic if the sample sizes for the curves differ.

5. Violin plots

Density plots such as Figure 3.19 can look crowded if there are several densities, and one idea is to arrange the densities in a layout inspired by boxplots: the violin plot (Figure 3.20). Here, the density estimate is used to draw a symmetric shape, which sometimes visually reminds of a violin.

'p + geom_violin()'

## [1] "p + geom_violin()"

##6.Ridgeline plots Another play on density plots are ridgeline plots (Figure 3.21):

'library("ggridges")
ggplot(genes, aes(x = value, y = gene, fill = gene)) + 
  geom_density_ridges()'

## [1] "library(\"ggridges\")\nggplot(genes, aes(x = value, y = gene, fill = gene)) + \n  geom_density_ridges()"

This type of display is perhaps most appropriate if the number of densities shown goes into the dozens.

'top42 = order(rowMeans(Biobase::exprs(x)), decreasing = TRUE)[1:42]
g42 = melt(Biobase::exprs(x)[rev(top42), ], varnames = c("probe", "sample"))
ggplot(g42, aes(x = value, y = probe))'

## [1] "top42 = order(rowMeans(Biobase::exprs(x)), decreasing = TRUE)[1:42]\ng42 = melt(Biobase::exprs(x)[rev(top42), ], varnames = c(\"probe\", \"sample\"))\nggplot(g42, aes(x = value, y = probe))"

## 7. ECDF plots The mathematically most convenient way to describe the distribution of a one-dimensional random variable is its cumulative distribution function (CDF), i.e., the function defined by

F(x) = P(Xx),

where takes all values along the real axis. The density of is then the derivative of , if it exists9. The finite sample version of the probability 3.1 is called the empirical cumulative distribution function (ECDF),

F_{n}(x) = = _{i=1}^n 𝟙({xx_i}),

where denote a sample of draws from and 𝟙 is the indicator function, i.e., the function that takes the value 1 if the expression in its argument is true and 0 otherwise. If this sounds abstract, we can get a perhaps more intuitive understanding from the following example

'simdata = rnorm(70)
tibble(index = seq(along = simdata),
          sx = sort(simdata)) %>%
ggplot(aes(x = sx, y = index)) + geom_step()'

## [1] "simdata = rnorm(70)\ntibble(index = seq(along = simdata),\n          sx = sort(simdata)) %>%\nggplot(aes(x = sx, y = index)) + geom_step()"

Plotting the sorted values against their ranks gives the essential features of the ECDF (Figure 3.23). In practice, we do not need to do the sorting and the other steps in the above code manually and will rather use the stat_ecdf() geometric object. The ECDFs of our data are shown in Figure 3.24.

'ggplot(genes, aes( x = value, color = gene)) + stat_ecdf()'

## [1] "ggplot(genes, aes( x = value, color = gene)) + stat_ecdf()"

The ECDF has several nice properties:

It is lossless: the ECDF contains all the information contained in the original sample , except for the order of the values, which is assumed to be unimportant.

As grows, the ECDF converges to the true CDF . Even for limited sample sizes , the difference between the two functions tends to be small. Note that this is not the case for the empirical density! Without smoothing, the empirical density of a finite sample is a sum of Dirac delta functions, which is difficult to visualize and quite different from any underlying smooth, true density. With smoothing, the difference can be less pronounced, but is difficult to control, as we discussed above.

##8.The effect of transformations on densities It is tempting to look at histograms or density plots and inspect them for evidence of bimodality (or multimodality) as an indication of some underlying biological phenomenon. Before doing so, it is important to remember that the number of modes of a density depends on scale transformations of the data, via the chain rule. For instance, let’s look at the data from one of the arrays in the Hiiragi dataset (Figure 3.26):

'ggplot(dfx, aes(x = `64 E4.5 (EPI)`)) + geom_histogram(bins = 100)
ggplot(dfx, aes(x = 2 ^ `64 E4.5 (EPI)`)) + 
  geom_histogram(binwidth = 20) + xlim(0, 1500)'

## [1] "ggplot(dfx, aes(x = `64 E4.5 (EPI)`)) + geom_histogram(bins = 100)\nggplot(dfx, aes(x = 2 ^ `64 E4.5 (EPI)`)) + \n  geom_histogram(binwidth = 20) + xlim(0, 1500)"

Refrensi :

https://www.huber.embl.de/msmb/03-chap.html#sec-graphics-univar

Visualizing data in 1D

Syukur Halim, NIM :220605220014, Dosen Pembimbing Prof. Dr. Suhartono, M.Kom., Magister Informatika UIN Maulana Malik Ibrahim Malang, Indonesia

2023-11-08

2. Boxplots

3. Dot plots and beeswarm plots

4. Density plots

5. Violin plots