Exploring data with ggplot2

Classic plots vs ggplot

plot, barplot, hist

  • One function per plot style
  • Limited ability to modify the plot (eg colours) depending on plot type
  • Inconsistent data requirements (eg matrix vs data frame, tall vs wide)

ggplot2

  • One package many, many, many plot styles
  • Many aspects of plot appearance can be made to depend on the data
  • Data should generally be tall rather than wide. Use melt and cast to transform.

Grammar of Graphics

ggplot is an implementation of the grammar of graphics

  • Data
  • Geometric objects (points, lines, areas)
  • Aesthetic mappings between data and objects
  • Statistical transformations (binning, aggregating)
  • Coordinate system
  • Scales (map data to aesthetic attributes)
  • Annotations (title, legend etc)
  • Faceting (representing data across multiple plots)

Layers in ggplot

Each layer in ggplot has at least data, geom (geometric object), stat (statistical transform) and aes (aesthetic)

The ggplot function sets up a plot's baseline defaults

p<-ggplot(data,aes(x=someval,y=someval))

Add layers with the + operator

p+geom_point()

This layers inherits data and aes from the base plot and has a default stat

A simple scatterplot

library(ggplot2)
data(diamonds)
p<-ggplot(diamonds)
p+geom_point(aes(x=carat,y=price))

Add another aesthetic mapping

p<-ggplot(diamonds)
p+geom_point(aes(x=carat,y=price,color=clarity))

Available aesthetics

Aesthetics for geom_point are;

  • x
  • y
  • alpha
  • colour
  • fill
  • shape
  • size

Find available aesthetics at http://docs.ggplot2.org/current/.

Adding another layer

p<-ggplot(diamonds,aes(x=carat,y=price))
p+geom_point(aes(color=clarity))
p+geom_smooth()

geom and stat objects are often complementary

For example geom_smooth() has stat_smooth() as its default stat

A stat adds new columns to the data that are used by the geom. For example stat_smooth() creates;

  • y predicted value
  • ymin lower pointwise confidence interval around the mean
  • ymax upper pointwise confidence interval around the mean
  • se standard error

These can all be accessed for aesthetic mapping if needed.

Another stat , geom example

p <- ggplot(data=diamonds,aes(x=price))
p <- p + stat_bin(aes(y = ..count..))

Where does ..count.. come from?

help(stat_bin)

The stat_bin function adds four new computed variables to the data;

  • count
  • density
  • ncount
  • ndensity

We mapped ..count.. to y in order to create a histogram

Coordinate Systems

Coordinate systems map object positions onto the plot itself. There are certain special aesthetics associated with coordinate systems. For example in cartesian coordinates the aesthetics x and y describe position on the page. In polar coordinates the equivalent are theta and radius. Examples include;

  • Cartesian (standard x,y plot)
  • Polar (circular plots)
  • Maps (project positions on the globe to a flat plane)

Barplot in Cartesian Coords

p <- ggplot( data=diamonds)
p <- p + geom_bar(aes(fill=clarity, x=1))

Barplot in Polar Coords

Mapping radius to bar height

p <- ggplot( data=diamonds) + geom_bar(aes(fill=clarity, x=1))
p <- p  + coord_polar(theta="x")

Barplot in Polar Coords

Mapping angle to bar height

p <- ggplot( data=diamonds) + geom_bar(aes(fill=clarity, x=1))
p <- p  + coord_polar(theta="y")

Facetting

p<-ggplot(data=diamonds,aes(x=carat,y=price))
p<-p+geom_point(aes(color=cut))
p+facet_grid(color~cut)

Reshaping data

Use the reshape2 (or tidyr) package to easily transform data for ggplot

d <- read.csv("mqdata.csv")
d %>% head()
##   Protein      A1_1      A1_2      A2_1      A2_2      A3_1     A3_2
## 1       1    331950    361220    422930    630540    372500   365430
## 2       2   4563000   4448000   2539500   3830500   3864900  3596900
## 3       3    814690    898580   1085600   1056700    788260   728690
## 4       4 102070000 115290000 114100000 125680000 108000000 94464000
## 5       5    404190    378130    821670    658280    366020   366990
## 6       6        NA        NA        NA        NA        NA       NA
##        B1_1      B1_2       B2_1      B2_2      B3_1      B3_2 nmiss
## 1    354870    372210 3.6507e+05    423640    330950    275670     0
## 2   4010500   3505300 5.3546e+06   4319900   3299900   2096800     0
## 3    724620    579320 9.7646e+05    931880    596570    690040     0
## 4 211910000 183910000 1.7914e+08 179580000 259720000 204390000     0
## 5    246350    237430 3.5972e+05    205960    318120    322570     0
## 6        NA        NA 7.7677e+03        NA        NA        NA    11

Melt data from wide to tall

library(tidyr)
##md=melt(d,id.vars=c("Protein","nmiss"),variable.name="Sample",value.name="Intensity")
md <- d %>% gather(Sample,Intensity,-Protein,-nmiss)
head(md)
##   Protein nmiss Sample Intensity
## 1       1     0   A1_1    331950
## 2       2     0   A1_1   4563000
## 3       3     0   A1_1    814690
## 4       4     0   A1_1 102070000
## 5       5     0   A1_1    404190
## 6       6    11   A1_1        NA

Most of the time this is what you want for ggplot

Cast from tall to wide

md %>% spread(key = Sample, value = Intensity) %>% head
##   Protein nmiss      A1_1      A1_2      A2_1      A2_2      A3_1     A3_2
## 1       1     0    331950    361220    422930    630540    372500   365430
## 2       2     0   4563000   4448000   2539500   3830500   3864900  3596900
## 3       3     0    814690    898580   1085600   1056700    788260   728690
## 4       4     0 102070000 115290000 114100000 125680000 108000000 94464000
## 5       5     0    404190    378130    821670    658280    366020   366990
## 6       6    11        NA        NA        NA        NA        NA       NA
##        B1_1      B1_2       B2_1      B2_2      B3_1      B3_2
## 1    354870    372210 3.6507e+05    423640    330950    275670
## 2   4010500   3505300 5.3546e+06   4319900   3299900   2096800
## 3    724620    579320 9.7646e+05    931880    596570    690040
## 4 211910000 183910000 1.7914e+08 179580000 259720000 204390000
## 5    246350    237430 3.5972e+05    205960    318120    322570
## 6        NA        NA 7.7677e+03        NA        NA        NA

Often this is how data arrives

Plots that don't fit the ggplot paradigm

Heatmaps!

Made with heatmap2 not ggplot

Why heatmaps break ggplot

  • Composed of multiple sub-plots
  • Image plot is made (on tall data)
  • Clustering and dendrograms (on wide data)
  • Want to map aesthetics to both row and colum metadata