Exploring data with ggplot2

Classic plots vs ggplot

plot, barplot, hist

  • One function per plot style
  • Limited ability to modify the plot (eg colours) depending on plot type
  • Inconsistent data requirements (eg matrix vs data frame, tall vs wide)

ggplot2

  • One package many, many, many plot styles
  • Many aspects of plot appearance can be made to depend on the data
  • Data should generally be tall rather than wide. Use melt and cast to transform.

Grammar of Graphics

ggplot is an implementation of the grammar of graphics

  • Data
  • Geometric objects (points, lines, areas)
  • Aesthetic mappings between data and objects
  • Statistical transformations (binning, aggregating)
  • Position adjustment (stack, dodge, jitter)
  • Coordinate system and scale
  • Annotations (title, legend etc)
  • Faceting (representing data across multiple plots)

Layers in ggplot

Each layer in ggplot has at least data, geom (geometric object), stat (statistical transform) and aes (aesthetic)

The ggplot function sets up a plot's baseline defaults

p<-ggplot(data,aes(x=someval,y=someval))

Add layers with the + operator

p+geom_point()

This layers inherits data and aes from the base plot and has a default stat

A simple scatterplot

library(ggplot2)
data(diamonds)
p<-ggplot(diamonds)
p+geom_point(aes(x=carat,y=price))

Add another aesthetic mapping

p<-ggplot(diamonds)
p+geom_point(aes(x=carat,y=price,color=clarity))

Available aesthetics

Aesthetics for geom_point are;

  • x
  • y
  • alpha
  • colour
  • fill
  • shape
  • size

Find available aesthetics at http://docs.ggplot2.org/current/.

Adding another layer

p<-ggplot(diamonds,aes(x=carat,y=price))
p+geom_point(aes(color=clarity))
p+geom_smooth()

geom and stat objects are often complementary

For example geom_smooth() has stat_smooth() as its default stat

A stat adds new columns to the data that are used by the geom. For example stat_smooth() creates;

  • y predicted value
  • ymin lower pointwise confidence interval around the mean
  • ymax upper pointwise confidence interval around the mean
  • se standard error

These can all be accessed for aesthetic mapping if needed.

Another stat example

p<-ggplot(data=diamonds,aes(x=carat,y=price))
p<-p+stat_density2d(aes(color=..level..))
p+scale_color_continuous("Density")

Where does ..level.. come from?

help(stat_contour)

Value

A data frame with additional column:

level  height of contour

Facetting

p<-ggplot(data=diamonds,aes(x=carat,y=price))
p<-p+geom_point(aes(color=cut))
p+facet_grid(color~cut)

Reshaping data

Use the reshape2 package to easily transform data for ggplot

d<-read.csv("mqdata.csv")
head(d)
##   Protein      A1_1      A1_2      A2_1      A2_2      A3_1     A3_2
## 1       1    331950    361220    422930    630540    372500   365430
## 2       2   4563000   4448000   2539500   3830500   3864900  3596900
## 3       3    814690    898580   1085600   1056700    788260   728690
## 4       4 102070000 115290000 114100000 125680000 108000000 94464000
## 5       5    404190    378130    821670    658280    366020   366990
## 6       6        NA        NA        NA        NA        NA       NA
##        B1_1      B1_2       B2_1      B2_2      B3_1      B3_2 nmiss
## 1    354870    372210 3.6507e+05    423640    330950    275670     0
## 2   4010500   3505300 5.3546e+06   4319900   3299900   2096800     0
## 3    724620    579320 9.7646e+05    931880    596570    690040     0
## 4 211910000 183910000 1.7914e+08 179580000 259720000 204390000     0
## 5    246350    237430 3.5972e+05    205960    318120    322570     0
## 6        NA        NA 7.7677e+03        NA        NA        NA    11

Melt data from wide to tall

library(reshape2)
md=melt(d,id.vars=c("Protein","nmiss"),variable.name="Sample",value.name="Intensity")
head(md)
##   Protein nmiss Sample Intensity
## 1       1     0   A1_1    331950
## 2       2     0   A1_1   4563000
## 3       3     0   A1_1    814690
## 4       4     0   A1_1 102070000
## 5       5     0   A1_1    404190
## 6       6    11   A1_1        NA

Cast from tall to wide

cmd=dcast(md,Protein+nmiss~Sample,value.var='Intensity')
head(cmd)
##   Protein nmiss      A1_1      A1_2      A2_1      A2_2      A3_1     A3_2
## 1       1     0    331950    361220    422930    630540    372500   365430
## 2       2     0   4563000   4448000   2539500   3830500   3864900  3596900
## 3       3     0    814690    898580   1085600   1056700    788260   728690
## 4       4     0 102070000 115290000 114100000 125680000 108000000 94464000
## 5       5     0    404190    378130    821670    658280    366020   366990
## 6       6    11        NA        NA        NA        NA        NA       NA
##        B1_1      B1_2       B2_1      B2_2      B3_1      B3_2
## 1    354870    372210 3.6507e+05    423640    330950    275670
## 2   4010500   3505300 5.3546e+06   4319900   3299900   2096800
## 3    724620    579320 9.7646e+05    931880    596570    690040
## 4 211910000 183910000 1.7914e+08 179580000 259720000 204390000
## 5    246350    237430 3.5972e+05    205960    318120    322570
## 6        NA        NA 7.7677e+03        NA        NA        NA

Visualise Missingness

p<-ggplot(md,aes(x=as.factor(nmiss),y=log2(Intensity)))
p+geom_boxplot()+facet_wrap(~Sample)