Statistical Graphics in R

A Survey

Robert Norberg & Robert Truesdale

Comparing Radon Variability in 2 Studies

plot of chunk unnamed-chunk-1

Kernel Densities Bias of Passive Samplers

plot of chunk unnamed-chunk-2

Boxplot of Radon Attenuation

plot of chunk unnamed-chunk-3

Boxplots of VOC Attenuation

alt text

Line Plot of VOC Concentrations

plot of chunk unnamed-chunk-4

Preliminary Data Inspection

plot of chunk unnamed-chunk-5

Clean Data and Plot Again

plot of chunk unnamed-chunk-6

Outline

  • Plotting options in R
    • Base R Graphics
    • Lattice Graphics
    • ggplot2
      • Used to make all of the figures you just saw
  • An in depth example with ggplot2
  • Other effective data visualization techniques in R
    • Markdown
    • Interactive Graphics
    • GUIs

Base R Graphics

n = 40000
x0 = 1
y0 = 1
a = 1
b = 4
c = 60
x <- c(x0, rep(NA, n - 1))
y <- c(y0, rep(NA, n - 1))
cor <- rep(0, n)
for (i in 2:n) {
    x[i] = y[i - 1] - sign(x[i - 1]) * sqrt(abs(b * x[i - 1] - c))
    y[i] = a - x[i - 1]
    cor[i] <- round(sqrt((x[i] - x[i - 1])^2 + (y[i] - y[i - 1])^2), 0)
}
n.c <- length(unique(cor))
cores <- heat.colors(n.c)
plot(x, y, pch = ".", col = cores[cor], main = "Fractal Example")

plot of chunk unnamed-chunk-8

Pros:

  1. Almost infinitely customizable - if you can think of it, you can do it
  2. This is one of the biggest reasons for R's rising popularity

Cons:

  1. Tedious - Lots of trial and error required to get a nice looking figure
  2. Very low level - some simple commands produce plots, but they have awful default settings and altering these can be extremely tedious.

Lattice

library(lattice)
attach(mtcars)
gear.f <- factor(gear, levels = c(3, 4, 5), labels = c("3gears", "4gears", "5gears"))
cyl.f <- factor(cyl, levels = c(4, 6, 8), labels = c("4cyl", "6cyl", "8cyl"))
cloud(mpg ~ wt * qsec | cyl.f, layout = c(3, 1, 1), main = "3D Scatterplot by Cylinders")

plot of chunk unnamed-chunk-9

Lattice

Pros:

  1. Higher level than base R
  2. Implement things like multiple panels
  3. Still extremely flexible
  4. Many base R plotting commands still apply

Cons:

  1. Still pretty low level
  2. Defaults are not especially great
  3. If you want to go outside the defaults, it can be extremely tedious

ggplot2

ggplot(diamonds, aes(x = carat, y = price, color = clarity)) + 
facet_grid(color ~ cut) + 
scale_y_sqrt() + 
geom_point(alpha = 0.3) + 
geom_smooth(method = lm)

plot of chunk unnamed-chunk-11

ggplot2

Pros:

  1. Higher level
    • qplot() is especially quick and easy
  2. Very nice looking defaults
    • If you don't like them, create your own theme!
  3. Easy to create paneled plots
  4. Use "geom"s and "stat"s
    • For example, geom_point() indicates your desire to create a scatterplot, while stat_boxplot calculates the summary statistics necessary to plot a boxplot.

Cons:

  1. Less flexible
  2. A lot like learning a new language

An in depth example with ggplot2

The Data:

VOC concentration measured in indoor air over the course of a year.

library(xtable)  # A nice package for displaying tables in Markdown
print(xtable(head(voc.data, 8)), type = "html")
DateTime Location Chloroform Trichloroethene Tetrachloroethene
156 2011-12-12 04:10:13 Outside Air 0.52 1.55
157 2011-12-12 05:49:41 Outside Air 0.45 1.60
166 2011-12-12 20:44:48 Outside Air 0.50 0.53
167 2011-12-12 22:24:16 Outside Air 0.50 1.97
168 2011-12-13 00:03:44 Outside Air 0.57 0.54 1.95
169 2011-12-13 01:43:11 Outside Air 0.53 0.58 2.09
199 2011-12-15 12:45:02 Outside Air 0.56 1.99
200 2011-12-15 14:24:29 Outside Air 0.88 5.12

Melting Data

First, we must get this data into long form to use it in ggplot2. I use reshape2, a nice package created by the author of ggplot2 for manipulating data structures.

library(reshape2)
voc.long <- melt(data = voc.data, id.vars = c("DateTime", "Location"), measure.vars = c("Chloroform", 
    "Trichloroethene", "Tetrachloroethene"), value.name = "Concentration", variable.name = "Compound")
DateTime Location Compound Concentration
1 2011-12-12 04:10:13 Outside Air Chloroform 0.52
2 2011-12-12 05:49:41 Outside Air Chloroform 0.45
3 2011-12-12 20:44:48 Outside Air Chloroform 0.50
4 2011-12-12 22:24:16 Outside Air Chloroform 0.50
5 2011-12-13 00:03:44 Outside Air Chloroform 0.57
6 2011-12-13 01:43:11 Outside Air Chloroform 0.53
7 2011-12-15 12:45:02 Outside Air Chloroform 0.56
8 2011-12-15 14:24:29 Outside Air Chloroform 0.88

Melting Data

  • I reccomend begginers use the reshapeGUI package.
  • It helps you reshape your data, then generates the code necessary to implement the manipulation.
library(reshapeGUI)
reshapeGUI()

Tip: If your dataset is large, create a temporary dataset that consists of the first few rows and manipulate this in the GUI. It will work much faster.

temporary <- voc.data[1:7, ]

alt text

Dates in R

Next, we must coerce DateTime to a Date/Time class recognized by R (as of right now it is seen by R as a "factor", a categorical variable):

voc.long$DateTime <- as.POSIXct(voc.long$DateTime, format = "%Y-%m-%d %H:%M:%S", 
    tz = "GMT")

For questions about the format argument, see:

http://stat.ethz.ch/R-manual/R-patched/library/base/html/strptime.html

alt text

Dates in R

R assumes the data were taken in the same time zone that your computer abides by. If this is not the case, make sure to specify!!! Take for example this code:

is.na(as.POSIXct("3/11/2012 2:10:00", format = "%m/%d/%Y %H:%M:%S"))
## [1] TRUE

This is an hour that does not exist in Eastern Standard Time. If I specify the time zone this problem is corrected:

is.na(as.POSIXct("3/11/2012 2:10:00", format = "%m/%d/%Y %H:%M:%S", tz = "GMT"))
## [1] FALSE

Plotting in ggplot2

Now we're ready to plot the data. The basic plotting call is ggplot(). It has 2 key arguments - data and aes(), short for "aesthetics".

my.plot <- ggplot(data = voc.long, aes(x = DateTime, y = Concentration, color = Compound))
  • In the data argument I supply the long form data set we just created.
  • In the aes() argument I assign mappings between the data and graphical elements.
    • X axis = DateTime
    • Y axis = Concentration
    • Color = Compound Now lets add a layer to our plot:
my.dot.plot <- my.plot + geom_point(alpha = 0.2)
my.dot.plot

Plotting in ggplot2

plot of chunk unnamed-chunk-23

Plotting in ggplot2

Until now our plot was an abstract idea inside the computer. Now we have supplied instructions for how that abstraction should manifest itself visually.

I chose geom_point() for a scatter plot, but you could choose any of the available geoms that make sense for what you're trying to plot.

Different geoms come with different options/arguments.

  • In this call to geom_point() the optional argument alpha=0.1 makes each point semi transparent.

Also, notice some key features of the default settings:

  • The color scheme has been chosen for you. You can specify your own if you desire.
  • The legend has been created for you. This can be renamed, relabeled, hidden, moved around, etc.
  • The axes have been scaled and labeled for you. These can be changed in just about any manner you could imagine.

Plotting in ggplot2

  • You can add as many layers as you like on top of one another.

  • Additional layers will adhere by the aesthetic mappings you have already supplied.

  • To demonstrate this I add a smoothed line to each compound:

my.trend.plot <- my.dot.plot + geom_smooth(lwd = 1.2)
my.trend.plot

plot of chunk unnamed-chunk-24

Plotting in ggplot2

  • You can add layers with different data sets to the same plot
    • Just add another "data =" argument to the new layer.
  • If you wish to ignore the aesthetic mappings already created, simply add inherit.aes=FALSE to your layer and then supply another aes() argument.
plot.text <- data.frame(y.coord = 200, x.coord = strptime("11/7/2011 0:0:00", 
    format = "%m/%d/%Y %H:%M:%S", tz = "GMT"), text = "HVAC On \n X")
my.dot.plot <- my.dot.plot + geom_text(inherit.aes = F, data = plot.text, aes(x = x.coord, 
    y = y.coord, label = text, guide = F))
my.dot.plot

Plotting in ggplot2

plot of chunk unnamed-chunk-26

Plotting in ggplot2

  • Now lets add another layer of complexity to the plot. Lets give each location its own panel in the figure:
my.panel.plot <- my.dot.plot + facet_wrap(~Location, nrow = 2)
my.panel.plot

plot of chunk unnamed-chunk-27

Plotting in ggplot2

  • Faceting is an extremely useful tool for easily adding an extra dimension or two to your figure.
    • Here we have added one extra dimension.
  • To add 2 dimensions use facet_grid(Dimension1~Dimension2)
  • Notice that the panels have been labeled for you. These can also be played with quite a bit.

  • Some of these panels look like they could be informative, but others look like nothing's there. Concentration data is meant to be viewed on a log scale. Lets add this.

my.panel.plot <- my.panel.plot + scale_y_log10()
my.panel.plot

Plotting in ggplot2

plot of chunk unnamed-chunk-29

Plotting in ggplot2

Lets finish off the plot with some small aesthetic details.

  • Change some labels:
my.panel.plot <- my.panel.plot + labs(x = "Date", y = "Concentration (ug/L)", 
    title = "VOC Vapor Intrusion")
  • Add a "theme" and make labelling larger:
my.panel.plot <- my.panel.plot + theme_bw(base_size = 18)
  • Make my X axis labels just right:
library(scales)
my.panel.plot <- my.panel.plot + scale_x_datetime(labels = date_format("%b-'%y"), 
    breaks = "2 months", minor_breaks = "month") + theme(axis.text.x = element_text(angle = 25, 
    vjust = 0.5))

Plotting in ggplot2

plot of chunk unnamed-chunk-33

Get even fancier with R

knitr

# Display my code
print("Output appears below code chunk")
## [1] "Output appears below code chunk"

Then add text using markdown!

You can use markdown to create:

  • html documents
  • pdf documents
  • slide shows - (This whole presentation was created in markdown!)

Switch back and forth between R code and markdown (a simple text editor language)

knitr

Pros:

  1. Nice to switch back and forth between code and text
  2. Easily update documents to include new data
  3. Self contained - all of the figures should show up embedded in the document
    • I've had some trouble with this at times

Cons:

  1. If you're used to Word, markdown feels extremely limiting
  2. If you rerun a document, the figures and statistics and figures might change, but your typed conclusions won't.

Interactive R Graphics

  • Google Vis API
  • rggobi

One step further - GUIs

  • R GUIS (Rgtk2, tcltk2, gwidgets)
    • An example from a school project
      • Deliver a tool instead of a picture
      • Only hosted locally - must have R installed
  • Shiny - a new tool for making simple web apps
    • Also only hosted locally
    • Much simpler than building a GUI, but also less flexible

Resources:

Base R graphics

  • R graphics, 2nd ed. / Murrell, Paul. CRC Press, 2011.
    • I definitely recommend a book, not just online documentation

Lattice package

  • Lattice Multivariate Data Visualization with R, Deepayan Sarkar, Springer New York, 2008.

ggplot2

  • Great online documentation: http://docs.ggplot2.org/current/
  • ggplot2: Elegant Graphics for Data Analysis
    • Written by the author of the software Hadley Wickham, free online through RTI library!

Resources, cointinued:

Markdown

R GUIs

Find this presentation online:

http://rpubs.com/rnorberg/StatGraphics