Harlan D. Harris, PhD
October 16, 2013
Graphics are (should be!) created by combining a specification with data. (Wilkinson, 2005)
The specification makes you think about what you're trying to communicate.
The specification is not the name of the visual form: bar graph, scatterplot, histogram.
The specification is a collection of rules that together describe how to build a graph, a…
Grammar of Graphics
I think about the message:
The computer translates it to an image:
spec + data.frame =
As we'll see later, a strength of ggplot is that you can do certain types of computations on your data from within the system. Here, it's computing a ratio of two things. In general, this power allows you to quickly explore your data, and different ways of presenting your data, without having to make changes to your data table.
canonical flexibility example: pie chart is a stacked bar graph plotted in polar coordinates! In fact, that's how ggplot implements pie charts!
reshape2, plyr, lubridate, stringr, etc.library(ggplot2); library(plyr)
data(baseball)
tail(baseball, 4)
id year stint team lg g ab r h X2b X3b
89526 benitar01 2007 1 SFN NL 19 0 0 0 0 0
89530 ausmubr01 2007 1 HOU NL 117 349 38 82 16 3
89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1
89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0
hr rbi sb cs bb so ibb hbp sh sf gidp
89526 0 0 0 0 0 0 0 0 0 0 0
89530 3 25 6 1 37 74 3 6 4 1 11
89533 13 49 3 0 27 30 5 2 0 3 13
89534 0 0 0 0 0 3 0 0 0 0 0
Craig Biggio from the Astros
library(reshape2);
alnl <- subset(baseball, lg %in% c('AL', 'NL'));
bb_wide <- ddply(alnl, .(lg, year), summarise, r=sum(r), h=sum(h), hr=sum(hr));
head(bb_wide,3)
lg year r h hr
1 AL 1901 1322 2330 61
2 AL 1902 1872 3648 88
3 AL 1903 1409 3050 68
bb_long <- melt(bb_wide, id=c('lg', 'year'), variable.name='stat')
head(bb_long, 3)
lg year stat value
1 AL 1901 r 1322
2 AL 1902 r 1872
3 AL 1903 r 1409
bb_long_al <- subset(bb_long, lg=='AL')
p <- ggplot(data=bb_long_al,
mapping=aes(x=year, y=value,
color=stat)) +
layer(geom='line')
print(p)
p – building an object, not plotting on the screen ggplot – create a “core” ggplot object – required data – ggplot wants a data.frame mapping – all ggplots have an underlying x and y, and the mapping parameter describes how to map the underlying variables to the variables in a df x and y – realized as horiz and vert here, but might not be in other graphs. obvious mappings. Note that ggplot grabs the df col. names as default axis labels color – the color of the lines is mapped to the value of the Condition column. (the actual colors are set by default here – will override later)
ggplot(data=bb_long_al,
mapping=aes(x=year, y=value,
color=stat)) +
layer(geom='line')
| data | layers | mapping | scales | coords | facet | theme |
|---|---|---|---|---|---|---|
| (copy) | ↓ | x=year y=value color=stat | (default) |
| data | mapping | geom | stat | geom.params | stat.params |
|---|---|---|---|---|---|
| ↑ | ↑ | line | identity |
p2 <- p + layer(geom='point',
size=4)
summary(p2)
data: lg, year, stat, value [321x4]
mapping: x = year, y = value, colour = stat
faceting: facet_null()
-----------------------------------
geom_line:
stat_identity:
position_identity: (width = NULL, height = NULL)
geom_point: size = 4
stat_identity:
position_identity: (width = NULL, height = NULL)
| data | layers | mapping | scales | coords | facet | theme |
|---|---|---|---|---|---|---|
| (copy) | ↓ | x=year y=value color=stat | (default) |
| data | mapping | geom | stat | geom.params | stat.params |
|---|---|---|---|---|---|
| ↑ | ↑ | line | identity |
| data | mapping | geom | stat | geom.params | stat.params |
|---|---|---|---|---|---|
| ↑ | ↑ | point | identity | size=4 |
Mappings of data to representations of displayable attributes.
p3 <- p2 + scale_y_continuous("Total", limits=c(0,15000),
breaks=c(0,5000,10000,12000)) +
scale_x_continuous("Year") +
scale_color_manual("Stat",
values=c('darkgreen','darkblue','darkred'))
scale_x_continuous(), butcoord_cartesian(xlim=c(1,10))scale_x_continuous(limits=c(1,10)) truncates the data too,
which can break statistical calculationsboth are properties of the base object, not of layers, so they're universal for the whole plot
Adjust properties of a graph as a whole.
p4 <- p3 + ggtitle("Baseball!") +
theme_bw() +
theme(plot.title=element_text(color='#FF8888', size=40))
Note that although this is last in series of function calls added together, it's actually modifying the base ggplot object, not any of the layers. This is unfortunate. Not going into themes, but basically you can modify lots of the underlying lines and boxes and colors that make up a ggplot plot, either on a case-by-case basis like this, or by creating a theme object that you can make the default in your environment.
layer() calls are tedious!geom_*() creates a layer with a specific geom (and various defaults, including a stat)stat_*() creates a layer with a specific stat (and various defaults, including a geom)qplot() creates a ggplot and a layerp_typ <- ggplot(bb_long_al, aes(x=year, y=value, color=stat)) +
geom_line() +
geom_point(size=2)+
ylab("Total") + xlab("Year") +
scale_color_manual("Stat", values=c('darkgreen','darkblue','darkred')) +
ggtitle("American League") + theme_bw()
stat=”identity"stat=”lm”
lm(), generate new data to be plotted by geom_line(),
CIs w/ geom_ribbon()stat=”smooth”
loess() or gam()stat=”summary”
stat=”bin”
p_typ <- ggplot(bb_long, aes(x=year, y=value, color=stat)) +
geom_line() +
facet_wrap(~lg)
p_annot <- ggplot(bb_long, aes(x=year, y=value, color=stat)) +
geom_line(alpha=.2, size=2) +
stat_smooth() +
facet_wrap(~lg) +
annotate("text", x=1981, y=0, label="Strike!",
angle=90, hjust=0)
library(scales)
df <- data.frame(
date = seq(Sys.Date(), len=100, by="1 day")[sample(100, 50)],
price = runif(50)
)
df <- df[order(df$date), ]
p_ts <- ggplot(df, aes(date, price)) + geom_line() +
scale_x_date(breaks=date_breaks("2 weeks"))
p_density <- ggplot(baseball, aes(ab, fill=lg)) +
geom_density(alpha=.4)
library(Hmisc)
p_fb <- ggplot(baseball, aes(lg, ab)) +
geom_jitter() +
stat_summary(fun.data='median_hilow',geom='crossbar', color='red', conf.int=.9)