R Graphics

Shige

Three ways to do graphics in R

The base graphics system

  • “Pen and paper” model
  • Simple
  • Quick
  • Easy
library(Zelig)
data(turnout)
hist(turnout$educate)

plot of chunk unnamed-chunk-1

The lattice graphics system

  • Flexible
  • Conditional plots made easy
library(lattice)
histogram(~ educate, data=turnout)

plot of chunk unnamed-chunk-2

The ggplot2 system

Based on the book The Grammer of Graphics and thus is:

  • Modular
  • Flexible
  • Elegant
  • Two sets of syntax
    1. qplot(): syntax similar to the base graphics
    2. ggplot(): raw power
library(ggplot2)
p <- ggplot(turnout, aes(x=educate)) + geom_histogram()
print(p)

plot of chunk unnamed-chunk-3

Let's focus on ggplot2

One system, many faces: ggplot2 is themable

library(ggthemes)
p1 <- ggplot(turnout, aes(educate)) + geom_histogram() + theme_tufte()
print(p1)

plot of chunk unnamed-chunk-4

library(ggthemes)
p2 <- ggplot(turnout, aes(educate)) + geom_histogram() + theme_stata()
print(p2)

plot of chunk unnamed-chunk-5

More themes

p3 <- ggplot(turnout, aes(educate)) + geom_histogram() + theme_economist()
print(p3)

plot of chunk unnamed-chunk-6

p4 <- ggplot(turnout, aes(educate)) + geom_histogram() + theme_excel()
print(p4)

plot of chunk unnamed-chunk-7

Here are some more themes.

A more complicated example: grouping

ggplot(turnout, aes(educate)) + geom_histogram() + facet_grid(race ~ .)

plot of chunk unnamed-chunk-8

ggplot(turnout, aes(educate)) + geom_histogram() + facet_grid(vote ~ .)

plot of chunk unnamed-chunk-9

Putting them together

ggplot(turnout, aes(educate)) + geom_histogram() + facet_grid(race ~ vote)

plot of chunk unnamed-chunk-10

If you like a frequency polygon:

ggplot(turnout, aes(educate)) + geom_freqpoly() + facet_grid(race ~ vote)

plot of chunk unnamed-chunk-11

Box plots

Education by race:

ggplot(turnout, aes(y=educate, x=race)) + geom_boxplot()

plot of chunk unnamed-chunk-12

Key concepts: the grammar of graphics

  1. Data: The information we want to visualize;
  2. Geoms: The geometric objects that are drawn to represent the data, such as bars, lines, and points.
  3. Aesthetics: Visual properties of geoms, such as x and y position, line color, point shapes, etc.
  4. Mapping: The process of connecting data values to aesthetics.
  5. Scales: They control the mapping from the values in the data space to values in the aesthetic space.
  6. Guides: They show the viewer how to map the visual properties back to the data space. The most commonly used guides are the tick marks and labels on an axis.

Let's create the data:

dat <- data.frame(xval=1:4, yval=c(3,5,6,9), group=c("A","B","A","B"))
names(dat)
[1] "xval"  "yval"  "group"

A basic ggplot() specification looks like following, which creates a ggplot object using the data frame dat and specifies the default aesthetic mappings within aes()

ggplot(dat, aes(x = xval, y = yval))
  • x=xval maps the column xval to the x position
  • y=yval maps the column yval to the y position

We also need to tell ggplot() what geometric objects (e.g., bars, points, lines) to put there. Let's begin with a scatter plot:

ggplot(dat, aes(x = xval, y = yval)) + geom_point()

plot of chunk unnamed-chunk-15

But we can easily turn it into a line chart

ggplot(dat, aes(x=xval, y=yval)) + geom_line()

plot of chunk unnamed-chunk-16

This would be even better

ggplot(dat, aes(x=xval, y=yval)) + geom_point() + geom_line()

plot of chunk unnamed-chunk-17

One way to do the grouping

ggplot(dat, aes(x=xval, y=yval)) + geom_point(aes(colour=group))

plot of chunk unnamed-chunk-18

ggplot(dat, aes(x=xval, y=yval)) + geom_point(aes(colour=group)) + geom_line(aes(colour=group))

plot of chunk unnamed-chunk-19

Let's add a regression line

ggplot(dat, aes(x = xval, y = yval)) + geom_point() + geom_smooth(method = "lm")

plot of chunk unnamed-chunk-20

As you can see, it is just another layer.

ggplot(dat, aes(x = xval, y = yval)) + geom_point() + stat_smooth(method = "lm")

plot of chunk unnamed-chunk-21

It turns out that “geom_smooth()” is the same as “stat_smooth()”, which is one of the many statistical functions that can be used with ggplot().

Let's go back to our voter turnout example:

ggplot(turnout, aes(x=educate,y=income)) + geom_point() + geom_smooth(method="lm")

plot of chunk unnamed-chunk-22

ggplot(turnout, aes(x=educate,y=income)) + geom_point() + geom_smooth()

plot of chunk unnamed-chunk-23

Linear smoother

ggplot(turnout, aes(x=educate,y=income, colour=race)) + geom_point() + scale_colour_manual(values=c("red", "blue")) + geom_smooth(method="lm")

plot of chunk unnamed-chunk-24

Nonlinear smoother

ggplot(turnout, aes(x=educate,y=income, colour=race)) + geom_point() + scale_colour_manual(values=c("red", "blue")) + geom_smooth()

plot of chunk unnamed-chunk-25

Mean

ggplot(turnout, aes(x = educate, y = income, colour = race)) + scale_colour_manual(values = c("red", 
    "blue")) + stat_summary(fun.y = mean, geom = "point")

plot of chunk stat_mean

Median

ggplot(turnout, aes(x = educate, y = income, colour = race)) + scale_colour_manual(values = c("red", 
    "blue")) + stat_summary(fun.y = median, geom = "point")

plot of chunk stat_median

Let's think about what we just did

  • We created a plot of conditional mean and median directly from the raw data without actually creating a new data set of the conditional mean and mediation;
  • If we are using Excel or Stata, how would we go about and do this?
  • That does not mean we cannot manipulate the data and create the summary statistical plots manutally.
  • In fact, the “reshape2” package makes such data manipulation trivially easy.

Using the "reshape2" package: I

  • It is somewhat similar to the “reshape” command in Stata but much more powerful;
  • It can transform data set from any arbitrary shapes into any other arbitrary shapes.
race age educate income vote
white 60 14 3.3458 1
white 51 10 1.8561 0
white 24 12 0.6304 0
white 38 8 3.4183 1
white 25 12 2.7852 1

These are the first 5 cases of the original data. Suppose we want to transform the data into something like

race educate income vote
others 11.04 2.927 0.6267
white 12.24 4.051 0.7664

Or

age educate income vote
17 14.0 6.78 1.000
18 11.2 2.87 0.364
19 12.5 3.43 0.714
20 13.1 3.15 0.467
21 12.3 3.07 0.638
22 12.1 2.57 0.477
23 12.5 2.44 0.535
24 13.0 3.44 0.600
25 12.9 3.92 0.628
26 13.0 3.57 0.755
27 12.9 4.10 0.722
28 12.8 3.70 0.692
29 13.1 3.93 0.698
30 13.5 4.48 0.595
31 13.1 4.07 0.712
32 12.9 4.21 0.780
33 13.3 3.86 0.766
34 12.7 4.27 0.806
35 12.9 4.60 0.771
36 13.2 4.53 0.688

Using the "reshape2" package: II

Reshape2 has two main command, “melt” and “cast”.

  • Melt: the process of transforming the data into the most flexible format where each row represents one observation of one variable;
  • Cast: the process of transforming the “molten” data into the desired format.

For our example, the first step looks like this:

new <- melt(turnout, id = c("race", "age"))

This creates the following data:

race age variable value
white 17 educate 14.0000
white 17 income 6.7838
white 17 vote 1.0000
others 18 educate 10.0000
others 18 income 0.9457
others 18 vote 0.0000
white 18 educate 12.0000
white 18 educate 12.0000
white 18 educate 12.0000
white 18 educate 12.0000
white 18 educate 12.0000
white 18 educate 9.0000
white 18 educate 13.0000
white 18 educate 10.0000
white 18 educate 10.0000
white 18 educate 11.0000
white 18 income 7.0036
white 18 income 0.1936
white 18 income 4.6690
white 18 income 3.7972
white 18 income 6.2740
white 18 income 0.7294
white 18 income 0.9214
white 18 income 3.1356
white 18 income 0.2364

Using the "reshape2" package: III

Now we can “cast” the data into the form that we want:

new_cast <- dcast(new, race ~ variable, mean)

produces the following:

race educate income vote
others 11.04 2.927 0.6267
white 12.24 4.051 0.7664

And

new_cast <- dcast(new, race + age ~ variable, mean)

produces:

age race educate income vote
17 white 14.00 6.7838 1.0000
18 others 10.00 0.9457 0.0000
18 white 11.30 3.0608 0.4000
19 others 12.00 2.8296 0.3333
19 white 12.64 3.5957 0.8182
20 others 14.00 2.8769 0.2500
20 white 12.73 3.2490 0.5455
21 others 12.09 2.2023 0.7273
21 white 12.31 3.3388 0.6111
22 others 11.70 1.6117 0.5000
22 white 12.21 2.8552 0.4706
23 others 12.29 1.2820 0.2857
23 white 12.53 2.6662 0.5833
24 others 11.75 3.6401 0.3750
24 white 13.24 3.3950 0.6486
25 others 11.71 3.6925 0.4286
25 white 13.17 3.9697 0.6667
26 others 11.73 2.6948 0.7273
26 white 13.33 3.8026 0.7619
27 others 12.80 3.1278 0.8000

Creating a map with ggmap

ggmap is a cool map making package based on ggplot2.

library(ggmap)
nyc <- "new york city"
qmap(nyc, zoom=12)

plot of chunk unnamed-chunk-35

bj <- "beijing"
qmap(bj, zoom=14)

plot of chunk unnamed-chunk-36

A short introduction of ggmap can be found here.

More resources