Handout 011: Intro to GGPlot

More on Visualization

R has a lot of excellent facilities for visualization. One of the best is the “Grammar of Graphics” tool, universally known as “ggplot” after the name of its library: ggplot2. To install,

install.packages("ggplot2")

The Advantages of GGPlot

GGplot makes prettier graphics. That doesn’t just matter for presentation – it also helps you decide what sorts of factors might be theoretically interesting to model.

For instance:

library(ggplot2)

state.df <- data.frame(state.x77)

out <- ggplot(state.df, aes(x=Illiteracy,y=Income)) +
        geom_point() +
        ggtitle("Income vs. Illiteracy, US States, 1977")
out  ## Ggplot needs you to type in the object you've created again to display it

This is nice! But what else can we do?

Using GGPlot to Create More Informative Data Visualiations

When we plot something on a two-dimensional axis, we use two pieces of information: the x- and y-coordinates. But those aren’t the only parameters we can vary (see especially Few pp. 176–81). We can also vary:

color
shape of plotting characters
size of plotting characters
opacity (transparency) of plotting characters
visual weight (thickness, line width, line type)

All of these can be conceived of as varying with some other factor. Sometimes, as with categorical variables, we conceive of the most useful representations as being those that are most nearly categorical: for instance, shapes and line types. Sometimes, as with continuous variables, we conceive of the most useful representations as being those that are most nearly continuous: for instance, size and color. There are intermediate or mixed cases: we can use discrete colors to represent discrete categories, for example. The deeper point is this: even though a basic chart only displays two parameters’ worth of data, those aren’t the only ones we’re limited to.

So, for instance, if we wanted to illustrate a three-variable relationship among the states:

out +  geom_point(aes(size=Population))

This chart takes the same object we created earlier and uses the Population of each state as the basis for creating a new chart that shows the populations of the states as the size of the points. (This is technically called a “bubblechart”.)

We can also create new variables that give us the basis for additional visualizations. For instance:

state.df$South <- 0 ## Create a new variable

South <- c("Virginia","Georgia","North Carolina","Alabama","South Carolina","Arkansas","Texas","Mississippi")


state.df[rownames(state.df) %in% South,]$South <- 1 ## this is a quicker way of doing something; right now you should consider doing this manually
head(state.df[,c(1,3,5,7,9)]) ## look at the data after you change it

##            Population Illiteracy Murder Frost South
## Alabama          3615        2.1   15.1    20     1
## Alaska            365        1.5   11.3   152     0
## Arizona          2212        1.8    7.8    15     0
## Arkansas         2110        1.9   10.1    65     1
## California      21198        1.1   10.3    20     0
## Colorado         2541        0.7    6.8   166     0

state.df[state.df$South==1,c(1,3,5,7,9)]

##                Population Illiteracy Murder Frost South
## Alabama              3615        2.1   15.1    20     1
## Arkansas             2110        1.9   10.1    65     1
## Georgia              4931        2.0   13.9    60     1
## Mississippi          2341        2.4   12.5    50     1
## North Carolina       5441        1.8   11.1    80     1
## South Carolina       2816        2.3   11.6    65     1
## Texas               12237        2.2   12.2    35     1
## Virginia             4981        1.4    9.5    85     1

out + geom_point(aes(colour=as.factor(state.df$South)))

This plot helps us see quickly that the Southern states seem to have substantially different Illiteracy scores than non-Southern states. That might indicate a potential source of endogeneity.

out + geom_point(aes(colour=as.factor(state.df$South),size=Population))

Putting these two factors together suggests that smaller states might be more likely to have illiteracy, while Southern states almost uniformly display higher levels of illiteracy. That would be interesting to explore further.

Handout 011: Intro to GGPlot

Paul Musgrave

10/25/2016

More on Visualization

The Advantages of GGPlot

Using GGPlot to Create More Informative Data Visualiations