R has a lot of excellent facilities for visualization. One of the best is the “Grammar of Graphics” tool, universally known as “ggplot” after the name of its library: ggplot2. To install,
install.packages("ggplot2")
GGplot makes prettier graphics. That doesn’t just matter for presentation – it also helps you decide what sorts of factors might be theoretically interesting to model.
For instance:
library(ggplot2)
state.df <- data.frame(state.x77)
out <- ggplot(state.df, aes(x=Illiteracy,y=Income)) +
geom_point() +
ggtitle("Income vs. Illiteracy, US States, 1977")
out ## Ggplot needs you to type in the object you've created again to display it
This is nice! But what else can we do?
When we plot something on a two-dimensional axis, we use two pieces of information: the x- and y-coordinates. But those aren’t the only parameters we can vary (see especially Few pp. 176–81). We can also vary:
All of these can be conceived of as varying with some other factor. Sometimes, as with categorical variables, we conceive of the most useful representations as being those that are most nearly categorical: for instance, shapes and line types. Sometimes, as with continuous variables, we conceive of the most useful representations as being those that are most nearly continuous: for instance, size and color. There are intermediate or mixed cases: we can use discrete colors to represent discrete categories, for example. The deeper point is this: even though a basic chart only displays two parameters’ worth of data, those aren’t the only ones we’re limited to.
So, for instance, if we wanted to illustrate a three-variable relationship among the states:
out + geom_point(aes(size=Population))
This chart takes the same object we created earlier and uses the Population of each state as the basis for creating a new chart that shows the populations of the states as the size of the points. (This is technically called a “bubblechart”.)
We can also create new variables that give us the basis for additional visualizations. For instance:
state.df$South <- 0 ## Create a new variable
South <- c("Virginia","Georgia","North Carolina","Alabama","South Carolina","Arkansas","Texas","Mississippi")
state.df[rownames(state.df) %in% South,]$South <- 1 ## this is a quicker way of doing something; right now you should consider doing this manually
head(state.df[,c(1,3,5,7,9)]) ## look at the data after you change it
## Population Illiteracy Murder Frost South
## Alabama 3615 2.1 15.1 20 1
## Alaska 365 1.5 11.3 152 0
## Arizona 2212 1.8 7.8 15 0
## Arkansas 2110 1.9 10.1 65 1
## California 21198 1.1 10.3 20 0
## Colorado 2541 0.7 6.8 166 0
state.df[state.df$South==1,c(1,3,5,7,9)]
## Population Illiteracy Murder Frost South
## Alabama 3615 2.1 15.1 20 1
## Arkansas 2110 1.9 10.1 65 1
## Georgia 4931 2.0 13.9 60 1
## Mississippi 2341 2.4 12.5 50 1
## North Carolina 5441 1.8 11.1 80 1
## South Carolina 2816 2.3 11.6 65 1
## Texas 12237 2.2 12.2 35 1
## Virginia 4981 1.4 9.5 85 1
out + geom_point(aes(colour=as.factor(state.df$South)))
This plot helps us see quickly that the Southern states seem to have substantially different Illiteracy scores than non-Southern states. That might indicate a potential source of endogeneity.
out + geom_point(aes(colour=as.factor(state.df$South),size=Population))
Putting these two factors together suggests that smaller states might be more likely to have illiteracy, while Southern states almost uniformly display higher levels of illiteracy. That would be interesting to explore further.