Visualizing Data with GGPLOT2

GGPLOT2 is a graphics library that was inspired by the book The Grammar of Graphics by Leland Wilkinson (hence the “GG”). Wilkinson's grammar is extremely intuitive to anyone with experience in GIS. In fact, in his book Wilkinson explicitly notes that his “grammar” borrows heavily from the work of geographers (he cites Tobler and MacEachrean). At a conceptual level Wilkinson's grammar incorporates key concepts from GIS, including layers, coordinate systems, and using a variable to determine the color, texture, and size of an object.

ggplot is built around three simple ideas: 1. Variables (in a data.frame) are mapped to visual variables (such as size, position, color, etc.) using an “aesthetic mapping.” 2. Mappings can be applied to any number of “geometries” (in the language of ggplot these are called “geoms”). 3. Plots can have multiple layers (like a GIS). These layers can be juxtaposed or displayed side by side in “facets.”

For me, as a geographer, this is the essence of ggplot, the authors and other users of ggplot might disagree. I see ggplot as a GIS-like system for statistical graphics. Somewhat ironically, in spite of my characterization of ggplot it stinks for making actual maps.

A key reference for using ggplot is the online documentation. It seems like ggplot2 is the Justin Beiber of R libraries - there are oddly obsessed fans. The plus side of ggplot's popularity is that there is a lot of info online, the downside of this is that the library evolves rapidly and there is some iffy stuff posted by ggplot-Beiblers.

library(ggplot2)
library(maptools)

##LOAD DATA
USA <- readShapePoly("~/Dropbox/GEOG 5023/GEOG 5023 - Spring 2013/Data/USA copy.shp")
##Remove count fields and rows with missing data 
USA <- USA[,c(1:8, 14:30)]
USA <- na.omit(USA)

For example, a simple scatter plot can be made in ggplot by mapping one variable to the x-coordinate and one variable to the y-coordinate.

plot1 <- ggplot(data=USA@data, aes(x=Obese, y=homevalu))

Notice the above line doesn't draw anything. That's because we have to assign a “geom” to the aesthetic mapping specified by aes().

plot1 + geom_point()

plot of chunk unnamed-chunk-4

Now, we've visualized our aesthetic mapping using a point geometry. It appears there is not much of a relationship between obesity and home values but the points are clumped together and over-plotted. We can transform the coordinates:

plot1 + geom_point() + scale_x_log10() + scale_y_log10()

plot of chunk unnamed-chunk-5

Using “alpha” we can make dots transparent. We can assign the transparency on the basis of a variable or a fixed value. Below I assign alpha=1/10 which means that 10 dots will have to be plotted atop each other to generate an opaque black point. Notice that most values are concentrated near the mean of “Obese” and “homevalu”.

##add transparency to the points to make overplotting visible.
plot1 + geom_point(alpha=1/10) + scale_x_log10() + scale_y_log10()

plot of chunk unnamed-chunk-6

We could add a fitted line to the plot:

plot1 + geom_point(alpha=1/10) + geom_smooth(method="lm") 

plot of chunk unnamed-chunk-7

##try changing "lm" to "loess".

There are other ways to deal with the over-plotting problem.

library(hexbin)
plot1 + stat_binhex()

plot of chunk unnamed-chunk-8

plot1 + geom_bin2d()

plot of chunk unnamed-chunk-8

plot1 + geom_density2d()

plot of chunk unnamed-chunk-8

ggplot2 makes it very easy to incorporate qualitative variables. These can be used in several ways: 1. Facets: Each level of a factor can be plotted in its own panel.
2. Groups: Each level of a factor can be assigned its own group. For example, plotting fitted lines for each group through a scatter plot.
3. Appearance: Color, symbols, line weight, fill, and other variables can be assigned to a factor (qualitative variable).

Lets create a qualitative variable:

USA$good_states <- ifelse(USA$STATE_NAME %in% c("New York", "Massachusetts", "Rhode Island", "Wyoming"), yes="its good", no="its ok") 
USA$good_states <- as.factor(USA$good_states)

#MODIFY PLOT 1
plot2 <- ggplot(data=USA@data, aes(x=Obese, y=homevalu, color=good_states))
plot2+geom_point()

plot of chunk unnamed-chunk-9

plot2 <- ggplot(data=USA@data, aes(x=Obese, y=homevalu, color=good_states, shape=good_states))
plot2 + stat_smooth() #uses a local fit
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.

plot of chunk unnamed-chunk-9

plot2 + geom_point() + stat_smooth(method="lm", se=TRUE, lwd=.5, lty = 1)

plot of chunk unnamed-chunk-9

#lwd controls line thickenss
#lty controls line type 1= solid line, higher numbers various forms of dashed lines.
#se can be used ti turn off the grey standaed error envelopes.

Adding marginalia and changing the appearance of the plot is easy. Lets look at the percent college educated (pctcoled) and the per capita income (pcincome), these two variables have a correlation of r=.7 so our plots should show some sort of relationship.

plot3 <- ggplot(data=USA@data, aes(x=pctcoled, y=pcincome))
plot3 + geom_point() + 
  ylab("Per Capita Income") + 
  xlab("Percent College Educated") + 
  ggtitle("US Counties (2000)\nPercent College Educated by Per Capita Income")

plot of chunk unnamed-chunk-10

Its easy to make multidimensional plot. Lets say we wanted to add the unemployment variable to the plot by changing the color of the dots based on the unemployment rate.

plot4 <- ggplot(data=USA@data, aes(x=pctcoled, y=pcincome, color=unemploy)) + 
  geom_point() + 
  ylab("Per Capita Income") + 
  xlab("Percent College Educated") + 
  ggtitle("US Counties (2000)\nPercent College Educated by Per Capita Income") + 
  scale_color_gradient2("Unemployment", 
                        breaks= c(min(USA$unemploy), mean(USA$unemploy), max(USA$unemploy)),
                        labels= c("Below Average", "Average", "Above Average"),
                        low="green",
                        mid="yellow",
                        high="red",
                        midpoint = mean(USA$unemploy))
plot4

plot of chunk unnamed-chunk-11

Its possible to split the plot into panels based upon the “good states” variable. We create “facets” or subplots that display only the data for each level of the factor:

plot4 + facet_grid(.~good_states)

plot of chunk unnamed-chunk-12

#If you wanted to go crazy you could do:
#plot4 + facet_grid(.~ STATE_NAME)

I don't especially like these plots. Let's tweak the appearance of plot4. It is possible to change the “theme” used to display the plot. This is a newer feature of ggplot. It is also possible to do this by hand.

plot4 + theme_classic()

plot of chunk unnamed-chunk-13

Full blown custom themes can be used to make a consistent set of graphics for a presentation or paper. Someone has created a library of themes:

install.packages('ggthemes', dependencies = TRUE)
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror
library(ggthemes)
plot4 + theme_economist()

plot of chunk unnamed-chunk-14

plot4 + theme_solarized() #OUCH!

plot of chunk unnamed-chunk-14

plot4 + theme_tufte()

plot of chunk unnamed-chunk-14

I've made have a custom theme that I often use in presentations:

sethTheme <- theme(
    panel.background = element_rect(fill = "black"),
    plot.background = element_rect(fill = "black"), 
    panel.grid.minor = element_blank(), 
    panel.grid.major = element_line(linetype = 3, colour = "white"), 
    axis.text.x = element_text(colour = "grey80"), 
    axis.text.y = element_text(colour = "grey80"), 
    axis.title.x = element_text(colour = "grey80"), 
    axis.title.y = element_text(colour = "grey80"), 
    legend.key = element_rect(fill = "black"), 
    legend.text = element_text(colour = "white"), 
    legend.title = element_text(colour = "black"), 
    legend.background = element_rect(fill = "black"),
    axis.ticks = element_blank()) 
plot4 + sethTheme

plot of chunk unnamed-chunk-15

The plot title and legend title are not displayed, see if you can figure out how to fix them.

One of my favorite things about ggplot is the ggsave function. This makes it very easy to save plots in just about any graphics format. I've found that plots saved as PDFs move nicely into Adobe illustrator. Using ggsave is simple ggsave("path/plotName.png") saves a png file. To save a PDF file simply change the extension: ggsave("path/plotName.pdf").

Making Maps

These maps look much better when plotted Maps are a pain in ggplot here's how its done:

##Use fortify to extract ploygon boundaries from the spatialDataFrame (its slow)
usa_geom<- fortify(USA, region="FIPS")

##reattach data to ploygon boundaries
usa_map_df <- merge(usa_geom, USA, by.x="id", by.y="FIPS")

##make a map of bush_pct
map1 <- ggplot(usa_map_df,  aes(long,lat,group=group)) + 
  geom_polygon(data=usa_map_df, aes(fill=Bush_pct)) +
  coord_equal() +
  scale_fill_gradient(low="yellow", high="red") + 
  geom_path(data=usa_geom, aes(long,lat,group=group), lty = 3, lwd=.1, color="white")
map1

plot of chunk unnamed-chunk-16

We can apply the “sethTheme”

map1 + sethTheme

plot of chunk unnamed-chunk-17

Its possible to create proper thematic maps with legend classes, ColorBrewer is even implemented in ggplot.

library(classInt)
classIntervals(USA$Bush_pct, n = 5, style = "quantile") #can use "jenks" but is very slow
## style: quantile
##     [0,50.52) [50.52,58.07) [58.07,64.37) [64.37,71.31) [71.31,92.83] 
##           622           622           622           622           623
breaks <- c(0, 50, 58, 64, 71, 93) #approximate quantiles
labels = c('[0 - 50%]', '[50% - 58%]', '[58% - 64%]', '[64% - 71%]', '[71% - 93%]')
usa_map_df$bushBreaks <- cut(usa_map_df$Bush_pct, breaks=breaks, labels=labels)
map2 <- ggplot(aes(long,lat,group=group),data=usa_map_df) + 
  geom_polygon(data=usa_map_df, aes(fill=bushBreaks)) +
  coord_equal()
map2

plot of chunk unnamed-chunk-18

Assign a sensible color scale for the legend:

library(RColorBrewer)
map2 + scale_fill_brewer('Votes for Bush in 2004 (%)', palette = "YlGnBu") + 
  sethTheme +
  ggtitle("Votes for Bush in 2004 (%)") + 
  theme(plot.title = element_text(size=24, face = "bold", color="white", hjust=2))

plot of chunk unnamed-chunk-19