Lab Exercise with ggplot package

April 22, 2013

see http://rpubs.com/Seth/ggplot_5023

mappings: connects data variables to visual variables (variables from data frame to things like shape, size, position, transparency, etc.)

aes = aesthetic. connects things to a mapping.

call: ggplot(dataframe, aes(variable1, variable2, color=v4)) + geompoint() these can be connected to any kind of plot, e.g. geomline, hexbin… there are tons of geometries available in ggplot.

see docs.ggplot2.org/current to get a full list of all the geometries available. then, there are also “statistics” that apply transformations to your data. e.g. you can say, just put a regression line thorugh this. there's also density, etc.

ggplot gets you thinking about variables in lots of different ways. It's really fun to play around with it. Seth likes SASS or STATA for regressions better, but they don't have anything like ggplot. you can also export files into formats that can be further manipulated in illustrator.

so, ggplot is new and very, very popular.

install.packages("ggplot2", dependencies = TRUE)
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror
install.packages("maptools", dependencies = TRUE)
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror
library(ggplot2)
library(maptools)
## Loading required package: foreign
## Warning: package 'foreign' was built under R version 2.15.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 2.15.3
## Loading required package: grid
## Loading required package: lattice
## Checking rgeos availability: TRUE
USA <- readShapePoly("/Users/caitlin/Dropbox/CAITLINS DOCUMENTS/CU Boulder/Courses/GEOG 5023 Quant methods - Spielman/Data/USA copy.shp")
## Remove count fields and rows with missing data
USA <- USA[, c(1:8, 14:30)]
USA <- na.omit(USA)

For example, a simple scatter plot can be made in ggplot by mapping one variable to the x-coordinate and one variable to the y-coordinate.

plot1 <- ggplot(data = USA@data, aes(x = Obese, y = homevalu))

Notice the above line doesn't draw anything. That's because we have to assign a “geom” to the aesthetic mapping specified by aes().

plot1 + geom_point()

plot of chunk unnamed-chunk-2

Now, we've visualized our aesthetic mapping using a point geometry. It appears there is not much of a relationship between obesity and home values but the points are clumped together and over-plotted. We can transform the coordinates:

plot1 + geom_point() + scale_x_log10() + scale_y_log10()

plot of chunk unnamed-chunk-3

Using “alpha” we can make dots transparent. We can assign the transparency on the basis of a variable or a fixed value. Below I assign alpha=1/10 which means that 10 dots will have to be plotted atop each other to generate an opaque black point. Notice that most values are concentrated near the mean of “Obese” and “homevalu”.

## add transparency to the points to make overplotting visible.
plot1 + geom_point(alpha = 1/10) + scale_x_log10() + scale_y_log10()

plot of chunk unnamed-chunk-4

We could add a fitted line to the plot:

plot1 + geom_point(alpha = 1/10) + geom_smooth(method = "lm")  #grey area is the standard error

plot of chunk unnamed-chunk-5

##try changing “lm” to “loess”.

There are other ways to deal with the over-plotting problem.

install.packages("hexbin")
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror
library(hexbin)
plot1 + stat_binhex()

plot of chunk hexbin

plot1 + geom_bin2d()

plot of chunk square bin

plot1 + geom_density2d()

plot of chunk contour plot

ggplot2 makes it very easy to incorporate qualitative variables. These can be used in several ways: 1. Facets: Each level of a factor can be plotted in its own panel. 2. Groups: Each level of a factor can be assigned its own group. For example, plotting fitted lines for each group through a scatter plot. 3. Appearance: Color, symbols, line weight, fill, and other variables can be assigned to a factor (qualitative variable).

Lets create a qualitative variable:

USA$good_states <- ifelse(USA$STATE_NAME %in% c("New York", "Massachusetts", 
    "Rhode Island", "Wyoming"), yes = "its good", no = "its ok")
USA$good_states <- as.factor(USA$good_states)
# MODIFY PLOT 1
plot2 <- ggplot(data = USA@data, aes(x = Obese, y = homevalu, color = good_states))
plot2 + geom_point()  #each dot is a county. the states seth likes are kind of in the middle...

plot of chunk unnamed-chunk-6

plot2 <- ggplot(data = USA@data, aes(x = Obese, y = homevalu, color = good_states, 
    shape = good_states))
plot2 + stat_smooth()  #uses a local fit
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.

plot of chunk unnamed-chunk-7

## geom_smooth: method='auto' and size of largest group is >=1000, so
## using gam with formula: y ~ s(x, bs = 'cs'). Use 'method = x' to change
## the smoothing method.
plot2 + geom_point() + stat_smooth(method = "lm", se = TRUE, lwd = 0.5, lty = 1)

plot of chunk unnamed-chunk-8

# lwd controls line thickenss lty controls line type 1= solid line, higher
# numbers various forms of dashed lines. se can be used ti turn off the
# grey standaed error envelopes.

Adding marginalia and changing the appearance of the plot is easy. Lets look at the percent college educated (pctcoled) and the per capita income (pcincome), these two variables have a correlation of r=.7 so our plots should show some sort of relationship.

plot3 <- ggplot(data = USA@data, aes(x = pctcoled, y = pcincome))
plot3 + geom_point() + ylab("Per Capita Income") + xlab("Percent College Educated") + 
    ggtitle("US Counties (2000)\nPercent College Educated by Per Capita Income")

plot of chunk unnamed-chunk-10

Its easy to make multidimensional plot. Lets say we wanted to add the unemployment variable to the plot by changing the color of the dots based on the unemployment rate.

plot4 <- ggplot(data = USA@data, aes(x = pctcoled, y = pcincome, color = unemploy)) + 
    geom_point() + ylab("Per Capita Income") + xlab("Percent College Educated") + 
    ggtitle("US Counties (2000)\nPercent College Educated by Per Capita Income") + 
    scale_color_gradient2("Unemployment", breaks = c(min(USA$unemploy), mean(USA$unemploy), 
        max(USA$unemploy)), labels = c("Below Average", "Average", "Above Average"), 
        low = "green", mid = "yellow", high = "red", midpoint = mean(USA$unemploy))
plot4

plot of chunk unnamed-chunk-11

# plot an x and y variable, and then color the dots according to a third
# color. Hoping to see a pattern in the variables, as well as in the
# colors. Color of the dots is the unemployment rate. Colors are manually
# specified. Green is below average. yellow is around avg. unemployment
# above avg is red. The prob is there is a long tail - this could be
# tweaked to make it better. Potting order may be hiding some reds or
# greens... the over... you can use the jitter function geomjitter, which
# will randomly plot the points, to help see the pattern

Its possible to split the plot into panels based upon the “good states” variable. We create “facets” or subplots that display only the data for each level of the factor:

plot4 + facet_grid(. ~ good_states)

plot of chunk unnamed-chunk-12

# facet means split the plot based upon good states and bad states
# If you wanted to go crazy you could do:
plot4 + facet_grid(. ~ STATE_NAME)  #you could do this for all 50 states! could try facet_wrap - instead of drawing all 50 charts in a row, it will wrap them around...

plot of chunk unnamed-chunk-13

I don't especially like these plots. Let's tweak the appearance of plot4. It is possible to change the “theme” used to display the plot. This is a newer feature of ggplot. It is also possible to do this by hand.

plot4 + theme_classic()

plot of chunk Themes

# you can set up your own style for plots. this one is called
# 'theme_classic' and there is a theme library emerging in ggthemes
# library. A guy has tried to make themes that mirror major publications,
# like WSJ, the Economist, Tufte!

Full blown custom themes can be used to make a consistent set of graphics for a presentation or paper. Someone has created a library of themes:

install.packages("ggthemes", dependencies = TRUE)
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror

## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 2.15.3
plot4 + theme_economist()

plot of chunk Economist theme

plot4 + theme_solarized()  #OUCH!

plot of chunk Solarized theme - yuck!

plot4 + theme_tufte()

plot of chunk Tufte theme

You can spit things out of ggplot as a pdf, or you can do it in a file format that can be used in Illustrator for further manipulation. Once you have a theme, you can apply it to any plot.

I've made have a custom theme that I often use in presentations:

sethTheme <- theme(
    panel.background = element_rect(fill = "black"),
    plot.background = element_rect(fill = "black"), 
    panel.grid.minor = element_blank(), 
    panel.grid.major = element_line(linetype = 3, colour = "white"), 
    axis.text.x = element_text(colour = "grey80"), 
    axis.text.y = element_text(colour = "grey80"), 
    axis.title.x = element_text(colour = "grey80"), 
    axis.title.y = element_text(colour = "grey80"), 
    legend.key = element_rect(fill = "black"), 
    legend.text = element_text(colour = "white"), 
    legend.title = element_text(colour = "grey80"), 
    legend.background = element_rect(fill = "black"), #change to blue, shows legend title
    axis.ticks = element_blank(), 
    title = element_text(colour="grey80"))
plot4 + sethTheme

plot of chunk Seth theme

Lab exercise: problem in that Seth's is that the titles are gone - they are probably black. Exercise: fix this problem, and you'll have a pretty good idea of how these work. Draw title and legend title in white. (DONE!)

Themes are not that well documented online, but there are good books. Look at the helpfile for themes (?themes in the console) and you can see anything that can be manipulated. Look at this to figure out how to manipulate Seth's theme.

###MAKING MAPS

These maps look much better when plotted. Maps are a pain in ggplot. Here's how its done:

##Use fortify to extract ploygon boundaries from the spatialDataFrame (its slow)
library(rgeos)
## Warning: package 'rgeos' was built under R version 2.15.3
## rgeos version: 0.2-16, (SVN revision 389) GEOS runtime version:
## 3.3.3-CAPI-1.7.4 Polygon checking: TRUE
usa_geom<- fortify(USA, region="FIPS") 

##reattach data to ploygon boundaries
usa_map_df <- merge(usa_geom, USA, by.x="id", by.y="FIPS")

##make a map of bush_pct
map1 <- ggplot(usa_map_df,  aes(long,lat,group=group)) + #setting x and y coordinates using lat and long data
  geom_polygon(data=usa_map_df, aes(fill=Bush_pct)) + #filling the polygons created above with the data
  coord_equal() +
  scale_fill_gradient(low="yellow", high="red") + 
  geom_path(data=usa_geom, aes(long,lat,group=group), lty = 3, lwd=.1, color="white")
map1

plot of chunk unnamed-chunk-14

We can apply the “sethTheme”

map1 + sethTheme

plot of chunk unnamed-chunk-15

Its possible to create proper thematic maps with legend classes, ColorBrewer is even implemented in ggplot.

library(classInt)
## Loading required package: class
## Loading required package: e1071
classIntervals(USA$Bush_pct, n = 5, style = "quantile")
## style: quantile
##     [0,50.52) [50.52,58.07) [58.07,64.37) [64.37,71.31) [71.31,92.83] 
##           622           622           622           622           623

## style: quantile [0,50.52) [50.52,58.07) [58.07,64.37) [64.37,71.31)
## [71.31,92.83] 622 622 622 622 623

breaks <- c(0, 50, 58, 64, 71, 93)  #approximate quantiles
labels = c("[0 - 50%]", "[50% - 58%]", "[58% - 64%]", "[64% - 71%]", "[71% - 93%]")
usa_map_df$bushBreaks <- cut(usa_map_df$Bush_pct, breaks = breaks, labels = labels)
map2 <- ggplot(aes(long, lat, group = group), data = usa_map_df) + geom_polygon(data = usa_map_df, 
    aes(fill = bushBreaks)) + coord_equal()
map2

plot of chunk unnamed-chunk-16

Assign a sensible color scale for the legend:

library(RColorBrewer)
map2 + scale_fill_brewer("Votes for Bush in 2004 (%)", palette = "YlGnBu") + 
    sethTheme + ggtitle("Votes for Bush in 2004 (%)") + theme(plot.title = element_text(size = 24, 
    face = "bold", color = "white", hjust = 2))

plot of chunk unnamed-chunk-17