April 22, 2013
see http://rpubs.com/Seth/ggplot_5023
mappings: connects data variables to visual variables (variables from data frame to things like shape, size, position, transparency, etc.)
aes = aesthetic. connects things to a mapping.
call: ggplot(dataframe, aes(variable1, variable2, color=v4)) + geompoint() these can be connected to any kind of plot, e.g. geomline, hexbin… there are tons of geometries available in ggplot.
see docs.ggplot2.org/current to get a full list of all the geometries available. then, there are also “statistics” that apply transformations to your data. e.g. you can say, just put a regression line thorugh this. there's also density, etc.
ggplot gets you thinking about variables in lots of different ways. It's really fun to play around with it. Seth likes SASS or STATA for regressions better, but they don't have anything like ggplot. you can also export files into formats that can be further manipulated in illustrator.
so, ggplot is new and very, very popular.
install.packages("ggplot2", dependencies = TRUE)
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror
install.packages("maptools", dependencies = TRUE)
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror
library(ggplot2)
library(maptools)
## Loading required package: foreign
## Warning: package 'foreign' was built under R version 2.15.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 2.15.3
## Loading required package: grid
## Loading required package: lattice
## Checking rgeos availability: TRUE
USA <- readShapePoly("/Users/caitlin/Dropbox/CAITLINS DOCUMENTS/CU Boulder/Courses/GEOG 5023 Quant methods - Spielman/Data/USA copy.shp")
## Remove count fields and rows with missing data
USA <- USA[, c(1:8, 14:30)]
USA <- na.omit(USA)
For example, a simple scatter plot can be made in ggplot by mapping one variable to the x-coordinate and one variable to the y-coordinate.
plot1 <- ggplot(data = USA@data, aes(x = Obese, y = homevalu))
Notice the above line doesn't draw anything. That's because we have to assign a “geom” to the aesthetic mapping specified by aes().
plot1 + geom_point()
Now, we've visualized our aesthetic mapping using a point geometry. It appears there is not much of a relationship between obesity and home values but the points are clumped together and over-plotted. We can transform the coordinates:
plot1 + geom_point() + scale_x_log10() + scale_y_log10()
Using “alpha” we can make dots transparent. We can assign the transparency on the basis of a variable or a fixed value. Below I assign alpha=1/10 which means that 10 dots will have to be plotted atop each other to generate an opaque black point. Notice that most values are concentrated near the mean of “Obese” and “homevalu”.
## add transparency to the points to make overplotting visible.
plot1 + geom_point(alpha = 1/10) + scale_x_log10() + scale_y_log10()
We could add a fitted line to the plot:
plot1 + geom_point(alpha = 1/10) + geom_smooth(method = "lm") #grey area is the standard error
##try changing “lm” to “loess”.
There are other ways to deal with the over-plotting problem.
install.packages("hexbin")
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror
library(hexbin)
plot1 + stat_binhex()
plot1 + geom_bin2d()
plot1 + geom_density2d()
ggplot2 makes it very easy to incorporate qualitative variables. These can be used in several ways: 1. Facets: Each level of a factor can be plotted in its own panel. 2. Groups: Each level of a factor can be assigned its own group. For example, plotting fitted lines for each group through a scatter plot. 3. Appearance: Color, symbols, line weight, fill, and other variables can be assigned to a factor (qualitative variable).
Lets create a qualitative variable:
USA$good_states <- ifelse(USA$STATE_NAME %in% c("New York", "Massachusetts",
"Rhode Island", "Wyoming"), yes = "its good", no = "its ok")
USA$good_states <- as.factor(USA$good_states)
# MODIFY PLOT 1
plot2 <- ggplot(data = USA@data, aes(x = Obese, y = homevalu, color = good_states))
plot2 + geom_point() #each dot is a county. the states seth likes are kind of in the middle...
plot2 <- ggplot(data = USA@data, aes(x = Obese, y = homevalu, color = good_states,
shape = good_states))
plot2 + stat_smooth() #uses a local fit
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## geom_smooth: method='auto' and size of largest group is >=1000, so
## using gam with formula: y ~ s(x, bs = 'cs'). Use 'method = x' to change
## the smoothing method.
plot2 + geom_point() + stat_smooth(method = "lm", se = TRUE, lwd = 0.5, lty = 1)
# lwd controls line thickenss lty controls line type 1= solid line, higher
# numbers various forms of dashed lines. se can be used ti turn off the
# grey standaed error envelopes.
Adding marginalia and changing the appearance of the plot is easy. Lets look at the percent college educated (pctcoled) and the per capita income (pcincome), these two variables have a correlation of r=.7 so our plots should show some sort of relationship.
plot3 <- ggplot(data = USA@data, aes(x = pctcoled, y = pcincome))
plot3 + geom_point() + ylab("Per Capita Income") + xlab("Percent College Educated") +
ggtitle("US Counties (2000)\nPercent College Educated by Per Capita Income")
Its easy to make multidimensional plot. Lets say we wanted to add the unemployment variable to the plot by changing the color of the dots based on the unemployment rate.
plot4 <- ggplot(data = USA@data, aes(x = pctcoled, y = pcincome, color = unemploy)) +
geom_point() + ylab("Per Capita Income") + xlab("Percent College Educated") +
ggtitle("US Counties (2000)\nPercent College Educated by Per Capita Income") +
scale_color_gradient2("Unemployment", breaks = c(min(USA$unemploy), mean(USA$unemploy),
max(USA$unemploy)), labels = c("Below Average", "Average", "Above Average"),
low = "green", mid = "yellow", high = "red", midpoint = mean(USA$unemploy))
plot4
# plot an x and y variable, and then color the dots according to a third
# color. Hoping to see a pattern in the variables, as well as in the
# colors. Color of the dots is the unemployment rate. Colors are manually
# specified. Green is below average. yellow is around avg. unemployment
# above avg is red. The prob is there is a long tail - this could be
# tweaked to make it better. Potting order may be hiding some reds or
# greens... the over... you can use the jitter function geomjitter, which
# will randomly plot the points, to help see the pattern
Its possible to split the plot into panels based upon the “good states” variable. We create “facets” or subplots that display only the data for each level of the factor:
plot4 + facet_grid(. ~ good_states)
# facet means split the plot based upon good states and bad states
# If you wanted to go crazy you could do:
plot4 + facet_grid(. ~ STATE_NAME) #you could do this for all 50 states! could try facet_wrap - instead of drawing all 50 charts in a row, it will wrap them around...
I don't especially like these plots. Let's tweak the appearance of plot4. It is possible to change the “theme” used to display the plot. This is a newer feature of ggplot. It is also possible to do this by hand.
plot4 + theme_classic()
# you can set up your own style for plots. this one is called
# 'theme_classic' and there is a theme library emerging in ggthemes
# library. A guy has tried to make themes that mirror major publications,
# like WSJ, the Economist, Tufte!
Full blown custom themes can be used to make a consistent set of graphics for a presentation or paper. Someone has created a library of themes:
install.packages("ggthemes", dependencies = TRUE)
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
## Error: trying to use CRAN without setting a mirror
## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 2.15.3
plot4 + theme_economist()
plot4 + theme_solarized() #OUCH!
plot4 + theme_tufte()
You can spit things out of ggplot as a pdf, or you can do it in a file format that can be used in Illustrator for further manipulation. Once you have a theme, you can apply it to any plot.
I've made have a custom theme that I often use in presentations:
sethTheme <- theme(
panel.background = element_rect(fill = "black"),
plot.background = element_rect(fill = "black"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(linetype = 3, colour = "white"),
axis.text.x = element_text(colour = "grey80"),
axis.text.y = element_text(colour = "grey80"),
axis.title.x = element_text(colour = "grey80"),
axis.title.y = element_text(colour = "grey80"),
legend.key = element_rect(fill = "black"),
legend.text = element_text(colour = "white"),
legend.title = element_text(colour = "grey80"),
legend.background = element_rect(fill = "black"), #change to blue, shows legend title
axis.ticks = element_blank(),
title = element_text(colour="grey80"))
plot4 + sethTheme
Lab exercise: problem in that Seth's is that the titles are gone - they are probably black. Exercise: fix this problem, and you'll have a pretty good idea of how these work. Draw title and legend title in white. (DONE!)
Themes are not that well documented online, but there are good books. Look at the helpfile for themes (?themes in the console) and you can see anything that can be manipulated. Look at this to figure out how to manipulate Seth's theme.
###MAKING MAPS
These maps look much better when plotted. Maps are a pain in ggplot. Here's how its done:
##Use fortify to extract ploygon boundaries from the spatialDataFrame (its slow)
library(rgeos)
## Warning: package 'rgeos' was built under R version 2.15.3
## rgeos version: 0.2-16, (SVN revision 389) GEOS runtime version:
## 3.3.3-CAPI-1.7.4 Polygon checking: TRUE
usa_geom<- fortify(USA, region="FIPS")
##reattach data to ploygon boundaries
usa_map_df <- merge(usa_geom, USA, by.x="id", by.y="FIPS")
##make a map of bush_pct
map1 <- ggplot(usa_map_df, aes(long,lat,group=group)) + #setting x and y coordinates using lat and long data
geom_polygon(data=usa_map_df, aes(fill=Bush_pct)) + #filling the polygons created above with the data
coord_equal() +
scale_fill_gradient(low="yellow", high="red") +
geom_path(data=usa_geom, aes(long,lat,group=group), lty = 3, lwd=.1, color="white")
map1
We can apply the “sethTheme”
map1 + sethTheme
Its possible to create proper thematic maps with legend classes, ColorBrewer is even implemented in ggplot.
library(classInt)
## Loading required package: class
## Loading required package: e1071
classIntervals(USA$Bush_pct, n = 5, style = "quantile")
## style: quantile
## [0,50.52) [50.52,58.07) [58.07,64.37) [64.37,71.31) [71.31,92.83]
## 622 622 622 622 623
## style: quantile [0,50.52) [50.52,58.07) [58.07,64.37) [64.37,71.31)
## [71.31,92.83] 622 622 622 622 623
breaks <- c(0, 50, 58, 64, 71, 93) #approximate quantiles
labels = c("[0 - 50%]", "[50% - 58%]", "[58% - 64%]", "[64% - 71%]", "[71% - 93%]")
usa_map_df$bushBreaks <- cut(usa_map_df$Bush_pct, breaks = breaks, labels = labels)
map2 <- ggplot(aes(long, lat, group = group), data = usa_map_df) + geom_polygon(data = usa_map_df,
aes(fill = bushBreaks)) + coord_equal()
map2
Assign a sensible color scale for the legend:
library(RColorBrewer)
map2 + scale_fill_brewer("Votes for Bush in 2004 (%)", palette = "YlGnBu") +
sethTheme + ggtitle("Votes for Bush in 2004 (%)") + theme(plot.title = element_text(size = 24,
face = "bold", color = "white", hjust = 2))