ggplot2 Figures from the Gapminder Data Set
We will use the gapminder data set. In the previous homework, we created plots using the lattice package. In this homework, we will be visualizing the gapminder data set using the ggplot2 package.
We load the data and perform a subset operation to remove the Oceania continent, as recommended by Jenny. A quick sanity check is done by checking the structure of the data set with str function after importing and subsetting the data.
str(gDat)
## 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
ggplot2 Figures from the Gapminder Data SetWe re-vist one visualization task from the previous homework to compare the lattice plot with the ggplot2 plot. This task is from Jenny's menu of data aggregation tasks (link to task). We alter the task very slightly by including the 10% trimmed mean in addition to the minimum and maximum GDP per capita.
We begin by using ddply to aggregate the data. We use a custom function in ddply to retrieve the countries with the minimum and maximum GDP per capita for each year.
# Min/Max GDP by year and Country
minMaxGDP <- ddply(gDat, ~year + continent, .drop = FALSE, function(x) {
minCountry = as.character(x$country[which.min(x$gdpPercap)])
minGdp = min(x$gdpPercap)
maxCountry = as.character(x$country[which.max(x$gdpPercap)])
maxGdp = max(x$gdpPercap)
trimMean = mean(x$gdpPercap, trim = 0.1)
return(data.frame(gdpPercap = c(minGdp, maxGdp, trimMean), country = c(minCountry,
maxCountry, ""), stat = c("minGdpPercap", "maxGdpPercap", "trimMean")))
})
We display the aggregated data in a “long” table below. To save some space, we display only the first 20 rows of the 144 rows in the table.
| year | continent | gdpPercap | country | stat |
|---|---|---|---|---|
| 1952 | Africa | 298.85 | Lesotho | minGdpPercap |
| 1952 | Africa | 4725.30 | South Africa | maxGdpPercap |
| 1952 | Africa | 1085.16 | trimMean | |
| 1952 | Americas | 1397.72 | Dominican Republic | minGdpPercap |
| 1952 | Americas | 13990.48 | United States | maxGdpPercap |
| 1952 | Americas | 3494.33 | trimMean | |
| 1952 | Asia | 331.00 | Myanmar | minGdpPercap |
| 1952 | Asia | 108382.35 | Kuwait | maxGdpPercap |
| 1952 | Asia | 1690.45 | trimMean | |
| 1952 | Europe | 973.53 | Bosnia and Herzegovina | minGdpPercap |
| 1952 | Europe | 14734.23 | Switzerland | maxGdpPercap |
| 1952 | Europe | 5436.62 | trimMean | |
| 1957 | Africa | 336.00 | Lesotho | minGdpPercap |
| 1957 | Africa | 5487.10 | South Africa | maxGdpPercap |
| 1957 | Africa | 1176.76 | trimMean | |
| 1957 | Americas | 1544.40 | Dominican Republic | minGdpPercap |
| 1957 | Americas | 14847.13 | United States | maxGdpPercap |
| 1957 | Americas | 4037.75 | trimMean | |
| 1957 | Asia | 350.00 | Myanmar | minGdpPercap |
| 1957 | Asia | 113523.13 | Kuwait | maxGdpPercap |
Now, we create the accompanying figure for the above table in lattice and ggplot2. For each continent, we plot the progression of the minimum, the maximum, and the 10% trimmed mean of the GDP per capita over the years in the data set. A locally fitted (LOESS) regression line is plotted over the points in the data. We make one change from our original plot in the previous homework - the gdpPercap variable on the y-axis is plotted in the log scale to allow us better compare the lines.
# Plot points and a loess smoothed fit in lattice
xyplot(gdpPercap ~ year | continent,
group=stat,
data=minMaxGDP,
grid='h',
auto.key = TRUE,
type=c("p", "smooth"),
scales=list(y = list(log = 10)))
# Do the same with ggplot2
ggplot(data=minMaxGDP,
aes(x=year, y=gdpPercap, group=stat, colour=stat)) +
geom_point() +
geom_smooth(fill=NA, method="loess") +
facet_wrap(~continent) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
We focus on comparing the lattice plot with the ggplot2 plot. For a narrative of the plots, please refer to the previous homework.
For the lattice plot, this plot was built without delving into making our own panel functions, which is where the true customization power lies in lattice. That being said, we were able to produce a nice figure quite easily using just the xyplot function. Faceting is achieved by simply providing a formula, grouping is expressed explicitely, and lattice takes care of the colours. Grid lines are added by simply calling grid='h', and the type of plot is provided in a character vector (type=c("p", "smooth") provides points and a smoothed LOESS line). I found that changing the axis to the log scale was unintuitive, requiring us to provide a list to the scales argument. On the plus side, the scales are rather well displayed by default in lattice.
In the ggplot2 plot, we use the grammar of graphics to build our plot. The data set and the main aesthetics are provided in the ggplot call, where we specify the variable for the x-axis, the variable for the y-axis, the grouping variable, and the colour. We then tell ggplot what to draw on the plot via calls to geom layers. We use geom_point to add points to the plot, and geom_smooth to add a smooth line fit to the points. Faceting is explicitely called using facet_wrap, and the faceting variable can be called using a formula. The y-axis can be transformed to the log scale by adding the scale_y_log10 call, but to make the scale look nice like the lattice plot, we used the package scales by Hadley Wickham.
The default order of the facets are different in the lattice plots when compared to plots made with ggplot2. The plots are added from the bottom left to to the top right, following the factor ordering in the data, when using lattice. On the other hand, ggplot2 adds the plots from the top left to the bottom right, which I found to be more intuitive.
We now focus our attention to creating Hans Rosling's infamous bubble plot. We will have the (log) GDP per capita on the x-axis, and Life Expectancy on the y-axis. The size of the points on the plot will represent the population size, and the colour of the points will represent the continent. We will focus on the data at years 1952 and 2007, the earliest and latest data points in the gapminder data set.
To help us with creating the bubble plot in lattice, Jenny provides a supplementary data set that contains nice colours and sorts the countries by size so that smaller countries are not hidden behind larger countries. We keep Oceania in the following plots.
# Get the copy of the data set that is ordered and given with colours
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderWithColorsAndSorted.txt"
gDatSorted <- read.delim(file = gdURL, as.is = 7)
# Get the data for 1952 and 2007
gDatSub <- subset(gDatSorted, year == 1952 | year == 2007)
# Continent Colours
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderContinentColors.txt"
continentColors <- read.delim(file = gdURL, as.is = 3) # protect color
The lattice plot code is adapted from Jenny's code from the course materials with some very minor modifications.
We begin by constructing a list to define the legend for the lattice plot. In particular, the location of the plot is given, as well as parameters for the labels, such as the text and the colour.
Then the bubble plot is made using the xyplot scatterplot function. We specify the size of the points by defining the cex parameter. The area of the points are related to the population size, and a custom panel function is used to fill in the country colours using Jenny's pre-specified colours in the data set.
# Define continent key
continentKey <-
with(continentColors,
list(x=0.95, y=0.05, corner=c(1, 0),
text=list(as.character(continent)),
points=list(pch=21, col='grey20', fill=color)))
# The bubble plot for the 2007 data, taken from Jenny's code
xyplot(lifeExp ~ gdpPercap | factor(year), data=gDatSub, aspect=2/3,
grid=TRUE, scales=list(x=list(log=10, equispaced.log=FALSE)),
cex=sqrt(gDatSub$pop/pi)/1500, fill.color=gDatSub$color,
col='grey20', key=continentKey,
panel=function(x, y, ..., cex, fill.color, subscripts) {
panel.xyplot(x, y, cex=cex[subscripts],
pch=21, fill=fill.color[subscripts], ...)
})
The ggplot2 figure is, in my opinion, more straightforward. The data is specified, and then the variable's x and y mapping are specified in the aesthetics of the ggplot function call. The more specific aesthetics for the points are given inside an aes call from within the geom_point layer, such as the size and the fill colour. We specify the range of sizes of the points using the scale_size_continuous call, and set it to not show the legend for the point size. To make the aspect ratio similar to the lattice plot, we add a call to the coord_fixed function, and set the ratio appropriately. We again use the scales package to make our x-axis labels look nicer in the log scale.
# Now in ggplot
ggplot(gDatSub, aes(x=gdpPercap, y=lifeExp)) +
geom_point(aes(size=sqrt(pop/pi), fill=continent),
colour="black", pch=21) +
scale_x_log10("log(gdpPercap)",
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_size_continuous(range=c(1,40), guide="none") +
facet_wrap(~year, nrow=2) +
ylim(28,85) + coord_fixed(ratio=0.03)
I found that making plots using the ggplot2 package was more straightforward than using lattice to produce similar plots. For both packages, I used google to search how to make my plots look a certain way, but I found the instructions for ggplot2 to be more understandable and intuitive. While I do not claim that ggplot2 is superior to lattice, I find that the coding structure of the ggplot2 functions was easier to understand.
Looking at the code for lattice, sometimes it is very easy to make specialized plots (for example, adding a regression line is as easy as adding a letter to the type argument in xyplot), but at other times, it is very hard to achieve some effect (such as adding colours for each bubble in the bubble plot - this required creating a custom panel function to add the colours!).
I will continue to learn both packages, as I am still in the learning phases. I did notice some performance advantages when making plots with lattice (in some other plots not in this report), and I have been told that the ggplot2 functions can be some level of magnitude slower than lattice. This may prove to be useful when plotting figures for very large data sets
For the code used to generate this report, click here