STAT 545A - Homework #5

Christian Okkels

Introduction & loading of libraries and data

This assignment consists of making ggplot() versions of figures already made via lattice(). I will mainly consider the data aggregation tasks and plots in my Homework #4, thus making ggplot companions to the many lattice plots I did there.

First, we load the necessary libraries and the data.

library(ggplot2)
library(lattice)
library(plyr)
library(xtable)
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Just in case we feel the urge to see some of the data, we define a function for making HTML tables:

htmlPrint <- function(x, ..., digits = 0, include.rownames = FALSE) {
    print(xtable(x, digits = digits, ...), type = "html", include.rownames = include.rownames, 
        ...)
}

In Homework #4, I first considered a “typical” life expectancy for different years. So let's start with that.

Change of life expectancy over time for different continents

The data aggregation is performed via the ddply() command in the plyr library.

avgLifeExpByContAndYear <- ddply(gDat, ~continent + year, summarize, avgLifeExp = mean(lifeExp))
# htmlPrint(avgLifeExpByContAndYear)

Lattice:
Using lattice, we get the following figure.

xyplot(avgLifeExp ~ year | continent, avgLifeExpByContAndYear, type = c("p", 
    "smooth"))

plot of chunk unnamed-chunk-4

ggplot:
To obtain multi-panel conditioning in ggplot we need to use facet_wrap(). We want to condition on the continents and thus specify the argument as ~ continent. Now, the lattice plot above also includes a smooth line through the data points. In ggplot, this is achieved via the geom_smooth() command, in which we just use the default smoothing method loess (this is suitable for the number of points we are dealing with here, as described in the ggplot eBook).

ggplot(avgLifeExpByContAndYear, aes(x = year, y = avgLifeExp)) + geom_point() + 
    facet_wrap(~continent) + geom_smooth(method = "loess")

plot of chunk unnamed-chunk-5

Evidently, the two plots are very much alike, although the smoothing appears slightly different.

Number and proportion of countries with low/high life expectancy over time by continent

# lifeExpThreshold = mean(gDat$lifeExp)
lifeExpThreshold = 50
nCountriesWithLowLifeExpOverTime <- ddply(gDat, ~continent + year, function(x) c(nCntryLowLE = sum(x$lifeExp <= 
    lifeExpThreshold)))
# htmlPrint(avgLifeExpByContAndYear)

Lattice:
In lattice, the plotting is done as follows. Using smoothing in this particular case results in an error/warning message overlayed on the plot . Getting rid of this (but also the smoothing) is done in the out-commented line of code.

xyplot(nCntryLowLE ~ year, nCountriesWithLowLifeExpOverTime, groups = continent, 
    auto.key = TRUE, type = c("p", "smooth"))  # returns a warning/error message on top of the plot.

plot of chunk unnamed-chunk-7

# xyplot(nCntryLowLE ~ year, nCountriesWithLowLifeExpOverTime, groups =
# continent, auto.key = TRUE)

ggplot:
In the previous part we used multi-panel conditioning to distinguish between the different continents. This was doe via facet_wrap() in ggplot. Now, however, we have used the groups = argument in the lattice's xyplot() to show the plots for the different continents in the same figure. As we would like to replicate this in ggplot, we must find something else than facet_wrap(). In fact, ggplot has a similar grouping argument, group =, which goes into the aesthetics argument, aes(), of the ggplot() command. Now, by default, the different plots will have the same color and it will be difficult to tell which one belongs to which continent (see the commented line in the chunk below). Thus, we need to include a key, or legend. This is done by playing around with geom_point() and/or geom_smooth(); we add aes(col = continent) as an argument. Doing it to only geom_point() colors just the points (first figure below). Similarly, doing it just for geom_smooth() colors only the smoothed lines (second figure below). However, we can also do it for both (third figure below).

## ggplot(nCountriesWithLowLifeExpOverTime, aes(x = year, y = nCntryLowLE,
## group = continent)) + geom_point() + geom_smooth(method = 'loess') # this
## has no key/legend.
ggplot(nCountriesWithLowLifeExpOverTime, aes(x = year, y = nCntryLowLE, group = continent)) + 
    geom_point(aes(col = continent)) + geom_smooth(method = "loess")

plot of chunk unnamed-chunk-8

ggplot(nCountriesWithLowLifeExpOverTime, aes(x = year, y = nCntryLowLE, group = continent)) + 
    geom_point() + geom_smooth(method = "loess", aes(col = continent))

plot of chunk unnamed-chunk-8

ggplot(nCountriesWithLowLifeExpOverTime, aes(x = year, y = nCntryLowLE, group = continent)) + 
    geom_point(aes(col = continent)) + geom_smooth(method = "loess", se = FALSE, 
    aes(col = continent))  # turn off confidence intervals via 'se = FALSE'

plot of chunk unnamed-chunk-8

The plots created using ggplot closely resemble the lattice plot–apart from the “layout” in terms of colors and background, of course. A rather nice feature is that ggplot includes confidence intervals by default. These can be turned off by using se = FALSE, as exemplified in the last of the plots above.

In Homework #4, I considered not only the number but also the proportion of countries with a life expectancy lower than a certain threshold. Creating new plots for this case would be redundant, since it is only the data in the data.frame that has slightly changed; the plotting commands are entirely the same, save, of course, for the variable and data.frame names.

Let us therefore move on.

Max, min, and spread of GDP per capita for the different continents

Lattice:
In Homework #4, the plot made in lattice was a so-called “box-and-whiskers” plot:

# bwplot(gdpPercap ~ as.factor(year) | continent, gDat)
bwplot(gdpPercap ~ as.factor(year) | continent, subset(gDat, continent != "Oceania"))

plot of chunk unnamed-chunk-9

ggplot:
So, we wish to replicate this type of “box-and-whiskers” plot in ggplot. This is achieved by adding geom_boxplot(). The lattice plot uses multi-panel condition on the continents; thus, in ggplot, we add facet_wrap(~ continent).

ggplot(subset(gDat, continent != "Oceania"), aes(x = as.factor(year), y = gdpPercap)) + 
    facet_wrap(~continent) + geom_boxplot()

plot of chunk unnamed-chunk-10

The ggplot and lattice plots above are very similar, the only differences being the color/background and the order of the panels.

Population over time for each continent

In Homework #4, the first part of this task considered a plot of the average population versus time, conditioned on the continents.

avgPopByContAndYear <- ddply(gDat, ~continent + year, summarize, avgPop = mean(pop))

Lattice:
In lattice, the plot looked like this:

xyplot(avgPop ~ year, avgPopByContAndYear, groups = continent, auto.key = TRUE, 
    type = c("p", "smooth"))

plot of chunk unnamed-chunk-12

ggplot:
We wish to replicate the above plot using ggplot. Now, we are dealing with a simple plot of the average population versus time, with the data coming from the new data.frame saved above. So, the average population and the years will be our aesthetics appearing in the aes() argument of the ggplot command. In the lattice plot, we are using the group = argument to plot the data for the five continents in the same figure. We recall from above that ggplot had a similar argument to be passed into aes(). The lattice also includes smooth lines through the data points. In ggplot, this is achieved by geom_smooth(); here, we add the default smoothing method “loess” as an argument, tell ggplot not to use confidence intervals, and specify different colors for the different continents. Coloring the points themselves in done in geom_point(). All of this results in the plot below.

ggplot(avgPopByContAndYear, aes(x = year, y = avgPop, group = continent)) + 
    geom_point(aes(col = continent)) + geom_smooth(method = "loess", se = FALSE, 
    aes(col = continent))

plot of chunk unnamed-chunk-13

Evidently, the ggplot is extremely similar to the lattice plot above.

Interesting countries with respect to population

In Homework #4, I made a linear model of population through time:

yearMin <- min(gDat$year)
lmFun <- function(x) {
    estCoefs <- coef(lm(pop ~ I(year - yearMin), x))
    names(estCoefs) <- c("intercept", "slope")
    return(estCoefs)
}

This function was then passed to ddply in order to get the model coefficients for each country:

lmCoefsForCountries <- ddply(gDat, ~country + continent, lmFun)
# str(lmCoefsForCountries)

Finally, I considered various cases and definities of “interesting” countries, plotting the results.

First, we find the countries with the largest intercepts:

largestIntercepts <- ddply(lmCoefsForCountries, ~continent, function(x) {
    theMax <- which.max(x$intercept)
    x[theMax, c("continent", "country", "intercept", "slope")]
})

Lattice:
In lattice, the population versus time was plotted for these countries as follows:

xyplot(pop ~ year | country, gDat, subset = country %in% largestIntercepts$country, 
    type = c("p", "r"))

plot of chunk unnamed-chunk-17

ggplot:
In the lattice plot above we have used multi-panel conditioning, and we condition on the countries. In ggplot, this multi-panel condition is achieved via facet_wrap(~ country). Above, we are also considering a certain subset of the data, namely only those countries having the largest intercepts. For ggplot, we obtain the same subset of the data by using the general subset() command.

ggplot(subset(gDat, subset = country %in% largestIntercepts$country), aes(x = year, 
    y = pop)) + geom_point() + facet_wrap(~country) + geom_smooth(method = "loess")

plot of chunk unnamed-chunk-18

The plots made using lattice and ggplot, respectively, are very alike; save for some layout and ordering details, they are identical in the sense that they show exactly the same.

The remaining plots in my Homework #4 consider the most “interesting” countries in terms of large slopes and maximum residuals. However, it is only some of the data calculations, functions, etc. that are different; the plotting is exactly the same (only with different variable names) as for the case considered above. So the ggplot example above, for the countries with the largest intercepts, carries over perfectly to these other cases.

In the above, we have considered many different types of plots. But let us look at one more type that has not appeared above; namely the stripplot. For this case I will use the part in Jenny's solutions to Homework #4 that considers “typical” life expectancy for different years.

Examine “typical” life expectancy for different years

We first find the trimmed mean (over all countries) for each year and save the data.frame to foo. The trim factor is taken to be 0.2, meaning that 20% of the data will be cut from both ends.

jTrim <- 0.2
foo <- ddply(gDat, ~year, summarize, tMean = mean(lifeExp, trim = jTrim))
# htmlPrint(foo)

Lattice:
Plotting via lattice yields the figure below. We have added some jitter to the data as well as a line passing through the trimmed mean values for each year.

stripplot(lifeExp ~ factor(year), gDat, jitter.data = TRUE, type = c("p", "a"))

plot of chunk unnamed-chunk-20

ggplot:
Below, we have attempted to create approximately the same figure using ggplot. Jitter is added via geom_jitter(). By default, the jitter made by ggplot is much wider than that made by lattice. However, the width and height, or the amount of jitter, can be adjusted in geom_jitter() by specifying the position = position_jitter() argument and then specifying a certain width and/or height. Below, we have chosen a width of 0.5, which seems to fit rather well the jitter width of the lattice plot above. The confidence interval around the smoothed line can be turned on/off by specifying the se = argument in geom_smooth() as TRUE/FALSE.

ggplot(gDat, aes(x = year, y = lifeExp)) + geom_jitter(position = position_jitter(width = 0.5)) + 
    geom_smooth(method = "loess", se = FALSE)

plot of chunk unnamed-chunk-21

The ggplot and lattice plots are as good as identical (again, save for differences in layout, etc.)

Christian Birch Okkels