Presets:
Sys.setenv(lang = "EN")
library("plyr")
library("lattice")
library("ggplot2")
I use Gapminder data without Oceania here:
dat <- read.delim("http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt")
dat = droplevels(subset(dat, continent != "Oceania"))
str(dat)
## 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
ggplot2First, I plot a violin plot for life expectancy versus year. I treat year as if it is a categorical variable:
ggplot(data = dat) + geom_violin(aes(x = factor(year), y = lifeExp))
By this figure, we can clearly see that how the distribution of life expectancy of countries changes over time. In the earlier years, the “center of gravity” was at the low-end of the “violin”. However, as time goes by, it gradually moves upward, and after 1987 the “center of gravity” is clearly at the high-end of the “violin”.
geom_path in ggplot2Now I sample 6 countries randomly from the dataset:
set.seed(100)
dat.string = sample(unique(dat$country), 6)
dat.sample = droplevels(subset(dat, country %in% dat.string))
str(dat.sample)
## 'data.frame': 72 obs. of 6 variables:
## $ country : Factor w/ 6 levels "Bangladesh","Dominican Republic",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 46886859 51365468 56839289 62821884 70759295 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 37.5 39.3 41.2 43.5 45.3 ...
## $ gdpPercap: num 684 662 686 721 630 ...
levels(dat.sample$country)
## [1] "Bangladesh" "Dominican Republic" "France"
## [4] "Italy" "Japan" "Malawi"
First, let's try a basic scatterplot:
ggplot(data = dat.sample) + geom_point(aes(x = gdpPercap, y = lifeExp, color = country))
Now, let's try an “enhanced” scatterplot. The following depicts the trajectory of \((gdpPercap, ~ lifeExp)\) over time on the 2-dimensional space:
ggplot(data = dat.sample) + geom_path(aes(x = gdpPercap, y = lifeExp, group = country,
color = year, size = 2))
As we can see, there is a overall pattern on the trajectory of \((gdpPercap, ~ lifeExp)\) on the space of \(gdpPercap \times lifeExp\). For GDP per capita less than 5000, life expectancy grows rapidly as GDP per capita increases. Then, after GDP per capita reaching 5000, life expectancy grows gradually as GDP per capita grows. This pattern is persistent if we draw more trajectories.
If I try this with lattice, I would try the following:
xyplot(lifeExp ~ gdpPercap, data = dat.sample, group = country, col.line = dat.sample$year,
type = "l", auto.key = list(columns = nlevels(dat.sample$country)/2), lwd = 5)
As in ggplot2, I put different variables for group and color. However, looking at the figure above, lattice does not give what I want to draw. Also, the col.line option works in a strange way. Compare the plot above with the plot below:
xyplot(lifeExp ~ gdpPercap, data = dat.sample, group = country, type = "l",
auto.key = list(columns = nlevels(dat.sample$country)/2), lwd = 5)
In the second plot, I removed col.line option. Now we can see that, in the first plot, the colors in the plot does not match those in the legend; the colors in the plot matches to those in legend in the second plot. I don't know how col.line worked in the first plot, but it did not produce what I wanted to draw anyway. On the other hand, ggplot2 works quite intuitively and it produces what I meant exactly when I put different values to group and color.