In this homework we will examine the result of the data aggregation task graphically with lattice and ggplot2 . I will use Jenny's solutions as a template for the graphical analysis.
Since next week's homework is to use ggplot2 (and I am more familiar with it now) I will just make both lattice and ggplot2 figures in one go :)
Let's begin:
library(plyr)
library(xtable)
library(lattice)
library(ggplot2)
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
tail(gDat)
## country year pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982 7636524 Africa 60.36 788.9
## 1700 Zimbabwe 1987 9216418 Africa 62.35 706.2
## 1701 Zimbabwe 1992 10704340 Africa 60.38 693.4
## 1702 Zimbabwe 1997 11404948 Africa 46.81 792.4
## 1703 Zimbabwe 2002 11926563 Africa 39.99 672.0
## 1704 Zimbabwe 2007 12311143 Africa 43.49 469.7
I will use Jenny's htmlPrint table to make the printing process a little easier
htmlPrint <- function(x, ...,
digits = 0, include.rownames = FALSE) {
print(xtable(x, digits = digits, ...), type = 'html',
include.rownames = include.rownames, ...)
}
No disrespect to our commonwealth folks down under, but for practical purposes a continent with only two countries is not that interesting. Let's remove it from here on out:
gDat <- droplevels(subset(gDat, continent != "Oceania"))
An easy first plot would be to look at the min and max GDP per capita within the continents over time. Since we are going to graph these data we want a tall format (and need to modify JB's code to include all dates)
foo <- ddply(gDat, .(continent, year), function(x) {
gdpPercap <- range(x$gdpPercap)
return(data.frame(gdpPercap, stat = c("min", "max")))
})
htmlPrint(foo)
| continent | year | gdpPercap | stat |
|---|---|---|---|
| Africa | 1952 | 299 | min |
| Africa | 1952 | 4725 | max |
| Africa | 1957 | 336 | min |
| Africa | 1957 | 5487 | max |
| Africa | 1962 | 355 | min |
| Africa | 1962 | 6757 | max |
| Africa | 1967 | 413 | min |
| Africa | 1967 | 18773 | max |
| Africa | 1972 | 464 | min |
| Africa | 1972 | 21011 | max |
| Africa | 1977 | 502 | min |
| Africa | 1977 | 21951 | max |
| Africa | 1982 | 462 | min |
| Africa | 1982 | 17364 | max |
| Africa | 1987 | 390 | min |
| Africa | 1987 | 11864 | max |
| Africa | 1992 | 411 | min |
| Africa | 1992 | 13522 | max |
| Africa | 1997 | 312 | min |
| Africa | 1997 | 14723 | max |
| Africa | 2002 | 241 | min |
| Africa | 2002 | 12522 | max |
| Africa | 2007 | 278 | min |
| Africa | 2007 | 13206 | max |
| Americas | 1952 | 1398 | min |
| Americas | 1952 | 13990 | max |
| Americas | 1957 | 1544 | min |
| Americas | 1957 | 14847 | max |
| Americas | 1962 | 1662 | min |
| Americas | 1962 | 16173 | max |
| Americas | 1967 | 1452 | min |
| Americas | 1967 | 19530 | max |
| Americas | 1972 | 1654 | min |
| Americas | 1972 | 21806 | max |
| Americas | 1977 | 1874 | min |
| Americas | 1977 | 24073 | max |
| Americas | 1982 | 2011 | min |
| Americas | 1982 | 25010 | max |
| Americas | 1987 | 1823 | min |
| Americas | 1987 | 29884 | max |
| Americas | 1992 | 1456 | min |
| Americas | 1992 | 32004 | max |
| Americas | 1997 | 1342 | min |
| Americas | 1997 | 35767 | max |
| Americas | 2002 | 1270 | min |
| Americas | 2002 | 39097 | max |
| Americas | 2007 | 1202 | min |
| Americas | 2007 | 42952 | max |
| Asia | 1952 | 331 | min |
| Asia | 1952 | 108382 | max |
| Asia | 1957 | 350 | min |
| Asia | 1957 | 113523 | max |
| Asia | 1962 | 388 | min |
| Asia | 1962 | 95458 | max |
| Asia | 1967 | 349 | min |
| Asia | 1967 | 80895 | max |
| Asia | 1972 | 357 | min |
| Asia | 1972 | 109348 | max |
| Asia | 1977 | 371 | min |
| Asia | 1977 | 59265 | max |
| Asia | 1982 | 424 | min |
| Asia | 1982 | 33693 | max |
| Asia | 1987 | 385 | min |
| Asia | 1987 | 28118 | max |
| Asia | 1992 | 347 | min |
| Asia | 1992 | 34933 | max |
| Asia | 1997 | 415 | min |
| Asia | 1997 | 40301 | max |
| Asia | 2002 | 611 | min |
| Asia | 2002 | 36023 | max |
| Asia | 2007 | 944 | min |
| Asia | 2007 | 47307 | max |
| Europe | 1952 | 974 | min |
| Europe | 1952 | 14734 | max |
| Europe | 1957 | 1354 | min |
| Europe | 1957 | 17909 | max |
| Europe | 1962 | 1710 | min |
| Europe | 1962 | 20431 | max |
| Europe | 1967 | 2172 | min |
| Europe | 1967 | 22966 | max |
| Europe | 1972 | 2860 | min |
| Europe | 1972 | 27195 | max |
| Europe | 1977 | 3528 | min |
| Europe | 1977 | 26982 | max |
| Europe | 1982 | 3631 | min |
| Europe | 1982 | 28398 | max |
| Europe | 1987 | 3739 | min |
| Europe | 1987 | 31541 | max |
| Europe | 1992 | 2497 | min |
| Europe | 1992 | 33966 | max |
| Europe | 1997 | 3193 | min |
| Europe | 1997 | 41283 | max |
| Europe | 2002 | 4604 | min |
| Europe | 2002 | 44684 | max |
| Europe | 2007 | 5937 | min |
| Europe | 2007 | 49357 | max |
Here are the first two interesting plots broken down into continent changes over time.
xyplot(gdpPercap ~ year | continent, data = foo, groups = stat, auto.key = T, type = "o")
ggplot(foo, aes(x = year, y = gdpPercap, color = stat)) + geom_line() +
geom_point(shape = 1) + facet_wrap(~continent)
This probably looks better if we choose a more informative grouping. We should show all of the max on the same scale and then the min on the same scale.
xyplot(gdpPercap ~ year | stat, data = foo, groups = continent, auto.key = T, type = "o")
ggplot(foo, aes(x = year, y = gdpPercap, color = continent)) +
geom_line() + geom_point(shape = 1) + facet_wrap(~stat)
Two things are annoying here:
xyplot(gdpPercap ~ year | stat, data = foo, groups = continent, auto.key = T, scales = "free", type = "o")
ggplot(foo, aes(x = year, y = gdpPercap, color = continent)) +
geom_line() + geom_point(shape = 1) + facet_grid(stat ~ ., scales = "free")
This solves one problem, but not the other…I don't have a good solution for labeling when the factors are not invariant across groupings.
The second task I will explore is from Dean's report. He has looked at the absolute and relative world population in each of the continents.
worldRelativePop <- ddply(gDat, .(continent, year),
function(.data) {
.data <- as.list(.data)
.data['continentPop'] <- sum(.data$pop)
.data['worldPop'] <- sum(subset(gDat, year == .data$year[1])[['pop']])
.data['percent'] <- round(as.numeric(.data['continentPop'])/as.numeric(.data['worldPop'])*100,2)
quickdf(.data[c("continentPop","worldPop",'percent')])
}
)
htmlPrint(worldRelativePop, digits = c(0,0,0,0,0,2))
| continent | year | continentPop | worldPop | percent |
|---|---|---|---|---|
| Africa | 1952 | 237640501 | 2396271145 | 9.92 |
| Africa | 1957 | 264837738 | 2652462604 | 9.98 |
| Africa | 1962 | 296516865 | 2886499456 | 10.27 |
| Africa | 1967 | 335289489 | 3202877970 | 10.47 |
| Africa | 1972 | 379879541 | 3560871058 | 10.67 |
| Africa | 1977 | 433061021 | 3912806807 | 11.07 |
| Africa | 1982 | 499348587 | 4271041990 | 11.69 |
| Africa | 1987 | 574834110 | 4671903003 | 12.30 |
| Africa | 1992 | 659081517 | 5089790609 | 12.95 |
| Africa | 1997 | 743832984 | 5492963042 | 13.54 |
| Africa | 2002 | 833723916 | 5863522750 | 14.22 |
| Africa | 2007 | 929539692 | 6226463232 | 14.93 |
| Americas | 1952 | 345152446 | 2396271145 | 14.40 |
| Americas | 1957 | 386953916 | 2652462604 | 14.59 |
| Americas | 1962 | 433270254 | 2886499456 | 15.01 |
| Americas | 1967 | 480746623 | 3202877970 | 15.01 |
| Americas | 1972 | 529384210 | 3560871058 | 14.87 |
| Americas | 1977 | 578067699 | 3912806807 | 14.77 |
| Americas | 1982 | 630290920 | 4271041990 | 14.76 |
| Americas | 1987 | 682753971 | 4671903003 | 14.61 |
| Americas | 1992 | 739274104 | 5089790609 | 14.52 |
| Americas | 1997 | 796900410 | 5492963042 | 14.51 |
| Americas | 2002 | 849772762 | 5863522750 | 14.49 |
| Americas | 2007 | 898871184 | 6226463232 | 14.44 |
| Asia | 1952 | 1395357352 | 2396271145 | 58.23 |
| Asia | 1957 | 1562780599 | 2652462604 | 58.92 |
| Asia | 1962 | 1696357182 | 2886499456 | 58.77 |
| Asia | 1967 | 1905662900 | 3202877970 | 59.50 |
| Asia | 1972 | 2150972248 | 3560871058 | 60.41 |
| Asia | 1977 | 2384513556 | 3912806807 | 60.94 |
| Asia | 1982 | 2610135582 | 4271041990 | 61.11 |
| Asia | 1987 | 2871220762 | 4671903003 | 61.46 |
| Asia | 1992 | 3133292191 | 5089790609 | 61.56 |
| Asia | 1997 | 3383285500 | 5492963042 | 61.59 |
| Asia | 2002 | 3601802203 | 5863522750 | 61.43 |
| Asia | 2007 | 3811953827 | 6226463232 | 61.22 |
| Europe | 1952 | 418120846 | 2396271145 | 17.45 |
| Europe | 1957 | 437890351 | 2652462604 | 16.51 |
| Europe | 1962 | 460355155 | 2886499456 | 15.95 |
| Europe | 1967 | 481178958 | 3202877970 | 15.02 |
| Europe | 1972 | 500635059 | 3560871058 | 14.06 |
| Europe | 1977 | 517164531 | 3912806807 | 13.22 |
| Europe | 1982 | 531266901 | 4271041990 | 12.44 |
| Europe | 1987 | 543094160 | 4671903003 | 11.62 |
| Europe | 1992 | 558142797 | 5089790609 | 10.97 |
| Europe | 1997 | 568944148 | 5492963042 | 10.36 |
| Europe | 2002 | 578223869 | 5863522750 | 9.86 |
| Europe | 2007 | 586098529 | 6226463232 | 9.41 |
First thing I noticed was the use of this cool function quickdf which is a plyr implementation of creating a dataframe quickly. I then noticed that I was not getting the same results as Dean; however, the difference is due to the removal of Oceania! I will process without further worry after this check.
Let us first see how population changes by continent over time:
xyplot(continentPop ~ year, data = worldRelativePop, groups = continent, auto.key = T, type = c("o", "p"))
ggplot(worldRelativePop, aes(year, continentPop, color = continent)) + geom_line() + geom_point()
That was fairly painless, but the legend needs to be fixed. This seems to be more annoying/difficult than originally anticipated since the ordering needs to be done across some function of the chart in the time series as opposed to just one value. Need to ask JB about the proper approach for this functionality….
xyplot(percent ~ year, data = worldRelativePop, groups = continent, auto.key = T, type = c("o", "p"))
ggplot(worldRelativePop, aes(year, percent, color = continent)) + geom_line() + geom_point()
Clearly Asia is the winner, the more interesting details are in the other continents:
foo <- droplevels(subset(worldRelativePop, continent != "Asia"))
xyplot(percent ~ year, data = foo, groups = continent, auto.key = T, type = c("o", "p"))
ggplot(foo, aes(year, percent, color = continent)) + geom_line() + geom_point()
Now we can clearly see some interesting trends:
Another point of interest would be to see how stationary these percent distributions are (and I want to try a density plot in ggplot2). But, first we will do some stripplots
stripplot(percent ~ reorder(continent, percent), jitter.data = T, worldRelativePop, type = c("p", "a"))
ggplot(worldRelativePop) + geom_point(position = position_jitter(width = 0.15), aes(reorder(continent, percent), percent), shape = 1)
densityplot(~percent | continent, data = worldRelativePop, scales = "free")
ggplot(worldRelativePop, aes(percent)) + stat_density(geom = "path", position = "identity") +
facet_wrap(~continent, scales = "free")
ggplot does not actually do a good good reproducing these figures from lattice…I think there is likely a way to fine tune the grid over where the kernel estimate is being calculated, although I do not have time to look through too much documentation now.
####
date()
## [1] "Sun Sep 29 23:08:15 2013"
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_0.9.3.1 lattice_0.20-23 xtable_1.7-1 plyr_1.8
## [5] knitr_1.4.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-2 dichromat_2.0-0 digest_0.6.3
## [4] evaluate_0.4.7 formatR_0.9 grid_3.0.1
## [7] gtable_0.1.2 labeling_0.2 MASS_7.3-26
## [10] munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5
## [13] reshape2_1.2.2 scales_0.2.3 stringr_0.6.2
## [16] tools_3.0.1