Stat545A Homework #3 Visualize a quantitative variable with lattice

In this homework we will examine the result of the data aggregation task graphically with lattice and ggplot2 . I will use Jenny's solutions as a template for the graphical analysis.

Since next week's homework is to use ggplot2 (and I am more familiar with it now) I will just make both lattice and ggplot2 figures in one go :)

Let's begin:

Basic import and quality control

library(plyr)
library(xtable)
library(lattice)
library(ggplot2)
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
tail(gDat)
##       country year      pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982  7636524    Africa   60.36     788.9
## 1700 Zimbabwe 1987  9216418    Africa   62.35     706.2
## 1701 Zimbabwe 1992 10704340    Africa   60.38     693.4
## 1702 Zimbabwe 1997 11404948    Africa   46.81     792.4
## 1703 Zimbabwe 2002 11926563    Africa   39.99     672.0
## 1704 Zimbabwe 2007 12311143    Africa   43.49     469.7

I will use Jenny's htmlPrint table to make the printing process a little easier

htmlPrint <- function(x, ...,
   digits = 0, include.rownames = FALSE) {
   print(xtable(x, digits = digits, ...), type = 'html',
         include.rownames = include.rownames, ...)
   }

Get rid of Oceania

No disrespect to our commonwealth folks down under, but for practical purposes a continent with only two countries is not that interesting. Let's remove it from here on out:

gDat <- droplevels(subset(gDat, continent != "Oceania"))

Easy analysis of GDP “trends”

An easy first plot would be to look at the min and max GDP per capita within the continents over time. Since we are going to graph these data we want a tall format (and need to modify JB's code to include all dates)

foo <- ddply(gDat, .(continent, year), function(x) {
  gdpPercap <- range(x$gdpPercap)
  return(data.frame(gdpPercap, stat = c("min", "max")))
  })
htmlPrint(foo)
continent year gdpPercap stat
Africa 1952 299 min
Africa 1952 4725 max
Africa 1957 336 min
Africa 1957 5487 max
Africa 1962 355 min
Africa 1962 6757 max
Africa 1967 413 min
Africa 1967 18773 max
Africa 1972 464 min
Africa 1972 21011 max
Africa 1977 502 min
Africa 1977 21951 max
Africa 1982 462 min
Africa 1982 17364 max
Africa 1987 390 min
Africa 1987 11864 max
Africa 1992 411 min
Africa 1992 13522 max
Africa 1997 312 min
Africa 1997 14723 max
Africa 2002 241 min
Africa 2002 12522 max
Africa 2007 278 min
Africa 2007 13206 max
Americas 1952 1398 min
Americas 1952 13990 max
Americas 1957 1544 min
Americas 1957 14847 max
Americas 1962 1662 min
Americas 1962 16173 max
Americas 1967 1452 min
Americas 1967 19530 max
Americas 1972 1654 min
Americas 1972 21806 max
Americas 1977 1874 min
Americas 1977 24073 max
Americas 1982 2011 min
Americas 1982 25010 max
Americas 1987 1823 min
Americas 1987 29884 max
Americas 1992 1456 min
Americas 1992 32004 max
Americas 1997 1342 min
Americas 1997 35767 max
Americas 2002 1270 min
Americas 2002 39097 max
Americas 2007 1202 min
Americas 2007 42952 max
Asia 1952 331 min
Asia 1952 108382 max
Asia 1957 350 min
Asia 1957 113523 max
Asia 1962 388 min
Asia 1962 95458 max
Asia 1967 349 min
Asia 1967 80895 max
Asia 1972 357 min
Asia 1972 109348 max
Asia 1977 371 min
Asia 1977 59265 max
Asia 1982 424 min
Asia 1982 33693 max
Asia 1987 385 min
Asia 1987 28118 max
Asia 1992 347 min
Asia 1992 34933 max
Asia 1997 415 min
Asia 1997 40301 max
Asia 2002 611 min
Asia 2002 36023 max
Asia 2007 944 min
Asia 2007 47307 max
Europe 1952 974 min
Europe 1952 14734 max
Europe 1957 1354 min
Europe 1957 17909 max
Europe 1962 1710 min
Europe 1962 20431 max
Europe 1967 2172 min
Europe 1967 22966 max
Europe 1972 2860 min
Europe 1972 27195 max
Europe 1977 3528 min
Europe 1977 26982 max
Europe 1982 3631 min
Europe 1982 28398 max
Europe 1987 3739 min
Europe 1987 31541 max
Europe 1992 2497 min
Europe 1992 33966 max
Europe 1997 3193 min
Europe 1997 41283 max
Europe 2002 4604 min
Europe 2002 44684 max
Europe 2007 5937 min
Europe 2007 49357 max

Here are the first two interesting plots broken down into continent changes over time.

xyplot(gdpPercap ~ year | continent, data = foo, groups = stat, auto.key = T, type = "o")

plot of chunk unnamed-chunk-5

ggplot(foo, aes(x = year, y = gdpPercap, color = stat)) + geom_line() +
geom_point(shape = 1) + facet_wrap(~continent)

plot of chunk unnamed-chunk-5

This probably looks better if we choose a more informative grouping. We should show all of the max on the same scale and then the min on the same scale.

xyplot(gdpPercap ~ year | stat, data = foo, groups = continent, auto.key = T, type = "o")

plot of chunk unnamed-chunk-6

ggplot(foo, aes(x = year, y = gdpPercap, color = continent)) +
  geom_line() + geom_point(shape = 1) + facet_wrap(~stat)

plot of chunk unnamed-chunk-6

Two things are annoying here:

xyplot(gdpPercap ~ year | stat, data = foo, groups = continent, auto.key = T, scales = "free", type = "o")

plot of chunk unnamed-chunk-7

ggplot(foo, aes(x = year, y = gdpPercap, color = continent)) +
  geom_line() + geom_point(shape = 1) +  facet_grid(stat ~ ., scales = "free")

plot of chunk unnamed-chunk-7

This solves one problem, but not the other…I don't have a good solution for labeling when the factors are not invariant across groupings.

Absolute and relative world population in each of the continents

The second task I will explore is from Dean's report. He has looked at the absolute and relative world population in each of the continents.

worldRelativePop <- ddply(gDat, .(continent, year),
  function(.data) {
    .data <- as.list(.data)
    .data['continentPop'] <- sum(.data$pop)
    .data['worldPop'] <- sum(subset(gDat, year == .data$year[1])[['pop']])
    .data['percent'] <- round(as.numeric(.data['continentPop'])/as.numeric(.data['worldPop'])*100,2) 
    quickdf(.data[c("continentPop","worldPop",'percent')])
  }
)
htmlPrint(worldRelativePop, digits = c(0,0,0,0,0,2))
continent year continentPop worldPop percent
Africa 1952 237640501 2396271145 9.92
Africa 1957 264837738 2652462604 9.98
Africa 1962 296516865 2886499456 10.27
Africa 1967 335289489 3202877970 10.47
Africa 1972 379879541 3560871058 10.67
Africa 1977 433061021 3912806807 11.07
Africa 1982 499348587 4271041990 11.69
Africa 1987 574834110 4671903003 12.30
Africa 1992 659081517 5089790609 12.95
Africa 1997 743832984 5492963042 13.54
Africa 2002 833723916 5863522750 14.22
Africa 2007 929539692 6226463232 14.93
Americas 1952 345152446 2396271145 14.40
Americas 1957 386953916 2652462604 14.59
Americas 1962 433270254 2886499456 15.01
Americas 1967 480746623 3202877970 15.01
Americas 1972 529384210 3560871058 14.87
Americas 1977 578067699 3912806807 14.77
Americas 1982 630290920 4271041990 14.76
Americas 1987 682753971 4671903003 14.61
Americas 1992 739274104 5089790609 14.52
Americas 1997 796900410 5492963042 14.51
Americas 2002 849772762 5863522750 14.49
Americas 2007 898871184 6226463232 14.44
Asia 1952 1395357352 2396271145 58.23
Asia 1957 1562780599 2652462604 58.92
Asia 1962 1696357182 2886499456 58.77
Asia 1967 1905662900 3202877970 59.50
Asia 1972 2150972248 3560871058 60.41
Asia 1977 2384513556 3912806807 60.94
Asia 1982 2610135582 4271041990 61.11
Asia 1987 2871220762 4671903003 61.46
Asia 1992 3133292191 5089790609 61.56
Asia 1997 3383285500 5492963042 61.59
Asia 2002 3601802203 5863522750 61.43
Asia 2007 3811953827 6226463232 61.22
Europe 1952 418120846 2396271145 17.45
Europe 1957 437890351 2652462604 16.51
Europe 1962 460355155 2886499456 15.95
Europe 1967 481178958 3202877970 15.02
Europe 1972 500635059 3560871058 14.06
Europe 1977 517164531 3912806807 13.22
Europe 1982 531266901 4271041990 12.44
Europe 1987 543094160 4671903003 11.62
Europe 1992 558142797 5089790609 10.97
Europe 1997 568944148 5492963042 10.36
Europe 2002 578223869 5863522750 9.86
Europe 2007 586098529 6226463232 9.41

First thing I noticed was the use of this cool function quickdf which is a plyr implementation of creating a dataframe quickly. I then noticed that I was not getting the same results as Dean; however, the difference is due to the removal of Oceania! I will process without further worry after this check.

Let us first see how population changes by continent over time:

xyplot(continentPop ~ year, data = worldRelativePop, groups = continent, auto.key = T, type = c("o", "p"))

plot of chunk unnamed-chunk-9

ggplot(worldRelativePop, aes(year, continentPop, color = continent)) + geom_line() + geom_point()

plot of chunk unnamed-chunk-9

That was fairly painless, but the legend needs to be fixed. This seems to be more annoying/difficult than originally anticipated since the ordering needs to be done across some function of the chart in the time series as opposed to just one value. Need to ask JB about the proper approach for this functionality….

xyplot(percent ~ year, data = worldRelativePop, groups = continent, auto.key = T, type = c("o", "p"))

plot of chunk unnamed-chunk-10

ggplot(worldRelativePop, aes(year, percent, color = continent)) + geom_line() + geom_point()

plot of chunk unnamed-chunk-10

Clearly Asia is the winner, the more interesting details are in the other continents:

foo <- droplevels(subset(worldRelativePop, continent != "Asia"))
xyplot(percent ~ year, data = foo, groups = continent, auto.key = T, type = c("o", "p"))

plot of chunk unnamed-chunk-11

ggplot(foo, aes(year, percent, color = continent)) + geom_line() + geom_point()

plot of chunk unnamed-chunk-11

Now we can clearly see some interesting trends:

Another point of interest would be to see how stationary these percent distributions are (and I want to try a density plot in ggplot2). But, first we will do some stripplots

stripplot(percent ~ reorder(continent, percent), jitter.data = T, worldRelativePop, type = c("p", "a"))

plot of chunk unnamed-chunk-12

ggplot(worldRelativePop) + geom_point(position = position_jitter(width = 0.15), aes(reorder(continent, percent), percent), shape = 1)

plot of chunk unnamed-chunk-12

densityplot(~percent | continent, data = worldRelativePop, scales = "free")

plot of chunk unnamed-chunk-13

ggplot(worldRelativePop, aes(percent)) + stat_density(geom = "path", position = "identity") +
  facet_wrap(~continent, scales = "free")

plot of chunk unnamed-chunk-13

ggplot does not actually do a good good reproducing these figures from lattice…I think there is likely a way to fine tune the grid over where the kernel estimate is being calculated, although I do not have time to look through too much documentation now.

####

date()
## [1] "Sun Sep 29 23:08:15 2013"
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_0.9.3.1 lattice_0.20-23 xtable_1.7-1    plyr_1.8       
## [5] knitr_1.4.1    
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
##  [4] evaluate_0.4.7     formatR_0.9        grid_3.0.1        
##  [7] gtable_0.1.2       labeling_0.2       MASS_7.3-26       
## [10] munsell_0.4.2      proto_0.3-10       RColorBrewer_1.0-5
## [13] reshape2_1.2.2     scales_0.2.3       stringr_0.6.2     
## [16] tools_3.0.1