Homework04

The goal of this assignment is to produce figures to accompany data aggregation tasks in homework 3. Note for this assignment I dropped Oceania right from the beginning.

I have chosen to revisit the tasks where I believe JB's code is superior than my own.

# read from internet gDat <-
# read.delim('http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt')

# read locally
gDat <- read.delim(file.path(str_replace(getwd(), "stat545a-2013-hw04", "GapMinder"), 
    "gapminderDataFiveYear.txt"))

# I will drop Oceania in the beginning
gDat <- droplevels(subset(gDat, continent != "Oceania"))
summary(gDat$continent)  # oceania is not here
##   Africa Americas     Asia   Europe 
##      624      300      396      360

I must say, defining a htmlPrint() function is an excellent idea from JB. Anything that I do more than a couple of times probably deserves some simplification…

## define a function for converting and printing to HTML table
htmlPrint <- function(x, ..., digits = 0, include.rownames = FALSE) {
    print(xtable(x, digits = digits, ...), type = "html", include.rownames = include.rownames, 
        ...)
}

Task 1: Get the maximum and minimum of GDP per capita for all continents in a “tall” format. Medium/hard.

# my own code for the task this is not very elegant
fct <- function(dframe) {
    res <- data.frame(factor_name = "min", GDP = min(dframe$gdpPercap))
    res <- rbind(res, data.frame(factor_name = "max", GDP = max(dframe$gdpPercap)))
}
minmax.gdp_by_continent.tall <- arrange(ddply(gDat, ~continent, fct), factor_name, 
    GDP)

# JB's code for the same task
foo <- ddply(gDat, ~continent, function(x) {
    gdpPercap <- range(x$gdpPercap)
    return(data.frame(gdpPercap, stat = c("min", "max")))
})

JB's code is way more readable than my own. It took advantage of the fact that if afunction returns a data frame with 2 rows, these mini-dataframes will be stacked on top of eachother during the “glueing back” of the data aggregation. Further more, It is smart in the use of gdpPercap object in the intermediate step, it provides the label for the column through the object name!

Anyhow, here's the data table produced.

htmlPrint(foo)
continent gdpPercap stat
Africa 241 min
Africa 21951 max
Americas 1202 min
Americas 42952 max
Asia 331 min
Asia 113523 max
Europe 974 min
Europe 49357 max

Ideally I want the plot to contain information on minimum and maximum all at one glance. I would like X axis to be the countries, and Y axis to be GDP per capita, and using 2 different colours/shape to represent max and min.

Maybe a xy plot will do it?

xyplot(gdpPercap ~ continent, data = arrange(foo, stat, gdpPercap), groups = stat, 
    auto.key = TRUE)

plot of chunk unnamed-chunk-6

But the continents are not ordered in any way. I'd like them to be ordered by the minimum gdp. Also add a line connecting the a set of points (mins, or maxs) to show trend.

# this is extremely useful
foo <- within(foo, continent <- reorder(continent, gdpPercap, FUN = min))

xyplot(gdpPercap ~ continent, data = arrange(foo, stat, gdpPercap), groups = stat, 
    auto.key = TRUE, type = c("p", "a"))

plot of chunk unnamed-chunk-7

How is life expectancy changing over time on different continents? “Wide” format. Hard (or possibly silly).

# my code
mean.lifeExp_by_year_by_cont.wide <- daply(gDat, ~year, daply, ~continent, summarize, 
    mean = mean(lifeExp), median = median(lifeExp))  #note multiple stats can be computed with one call!  This is actually unnecessary


# JB's code
foo <- daply(gDat, ~continent + year, summarize, medLifeExp = median(lifeExp))

JB's code is much smarter. I had not figured out you don't need to nest function calls.

htmlPrint(as.data.frame(foo), include.rownames = TRUE)
1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Africa 39 41 43 45 47 49 51 52 52 53 51 53
Americas 55 56 58 61 63 66 67 69 70 72 72 73
Asia 45 48 49 54 57 61 64 66 69 70 71 72
Europe 66 68 70 71 71 72 73 75 75 76 78 79

Now how would I actually graph this. Multi-dimensional arrays/ or arrays in general seem to be difficult to graph since most graphing packages make it conveninent to pass in column names to indicate which values to take. This is impossible to make…

I didn't manage to find any way to graph the output of daply… so I will try to work with the “tall” data.frame format.

Here try doing it with panels for each continent. A little hard to compare…

foo <- ddply(gDat, ~continent + year, summarize, medLifeExp = median(lifeExp))

xyplot(medLifeExp ~ year | continent, data = foo, grid = "h")

plot of chunk unnamed-chunk-10

Here try to do it with “groups” in the same graph.

foo <- within(foo, continent <- reorder(continent, medLifeExp, FUN = min))
xyplot(medLifeExp ~ year, groups = continent, data = foo, grid = "h", type = c("p", 
    "a"), auto.key = T)

plot of chunk unnamed-chunk-11

This is a pretty good attempt. But I can really see direct labelling being a big bonus here… Apparently, the directlabels package is now compatible with lattice.

library(directlabels)
## Warning: package 'directlabels' was built under R version 3.0.2
direct.label(xyplot(medLifeExp ~ year, groups = continent, data = foo, grid = "h", 
    type = c("b"), xlim = c(1945, 2010), ylim = c(30, 80)), "lasso.labels")

plot of chunk unnamed-chunk-12

With direct labelling, it's much easier to compare continents. I also had to make adjustments to the graphing limits of the x-axis and y-axis to make sure that the labels are not cut off.