Homework04

The goal of this assignment is to produce figures to accompany data aggregation tasks in homework 3. Note for this assignment I dropped Oceania right from the beginning.

I have chosen to revisit the tasks where I believe JB's code is superior than my own.

# read from internet gDat <-
# read.delim('http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt')

# read locally
gDat <- read.delim(file.path(str_replace(getwd(), "stat545a-2013-hw04", "GapMinder"), 
    "gapminderDataFiveYear.txt"))

# I will drop Oceania in the beginning
gDat <- droplevels(subset(gDat, continent != "Oceania"))
summary(gDat$continent)  # oceania is not here

##   Africa Americas     Asia   Europe 
##      624      300      396      360

I must say, defining a htmlPrint() function is an excellent idea from JB. Anything that I do more than a couple of times probably deserves some simplification…

## define a function for converting and printing to HTML table
htmlPrint <- function(x, ..., digits = 0, include.rownames = FALSE) {
    print(xtable(x, digits = digits, ...), type = "html", include.rownames = include.rownames, 
        ...)
}

Task 1: Get the maximum and minimum of GDP per capita for all continents in a “tall” format. Medium/hard.

# my own code for the task this is not very elegant
fct <- function(dframe) {
    res <- data.frame(factor_name = "min", GDP = min(dframe$gdpPercap))
    res <- rbind(res, data.frame(factor_name = "max", GDP = max(dframe$gdpPercap)))
}
minmax.gdp_by_continent.tall <- arrange(ddply(gDat, ~continent, fct), factor_name, 
    GDP)

# JB's code for the same task
foo <- ddply(gDat, ~continent, function(x) {
    gdpPercap <- range(x$gdpPercap)
    return(data.frame(gdpPercap, stat = c("min", "max")))
})

JB's code is way more readable than my own. It took advantage of the fact that if afunction returns a data frame with 2 rows, these mini-dataframes will be stacked on top of eachother during the “glueing back” of the data aggregation. Further more, It is smart in the use of gdpPercap object in the intermediate step, it provides the label for the column through the object name!

Anyhow, here's the data table produced.

htmlPrint(foo)

continent	gdpPercap	stat
Africa	241	min
Africa	21951	max
Americas	1202	min
Americas	42952	max
Asia	331	min
Asia	113523	max
Europe	974	min
Europe	49357	max

Ideally I want the plot to contain information on minimum and maximum all at one glance. I would like X axis to be the countries, and Y axis to be GDP per capita, and using 2 different colours/shape to represent max and min.

Maybe a xy plot will do it?

xyplot(gdpPercap ~ continent, data = arrange(foo, stat, gdpPercap), groups = stat, 
    auto.key = TRUE)

plot of chunk unnamed-chunk-6

But the continents are not ordered in any way. I'd like them to be ordered by the minimum gdp. Also add a line connecting the a set of points (mins, or maxs) to show trend.

# this is extremely useful
foo <- within(foo, continent <- reorder(continent, gdpPercap, FUN = min))

xyplot(gdpPercap ~ continent, data = arrange(foo, stat, gdpPercap), groups = stat, 
    auto.key = TRUE, type = c("p", "a"))

plot of chunk unnamed-chunk-7

How is life expectancy changing over time on different continents? “Wide” format. Hard (or possibly silly).

# my code
mean.lifeExp_by_year_by_cont.wide <- daply(gDat, ~year, daply, ~continent, summarize, 
    mean = mean(lifeExp), median = median(lifeExp))  #note multiple stats can be computed with one call!  This is actually unnecessary


# JB's code
foo <- daply(gDat, ~continent + year, summarize, medLifeExp = median(lifeExp))

JB's code is much smarter. I had not figured out you don't need to nest function calls.

htmlPrint(as.data.frame(foo), include.rownames = TRUE)

	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
Africa	39	41	43	45	47	49	51	52	52	53	51	53
Americas	55	56	58	61	63	66	67	69	70	72	72	73
Asia	45	48	49	54	57	61	64	66	69	70	71	72
Europe	66	68	70	71	71	72	73	75	75	76	78	79

Now how would I actually graph this. Multi-dimensional arrays/ or arrays in general seem to be difficult to graph since most graphing packages make it conveninent to pass in column names to indicate which values to take. This is impossible to make…

I didn't manage to find any way to graph the output of daply… so I will try to work with the “tall” data.frame format.

Here try doing it with panels for each continent. A little hard to compare…

foo <- ddply(gDat, ~continent + year, summarize, medLifeExp = median(lifeExp))

xyplot(medLifeExp ~ year | continent, data = foo, grid = "h")

plot of chunk unnamed-chunk-10

Here try to do it with “groups” in the same graph.

foo <- within(foo, continent <- reorder(continent, medLifeExp, FUN = min))
xyplot(medLifeExp ~ year, groups = continent, data = foo, grid = "h", type = c("p", 
    "a"), auto.key = T)

plot of chunk unnamed-chunk-11

This is a pretty good attempt. But I can really see direct labelling being a big bonus here… Apparently, the directlabels package is now compatible with lattice.

library(directlabels)

## Warning: package 'directlabels' was built under R version 3.0.2

direct.label(xyplot(medLifeExp ~ year, groups = continent, data = foo, grid = "h", 
    type = c("b"), xlim = c(1945, 2010), ylim = c(30, 80)), "lasso.labels")

plot of chunk unnamed-chunk-12

With direct labelling, it's much easier to compare continents. I also had to make adjustments to the graphing limits of the x-axis and y-axis to make sure that the labels are not cut off.