The goal of this assignment is to produce figures to accompany data aggregation tasks in homework 3. Note for this assignment I dropped Oceania right from the beginning.
I have chosen to revisit the tasks where I believe JB's code is superior than my own.
# read from internet gDat <-
# read.delim('http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt')
# read locally
gDat <- read.delim(file.path(str_replace(getwd(), "stat545a-2013-hw04", "GapMinder"),
"gapminderDataFiveYear.txt"))
# I will drop Oceania in the beginning
gDat <- droplevels(subset(gDat, continent != "Oceania"))
summary(gDat$continent) # oceania is not here
## Africa Americas Asia Europe
## 624 300 396 360
I must say, defining a htmlPrint() function is an excellent idea from JB. Anything that I do more than a couple of times probably deserves some simplification…
## define a function for converting and printing to HTML table
htmlPrint <- function(x, ..., digits = 0, include.rownames = FALSE) {
print(xtable(x, digits = digits, ...), type = "html", include.rownames = include.rownames,
...)
}
# my own code for the task this is not very elegant
fct <- function(dframe) {
res <- data.frame(factor_name = "min", GDP = min(dframe$gdpPercap))
res <- rbind(res, data.frame(factor_name = "max", GDP = max(dframe$gdpPercap)))
}
minmax.gdp_by_continent.tall <- arrange(ddply(gDat, ~continent, fct), factor_name,
GDP)
# JB's code for the same task
foo <- ddply(gDat, ~continent, function(x) {
gdpPercap <- range(x$gdpPercap)
return(data.frame(gdpPercap, stat = c("min", "max")))
})
JB's code is way more readable than my own. It took advantage of the fact that if afunction returns a data frame with 2 rows, these mini-dataframes will be stacked on top of eachother during the “glueing back” of the data aggregation. Further more, It is smart in the use of gdpPercap object in the intermediate step, it provides the label for the column through the object name!
Anyhow, here's the data table produced.
htmlPrint(foo)
| continent | gdpPercap | stat |
|---|---|---|
| Africa | 241 | min |
| Africa | 21951 | max |
| Americas | 1202 | min |
| Americas | 42952 | max |
| Asia | 331 | min |
| Asia | 113523 | max |
| Europe | 974 | min |
| Europe | 49357 | max |
Ideally I want the plot to contain information on minimum and maximum all at one glance. I would like X axis to be the countries, and Y axis to be GDP per capita, and using 2 different colours/shape to represent max and min.
Maybe a xy plot will do it?
xyplot(gdpPercap ~ continent, data = arrange(foo, stat, gdpPercap), groups = stat,
auto.key = TRUE)
But the continents are not ordered in any way. I'd like them to be ordered by the minimum gdp. Also add a line connecting the a set of points (mins, or maxs) to show trend.
# this is extremely useful
foo <- within(foo, continent <- reorder(continent, gdpPercap, FUN = min))
xyplot(gdpPercap ~ continent, data = arrange(foo, stat, gdpPercap), groups = stat,
auto.key = TRUE, type = c("p", "a"))
# my code
mean.lifeExp_by_year_by_cont.wide <- daply(gDat, ~year, daply, ~continent, summarize,
mean = mean(lifeExp), median = median(lifeExp)) #note multiple stats can be computed with one call! This is actually unnecessary
# JB's code
foo <- daply(gDat, ~continent + year, summarize, medLifeExp = median(lifeExp))
JB's code is much smarter. I had not figured out you don't need to nest function calls.
htmlPrint(as.data.frame(foo), include.rownames = TRUE)
| 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Africa | 39 | 41 | 43 | 45 | 47 | 49 | 51 | 52 | 52 | 53 | 51 | 53 |
| Americas | 55 | 56 | 58 | 61 | 63 | 66 | 67 | 69 | 70 | 72 | 72 | 73 |
| Asia | 45 | 48 | 49 | 54 | 57 | 61 | 64 | 66 | 69 | 70 | 71 | 72 |
| Europe | 66 | 68 | 70 | 71 | 71 | 72 | 73 | 75 | 75 | 76 | 78 | 79 |
Now how would I actually graph this. Multi-dimensional arrays/ or arrays in general seem to be difficult to graph since most graphing packages make it conveninent to pass in column names to indicate which values to take. This is impossible to make…
I didn't manage to find any way to graph the output of daply… so I will try to work with the “tall” data.frame format.
Here try doing it with panels for each continent. A little hard to compare…
foo <- ddply(gDat, ~continent + year, summarize, medLifeExp = median(lifeExp))
xyplot(medLifeExp ~ year | continent, data = foo, grid = "h")
Here try to do it with “groups” in the same graph.
foo <- within(foo, continent <- reorder(continent, medLifeExp, FUN = min))
xyplot(medLifeExp ~ year, groups = continent, data = foo, grid = "h", type = c("p",
"a"), auto.key = T)
This is a pretty good attempt. But I can really see direct labelling being a big bonus here… Apparently, the directlabels package is now compatible with lattice.
library(directlabels)
## Warning: package 'directlabels' was built under R version 3.0.2
direct.label(xyplot(medLifeExp ~ year, groups = continent, data = foo, grid = "h",
type = c("b"), xlim = c(1945, 2010), ylim = c(30, 80)), "lasso.labels")
With direct labelling, it's much easier to compare continents. I also had to make adjustments to the graphing limits of the x-axis and y-axis to make sure that the labels are not cut off.