Homework 4: Visualizing Quantitative Variables with lattice
For this Homework, I have used some data aggregation code from student Sean Jewell.
His report can be found here: http://rpubs.com/jewellsean/stat545a-2013-hw03_jewell-sea
and his code here: https://gist.github.com/jewellsean/bf3c28b63e6f99953153#file-stat545a-2013-hw03_jewell-sea-rmd
I found (and I chose it partly because) it was very well annotated in that there was a lot of textual discussion before and after each table. The tables were clear and well laid out. The discussion was interesting and the code was clear. The inline comments here and there were also helpful to clarify what was being done.
Load packages that will be used and import gapminder dataset and str data to check that import has been successful.
library(plyr)
library(lattice)
library(knitr)
library(xtable)
library(RColorBrewer)
setwd("~/Rwork/STAT545/20130909")
gDat <- read.delim(file = "gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
tail(gDat)
## country year pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982 7636524 Africa 60.36 788.9
## 1700 Zimbabwe 1987 9216418 Africa 62.35 706.2
## 1701 Zimbabwe 1992 10704340 Africa 60.38 693.4
## 1702 Zimbabwe 1997 11404948 Africa 46.81 792.4
## 1703 Zimbabwe 2002 11926563 Africa 39.99 672.0
## 1704 Zimbabwe 2007 12311143 Africa 43.49 469.7
hDat <- droplevels(subset(gDat, continent != "Oceania")) # remove data from Oceania
table(hDat$continent) # confirm removal
##
## Africa Americas Asia Europe
## 624 300 396 360
Depict the maximum and minimum of GDP per capita for all contients.
It was strange not to have the year included in this table, and here I am trying to figure out a way to include year as a variable.
## some nice code that Sean added to avoid the repetitive task for every table generated (similar to what Jenny did in lecture example)
printTable <- function(df) {
print(xtable(df), type = "html", include.rownames = F)
}
minMaxGDPwide <- ddply(hDat, ~continent, summarize, minGDP = min(gdpPercap),
maxGDP = max(gdpPercap))
minMaxGDPwide <- arrange(minMaxGDPwide, minGDP) # sort on minGDP
printTable(minMaxGDPwide)
| continent | minGDP | maxGDP |
|---|---|---|
| Africa | 241.17 | 21951.21 |
| Asia | 331.00 | 113523.13 |
| Europe | 973.53 | 49357.19 |
| Americas | 1201.64 | 42951.65 |
xyplot(gdpPercap ~ year|continent, hDat, log = TRUE, type = c("p", "a"))
xyplot(log(gdpPercap) ~ year|continent, hDat, jitter.x = TRUE, log = TRUE, type = c("p", "a"))
The log transformed graph presents better, but it would depend on the audience - whether they were used to interpreting log axes.
While these both give a nice eyeball for the variability, it would be nice to include the label on the max and min as defined in the table above. I am not sure how to identify outliers with labels: there was some discussion online about how to use panel.pointLabel from maptools, but I was not quite brave enough to add this to my to do list yet. This could also be included textually, or by including a small table, for those who would like more detailed information.
minMaxGDPElements <- ddply(hDat, ~continent, summarize,
countryMin = country[which.min(gdpPercap)],
timeMin = year[which.min(gdpPercap)], minGDP = min(gdpPercap), countryMax = country[which.max(gdpPercap)],
timeMax = year[which.max(gdpPercap)], maxGDP = max(gdpPercap))
minMaxGDPElements <- arrange(minMaxGDPElements, minGDP)
printTable(minMaxGDPElements)
| continent | countryMin | timeMin | minGDP | countryMax | timeMax | maxGDP |
|---|---|---|---|---|---|---|
| Africa | Congo, Dem. Rep. | 2002 | 241.17 | Libya | 1977 | 21951.21 |
| Asia | Myanmar | 1952 | 331.00 | Kuwait | 1957 | 113523.13 |
| Europe | Bosnia and Herzegovina | 1952 | 973.53 | Norway | 2007 | 49357.19 |
| Americas | Haiti | 2007 | 1201.64 | United States | 2007 | 42951.65 |
Look at the spread of GDP per capita within the continents.
How is life expectancy changing over time on different contients?
densityplot(year ~ lifeExp | continent, hDat, plot.points = FALSE,
ref = TRUE, group = year, auto.key = list(columns = nlevels(hDat$continent)))
This is a bit hard to interpret as a result of both the colours (not in any kind of order) and the sheer number of lines in each plot. I thought it would be nice to group by decade and keep the colours in the same shade.
hDat$decade <- hDat$year%/%10 * 10
myColours <- brewer.pal(6,"Blues")
iDat <- ddply(hDat, ~country*decade*continent, summarize, lifeExp = mean(lifeExp))
densityplot(decade ~ lifeExp | continent, iDat, plot.points = FALSE,
col = myColours,
## par.settings = list(myColours), I can't get this to work - seems like a common frustration online!!
ref = TRUE, group = decade,
auto.key = list(columns = 3, col = myColours,
corner = c(0.05, 0.95), col = myColours)
)
The lines in the key are intensely frustrating. This seems to be a common theme online and part of the benefit to moving to ggplot2 I think.
While these density plots can be useful for getting a general sense of the data, it is also nice to get a more specific handle on the exact
Working from Sean's code and trying to find a nice way to demonstrate the mean life expectancy for different years.
meanLE <- ddply(hDat, ~ continent*year, summarize, meanLifeExp = mean(lifeExp))
lifeExpCont <- ddply(hDat, .(continent, year), summarize, meanLifeExp = mean(lifeExp),
medianLifeExp = median(lifeExp))
xyplot(meanLifeExp ~ year, lifeExpCont, groups = continent, auto.key = TRUE)
This turned out OK. I am sure that there are some more descriptive ways of showing the same information, but not a bad start.
Graph titles should be added.