stat545a-2013-hw04_inskip-jes

Homework 4: Visualizing Quantitative Variables with lattice

For this Homework, I have used some data aggregation code from student Sean Jewell.
His report can be found here: http://rpubs.com/jewellsean/stat545a-2013-hw03_jewell-sea
and his code here: https://gist.github.com/jewellsean/bf3c28b63e6f99953153#file-stat545a-2013-hw03_jewell-sea-rmd

I found (and I chose it partly because) it was very well annotated in that there was a lot of textual discussion before and after each table. The tables were clear and well laid out. The discussion was interesting and the code was clear. The inline comments here and there were also helpful to clarify what was being done.

Load packages that will be used and import gapminder dataset and str data to check that import has been successful.

library(plyr)
library(lattice)
library(knitr)
library(xtable)
library(RColorBrewer)
setwd("~/Rwork/STAT545/20130909")
gDat <- read.delim(file = "gapminderDataFiveYear.txt")
str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

tail(gDat)

##       country year      pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982  7636524    Africa   60.36     788.9
## 1700 Zimbabwe 1987  9216418    Africa   62.35     706.2
## 1701 Zimbabwe 1992 10704340    Africa   60.38     693.4
## 1702 Zimbabwe 1997 11404948    Africa   46.81     792.4
## 1703 Zimbabwe 2002 11926563    Africa   39.99     672.0
## 1704 Zimbabwe 2007 12311143    Africa   43.49     469.7


hDat <- droplevels(subset(gDat, continent != "Oceania"))  # remove data from Oceania
table(hDat$continent)  # confirm removal

## 
##   Africa Americas     Asia   Europe 
##      624      300      396      360

Depict the maximum and minimum of GDP per capita for all contients.

It was strange not to have the year included in this table, and here I am trying to figure out a way to include year as a variable.

## some nice code that Sean added to avoid the repetitive task for every table generated (similar to what Jenny did in lecture example)
printTable <- function(df) {
    print(xtable(df), type = "html", include.rownames = F)
    }

minMaxGDPwide <- ddply(hDat, ~continent, summarize, minGDP = min(gdpPercap), 
                       maxGDP = max(gdpPercap))
minMaxGDPwide <- arrange(minMaxGDPwide, minGDP)  # sort on minGDP
printTable(minMaxGDPwide)

continent	minGDP	maxGDP
Africa	241.17	21951.21
Asia	331.00	113523.13
Europe	973.53	49357.19
Americas	1201.64	42951.65


xyplot(gdpPercap ~ year|continent, hDat, log = TRUE, type = c("p", "a"))

plot of chunk table

xyplot(log(gdpPercap) ~ year|continent, hDat, jitter.x = TRUE, log = TRUE, type = c("p", "a"))

plot of chunk table

The log transformed graph presents better, but it would depend on the audience - whether they were used to interpreting log axes.

While these both give a nice eyeball for the variability, it would be nice to include the label on the max and min as defined in the table above. I am not sure how to identify outliers with labels: there was some discussion online about how to use panel.pointLabel from maptools, but I was not quite brave enough to add this to my to do list yet. This could also be included textually, or by including a small table, for those who would like more detailed information.

minMaxGDPElements <- ddply(hDat, ~continent, summarize, 
                           countryMin = country[which.min(gdpPercap)], 
                           timeMin = year[which.min(gdpPercap)], minGDP = min(gdpPercap), countryMax = country[which.max(gdpPercap)], 
    timeMax = year[which.max(gdpPercap)], maxGDP = max(gdpPercap))
minMaxGDPElements <- arrange(minMaxGDPElements, minGDP)
printTable(minMaxGDPElements)

continent	countryMin	timeMin	minGDP	countryMax	timeMax	maxGDP
Africa	Congo, Dem. Rep.	2002	241.17	Libya	1977	21951.21
Asia	Myanmar	1952	331.00	Kuwait	1957	113523.13
Europe	Bosnia and Herzegovina	1952	973.53	Norway	2007	49357.19
Americas	Haiti	2007	1201.64	United States	2007	42951.65

Look at the spread of GDP per capita within the continents.

How is life expectancy changing over time on different contients?

densityplot(year ~ lifeExp | continent, hDat, plot.points = FALSE, 
            ref = TRUE, group = year, auto.key = list(columns = nlevels(hDat$continent)))

plot of chunk unnamed-chunk-5

This is a bit hard to interpret as a result of both the colours (not in any kind of order) and the sheer number of lines in each plot. I thought it would be nice to group by decade and keep the colours in the same shade.

hDat$decade <- hDat$year%/%10 * 10
myColours <- brewer.pal(6,"Blues")
iDat  <- ddply(hDat, ~country*decade*continent, summarize, lifeExp = mean(lifeExp))
densityplot(decade ~ lifeExp | continent, iDat, plot.points = FALSE, 
            col = myColours,
            ## par.settings = list(myColours), I can't get this to work - seems like a common frustration online!!
            ref = TRUE, group = decade,
            auto.key = list(columns = 3, col = myColours, 
            corner = c(0.05, 0.95), col = myColours)
            )

plot of chunk unnamed-chunk-6

The lines in the key are intensely frustrating. This seems to be a common theme online and part of the benefit to moving to ggplot2 I think.

While these density plots can be useful for getting a general sense of the data, it is also nice to get a more specific handle on the exact

Working from Sean's code and trying to find a nice way to demonstrate the mean life expectancy for different years.

meanLE <- ddply(hDat, ~ continent*year, summarize, meanLifeExp = mean(lifeExp))
lifeExpCont <- ddply(hDat, .(continent, year), summarize, meanLifeExp = mean(lifeExp), 
    medianLifeExp = median(lifeExp))
xyplot(meanLifeExp ~ year, lifeExpCont, groups = continent, auto.key = TRUE)

plot of chunk unnamed-chunk-7

This turned out OK. I am sure that there are some more descriptive ways of showing the same information, but not a bad start.

Graph titles should be added.