Homework Number Four (Stat545a)

In this assignment we will be using the Gapminder dataset. (Located here for those who are curious).

The goal of the present assignment is to make use of code which was written by another student. I will use the code from Mina Park.

But first, we need to load the dataset and do out little sanity check with str(), followed by loading up the datasets we need: lattice and plyr.

gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

library(plyr)
library(lattice)

After looking through her (thankfully very well written and easy to follow!) code I decide that I will make plots from her task 2 and 4.2. The former is a tall display of mean GDP and mean life expectancy by continent over time. The latter is a tall calculation of the proportion of countries in each continent with a life expectancy below a hard wired thresh-hold, over time.

The code from task 2 is printed below:

lifeAndGDP <- ddply(gDat, ~year + continent, summarize, meanLifeExp = mean(lifeExp), 
    meanGdp = mean(gdpPercap))

Note that I have changed the name of the object produced to lifeAndGDP. This is different from how Mina had it, but I prefer my name to hers. Hers is much more descriptive, but I find mine more “friendly” to type. Different strokes for different folks I suppose.

I now want to check the names of the different variables in this new object:

names(lifeAndGDP)
## [1] "year"        "continent"   "meanLifeExp" "meanGdp"

Okay, it seems that the two of interest are meanLifeExp and meanGdp. To start I make some simple scatter plots of these two variables. Note that for both I make the choice to drop the Oceania continent, as it only has two countries. I should also explain what that hardwired 4 is doing in the auto.key command. In essence I agree with JB's idea that the key should be horizontal. I know it would be better not to hard code the number of elements, but it seemed to me that if I added the actual code to “figure out” what the number ought to be, then it would get needlessly messy and long. Since I am certain that the number will work out to 4 with this data-set I decided just to hard-code it.

xyplot(meanLifeExp ~ year, droplevels(subset(lifeAndGDP, continent != "Oceania")), 
    group = continent, auto.key = list(columns = 4), grid = "h", ylab = "Mean Life Expectancy", 
    xlab = "Year", main = "Mean Life Expectancy Over Time, by Continent")

plot of chunk unnamed-chunk-4

and…

xyplot(meanGdp ~ year, droplevels(subset(lifeAndGDP, continent != "Oceania")), 
    group = continent, auto.key = list(columns = 4), grid = "h", ylab = "Mean GDP", 
    xlab = "Year", main = "Mean GDP Over Time, by Continent")

plot of chunk unnamed-chunk-5

These two plots show that that across all continent there is a trend of increasing mean life expectancy and mean GDP over time. However, I feel that this would have been apparent from the table: the graph isn't adding too much to our understanding in my opinion.

I really want to make one graph which makes use of all the information in out lifeAndGDP object. To put it another way I want to make a really data-dense plot that illustrates something that is not visibale in a table displaying the numbers inside.

I settle on another scatter plot, but this time I plot mean life expecancy against mean GDP, still by continent. The plot is shown below:

xyplot(meanLifeExp ~ meanGdp, droplevels(subset(lifeAndGDP, continent != "Oceania")), 
    group = continent, auto.key = list(columns = 4), grid = "h", type = "a", 
    xlab = "Mean GDP", ylab = "Mean Life Expectancy", main = "Mean GDP versus Mean Life Expectancy, by Continent")

plot of chunk unnamed-chunk-6

So what's going on here? While it may take a couple more seconds to interpret than the other two plots I think that it is worth it, as it is telling us a more interesting story. The story is that for all continents, there is a pretty strong, positive, relatinship between mean GDP and mean life expectancy. Moreover, we see how the different continents “stack up” with respect to each other and these two variables. It also gives us a sense of how the two variables of interest relate to each other. It appears not to be a straightforward linear relationship; as life expectancy increases it takes a greater and greater increase in GDP to nudge life expectancy any higher.

Really though this makes me wish I could create a 3-d plot with year as the z-axis… I will have to look into this, as I think it would be pretty cool.

I now turn to the data that Mina calculated for her task 4.2. The code is shown below:

DatLowLifeExp <- ddply(gDat, ~continent + country + year, summarize, lowLifeExp = lifeExp < 
    60)
propLifeExp <- ddply(DatLowLifeExp, ~continent + year, summarize, nCountriesLowLifeExp = length(which(lowLifeExp == 
    TRUE)), nCountries = length(unique(country)), propLowLifeExp = (length(which(lowLifeExp == 
    TRUE))/length(unique(country))))

Note that I have modified the code: I have renamed some of the variables after my own particular tastes, and I have increased the thresh-hold for life expectancy to 60.

Let's start by seeing the different variables we have to work with:

names(propLifeExp)
## [1] "continent"            "year"                 "nCountriesLowLifeExp"
## [4] "nCountries"           "propLowLifeExp"

It seems that our variable of interest is propLifeExp. We start by making a set of side-by-side boxplots:

bwplot(propLowLifeExp ~ continent, propLifeExp, xlab = "Continent", ylab = "Proportion of Countries with \nLife Expextancy Under Sixty", 
    main = "Side-by-side Boxplots of Prevalance \nof Low Life Expectancy by Continent")

plot of chunk unnamed-chunk-9

This plot shows us that the proportion of countries with low life expectancies is consistently high in Africa, and that it is consistently low in Europe. The wide spread on Asia suggests that it it may be fairly dynamic over the course of the dataset. The two outliers above Europe may require further investigation: which two years are not behaving like the others? As expected there is little information to be gleaned from the Oceania data, partly reflecting the fact that there is little data, and thus information, there to begin with.

Just for curiosity, I decide to re-run the same code as above, by specifying a violin plot. There is no real reason for this, other than my insane love of violin plots. Oh, and it is a great excuse for me to specify a real gem of a color found in R: dodgerblue.

bwplot(propLowLifeExp ~ continent, propLifeExp, xlab = "Continent", ylab = "Proportion of Countries with \nLife Expextancy Under Sixty", 
    main = "Side-by-side Boxplots of Prevalance \nof Low Life Expectancy by Continent", 
    panel = panel.violin, col = "dodgerblue")

plot of chunk unnamed-chunk-10

The plot doesn't really add much to our understanding however, other than to reaffirm that we should probably drop Oceania.

Finally, I create a scatter-plot:

xyplot(propLowLifeExp ~ year, droplevels(subset(propLifeExp, continent != "Oceania")), 
    group = continent, auto.key = TRUE, type = "a", xlab = "Year", ylab = "Proportion of Countries with Life Expectency Under 60")

plot of chunk unnamed-chunk-11

In the end I think that this is actually the best way to display the information that was calculated by Mina, as it allows us to see trends over time (as opposed to guessing at them in the side-by-side boxplot figure).