Stat545A Homework #3 Data aggregation

In this homework assignment I will continue to leverage Hadley's impressive collection of R toolboxes. In particular, I will examine the Gapminder data with plyr. Unfortunately, no figures will be used here–it is very painful to try to examine trends without graphics–hopefully this pain is evident in the tables below. Just for fun, I made the tables a bit easier to read, but this should be no solace at all in the face of no figures!

Let's begin:

Basic import and quality control

library(plyr)
library(xtable)
library(MASS)
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
tail(gDat)
##       country year      pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982  7636524    Africa   60.36     788.9
## 1700 Zimbabwe 1987  9216418    Africa   62.35     706.2
## 1701 Zimbabwe 1992 10704340    Africa   60.38     693.4
## 1702 Zimbabwe 1997 11404948    Africa   46.81     792.4
## 1703 Zimbabwe 2002 11926563    Africa   39.99     672.0
## 1704 Zimbabwe 2007 12311143    Africa   43.49     469.7

Since I will be making many tables, here is a utility function to make the process (a little) less painful:

printTable <- function(df) {
    print(xtable(df), type = "html", include.rownames = F)
}

Easy analysis of GDP “trends”

I am going to explore the GDP per capita data extensively in the next few steps. At first we will look to see the min and max GDP per capita on a continent by continent basis.

minMaxGDPwide <- ddply(gDat, ~continent, summarize, minGDP = min(gdpPercap), 
    maxGDP = max(gdpPercap))
minMaxGDPwide <- arrange(minMaxGDPwide, minGDP)  # sort on minGDP
printTable(minMaxGDPwide)
continent minGDP maxGDP
Africa 241.17 21951.21
Asia 331.00 113523.13
Europe 973.53 49357.19
Americas 1201.64 42951.65
Oceania 10039.60 34435.37

Nothing is too surprising here: as expected Africa and China have the smallest GDP per capita while Europe, Americas, and Oceania lead the group. Interestingly, the trends do not carry onto the maxGDP by continent–here we observe that Asia actually has the largest maxGDP. An obvious question is: for which country and in what year did these min and max values occur? We look into this in the below:

minMaxGDPElements <- ddply(gDat, ~continent, summarize, countryMin = country[which.min(gdpPercap)], 
    timeMin = year[which.min(gdpPercap)], minGDP = min(gdpPercap), countryMax = country[which.max(gdpPercap)], 
    timeMax = year[which.max(gdpPercap)], maxGDP = max(gdpPercap))
minMaxGDPElements <- arrange(minMaxGDPElements, minGDP)
printTable(minMaxGDPElements)
continent countryMin timeMin minGDP countryMax timeMax maxGDP
Africa Congo, Dem. Rep. 2002 241.17 Libya 1977 21951.21
Asia Myanmar 1952 331.00 Kuwait 1957 113523.13
Europe Bosnia and Herzegovina 1952 973.53 Norway 2007 49357.19
Americas Haiti 2007 1201.64 United States 2007 42951.65
Oceania Australia 1952 10039.60 Australia 2007 34435.37

A few disturbing cases pop out here:

What is the spread of GDP across continents?

spreadGDP <- ddply(gDat, ~continent, summarize, SD = sd(gdpPercap), MAD = mad(gdpPercap), 
    IQR = IQR(gdpPercap))
spreadGDP <- arrange(spreadGDP, SD)
printTable(spreadGDP)
continent SD MAD IQR
Africa 2827.93 775.32 1616.17
Oceania 6358.98 6459.10 8072.26
Americas 6396.76 3269.33 4402.43
Europe 9355.21 8846.05 13248.30
Asia 14045.37 2820.83 7492.26

In looking at the spread data, I have examined three different measures of spread:

The second two measures are meant to be less sensitive to outlines, and thus more robust. Interestingly the ordering in GDP spread is not invariant to measure. Most notably, I would want to point out that the dispersion in GDP per capita in the Americas seems much smaller on a MAD basis than on a standard deviation basis.

Life expectancy trends

For some preliminary results on life expectancy, I looked to see the average life expectancy as a function of year. In the table below, we see that average life expectancy has increased through time. To make the measure more robust, I also examined the trimmed mean–that is, the mean of a trimmed dataset. In this instance the data has been trimmed to exclude the 10% of the data at the ends of the ordered life expectancy set.

# Trimmed mean, trim describes the fraction of observations to be trimmed at
# each end--btwn 0 - 0.5
lifeExpTrim <- ddply(gDat, ~year, summarize, StandardMean = mean(lifeExp), TrimmedMean = mean(lifeExp, 
    trim = 0.1), `StandardMean - TrimmedMean` = StandardMean - TrimmedMean)
printTable(lifeExpTrim)
year StandardMean TrimmedMean StandardMean - TrimmedMean
1952 49.06 48.58 0.48
1957 51.51 51.27 0.24
1962 53.61 53.58 0.03
1967 55.68 55.87 -0.19
1972 57.65 58.01 -0.37
1977 59.57 60.10 -0.53
1982 61.53 62.12 -0.58
1987 63.21 63.92 -0.71
1992 64.16 65.19 -1.02
1997 65.01 66.02 -1.00
2002 65.69 66.72 -1.02
2007 67.01 68.11 -1.11

In the earlier years (1952-1962) the standard (untrimmed) mean is greater than the trimmed mean which indicates that there were countries with very high life expediencies as outliers to the whole data set. This trend reversed in 1967 which indicates that the outliers are now pulling the standard mean down.

It may also be of interest to see how life expectancy is changing over time on different continents:

lifeExpCont <- ddply(gDat, .(continent, year), summarize, meanLifeExp = mean(lifeExp), 
    medianLifeExp = median(lifeExp))
# How is life expectancy changing over time on different continents? 'Wide'
# format
lifeExpContWide <- ddply(lifeExpCont, ~year, function(t) setNames(t$meanLifeExp, 
    unique(t$continent)))
printTable(lifeExpCont)
continent year meanLifeExp medianLifeExp
Africa 1952 39.14 38.83
Africa 1957 41.27 40.59
Africa 1962 43.32 42.63
Africa 1967 45.33 44.70
Africa 1972 47.45 47.03
Africa 1977 49.58 49.27
Africa 1982 51.59 50.76
Africa 1987 53.34 51.64
Africa 1992 53.63 52.43
Africa 1997 53.60 52.76
Africa 2002 53.33 51.24
Africa 2007 54.81 52.93
Americas 1952 53.28 54.74
Americas 1957 55.96 56.07
Americas 1962 58.40 58.30
Americas 1967 60.41 60.52
Americas 1972 62.39 63.44
Americas 1977 64.39 66.35
Americas 1982 66.23 67.41
Americas 1987 68.09 69.50
Americas 1992 69.57 69.86
Americas 1997 71.15 72.15
Americas 2002 72.42 72.05
Americas 2007 73.61 72.90
Asia 1952 46.31 44.87
Asia 1957 49.32 48.28
Asia 1962 51.56 49.33
Asia 1967 54.66 53.66
Asia 1972 57.32 56.95
Asia 1977 59.61 60.77
Asia 1982 62.62 63.74
Asia 1987 64.85 66.30
Asia 1992 66.54 68.69
Asia 1997 68.02 70.27
Asia 2002 69.23 71.03
Asia 2007 70.73 72.40
Europe 1952 64.41 65.90
Europe 1957 66.70 67.65
Europe 1962 68.54 69.53
Europe 1967 69.74 70.61
Europe 1972 70.78 70.89
Europe 1977 71.94 72.34
Europe 1982 72.81 73.49
Europe 1987 73.64 74.81
Europe 1992 74.44 75.45
Europe 1997 75.51 76.12
Europe 2002 76.70 77.54
Europe 2007 77.65 78.61
Oceania 1952 69.25 69.25
Oceania 1957 70.30 70.30
Oceania 1962 71.09 71.09
Oceania 1967 71.31 71.31
Oceania 1972 71.91 71.91
Oceania 1977 72.85 72.85
Oceania 1982 74.29 74.29
Oceania 1987 75.32 75.32
Oceania 1992 76.94 76.94
Oceania 1997 78.19 78.19
Oceania 2002 79.74 79.74
Oceania 2007 80.72 80.72
printTable(lifeExpContWide)
year Africa Americas Asia Europe Oceania
1952 39.14 53.28 46.31 64.41 69.25
1957 41.27 55.96 49.32 66.70 70.30
1962 43.32 58.40 51.56 68.54 71.09
1967 45.33 60.41 54.66 69.74 71.31
1972 47.45 62.39 57.32 70.78 71.91
1977 49.58 64.39 59.61 71.94 72.85
1982 51.59 66.23 62.62 72.81 74.29
1987 53.34 68.09 64.85 73.64 75.32
1992 53.63 69.57 66.54 74.44 76.94
1997 53.60 71.15 68.02 75.51 78.19
2002 53.33 72.42 69.23 76.70 79.74
2007 54.81 73.61 70.73 77.65 80.72

The long data is presented first, and then is converted to a wide format through plyr. Note, however, that this is really a reshaping through plyr (which could have been done through reshape2) and not a native plyr command to generate wide data. In the actual data, it is encouraging to see that within each continent the general trend has been increased life expectancy, though clearly at different rates. This is consistent with our earlier findings on the trimmed mean!

How many of these continents, and in turn, countries have low life expectancy? We define low life expectancy to be less than or equal to the 5% quantile.

lowLifeExp <- as.numeric(quantile(gDat$lifeExp, probs = 0.05))
continentLifeExp <- ddply(gDat, .(continent, year), summarize, lowLifeInstances = sum(lifeExp <= 
    lowLifeExp))  # this is tall, aka not fun
continentLifeExp <- ddply(continentLifeExp, ~year, function(t) setNames(t$lowLifeInstances, 
    unique(t$continent)))
printTable(continentLifeExp)
year Africa Americas Asia Europe Oceania
1952 23 1 8 0 0
1957 16 0 3 0 0
1962 11 0 2 0 0
1967 7 0 2 0 0
1972 4 0 1 0 0
1977 2 0 2 0 0
1982 1 0 0 0 0
1987 0 0 0 0 0
1992 2 0 0 0 0
1997 1 0 0 0 0
2002 0 0 0 0 0
2007 0 0 0 0 0

I noticed immediately that the low life expectancy data was predominately in Africa and Asia and stopped in 1997 (though this is really an artifact of how a low life expectancy is defined). It is more informative to have the proportion of countries which have low life expectancy than just the raw counts:

continentLifeExpProp <- ddply(gDat, .(continent, year), summarize, lowLifeInstances = sum(lifeExp <= 
    lowLifeExp)/length(unique(country)))
continentLifeExpProp <- ddply(continentLifeExpProp, ~year, function(t) setNames(round(t$lowLifeInstances, 
    2) * 100, unique(t$continent)))
printTable(continentLifeExpProp)
year Africa Americas Asia Europe Oceania
1952 44.00 4.00 24.00 0.00 0.00
1957 31.00 0.00 9.00 0.00 0.00
1962 21.00 0.00 6.00 0.00 0.00
1967 13.00 0.00 6.00 0.00 0.00
1972 8.00 0.00 3.00 0.00 0.00
1977 4.00 0.00 6.00 0.00 0.00
1982 2.00 0.00 0.00 0.00 0.00
1987 0.00 0.00 0.00 0.00 0.00
1992 4.00 0.00 0.00 0.00 0.00
1997 2.00 0.00 0.00 0.00 0.00
2002 0.00 0.00 0.00 0.00 0.00
2007 0.00 0.00 0.00 0.00 0.00

Wow, in 1952 44% of Africa's countries had a “low life expectancy!”

Interesting story lines

yearMin <- min(gDat$year)
lifeExpRegression <- function(x) {
    estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
    names(estCoefs) <- c("intercept", "slope")
    return(estCoefs)
}

lifeExpChanges <- arrange(ddply(ddply(gDat, .(continent, country), lifeExpRegression), 
    ~continent, summarize, minInt = min(intercept), minIntCountry = country[which.min(intercept)], 
    minSlope = min(slope), minSlopeCountry = country[which.min(slope)]), minInt)
printTable(lifeExpChanges)
continent minInt minIntCountry minSlope minSlopeCountry
Africa 28.40 Gambia -0.09 Zimbabwe
Asia 29.91 Afghanistan 0.24 Iraq
Americas 38.76 Bolivia 0.16 Paraguay
Europe 46.02 Turkey 0.12 Denmark
Oceania 68.40 Australia 0.19 New Zealand

Slightly alarming is that Zimbabwe is actually experiencing deteriorating life expectancy.

We can modify our regression function to provide us with some interesting results based on residuals from the least-squares fit:

lifeExpRegression <- function(x) {
    fit <- lm(lifeExp ~ I(year - yearMin), x)
    res <- max(fit$residuals)
    names(res) <- c("maxResidual")
    return(res)
}

lifeExpRes <- arrange(ddply(ddply(gDat, .(continent, country), lifeExpRegression), 
    ~continent, summarize, maxRes = max(maxResidual), maxResCountry = country[which.max(maxResidual)]), 
    maxRes)
printTable(lifeExpRes)
continent maxRes maxResCountry
Oceania 0.91 New Zealand
Americas 2.50 Honduras
Europe 3.50 Montenegro
Asia 6.70 Iraq
Africa 10.39 Zimbabwe

It would be interesting to do some analysis of these 5 countries to see why the life expectancy trend broke down…But I need figures and that has to wait until next week :)