In this homework assignment I will continue to leverage Hadley's impressive collection of R toolboxes. In particular, I will examine the Gapminder data with plyr. Unfortunately, no figures will be used here–it is very painful to try to examine trends without graphics–hopefully this pain is evident in the tables below. Just for fun, I made the tables a bit easier to read, but this should be no solace at all in the face of no figures!
Let's begin:
library(plyr)
library(xtable)
library(MASS)
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
tail(gDat)
## country year pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982 7636524 Africa 60.36 788.9
## 1700 Zimbabwe 1987 9216418 Africa 62.35 706.2
## 1701 Zimbabwe 1992 10704340 Africa 60.38 693.4
## 1702 Zimbabwe 1997 11404948 Africa 46.81 792.4
## 1703 Zimbabwe 2002 11926563 Africa 39.99 672.0
## 1704 Zimbabwe 2007 12311143 Africa 43.49 469.7
Since I will be making many tables, here is a utility function to make the process (a little) less painful:
printTable <- function(df) {
print(xtable(df), type = "html", include.rownames = F)
}
I am going to explore the GDP per capita data extensively in the next few steps. At first we will look to see the min and max GDP per capita on a continent by continent basis.
minMaxGDPwide <- ddply(gDat, ~continent, summarize, minGDP = min(gdpPercap),
maxGDP = max(gdpPercap))
minMaxGDPwide <- arrange(minMaxGDPwide, minGDP) # sort on minGDP
printTable(minMaxGDPwide)
| continent | minGDP | maxGDP |
|---|---|---|
| Africa | 241.17 | 21951.21 |
| Asia | 331.00 | 113523.13 |
| Europe | 973.53 | 49357.19 |
| Americas | 1201.64 | 42951.65 |
| Oceania | 10039.60 | 34435.37 |
Nothing is too surprising here: as expected Africa and China have the smallest GDP per capita while Europe, Americas, and Oceania lead the group. Interestingly, the trends do not carry onto the maxGDP by continent–here we observe that Asia actually has the largest maxGDP. An obvious question is: for which country and in what year did these min and max values occur? We look into this in the below:
minMaxGDPElements <- ddply(gDat, ~continent, summarize, countryMin = country[which.min(gdpPercap)],
timeMin = year[which.min(gdpPercap)], minGDP = min(gdpPercap), countryMax = country[which.max(gdpPercap)],
timeMax = year[which.max(gdpPercap)], maxGDP = max(gdpPercap))
minMaxGDPElements <- arrange(minMaxGDPElements, minGDP)
printTable(minMaxGDPElements)
| continent | countryMin | timeMin | minGDP | countryMax | timeMax | maxGDP |
|---|---|---|---|---|---|---|
| Africa | Congo, Dem. Rep. | 2002 | 241.17 | Libya | 1977 | 21951.21 |
| Asia | Myanmar | 1952 | 331.00 | Kuwait | 1957 | 113523.13 |
| Europe | Bosnia and Herzegovina | 1952 | 973.53 | Norway | 2007 | 49357.19 |
| Americas | Haiti | 2007 | 1201.64 | United States | 2007 | 42951.65 |
| Oceania | Australia | 1952 | 10039.60 | Australia | 2007 | 34435.37 |
A few disturbing cases pop out here:
spreadGDP <- ddply(gDat, ~continent, summarize, SD = sd(gdpPercap), MAD = mad(gdpPercap),
IQR = IQR(gdpPercap))
spreadGDP <- arrange(spreadGDP, SD)
printTable(spreadGDP)
| continent | SD | MAD | IQR |
|---|---|---|---|
| Africa | 2827.93 | 775.32 | 1616.17 |
| Oceania | 6358.98 | 6459.10 | 8072.26 |
| Americas | 6396.76 | 3269.33 | 4402.43 |
| Europe | 9355.21 | 8846.05 | 13248.30 |
| Asia | 14045.37 | 2820.83 | 7492.26 |
In looking at the spread data, I have examined three different measures of spread:
The second two measures are meant to be less sensitive to outlines, and thus more robust. Interestingly the ordering in GDP spread is not invariant to measure. Most notably, I would want to point out that the dispersion in GDP per capita in the Americas seems much smaller on a MAD basis than on a standard deviation basis.
For some preliminary results on life expectancy, I looked to see the average life expectancy as a function of year. In the table below, we see that average life expectancy has increased through time. To make the measure more robust, I also examined the trimmed mean–that is, the mean of a trimmed dataset. In this instance the data has been trimmed to exclude the 10% of the data at the ends of the ordered life expectancy set.
# Trimmed mean, trim describes the fraction of observations to be trimmed at
# each end--btwn 0 - 0.5
lifeExpTrim <- ddply(gDat, ~year, summarize, StandardMean = mean(lifeExp), TrimmedMean = mean(lifeExp,
trim = 0.1), `StandardMean - TrimmedMean` = StandardMean - TrimmedMean)
printTable(lifeExpTrim)
| year | StandardMean | TrimmedMean | StandardMean - TrimmedMean |
|---|---|---|---|
| 1952 | 49.06 | 48.58 | 0.48 |
| 1957 | 51.51 | 51.27 | 0.24 |
| 1962 | 53.61 | 53.58 | 0.03 |
| 1967 | 55.68 | 55.87 | -0.19 |
| 1972 | 57.65 | 58.01 | -0.37 |
| 1977 | 59.57 | 60.10 | -0.53 |
| 1982 | 61.53 | 62.12 | -0.58 |
| 1987 | 63.21 | 63.92 | -0.71 |
| 1992 | 64.16 | 65.19 | -1.02 |
| 1997 | 65.01 | 66.02 | -1.00 |
| 2002 | 65.69 | 66.72 | -1.02 |
| 2007 | 67.01 | 68.11 | -1.11 |
In the earlier years (1952-1962) the standard (untrimmed) mean is greater than the trimmed mean which indicates that there were countries with very high life expediencies as outliers to the whole data set. This trend reversed in 1967 which indicates that the outliers are now pulling the standard mean down.
It may also be of interest to see how life expectancy is changing over time on different continents:
lifeExpCont <- ddply(gDat, .(continent, year), summarize, meanLifeExp = mean(lifeExp),
medianLifeExp = median(lifeExp))
# How is life expectancy changing over time on different continents? 'Wide'
# format
lifeExpContWide <- ddply(lifeExpCont, ~year, function(t) setNames(t$meanLifeExp,
unique(t$continent)))
printTable(lifeExpCont)
| continent | year | meanLifeExp | medianLifeExp |
|---|---|---|---|
| Africa | 1952 | 39.14 | 38.83 |
| Africa | 1957 | 41.27 | 40.59 |
| Africa | 1962 | 43.32 | 42.63 |
| Africa | 1967 | 45.33 | 44.70 |
| Africa | 1972 | 47.45 | 47.03 |
| Africa | 1977 | 49.58 | 49.27 |
| Africa | 1982 | 51.59 | 50.76 |
| Africa | 1987 | 53.34 | 51.64 |
| Africa | 1992 | 53.63 | 52.43 |
| Africa | 1997 | 53.60 | 52.76 |
| Africa | 2002 | 53.33 | 51.24 |
| Africa | 2007 | 54.81 | 52.93 |
| Americas | 1952 | 53.28 | 54.74 |
| Americas | 1957 | 55.96 | 56.07 |
| Americas | 1962 | 58.40 | 58.30 |
| Americas | 1967 | 60.41 | 60.52 |
| Americas | 1972 | 62.39 | 63.44 |
| Americas | 1977 | 64.39 | 66.35 |
| Americas | 1982 | 66.23 | 67.41 |
| Americas | 1987 | 68.09 | 69.50 |
| Americas | 1992 | 69.57 | 69.86 |
| Americas | 1997 | 71.15 | 72.15 |
| Americas | 2002 | 72.42 | 72.05 |
| Americas | 2007 | 73.61 | 72.90 |
| Asia | 1952 | 46.31 | 44.87 |
| Asia | 1957 | 49.32 | 48.28 |
| Asia | 1962 | 51.56 | 49.33 |
| Asia | 1967 | 54.66 | 53.66 |
| Asia | 1972 | 57.32 | 56.95 |
| Asia | 1977 | 59.61 | 60.77 |
| Asia | 1982 | 62.62 | 63.74 |
| Asia | 1987 | 64.85 | 66.30 |
| Asia | 1992 | 66.54 | 68.69 |
| Asia | 1997 | 68.02 | 70.27 |
| Asia | 2002 | 69.23 | 71.03 |
| Asia | 2007 | 70.73 | 72.40 |
| Europe | 1952 | 64.41 | 65.90 |
| Europe | 1957 | 66.70 | 67.65 |
| Europe | 1962 | 68.54 | 69.53 |
| Europe | 1967 | 69.74 | 70.61 |
| Europe | 1972 | 70.78 | 70.89 |
| Europe | 1977 | 71.94 | 72.34 |
| Europe | 1982 | 72.81 | 73.49 |
| Europe | 1987 | 73.64 | 74.81 |
| Europe | 1992 | 74.44 | 75.45 |
| Europe | 1997 | 75.51 | 76.12 |
| Europe | 2002 | 76.70 | 77.54 |
| Europe | 2007 | 77.65 | 78.61 |
| Oceania | 1952 | 69.25 | 69.25 |
| Oceania | 1957 | 70.30 | 70.30 |
| Oceania | 1962 | 71.09 | 71.09 |
| Oceania | 1967 | 71.31 | 71.31 |
| Oceania | 1972 | 71.91 | 71.91 |
| Oceania | 1977 | 72.85 | 72.85 |
| Oceania | 1982 | 74.29 | 74.29 |
| Oceania | 1987 | 75.32 | 75.32 |
| Oceania | 1992 | 76.94 | 76.94 |
| Oceania | 1997 | 78.19 | 78.19 |
| Oceania | 2002 | 79.74 | 79.74 |
| Oceania | 2007 | 80.72 | 80.72 |
printTable(lifeExpContWide)
| year | Africa | Americas | Asia | Europe | Oceania |
|---|---|---|---|---|---|
| 1952 | 39.14 | 53.28 | 46.31 | 64.41 | 69.25 |
| 1957 | 41.27 | 55.96 | 49.32 | 66.70 | 70.30 |
| 1962 | 43.32 | 58.40 | 51.56 | 68.54 | 71.09 |
| 1967 | 45.33 | 60.41 | 54.66 | 69.74 | 71.31 |
| 1972 | 47.45 | 62.39 | 57.32 | 70.78 | 71.91 |
| 1977 | 49.58 | 64.39 | 59.61 | 71.94 | 72.85 |
| 1982 | 51.59 | 66.23 | 62.62 | 72.81 | 74.29 |
| 1987 | 53.34 | 68.09 | 64.85 | 73.64 | 75.32 |
| 1992 | 53.63 | 69.57 | 66.54 | 74.44 | 76.94 |
| 1997 | 53.60 | 71.15 | 68.02 | 75.51 | 78.19 |
| 2002 | 53.33 | 72.42 | 69.23 | 76.70 | 79.74 |
| 2007 | 54.81 | 73.61 | 70.73 | 77.65 | 80.72 |
The long data is presented first, and then is converted to a wide format through plyr. Note, however, that this is really a reshaping through plyr (which could have been done through reshape2) and not a native plyr command to generate wide data. In the actual data, it is encouraging to see that within each continent the general trend has been increased life expectancy, though clearly at different rates. This is consistent with our earlier findings on the trimmed mean!
How many of these continents, and in turn, countries have low life expectancy? We define low life expectancy to be less than or equal to the 5% quantile.
lowLifeExp <- as.numeric(quantile(gDat$lifeExp, probs = 0.05))
continentLifeExp <- ddply(gDat, .(continent, year), summarize, lowLifeInstances = sum(lifeExp <=
lowLifeExp)) # this is tall, aka not fun
continentLifeExp <- ddply(continentLifeExp, ~year, function(t) setNames(t$lowLifeInstances,
unique(t$continent)))
printTable(continentLifeExp)
| year | Africa | Americas | Asia | Europe | Oceania |
|---|---|---|---|---|---|
| 1952 | 23 | 1 | 8 | 0 | 0 |
| 1957 | 16 | 0 | 3 | 0 | 0 |
| 1962 | 11 | 0 | 2 | 0 | 0 |
| 1967 | 7 | 0 | 2 | 0 | 0 |
| 1972 | 4 | 0 | 1 | 0 | 0 |
| 1977 | 2 | 0 | 2 | 0 | 0 |
| 1982 | 1 | 0 | 0 | 0 | 0 |
| 1987 | 0 | 0 | 0 | 0 | 0 |
| 1992 | 2 | 0 | 0 | 0 | 0 |
| 1997 | 1 | 0 | 0 | 0 | 0 |
| 2002 | 0 | 0 | 0 | 0 | 0 |
| 2007 | 0 | 0 | 0 | 0 | 0 |
I noticed immediately that the low life expectancy data was predominately in Africa and Asia and stopped in 1997 (though this is really an artifact of how a low life expectancy is defined). It is more informative to have the proportion of countries which have low life expectancy than just the raw counts:
continentLifeExpProp <- ddply(gDat, .(continent, year), summarize, lowLifeInstances = sum(lifeExp <=
lowLifeExp)/length(unique(country)))
continentLifeExpProp <- ddply(continentLifeExpProp, ~year, function(t) setNames(round(t$lowLifeInstances,
2) * 100, unique(t$continent)))
printTable(continentLifeExpProp)
| year | Africa | Americas | Asia | Europe | Oceania |
|---|---|---|---|---|---|
| 1952 | 44.00 | 4.00 | 24.00 | 0.00 | 0.00 |
| 1957 | 31.00 | 0.00 | 9.00 | 0.00 | 0.00 |
| 1962 | 21.00 | 0.00 | 6.00 | 0.00 | 0.00 |
| 1967 | 13.00 | 0.00 | 6.00 | 0.00 | 0.00 |
| 1972 | 8.00 | 0.00 | 3.00 | 0.00 | 0.00 |
| 1977 | 4.00 | 0.00 | 6.00 | 0.00 | 0.00 |
| 1982 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 1987 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 1992 | 4.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 1997 | 2.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2002 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2007 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Wow, in 1952 44% of Africa's countries had a “low life expectancy!”
yearMin <- min(gDat$year)
lifeExpRegression <- function(x) {
estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
names(estCoefs) <- c("intercept", "slope")
return(estCoefs)
}
lifeExpChanges <- arrange(ddply(ddply(gDat, .(continent, country), lifeExpRegression),
~continent, summarize, minInt = min(intercept), minIntCountry = country[which.min(intercept)],
minSlope = min(slope), minSlopeCountry = country[which.min(slope)]), minInt)
printTable(lifeExpChanges)
| continent | minInt | minIntCountry | minSlope | minSlopeCountry |
|---|---|---|---|---|
| Africa | 28.40 | Gambia | -0.09 | Zimbabwe |
| Asia | 29.91 | Afghanistan | 0.24 | Iraq |
| Americas | 38.76 | Bolivia | 0.16 | Paraguay |
| Europe | 46.02 | Turkey | 0.12 | Denmark |
| Oceania | 68.40 | Australia | 0.19 | New Zealand |
Slightly alarming is that Zimbabwe is actually experiencing deteriorating life expectancy.
We can modify our regression function to provide us with some interesting results based on residuals from the least-squares fit:
lifeExpRegression <- function(x) {
fit <- lm(lifeExp ~ I(year - yearMin), x)
res <- max(fit$residuals)
names(res) <- c("maxResidual")
return(res)
}
lifeExpRes <- arrange(ddply(ddply(gDat, .(continent, country), lifeExpRegression),
~continent, summarize, maxRes = max(maxResidual), maxResCountry = country[which.max(maxResidual)]),
maxRes)
printTable(lifeExpRes)
| continent | maxRes | maxResCountry |
|---|---|---|
| Oceania | 0.91 | New Zealand |
| Americas | 2.50 | Honduras |
| Europe | 3.50 | Montenegro |
| Asia | 6.70 | Iraq |
| Africa | 10.39 | Zimbabwe |
It would be interesting to do some analysis of these 5 countries to see why the life expectancy trend broke down…But I need figures and that has to wait until next week :)