STAT 545A Homework 3

Matt Gingerich

In this homework set, we're again analysing a Gapminder data set, but with more complex data management than in the previous analysis. This first chunk of R code loads the data set, prints part of it as a sanity check, and loads the plyr library which will be used extensively throughout this page.

library(plyr)
gDat <- read.delim("gapminderDataFiveYear.txt")
tail(gDat)
##       country year      pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982  7636524    Africa   60.36     788.9
## 1700 Zimbabwe 1987  9216418    Africa   62.35     706.2
## 1701 Zimbabwe 1992 10704340    Africa   60.38     693.4
## 1702 Zimbabwe 1997 11404948    Africa   46.81     792.4
## 1703 Zimbabwe 2002 11926563    Africa   39.99     672.0
## 1704 Zimbabwe 2007 12311143    Africa   43.49     469.7

With the data set loaded, we can generate some summary statistics. Here we find the minimum and maximum GDP per capita on each continent and print the results in a wide table:

print(gdpMinMax <- ddply(gDat, ~continent, summarize, minGdpPercap = min(gdpPercap), 
    maxGdpPercap = max(gdpPercap)))
##   continent minGdpPercap maxGdpPercap
## 1    Africa        241.2        21951
## 2  Americas       1201.6        42952
## 3      Asia        331.0       113523
## 4    Europe        973.5        49357
## 5   Oceania      10039.6        34435

As seen above, printing the minimum and maximum GDP per capita of each continent in a wide table is quite straightforward, but it's considerably trickier to print the same information in a tall format without resorting to reshaping. The goal of the “tall” format is to have both the minimum and maximum values in the same column with an extra column used to describe whether the value is a minimum or a maximum. An implementation of maximum and minimum of GDP per capita for all continents in a “tall” format is shown in the next code chunk.

This code uses a custom function to generate a two-row dataframe for each piece of a dataframe it is given as input. This is necessary because in the tall format, the min and max values must occupy separate rows, and the standard “summarize” function only constructs a single row from the object it is given. I'm not entirely happy with this solution, as I'd have preferred to have the stackSummaries function accept the min and max functions as parameters instead of hardcoding them as I have done. It's also problematic that the labels “min” and “max” are hardcoded to match the functions used instead of being automatically derived from the function names. I suspect there's a much smarter way of doing this (using reshaping methods, for instance), but the custom function does provide a quick, dirty, and powerful mechanism for manipulating data.

stackGdpSummaries <- function(piece) {
    value <- c(min(piece$gdpPercap), max(piece$gdpPercap))
    summaryStat <- c("min", "max")
    return(data.frame(value, summaryStat))
}
print(gdpMinMaxTall <- ddply(gDat, ~continent, stackGdpSummaries))
##    continent    value summaryStat
## 1     Africa    241.2         min
## 2     Africa  21951.2         max
## 3   Americas   1201.6         min
## 4   Americas  42951.7         max
## 5       Asia    331.0         min
## 6       Asia 113523.1         max
## 7     Europe    973.5         min
## 8     Europe  49357.2         max
## 9    Oceania  10039.6         min
## 10   Oceania  34435.4         max

The following chunk examines various measures of variability and spread in the GDP per capita by continent data. The measures reported are the standard deviation (stddev), the variance, the median absolute deviation (MAD), and the interquartile range (IQR).

The rows are sorted in ascending order of variance. It comes as no surprise that variance and standard deviation both agree perfectly about the ordering of the countries by heterogeneity as standard deviation is simply the square root of variance. IQR and MAD both measure how much spread exists within certain percentiles of the dataset (when ordered according to their own definitions of spread), so these measures closely align with each other. However, the ordering suggested by IQR and MAD differ substantially from the ordering suggested by variance and standard deviation, since variance is influenced by the magnitudes of all deviations from the mean (by contast, IQR does not change when values outside the interquartile range are pushed further away from the median).

gdpSpread <- ddply(gDat, ~continent, summarize, stddev = sd(gdpPercap), variance = var(gdpPercap), 
    MAD = mad(gdpPercap), IQR = IQR(gdpPercap))
print(gdpSpread[order(gdpSpread$variance), ])
##   continent stddev  variance    MAD   IQR
## 1    Africa   2828   7997187  775.3  1616
## 5   Oceania   6359  40436669 6459.1  8072
## 2  Americas   6397  40918591 3269.3  4402
## 4    Europe   9355  87520020 8846.1 13248
## 3      Asia  14045 197272506 2820.8  7492

How is life expectancy changing over time on different continents? “Wide” format.

The goal of this section is to construct a data frame with six columns – one for the year and five for continents – with the values under the continent columns containing the life expectancy of the continent in the year on the same row.

This proved to be rather difficult. My first attempt was to use the following code snippet

lifeExpWide <- ddply(gDat, ~year+continent, summarize, medLifeExp=median(lifeExp))

which works beautifully to produce a table with the relevant data, but it's in the “tall” format that lists year, continent, and life expectancy in three columns and thus needs to create a row for each continent in each year.

I wasn't able to make any headway in using daply to produce data in the desired format, and after a considerable effort I'm not convinced it would really make sense to do so when the data set could be constructed using methods in the reshape library in a rather straightforward manner, seen below.

Note: after having completed and initially submitted this homework, I discovered Yumian Hu's elegant solution to this problem in the recently-published list on RPubs. With that in mind, I retract what I said earlier about the daply approach. Aside from this comment, this report is not influenced by other solutions and I've left my initial solution intact.

suppressPackageStartupMessages(library(reshape))
moltenDat <- melt(gDat, id = c("continent", "year", "country"))
print(lifeExpWide <- cast(moltenDat, year ~ continent, median, subset = variable == 
    "lifeExp"))
##    year Africa Americas  Asia Europe Oceania
## 1  1952  38.83    54.74 44.87  65.90   69.25
## 2  1957  40.59    56.07 48.28  67.65   70.30
## 3  1962  42.63    58.30 49.33  69.53   71.09
## 4  1967  44.70    60.52 53.66  70.61   71.31
## 5  1972  47.03    63.44 56.95  70.89   71.91
## 6  1977  49.27    66.35 60.77  72.34   72.85
## 7  1982  50.76    67.41 63.74  73.49   74.29
## 8  1987  51.64    69.50 66.30  74.81   75.32
## 9  1992  52.43    69.86 68.69  75.45   76.94
## 10 1997  52.76    72.15 70.27  76.12   78.19
## 11 2002  51.24    72.05 71.03  77.54   79.74
## 12 2007  52.93    72.90 72.40  78.61   80.72

A benefit to storing the “molten” data set is that it can be recast easily into other wide forms. For example, to find the percentage of countries in each continent that have a life expectancy above a certain average, we can replace the median function in the previous code chunk with a custom comparator function. The following chunk tabulates the percentage of countries with a life expectancy above 70 years.

baselineLifeExp <- 70
print(lifeExpComparison <- cast(moltenDat, year ~ continent, function(d) {
    return(sum(d > baselineLifeExp)/length(d) * 100)
}, subset = (variable == "lifeExp")))
##    year Africa Americas   Asia Europe Oceania
## 1  1952  0.000        0  0.000  16.67       0
## 2  1957  0.000        0  0.000  23.33     100
## 3  1962  0.000        8  0.000  40.00     100
## 4  1967  0.000       12  6.061  60.00     100
## 5  1972  0.000       16  9.091  70.00     100
## 6  1977  0.000       24 15.152  83.33     100
## 7  1982  0.000       36 18.182  90.00     100
## 8  1987  1.923       44 24.242  90.00     100
## 9  1992  3.846       48 33.333  90.00     100
## 10 1997  7.692       56 51.515  93.33     100
## 11 2002  9.615       76 57.576 100.00     100
## 12 2007 13.462       88 66.667 100.00     100

Life expectancies improved across all continents over time, but Africa in 2007 is still behind Europe's 1952 life expectancy by this measure.

It's worth pointing out that the number of countries per continent differ, which is why percentages are used above. The number of countries per continent can be crudely extacted with the following code (note that the number of countries is given as a function of the year, as the number of countries per continent could conceivably change over time).

cast(moltenDat, year ~ continent, length, subset = (variable == "lifeExp"))
##    year Africa Americas Asia Europe Oceania
## 1  1952     52       25   33     30       2
## 2  1957     52       25   33     30       2
## 3  1962     52       25   33     30       2
## 4  1967     52       25   33     30       2
## 5  1972     52       25   33     30       2
## 6  1977     52       25   33     30       2
## 7  1982     52       25   33     30       2
## 8  1987     52       25   33     30       2
## 9  1992     52       25   33     30       2
## 10 1997     52       25   33     30       2
## 11 2002     52       25   33     30       2
## 12 2007     52       25   33     30       2