Matt Gingerich
In this homework set, we're again analysing a Gapminder data set, but with more complex data management than in the previous analysis. This first chunk of R code loads the data set, prints part of it as a sanity check, and loads the plyr library which will be used extensively throughout this page.
library(plyr)
gDat <- read.delim("gapminderDataFiveYear.txt")
tail(gDat)
## country year pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982 7636524 Africa 60.36 788.9
## 1700 Zimbabwe 1987 9216418 Africa 62.35 706.2
## 1701 Zimbabwe 1992 10704340 Africa 60.38 693.4
## 1702 Zimbabwe 1997 11404948 Africa 46.81 792.4
## 1703 Zimbabwe 2002 11926563 Africa 39.99 672.0
## 1704 Zimbabwe 2007 12311143 Africa 43.49 469.7
With the data set loaded, we can generate some summary statistics. Here we find the minimum and maximum GDP per capita on each continent and print the results in a wide table:
print(gdpMinMax <- ddply(gDat, ~continent, summarize, minGdpPercap = min(gdpPercap),
maxGdpPercap = max(gdpPercap)))
## continent minGdpPercap maxGdpPercap
## 1 Africa 241.2 21951
## 2 Americas 1201.6 42952
## 3 Asia 331.0 113523
## 4 Europe 973.5 49357
## 5 Oceania 10039.6 34435
As seen above, printing the minimum and maximum GDP per capita of each continent in a wide table is quite straightforward, but it's considerably trickier to print the same information in a tall format without resorting to reshaping. The goal of the “tall” format is to have both the minimum and maximum values in the same column with an extra column used to describe whether the value is a minimum or a maximum. An implementation of maximum and minimum of GDP per capita for all continents in a “tall” format is shown in the next code chunk.
This code uses a custom function to generate a two-row dataframe for each piece of a dataframe it is given as input. This is necessary because in the tall format, the min and max values must occupy separate rows, and the standard “summarize” function only constructs a single row from the object it is given. I'm not entirely happy with this solution, as I'd have preferred to have the stackSummaries function accept the min and max functions as parameters instead of hardcoding them as I have done. It's also problematic that the labels “min” and “max” are hardcoded to match the functions used instead of being automatically derived from the function names. I suspect there's a much smarter way of doing this (using reshaping methods, for instance), but the custom function does provide a quick, dirty, and powerful mechanism for manipulating data.
stackGdpSummaries <- function(piece) {
value <- c(min(piece$gdpPercap), max(piece$gdpPercap))
summaryStat <- c("min", "max")
return(data.frame(value, summaryStat))
}
print(gdpMinMaxTall <- ddply(gDat, ~continent, stackGdpSummaries))
## continent value summaryStat
## 1 Africa 241.2 min
## 2 Africa 21951.2 max
## 3 Americas 1201.6 min
## 4 Americas 42951.7 max
## 5 Asia 331.0 min
## 6 Asia 113523.1 max
## 7 Europe 973.5 min
## 8 Europe 49357.2 max
## 9 Oceania 10039.6 min
## 10 Oceania 34435.4 max
The following chunk examines various measures of variability and spread in the GDP per capita by continent data. The measures reported are the standard deviation (stddev), the variance, the median absolute deviation (MAD), and the interquartile range (IQR).
The rows are sorted in ascending order of variance. It comes as no surprise that variance and standard deviation both agree perfectly about the ordering of the countries by heterogeneity as standard deviation is simply the square root of variance. IQR and MAD both measure how much spread exists within certain percentiles of the dataset (when ordered according to their own definitions of spread), so these measures closely align with each other. However, the ordering suggested by IQR and MAD differ substantially from the ordering suggested by variance and standard deviation, since variance is influenced by the magnitudes of all deviations from the mean (by contast, IQR does not change when values outside the interquartile range are pushed further away from the median).
gdpSpread <- ddply(gDat, ~continent, summarize, stddev = sd(gdpPercap), variance = var(gdpPercap),
MAD = mad(gdpPercap), IQR = IQR(gdpPercap))
print(gdpSpread[order(gdpSpread$variance), ])
## continent stddev variance MAD IQR
## 1 Africa 2828 7997187 775.3 1616
## 5 Oceania 6359 40436669 6459.1 8072
## 2 Americas 6397 40918591 3269.3 4402
## 4 Europe 9355 87520020 8846.1 13248
## 3 Asia 14045 197272506 2820.8 7492
The goal of this section is to construct a data frame with six columns – one for the year and five for continents – with the values under the continent columns containing the life expectancy of the continent in the year on the same row.
This proved to be rather difficult. My first attempt was to use the following code snippet
lifeExpWide <- ddply(gDat, ~year+continent, summarize, medLifeExp=median(lifeExp))
which works beautifully to produce a table with the relevant data, but it's in the “tall” format that lists year, continent, and life expectancy in three columns and thus needs to create a row for each continent in each year.
I wasn't able to make any headway in using daply to produce data in the desired format, and after a considerable effort I'm not convinced it would really make sense to do so when the data set could be constructed using methods in the reshape library in a rather straightforward manner, seen below.
Note: after having completed and initially submitted this homework, I discovered Yumian Hu's elegant solution to this problem in the recently-published list on RPubs. With that in mind, I retract what I said earlier about the daply approach. Aside from this comment, this report is not influenced by other solutions and I've left my initial solution intact.
suppressPackageStartupMessages(library(reshape))
moltenDat <- melt(gDat, id = c("continent", "year", "country"))
print(lifeExpWide <- cast(moltenDat, year ~ continent, median, subset = variable ==
"lifeExp"))
## year Africa Americas Asia Europe Oceania
## 1 1952 38.83 54.74 44.87 65.90 69.25
## 2 1957 40.59 56.07 48.28 67.65 70.30
## 3 1962 42.63 58.30 49.33 69.53 71.09
## 4 1967 44.70 60.52 53.66 70.61 71.31
## 5 1972 47.03 63.44 56.95 70.89 71.91
## 6 1977 49.27 66.35 60.77 72.34 72.85
## 7 1982 50.76 67.41 63.74 73.49 74.29
## 8 1987 51.64 69.50 66.30 74.81 75.32
## 9 1992 52.43 69.86 68.69 75.45 76.94
## 10 1997 52.76 72.15 70.27 76.12 78.19
## 11 2002 51.24 72.05 71.03 77.54 79.74
## 12 2007 52.93 72.90 72.40 78.61 80.72
A benefit to storing the “molten” data set is that it can be recast easily into other wide forms. For example, to find the percentage of countries in each continent that have a life expectancy above a certain average, we can replace the median function in the previous code chunk with a custom comparator function. The following chunk tabulates the percentage of countries with a life expectancy above 70 years.
baselineLifeExp <- 70
print(lifeExpComparison <- cast(moltenDat, year ~ continent, function(d) {
return(sum(d > baselineLifeExp)/length(d) * 100)
}, subset = (variable == "lifeExp")))
## year Africa Americas Asia Europe Oceania
## 1 1952 0.000 0 0.000 16.67 0
## 2 1957 0.000 0 0.000 23.33 100
## 3 1962 0.000 8 0.000 40.00 100
## 4 1967 0.000 12 6.061 60.00 100
## 5 1972 0.000 16 9.091 70.00 100
## 6 1977 0.000 24 15.152 83.33 100
## 7 1982 0.000 36 18.182 90.00 100
## 8 1987 1.923 44 24.242 90.00 100
## 9 1992 3.846 48 33.333 90.00 100
## 10 1997 7.692 56 51.515 93.33 100
## 11 2002 9.615 76 57.576 100.00 100
## 12 2007 13.462 88 66.667 100.00 100
Life expectancies improved across all continents over time, but Africa in 2007 is still behind Europe's 1952 life expectancy by this measure.
It's worth pointing out that the number of countries per continent differ, which is why percentages are used above. The number of countries per continent can be crudely extacted with the following code (note that the number of countries is given as a function of the year, as the number of countries per continent could conceivably change over time).
cast(moltenDat, year ~ continent, length, subset = (variable == "lifeExp"))
## year Africa Americas Asia Europe Oceania
## 1 1952 52 25 33 30 2
## 2 1957 52 25 33 30 2
## 3 1962 52 25 33 30 2
## 4 1967 52 25 33 30 2
## 5 1972 52 25 33 30 2
## 6 1977 52 25 33 30 2
## 7 1982 52 25 33 30 2
## 8 1987 52 25 33 30 2
## 9 1992 52 25 33 30 2
## 10 1997 52 25 33 30 2
## 11 2002 52 25 33 30 2
## 12 2007 52 25 33 30 2