This page will demonstrate some simple data aggregation examples using the plyr package. You can follow along with the Gapminder Data.
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
head(gDat)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.80 779.4
## 2 Afghanistan 1957 9240934 Asia 30.33 820.9
## 3 Afghanistan 1962 10267083 Asia 32.00 853.1
## 4 Afghanistan 1967 11537966 Asia 34.02 836.2
## 5 Afghanistan 1972 13079460 Asia 36.09 740.0
## 6 Afghanistan 1977 14880372 Asia 38.44 786.1
library(plyr)
Let's examine the min, max and mean GDP per capita for each continent in our data frame and then sort the data by the mean GDP per capita.
gdpByCont <- ddply(gDat, ~continent, summarize, minGDPperCap = min(gdpPercap),
maxGDPperCap = max(gdpPercap), meanGDPperCap = mean(gdpPercap))
tab <- (gdpByCont[order(gdpByCont$meanGDPperCap), ])
library(xtable)
tab <- xtable(tab)
print(tab, type = "html", include.rownames = F)
| continent | minGDPperCap | maxGDPperCap | meanGDPperCap |
|---|---|---|---|
| Africa | 241.17 | 21951.21 | 2193.75 |
| Americas | 1201.64 | 42951.65 | 7136.11 |
| Asia | 331.00 | 113523.13 | 7902.15 |
| Europe | 973.53 | 49357.19 | 14469.48 |
| Oceania | 10039.60 | 34435.37 | 18621.61 |
There is quite the range in the mean GDP per capita between the different continents. Oceania has the highest mean but has the second to lowest max GDP per capita. This is implies that Oceania has the tighest distribution of GDP per capita which makes sense because the continent group is only make up of two countries, Australia and New Zealand. Asia has the widest distribution but it appears heavily right skewed.
Let's examine the spread more in depth by calculating the standard deviation, median absolution deviation and the interquartile range. The results are sorted by the standard deviation.
spreadGDP <- ddply(gDat, ~continent, summarize, sdGDP = sd(gdpPercap), madGDP = mad(gdpPercap),
iqrGDP = IQR(gdpPercap))
print(xtable(spreadGDP[order(spreadGDP$sdGDP), ]), type = "html", include.rownames = F)
| continent | sdGDP | madGDP | iqrGDP |
|---|---|---|---|
| Africa | 2827.93 | 775.32 | 1616.17 |
| Oceania | 6358.98 | 6459.10 | 8072.26 |
| Americas | 6396.76 | 3269.33 | 4402.43 |
| Europe | 9355.21 | 8846.05 | 13248.30 |
| Asia | 14045.37 | 2820.83 | 7492.26 |
As expected, Asia has the largest standard deviation but interestingly it has the second lowest median absolute deviation.
It is also easy to look at mean life expectancy by year by using ddply. We can compare the mean life expectancy per year to two different trimmed means (5% and 10%).
avgLifebyYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp), avgTrim5perc = mean(lifeExp,
trim = 0.05), avgTrim10per = mean(lifeExp, trim = 0.1))
print(xtable(avgLifebyYear), type = "html", include.rownames = F)
| year | avgLifeExp | avgTrim5perc | avgTrim10per |
|---|---|---|---|
| 1952 | 49.06 | 48.85 | 48.58 |
| 1957 | 51.51 | 51.42 | 51.27 |
| 1962 | 53.61 | 53.64 | 53.58 |
| 1967 | 55.68 | 55.80 | 55.87 |
| 1972 | 57.65 | 57.85 | 58.01 |
| 1977 | 59.57 | 59.89 | 60.10 |
| 1982 | 61.53 | 61.85 | 62.12 |
| 1987 | 63.21 | 63.61 | 63.92 |
| 1992 | 64.16 | 64.81 | 65.19 |
| 1997 | 65.01 | 65.56 | 66.02 |
| 2002 | 65.69 | 66.20 | 66.72 |
| 2007 | 67.01 | 67.56 | 68.11 |
The first thing that jumps out is that the average life expectancy is increasing by year. This is not unexpected. Also, the trimmed means for 5% and 10% do not differ significantly from the means.