Stat545a Homework #3 — Practicing Data Aggregation

This page will demonstrate some simple data aggregation examples using the plyr package. You can follow along with the Gapminder Data.

gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
head(gDat)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia   28.80     779.4
## 2 Afghanistan 1957  9240934      Asia   30.33     820.9
## 3 Afghanistan 1962 10267083      Asia   32.00     853.1
## 4 Afghanistan 1967 11537966      Asia   34.02     836.2
## 5 Afghanistan 1972 13079460      Asia   36.09     740.0
## 6 Afghanistan 1977 14880372      Asia   38.44     786.1
library(plyr)

Let's examine the min, max and mean GDP per capita for each continent in our data frame and then sort the data by the mean GDP per capita.

gdpByCont <- ddply(gDat, ~continent, summarize, minGDPperCap = min(gdpPercap), 
    maxGDPperCap = max(gdpPercap), meanGDPperCap = mean(gdpPercap))
tab <- (gdpByCont[order(gdpByCont$meanGDPperCap), ])
library(xtable)
tab <- xtable(tab)
print(tab, type = "html", include.rownames = F)
continent minGDPperCap maxGDPperCap meanGDPperCap
Africa 241.17 21951.21 2193.75
Americas 1201.64 42951.65 7136.11
Asia 331.00 113523.13 7902.15
Europe 973.53 49357.19 14469.48
Oceania 10039.60 34435.37 18621.61

There is quite the range in the mean GDP per capita between the different continents. Oceania has the highest mean but has the second to lowest max GDP per capita. This is implies that Oceania has the tighest distribution of GDP per capita which makes sense because the continent group is only make up of two countries, Australia and New Zealand. Asia has the widest distribution but it appears heavily right skewed.

Let's examine the spread more in depth by calculating the standard deviation, median absolution deviation and the interquartile range. The results are sorted by the standard deviation.

spreadGDP <- ddply(gDat, ~continent, summarize, sdGDP = sd(gdpPercap), madGDP = mad(gdpPercap), 
    iqrGDP = IQR(gdpPercap))
print(xtable(spreadGDP[order(spreadGDP$sdGDP), ]), type = "html", include.rownames = F)
continent sdGDP madGDP iqrGDP
Africa 2827.93 775.32 1616.17
Oceania 6358.98 6459.10 8072.26
Americas 6396.76 3269.33 4402.43
Europe 9355.21 8846.05 13248.30
Asia 14045.37 2820.83 7492.26

As expected, Asia has the largest standard deviation but interestingly it has the second lowest median absolute deviation.

It is also easy to look at mean life expectancy by year by using ddply. We can compare the mean life expectancy per year to two different trimmed means (5% and 10%).

avgLifebyYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp), avgTrim5perc = mean(lifeExp, 
    trim = 0.05), avgTrim10per = mean(lifeExp, trim = 0.1))
print(xtable(avgLifebyYear), type = "html", include.rownames = F)
year avgLifeExp avgTrim5perc avgTrim10per
1952 49.06 48.85 48.58
1957 51.51 51.42 51.27
1962 53.61 53.64 53.58
1967 55.68 55.80 55.87
1972 57.65 57.85 58.01
1977 59.57 59.89 60.10
1982 61.53 61.85 62.12
1987 63.21 63.61 63.92
1992 64.16 64.81 65.19
1997 65.01 65.56 66.02
2002 65.69 66.20 66.72
2007 67.01 67.56 68.11

The first thing that jumps out is that the average life expectancy is increasing by year. This is not unexpected. Also, the trimmed means for 5% and 10% do not differ significantly from the means.