Stat545a Homework #3 — Practicing Data Aggregation

This page will demonstrate some simple data aggregation examples using the plyr package. You can follow along with the Gapminder Data.

gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

head(gDat)

##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia   28.80     779.4
## 2 Afghanistan 1957  9240934      Asia   30.33     820.9
## 3 Afghanistan 1962 10267083      Asia   32.00     853.1
## 4 Afghanistan 1967 11537966      Asia   34.02     836.2
## 5 Afghanistan 1972 13079460      Asia   36.09     740.0
## 6 Afghanistan 1977 14880372      Asia   38.44     786.1

library(plyr)

Let's examine the min, max and mean GDP per capita for each continent in our data frame and then sort the data by the mean GDP per capita.

gdpByCont <- ddply(gDat, ~continent, summarize, minGDPperCap = min(gdpPercap), 
    maxGDPperCap = max(gdpPercap), meanGDPperCap = mean(gdpPercap))
tab <- (gdpByCont[order(gdpByCont$meanGDPperCap), ])
library(xtable)
tab <- xtable(tab)
print(tab, type = "html", include.rownames = F)

continent	minGDPperCap	maxGDPperCap	meanGDPperCap
Africa	241.17	21951.21	2193.75
Americas	1201.64	42951.65	7136.11
Asia	331.00	113523.13	7902.15
Europe	973.53	49357.19	14469.48
Oceania	10039.60	34435.37	18621.61

There is quite the range in the mean GDP per capita between the different continents. Oceania has the highest mean but has the second to lowest max GDP per capita. This is implies that Oceania has the tighest distribution of GDP per capita which makes sense because the continent group is only make up of two countries, Australia and New Zealand. Asia has the widest distribution but it appears heavily right skewed.

Let's examine the spread more in depth by calculating the standard deviation, median absolution deviation and the interquartile range. The results are sorted by the standard deviation.

spreadGDP <- ddply(gDat, ~continent, summarize, sdGDP = sd(gdpPercap), madGDP = mad(gdpPercap), 
    iqrGDP = IQR(gdpPercap))
print(xtable(spreadGDP[order(spreadGDP$sdGDP), ]), type = "html", include.rownames = F)

continent	sdGDP	madGDP	iqrGDP
Africa	2827.93	775.32	1616.17
Oceania	6358.98	6459.10	8072.26
Americas	6396.76	3269.33	4402.43
Europe	9355.21	8846.05	13248.30
Asia	14045.37	2820.83	7492.26

As expected, Asia has the largest standard deviation but interestingly it has the second lowest median absolute deviation.

It is also easy to look at mean life expectancy by year by using ddply. We can compare the mean life expectancy per year to two different trimmed means (5% and 10%).

avgLifebyYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp), avgTrim5perc = mean(lifeExp, 
    trim = 0.05), avgTrim10per = mean(lifeExp, trim = 0.1))
print(xtable(avgLifebyYear), type = "html", include.rownames = F)

year	avgLifeExp	avgTrim5perc	avgTrim10per
1952	49.06	48.85	48.58
1957	51.51	51.42	51.27
1962	53.61	53.64	53.58
1967	55.68	55.80	55.87
1972	57.65	57.85	58.01
1977	59.57	59.89	60.10
1982	61.53	61.85	62.12
1987	63.21	63.61	63.92
1992	64.16	64.81	65.19
1997	65.01	65.56	66.02
2002	65.69	66.20	66.72
2007	67.01	67.56	68.11

The first thing that jumps out is that the average life expectancy is increasing by year. This is not unexpected. Also, the trimmed means for 5% and 10% do not differ significantly from the means.