Data Aggregation

This tutorial is on:

Using plyr package for data aggregation (mostly the ddply() function for data. frames) - a better alternative to subdividing your data into smaller data.frames (by factor category for example)
Presenting data in html table using xtable package

Load the gapminder data

gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

If you feel the urge to store a little snippet of a data.frame:

(snippet <- subset(gDat, country == "Canada"))

##     country year      pop continent lifeExp gdpPercap
## 241  Canada 1952 14785584  Americas   68.75     11367
## 242  Canada 1957 17010154  Americas   69.96     12490
## 243  Canada 1962 18985849  Americas   71.30     13462
## 244  Canada 1967 20819767  Americas   72.13     16077
## 245  Canada 1972 22284500  Americas   72.88     18971
## 246  Canada 1977 23796400  Americas   74.21     22091
## 247  Canada 1982 25201900  Americas   75.76     22899
## 248  Canada 1987 26549700  Americas   76.86     26627
## 249  Canada 1992 28523502  Americas   77.95     26343
## 250  Canada 1997 30305843  Americas   78.61     28955
## 251  Canada 2002 31902268  Americas   79.77     33329
## 252  Canada 2007 33390141  Americas   80.65     36319

Stop and ask yourself …

Do I want to create sub-data.frames for each level of some factor (or unique combination of several factors) ... in order to compute or graph something?

If YES, then maybe you really do need to store a copy of a subset of the data.frame. But seriously consider whether you can achieve your goals by simply using the subset = argument – or perhaps with() coupled with subset() – to enact a computation on a specific set of rows. If this still does not suit your needs, then maybe you really should use subset() as shown above and carry on.

If NO, use data aggregation techniques or conditioning in lattice or facetting in ggplot2 plots – dont subset the data.frame. Or, to be totally clear, only subset the data.frame as a temporary measure as you develop your elegant code for computing on or visualizing these sub-data.frames.

Data aggregation landscape

There are two main options for data aggregation:

- built-in functions, often referred to as the apply family of functions
- the plyr add-on package

This tutorial is on plyr

Data aggregation using plyr

Install and load the plyr package

library(plyr)

plyr Big ideas

The plyr functions will not make much sense viewed individually, e.g. simply reading the help for ddply() is not the fast track to competence. There is a very important over-arching logic for the package and it is well worth reading the article The split-apply-combine strategy for data analysis, Hadley Wickham, Journal of Statistical Software, vol. 40, no. 1, pp. 129, 2011. Though it is no substitute for reading the above, here is the most critical information:

split-apply-combine: A common analytical pattern is to split data into logical bits, apply some function to each bit, and stick the results back together again. Recognize when you're solving such a problem and exploit the right tools.
The computations on these little bits must be truly independent, i.e. the problem must be embarrassingly or pleasingly parallel, in order to use plyr.

The heart of plyr is a set a functions with names like this: XYply where X specifies what sort of input you're giving and Y specifies the sort of output you want.

* a = array, where matrices and vectors are important special cases
* d = data.frame
* l = list
* _ = no output; only valid for Y, obviously; useful when you're operating on a list purely for the side effects, e.g., making a plot or sending output to screen/file

The usage is very similar across these functions. Here are the main arguments: * .data is the first argument = the input * the next argument specifies how to split up the input into bits; it is does not exist when the input is a list, because the pieces are obviously the list components * then comes the function and further arguments needed to describe the computation to be applied to the bits

Today we will emphasize ddply() which accepts a data.frame, splits it into pieces based on one or more factors, computes on the pieces, then returns the results as a data.frame. For the record, the built-in functions most relevant to ddply() are tapply() and friends.

ddply()

Let's say we want to get the maximum life expectancy for each continent.

(maxLeByCont <- ddply(gDat, ~continent, summarize, maxLifeExp = max(lifeExp)))

##   continent maxLifeExp
## 1    Africa      76.44
## 2  Americas      80.65
## 3      Asia      82.60
## 4    Europe      81.76
## 5   Oceania      81.23

str(maxLeByCont)

## 'data.frame':    5 obs. of  2 variables:
##  $ continent : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
##  $ maxLifeExp: num  76.4 80.7 82.6 81.8 81.2

levels(maxLeByCont$continent)

## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

summarize() or its synonym summarise() is a plyr function that creates a new data.frame from an old one. It is related to the built-in function transform() that transforms variables in a data.frame or adds new ones. Feel free to play with it a bit in some top-level commands; you will use it alot inside plyr calls.

The two variables in maxLeByCont come from two sources. The continent factor is provided by ddply() and represents the labelling of the life expectancies with their associated continent. This is the book-keeping associated with dividing the input into little bits, computing on them, and gluing the results together again in an orderly, labelled fashion. We can take more credit for the other variable maxLifeExp, which has a name we chose (“maxLifeExp”) and arises from applying a function we specified (max()) to a variable of our choice (lifeExp).

compute the minimum GDP per capita by continent.

(minGDPbyCont <- ddply(gDat, ~continent, summarize, minGDP = min(gdpPercap)))

##   continent  minGDP
## 1    Africa   241.2
## 2  Americas  1201.6
## 3      Asia   331.0
## 4    Europe   973.5
## 5   Oceania 10039.6

The function you want to apply to the continent-specific data.frames can be built-in, like max() above, or a custom function you've written. This custom function can be written in advance or specified 'on the fly'. Here's how I would count the number of countries in this dataset for each continent.

ddply(gDat, ~continent, summarize, nUniqueCount = length(unique(country)))

##   continent nUniqueCount
## 1    Africa           52
## 2  Americas           25
## 3      Asia           33
## 4    Europe           30
## 5   Oceania            2

Here is another way to do the same thing that doesn't use summarize() at all:

ddply(gDat, ~continent, function(x) return(c(nUniqueCount = length(unique(x$country)))))

##   continent nUniqueCount
## 1    Africa           52
## 2  Americas           25
## 3      Asia           33
## 4    Europe           30
## 5   Oceania            2

You don't have to compute just one thing for each sub-data.frame, nor are you limited to computing on just one variable. Check it out.

ddply(gDat, ~continent, summarize, maxLifeExp = max(lifeExp), minLifeExp = min(lifeExp), 
    medGDPperCap = median(gdpPercap))

##   continent maxLifeExp minLifeExp medGDPperCap
## 1    Africa      76.44      23.60         1192
## 2  Americas      80.65      37.58         5466
## 3      Asia      82.60      28.80         2647
## 4    Europe      81.76      43.59        12082
## 5   Oceania      81.23      69.12        17983

Putting it all together: using ddply() and polishing the results

Now I want to do something more complicated. I want to fit a linear regression for each country, modelling life expectancy as a function of the year and then retain the estimated intercepts and slopes. I will walk before I run. Therefore, I will create a tiny sub-data.frame to prototype this, before I fold it into a ddply() call. If you're a newbie, watch how complicated tasks are slowly constructed.

jCountry <- "France"
(jDat <- subset(gDat, country == jCountry))  #temporary measure

##     country year      pop continent lifeExp gdpPercap
## 529  France 1952 42459667    Europe   67.41      7030
## 530  France 1957 44310863    Europe   68.93      8663
## 531  France 1962 47124000    Europe   70.51     10560
## 532  France 1967 49569000    Europe   71.55     13000
## 533  France 1972 51732000    Europe   72.38     16107
## 534  France 1977 53165019    Europe   73.83     18293
## 535  France 1982 54433565    Europe   74.89     20294
## 536  France 1987 55630100    Europe   76.34     22066
## 537  France 1992 57374179    Europe   77.46     24704
## 538  France 1997 58623428    Europe   78.64     25890
## 539  France 2002 59925035    Europe   79.59     28926
## 540  France 2007 61083916    Europe   80.66     30470


library(lattice)
xyplot(lifeExp ~ year, jDat, type = c("p", "r"))

plot of chunk unnamed-chunk-9


jFit <- lm(lifeExp ~ year, jDat)
summary(jFit)

## 
## Call:
## lm(formula = lifeExp ~ year, data = jDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3801 -0.1389  0.0124  0.1429  0.3349 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.98e+02   7.29e+00   -54.6  1.0e-13 ***
## year         2.38e-01   3.68e-03    64.8  1.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.22 on 10 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.997 
## F-statistic: 4.2e+03 on 1 and 10 DF,  p-value: 1.86e-14

Wow, check out that crazy intercept! Apparently the life expectancy in France around year 0 A.D. was minus 400 years! This a great opportunity for some sanity checking of a model fit and thinking about how to reparametrize the model to make the parameters have natural interpretation. I think it makes more sense for the intercept to correspond to life expectancy in 1952, the earliest date in our dataset. Let's try that again.

(yearMin <- min(gDat$year))

## [1] 1952

jFit <- lm(lifeExp ~ I(year - yearMin), jDat)
summary(jFit)

## 
## Call:
## lm(formula = lifeExp ~ I(year - yearMin), data = jDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3801 -0.1389  0.0124  0.1429  0.3349 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       67.79013    0.11949   567.3  < 2e-16 ***
## I(year - yearMin)  0.23850    0.00368    64.8  1.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.22 on 10 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.997 
## F-statistic: 4.2e+03 on 1 and 10 DF,  p-value: 1.86e-14

An intercept around 68 years makes much more common sense and is also supported by our plot.

What is this jFit object and how can I get stuff out of it?

class(jFit)

## [1] "lm"

mode(jFit)

## [1] "list"

str(jFit)

## List of 12
##  $ coefficients : Named num [1:2] 67.79 0.239
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "I(year - yearMin)"
##  $ residuals    : Named num [1:12] -0.3801 -0.0526 0.3349 0.1824 -0.1802 ...
##   ..- attr(*, "names")= chr [1:12] "529" "530" "531" "532" ...
##  $ effects      : Named num [1:12] -257.5522 14.2603 0.4152 0.2648 -0.0956 ...
##   ..- attr(*, "names")= chr [1:12] "(Intercept)" "I(year - yearMin)" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:12] 67.8 69 70.2 71.4 72.6 ...
##   ..- attr(*, "names")= chr [1:12] "529" "530" "531" "532" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:12, 1:2] -3.464 0.289 0.289 0.289 0.289 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:12] "529" "530" "531" "532" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "I(year - yearMin)"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.29 1.27
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 10
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = lifeExp ~ I(year - yearMin), data = jDat)
##  $ terms        :Classes 'terms', 'formula' length 3 lifeExp ~ I(year - yearMin)
##   .. ..- attr(*, "variables")= language list(lifeExp, I(year - yearMin))
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "lifeExp" "I(year - yearMin)"
##   .. .. .. ..$ : chr "I(year - yearMin)"
##   .. ..- attr(*, "term.labels")= chr "I(year - yearMin)"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(lifeExp, I(year - yearMin))
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "I(year - yearMin)"
##  $ model        :'data.frame':   12 obs. of  2 variables:
##   ..$ lifeExp          : num [1:12] 67.4 68.9 70.5 71.5 72.4 ...
##   ..$ I(year - yearMin):Class 'AsIs'  int [1:12] 0 5 10 15 20 25 30 35 40 45 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 lifeExp ~ I(year - yearMin)
##   .. .. ..- attr(*, "variables")= language list(lifeExp, I(year - yearMin))
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "lifeExp" "I(year - yearMin)"
##   .. .. .. .. ..$ : chr "I(year - yearMin)"
##   .. .. ..- attr(*, "term.labels")= chr "I(year - yearMin)"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(lifeExp, I(year - yearMin))
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "I(year - yearMin)"
##  - attr(*, "class")= chr "lm"

names(jFit)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

jFit$coefficients

##       (Intercept) I(year - yearMin) 
##           67.7901            0.2385

coef(jFit)

##       (Intercept) I(year - yearMin) 
##           67.7901            0.2385

As a rule, I use extractor functions like this when they are available.

We have achieved our goal for this specific country – we've gotten its intercept and slope. Now we need to package that as a function.

jFun <- function(x) coef(lm(lifeExp ~ I(year - yearMin), x))
jFun(jDat)

##       (Intercept) I(year - yearMin) 
##           67.7901            0.2385

I hate the names of these return values. Good names pay off downstream, so I will enhance my function.

jFun <- function(x) {
    estCoef <- coef(lm(lifeExp ~ I(year - yearMin), x))
    names(estCoef) <- c("Intercept", "slope")
    return(estCoef)
}

jFun(jDat)

## Intercept     slope 
##   67.7901    0.2385

It's always a good idea to try out a function on a few small examples.


jFun(subset(gDat, country == "Poland"))

## Intercept     slope 
##   64.7809    0.1962

jFun(subset(gDat, country == "Canada"))

## Intercept     slope 
##   68.8838    0.2189

jFun(subset(gDat, country == "Slovak Republic"))

## Intercept     slope 
##    67.010     0.134

jFun(subset(gDat, country == "India"))

## Intercept     slope 
##   39.2698    0.5053

jFun(subset(gDat, country == "China"))

## Intercept     slope 
##   47.1905    0.5307

It seems like we are ready to scale up by placing this function inside a ddply() call.

library(plyr)
jCoef <- ddply(gDat, ~country, jFun)
str(jCoef)

## 'data.frame':    142 obs. of  3 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Intercept: num  29.9 59.2 43.4 32.1 62.7 ...
##  $ slope    : num  0.275 0.335 0.569 0.209 0.232 ...

tail(jCoef)

##                country Intercept    slope
## 137          Venezuela     57.51  0.32972
## 138            Vietnam     39.01  0.67162
## 139 West Bank and Gaza     43.80  0.60110
## 140        Yemen, Rep.     30.13  0.60546
## 141             Zambia     47.66 -0.06043
## 142           Zimbabwe     55.22 -0.09302

This is in the script lmByCountry.r

Passing more then one argument to a function

the jFun above only requires one argument, x. What if it had more than one argument?

For example, let's imagine that the shift for the year covariate is an argument instead of a previously-assigned variable yearMin. Here's how it would work.

jFunTwoArg <- function(x, cvShift = 0) {
    estCoef <- coef(lm(lifeExp ~ I(year - cvShift), x))
    names(estCoef) <- c("Intercept", "slope")
    return(estCoef)
}

Since I've assigned cvShift = a default value of zero, we can get coefficients where the intercept corresponds to the year A.D. 0 with this simple call:

sillyCoef <- ddply(gDat, ~country, jFunTwoArg)
head(sillyCoef)

##       country Intercept  slope
## 1 Afghanistan    -507.5 0.2753
## 2     Albania    -594.1 0.3347
## 3     Algeria   -1067.9 0.5693
## 4      Angola    -376.5 0.2093
## 5   Argentina    -389.6 0.2317
## 6   Australia    -376.1 0.2277

We are getting the same estimated slopes but the silly year 0 intercepts we've seen before. Let's use the cvShift = argument to resolve this.

saneCoef <- ddply(gDat, ~country, jFunTwoArg, cvShift = 1952)
head(saneCoef)

##       country Intercept  slope
## 1 Afghanistan     29.91 0.2753
## 2     Albania     59.23 0.3347
## 3     Algeria     43.37 0.5693
## 4      Angola     32.13 0.2093
## 5   Argentina     62.69 0.2317
## 6   Australia     68.40 0.2277

We're back to our usual estimated intercepts, which reflect life expectancy in 1952. Of course hard-wiring 1952 is not a great idea, so here's probably our best code yet:

bestCoef <- ddply(gDat, ~country, jFunTwoArg, cvShift = min(gDat$year))
head(bestCoef)

##       country Intercept  slope
## 1 Afghanistan     29.91 0.2753
## 2     Albania     59.23 0.3347
## 3     Algeria     43.37 0.5693
## 4      Angola     32.13 0.2093
## 5   Argentina     62.69 0.2317
## 6   Australia     68.40 0.2277

Finally, let's present this information attractively in a table.

Markdown doesn't have a proper table syntax, because it is a ruthlessly simple language. To make a pretty table you need to install additional package(s). There are many but here we will use the xtable package. We will make an HTML table. This HTML code will survive unharmed as your R Markdown document is converted from R Markdown to Markdown and finally to HTML.

library(xtable)

Let's pick some countries at random and display their estimated coefficients. FYI: the following R chunk has an option results='asis' and that is important to the correct display of the table.

set.seed(916)
foo <- jCoef[sample(nrow(jCoef), size = 15), ]
foo <- xtable(foo)
print(foo, type = "html", include.rownames = F)

country	Intercept	slope
Lebanon	58.69	0.26
Senegal	36.75	0.50
Dominican Republic	48.60	0.47
Oman	37.21	0.77
Germany	67.57	0.21
Korea, Dem. Rep.	54.91	0.32
Mauritius	55.37	0.35
Slovak Republic	67.01	0.13
Comoros	40.00	0.45
Argentina	62.69	0.23
Central African Republic	38.81	0.18
Ecuador	49.07	0.50
West Bank and Gaza	43.80	0.60
Egypt	40.97	0.56
Myanmar	41.41	0.43

Two easy improvments to make this table more useful are

* include the continent information
* sort it rationally

The easiest way to get the continent information is to enhance our ddply() call.

Here is what we used:

jCoef <- ddply(gDat, ~country, jFun)

This divides gDat into country-specific pieces. But we can supply two factors in the second argument: **country and continent. In theory, sub-data.frames will be made for all possible combinations of the levels of country and continent. Many of those will have zero rows because there is, for example, no Belgium in Asia. By default, plyr functions drop these empty combinations. But the labelling work done by ddply() will still help us, as we will get both country and continent as factors in our result. This is easier to see than explain!

jCoef <- ddply(gDat, ~country + continent, jFun)
str(jCoef)

## 'data.frame':    142 obs. of  4 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
##  $ Intercept: num  29.9 59.2 43.4 32.1 62.7 ...
##  $ slope    : num  0.275 0.335 0.569 0.209 0.232 ...

tail(jCoef)

##                country continent Intercept    slope
## 137          Venezuela  Americas     57.51  0.32972
## 138            Vietnam      Asia     39.01  0.67162
## 139 West Bank and Gaza      Asia     43.80  0.60110
## 140        Yemen, Rep.      Asia     30.13  0.60546
## 141             Zambia    Africa     47.66 -0.06043
## 142           Zimbabwe    Africa     55.22 -0.09302

Now, prior to making the HTML table, we will sort the data.frame, so it starts with the country with the shortest life expectancy in 1952, and goes to the largest.

set.seed(916)
foo <- jCoef[sample(nrow(jCoef), size = 15), ]
foo <- arrange(foo, Intercept)

# an uglier way to order (w/o plyr) foo <- foo[order(foo$Intercept),]

foo <- xtable(foo)
print(foo, type = "html", include.rownames = F)

country	continent	Intercept	slope
Senegal	Africa	36.75	0.50
Oman	Asia	37.21	0.77
Central African Republic	Africa	38.81	0.18
Comoros	Africa	40.00	0.45
Egypt	Africa	40.97	0.56
Myanmar	Asia	41.41	0.43
West Bank and Gaza	Asia	43.80	0.60
Dominican Republic	Americas	48.60	0.47
Ecuador	Americas	49.07	0.50
Korea, Dem. Rep.	Asia	54.91	0.32
Mauritius	Africa	55.37	0.35
Lebanon	Asia	58.69	0.26
Argentina	Americas	62.69	0.23
Slovak Republic	Europe	67.01	0.13
Germany	Europe	67.57	0.21

Lessons

plyr is a powerful package for data aggregation.
ddply() is the most important function: data.frames are great, therefore ddply() is great because it takes data.frame as input and returns data.frame as output.
Simple functions, built-in or user-defined, can be provided directly in a call to ddply().
It's better to handle more complicated data aggregation differently. Build it up slowly and revisit earlier steps for improvement. Here is a gentle workflow:

* Create an indicative sub-data.frame
* Compute on it directly with top-level commands until you achieve your goal
* Package those commands inside a function
* Test the function with other sub-data.frames
* Refine your function, e.g. give the return values better names
* Construct a ddply() call using your function
* Identify more weakness in your function, refine, call ddply() again

Results you present to the world generally need polishing to make the best impression and have the best impact. This too will be an iterative process. Here are some good things to think about:

* Is there an obvious piece of ancillary information that I should include? Example: continent above. It was not needed for the computation but it is helpful to retain/restore it for the final table.

* Is there some logical ordering for the data? Alphabetical is ususally the default and it is completely arbitrary and often confusing.

* Have I presented the data in the most aesthetically pleasing way I currently know how? - xtable package