Using plyr package for data aggregation (mostly the ddply() function for data. frames) - a better alternative to subdividing your data into smaller data.frames (by factor category for example)
Presenting data in html table using xtable package
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
If you feel the urge to store a little snippet of a data.frame:
(snippet <- subset(gDat, country == "Canada"))
## country year pop continent lifeExp gdpPercap
## 241 Canada 1952 14785584 Americas 68.75 11367
## 242 Canada 1957 17010154 Americas 69.96 12490
## 243 Canada 1962 18985849 Americas 71.30 13462
## 244 Canada 1967 20819767 Americas 72.13 16077
## 245 Canada 1972 22284500 Americas 72.88 18971
## 246 Canada 1977 23796400 Americas 74.21 22091
## 247 Canada 1982 25201900 Americas 75.76 22899
## 248 Canada 1987 26549700 Americas 76.86 26627
## 249 Canada 1992 28523502 Americas 77.95 26343
## 250 Canada 1997 30305843 Americas 78.61 28955
## 251 Canada 2002 31902268 Americas 79.77 33329
## 252 Canada 2007 33390141 Americas 80.65 36319
Stop and ask yourself …
Do I want to create sub-data.frames for each level of some factor (or unique combination of several factors) ... in order to compute or graph something?
If YES, then maybe you really do need to store a copy of a subset of the data.frame. But seriously consider whether you can achieve your goals by simply using the subset = argument – or perhaps with() coupled with subset() – to enact a computation on a specific set of rows. If this still does not suit your needs, then maybe you really should use subset() as shown above and carry on.
If NO, use data aggregation techniques or conditioning in lattice or facetting in ggplot2 plots – dont subset the data.frame. Or, to be totally clear, only subset the data.frame as a temporary measure as you develop your elegant code for computing on or visualizing these sub-data.frames.
There are two main options for data aggregation:
- built-in functions, often referred to as the apply family of functions
- the plyr add-on package
This tutorial is on plyr
library(plyr)
The plyr functions will not make much sense viewed individually, e.g. simply reading the help for ddply() is not the fast track to competence. There is a very important over-arching logic for the package and it is well worth reading the article The split-apply-combine strategy for data analysis, Hadley Wickham, Journal of Statistical Software, vol. 40, no. 1, pp. 129, 2011. Though it is no substitute for reading the above, here is the most critical information:
split-apply-combine: A common analytical pattern is to split data into logical bits, apply some function to each bit, and stick the results back together again. Recognize when you're solving such a problem and exploit the right tools.
The computations on these little bits must be truly independent, i.e. the problem must be embarrassingly or pleasingly parallel, in order to use plyr.
The heart of plyr is a set a functions with names like this: XYply where X specifies what sort of input you're giving and Y specifies the sort of output you want.
* a = array, where matrices and vectors are important special cases
* d = data.frame
* l = list
* _ = no output; only valid for Y, obviously; useful when you're operating on a list purely for the side effects, e.g., making a plot or sending output to screen/file
The usage is very similar across these functions. Here are the main arguments: * .data is the first argument = the input * the next argument specifies how to split up the input into bits; it is does not exist when the input is a list, because the pieces are obviously the list components * then comes the function and further arguments needed to describe the computation to be applied to the bits
Today we will emphasize ddply() which accepts a data.frame, splits it into pieces based on one or more factors, computes on the pieces, then returns the results as a data.frame. For the record, the built-in functions most relevant to ddply() are tapply() and friends.
Let's say we want to get the maximum life expectancy for each continent.
(maxLeByCont <- ddply(gDat, ~continent, summarize, maxLifeExp = max(lifeExp)))
## continent maxLifeExp
## 1 Africa 76.44
## 2 Americas 80.65
## 3 Asia 82.60
## 4 Europe 81.76
## 5 Oceania 81.23
str(maxLeByCont)
## 'data.frame': 5 obs. of 2 variables:
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
## $ maxLifeExp: num 76.4 80.7 82.6 81.8 81.2
levels(maxLeByCont$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
summarize() or its synonym summarise() is a plyr function that creates a new data.frame from an old one. It is related to the built-in function transform() that transforms variables in a data.frame or adds new ones. Feel free to play with it a bit in some top-level commands; you will use it alot inside plyr calls.
The two variables in maxLeByCont come from two sources. The continent factor is provided by ddply() and represents the labelling of the life expectancies with their associated continent. This is the book-keeping associated with dividing the input into little bits, computing on them, and gluing the results together again in an orderly, labelled fashion. We can take more credit for the other variable maxLifeExp, which has a name we chose (“maxLifeExp”) and arises from applying a function we specified (max()) to a variable of our choice (lifeExp).
compute the minimum GDP per capita by continent.
(minGDPbyCont <- ddply(gDat, ~continent, summarize, minGDP = min(gdpPercap)))
## continent minGDP
## 1 Africa 241.2
## 2 Americas 1201.6
## 3 Asia 331.0
## 4 Europe 973.5
## 5 Oceania 10039.6
The function you want to apply to the continent-specific data.frames can be built-in, like max() above, or a custom function you've written. This custom function can be written in advance or specified 'on the fly'. Here's how I would count the number of countries in this dataset for each continent.
ddply(gDat, ~continent, summarize, nUniqueCount = length(unique(country)))
## continent nUniqueCount
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
Here is another way to do the same thing that doesn't use summarize() at all:
ddply(gDat, ~continent, function(x) return(c(nUniqueCount = length(unique(x$country)))))
## continent nUniqueCount
## 1 Africa 52
## 2 Americas 25
## 3 Asia 33
## 4 Europe 30
## 5 Oceania 2
ddply(gDat, ~continent, summarize, maxLifeExp = max(lifeExp), minLifeExp = min(lifeExp),
medGDPperCap = median(gdpPercap))
## continent maxLifeExp minLifeExp medGDPperCap
## 1 Africa 76.44 23.60 1192
## 2 Americas 80.65 37.58 5466
## 3 Asia 82.60 28.80 2647
## 4 Europe 81.76 43.59 12082
## 5 Oceania 81.23 69.12 17983
Now I want to do something more complicated. I want to fit a linear regression for each country, modelling life expectancy as a function of the year and then retain the estimated intercepts and slopes. I will walk before I run. Therefore, I will create a tiny sub-data.frame to prototype this, before I fold it into a ddply() call. If you're a newbie, watch how complicated tasks are slowly constructed.
jCountry <- "France"
(jDat <- subset(gDat, country == jCountry)) #temporary measure
## country year pop continent lifeExp gdpPercap
## 529 France 1952 42459667 Europe 67.41 7030
## 530 France 1957 44310863 Europe 68.93 8663
## 531 France 1962 47124000 Europe 70.51 10560
## 532 France 1967 49569000 Europe 71.55 13000
## 533 France 1972 51732000 Europe 72.38 16107
## 534 France 1977 53165019 Europe 73.83 18293
## 535 France 1982 54433565 Europe 74.89 20294
## 536 France 1987 55630100 Europe 76.34 22066
## 537 France 1992 57374179 Europe 77.46 24704
## 538 France 1997 58623428 Europe 78.64 25890
## 539 France 2002 59925035 Europe 79.59 28926
## 540 France 2007 61083916 Europe 80.66 30470
library(lattice)
xyplot(lifeExp ~ year, jDat, type = c("p", "r"))
jFit <- lm(lifeExp ~ year, jDat)
summary(jFit)
##
## Call:
## lm(formula = lifeExp ~ year, data = jDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3801 -0.1389 0.0124 0.1429 0.3349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.98e+02 7.29e+00 -54.6 1.0e-13 ***
## year 2.38e-01 3.68e-03 64.8 1.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.22 on 10 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.997
## F-statistic: 4.2e+03 on 1 and 10 DF, p-value: 1.86e-14
Wow, check out that crazy intercept! Apparently the life expectancy in France around year 0 A.D. was minus 400 years! This a great opportunity for some sanity checking of a model fit and thinking about how to reparametrize the model to make the parameters have natural interpretation. I think it makes more sense for the intercept to correspond to life expectancy in 1952, the earliest date in our dataset. Let's try that again.
(yearMin <- min(gDat$year))
## [1] 1952
jFit <- lm(lifeExp ~ I(year - yearMin), jDat)
summary(jFit)
##
## Call:
## lm(formula = lifeExp ~ I(year - yearMin), data = jDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3801 -0.1389 0.0124 0.1429 0.3349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.79013 0.11949 567.3 < 2e-16 ***
## I(year - yearMin) 0.23850 0.00368 64.8 1.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.22 on 10 degrees of freedom
## Multiple R-squared: 0.998, Adjusted R-squared: 0.997
## F-statistic: 4.2e+03 on 1 and 10 DF, p-value: 1.86e-14
An intercept around 68 years makes much more common sense and is also supported by our plot.
class(jFit)
## [1] "lm"
mode(jFit)
## [1] "list"
str(jFit)
## List of 12
## $ coefficients : Named num [1:2] 67.79 0.239
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "I(year - yearMin)"
## $ residuals : Named num [1:12] -0.3801 -0.0526 0.3349 0.1824 -0.1802 ...
## ..- attr(*, "names")= chr [1:12] "529" "530" "531" "532" ...
## $ effects : Named num [1:12] -257.5522 14.2603 0.4152 0.2648 -0.0956 ...
## ..- attr(*, "names")= chr [1:12] "(Intercept)" "I(year - yearMin)" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:12] 67.8 69 70.2 71.4 72.6 ...
## ..- attr(*, "names")= chr [1:12] "529" "530" "531" "532" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:12, 1:2] -3.464 0.289 0.289 0.289 0.289 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:12] "529" "530" "531" "532" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "I(year - yearMin)"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.29 1.27
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 10
## $ xlevels : Named list()
## $ call : language lm(formula = lifeExp ~ I(year - yearMin), data = jDat)
## $ terms :Classes 'terms', 'formula' length 3 lifeExp ~ I(year - yearMin)
## .. ..- attr(*, "variables")= language list(lifeExp, I(year - yearMin))
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "lifeExp" "I(year - yearMin)"
## .. .. .. ..$ : chr "I(year - yearMin)"
## .. ..- attr(*, "term.labels")= chr "I(year - yearMin)"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(lifeExp, I(year - yearMin))
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "I(year - yearMin)"
## $ model :'data.frame': 12 obs. of 2 variables:
## ..$ lifeExp : num [1:12] 67.4 68.9 70.5 71.5 72.4 ...
## ..$ I(year - yearMin):Class 'AsIs' int [1:12] 0 5 10 15 20 25 30 35 40 45 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 lifeExp ~ I(year - yearMin)
## .. .. ..- attr(*, "variables")= language list(lifeExp, I(year - yearMin))
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "lifeExp" "I(year - yearMin)"
## .. .. .. .. ..$ : chr "I(year - yearMin)"
## .. .. ..- attr(*, "term.labels")= chr "I(year - yearMin)"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(lifeExp, I(year - yearMin))
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "I(year - yearMin)"
## - attr(*, "class")= chr "lm"
names(jFit)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
jFit$coefficients
## (Intercept) I(year - yearMin)
## 67.7901 0.2385
coef(jFit)
## (Intercept) I(year - yearMin)
## 67.7901 0.2385
jFun <- function(x) coef(lm(lifeExp ~ I(year - yearMin), x))
jFun(jDat)
## (Intercept) I(year - yearMin)
## 67.7901 0.2385
jFun <- function(x) {
estCoef <- coef(lm(lifeExp ~ I(year - yearMin), x))
names(estCoef) <- c("Intercept", "slope")
return(estCoef)
}
jFun(jDat)
## Intercept slope
## 67.7901 0.2385
It's always a good idea to try out a function on a few small examples.
jFun(subset(gDat, country == "Poland"))
## Intercept slope
## 64.7809 0.1962
jFun(subset(gDat, country == "Canada"))
## Intercept slope
## 68.8838 0.2189
jFun(subset(gDat, country == "Slovak Republic"))
## Intercept slope
## 67.010 0.134
jFun(subset(gDat, country == "India"))
## Intercept slope
## 39.2698 0.5053
jFun(subset(gDat, country == "China"))
## Intercept slope
## 47.1905 0.5307
library(plyr)
jCoef <- ddply(gDat, ~country, jFun)
str(jCoef)
## 'data.frame': 142 obs. of 3 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Intercept: num 29.9 59.2 43.4 32.1 62.7 ...
## $ slope : num 0.275 0.335 0.569 0.209 0.232 ...
tail(jCoef)
## country Intercept slope
## 137 Venezuela 57.51 0.32972
## 138 Vietnam 39.01 0.67162
## 139 West Bank and Gaza 43.80 0.60110
## 140 Yemen, Rep. 30.13 0.60546
## 141 Zambia 47.66 -0.06043
## 142 Zimbabwe 55.22 -0.09302
This is in the script lmByCountry.r
the jFun above only requires one argument, x. What if it had more than one argument?
For example, let's imagine that the shift for the year covariate is an argument instead of a previously-assigned variable yearMin. Here's how it would work.
jFunTwoArg <- function(x, cvShift = 0) {
estCoef <- coef(lm(lifeExp ~ I(year - cvShift), x))
names(estCoef) <- c("Intercept", "slope")
return(estCoef)
}
Since I've assigned cvShift = a default value of zero, we can get coefficients where the intercept corresponds to the year A.D. 0 with this simple call:
sillyCoef <- ddply(gDat, ~country, jFunTwoArg)
head(sillyCoef)
## country Intercept slope
## 1 Afghanistan -507.5 0.2753
## 2 Albania -594.1 0.3347
## 3 Algeria -1067.9 0.5693
## 4 Angola -376.5 0.2093
## 5 Argentina -389.6 0.2317
## 6 Australia -376.1 0.2277
We are getting the same estimated slopes but the silly year 0 intercepts we've seen before. Let's use the cvShift = argument to resolve this.
saneCoef <- ddply(gDat, ~country, jFunTwoArg, cvShift = 1952)
head(saneCoef)
## country Intercept slope
## 1 Afghanistan 29.91 0.2753
## 2 Albania 59.23 0.3347
## 3 Algeria 43.37 0.5693
## 4 Angola 32.13 0.2093
## 5 Argentina 62.69 0.2317
## 6 Australia 68.40 0.2277
We're back to our usual estimated intercepts, which reflect life expectancy in 1952. Of course hard-wiring 1952 is not a great idea, so here's probably our best code yet:
bestCoef <- ddply(gDat, ~country, jFunTwoArg, cvShift = min(gDat$year))
head(bestCoef)
## country Intercept slope
## 1 Afghanistan 29.91 0.2753
## 2 Albania 59.23 0.3347
## 3 Algeria 43.37 0.5693
## 4 Angola 32.13 0.2093
## 5 Argentina 62.69 0.2317
## 6 Australia 68.40 0.2277
Markdown doesn't have a proper table syntax, because it is a ruthlessly simple language. To make a pretty table you need to install additional package(s). There are many but here we will use the xtable package. We will make an HTML table. This HTML code will survive unharmed as your R Markdown document is converted from R Markdown to Markdown and finally to HTML.
library(xtable)
Let's pick some countries at random and display their estimated coefficients. FYI: the following R chunk has an option results='asis' and that is important to the correct display of the table.
set.seed(916)
foo <- jCoef[sample(nrow(jCoef), size = 15), ]
foo <- xtable(foo)
print(foo, type = "html", include.rownames = F)
country | Intercept | slope |
---|---|---|
Lebanon | 58.69 | 0.26 |
Senegal | 36.75 | 0.50 |
Dominican Republic | 48.60 | 0.47 |
Oman | 37.21 | 0.77 |
Germany | 67.57 | 0.21 |
Korea, Dem. Rep. | 54.91 | 0.32 |
Mauritius | 55.37 | 0.35 |
Slovak Republic | 67.01 | 0.13 |
Comoros | 40.00 | 0.45 |
Argentina | 62.69 | 0.23 |
Central African Republic | 38.81 | 0.18 |
Ecuador | 49.07 | 0.50 |
West Bank and Gaza | 43.80 | 0.60 |
Egypt | 40.97 | 0.56 |
Myanmar | 41.41 | 0.43 |
Two easy improvments to make this table more useful are
* include the continent information
* sort it rationally
Here is what we used:
jCoef <- ddply(gDat, ~country, jFun)
This divides gDat into country-specific pieces. But we can supply two factors in the second argument: **country and continent. In theory, sub-data.frames will be made for all possible combinations of the levels of country and continent. Many of those will have zero rows because there is, for example, no Belgium in Asia. By default, plyr functions drop these empty combinations. But the labelling work done by ddply() will still help us, as we will get both country and continent as factors in our result. This is easier to see than explain!
jCoef <- ddply(gDat, ~country + continent, jFun)
str(jCoef)
## 'data.frame': 142 obs. of 4 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
## $ Intercept: num 29.9 59.2 43.4 32.1 62.7 ...
## $ slope : num 0.275 0.335 0.569 0.209 0.232 ...
tail(jCoef)
## country continent Intercept slope
## 137 Venezuela Americas 57.51 0.32972
## 138 Vietnam Asia 39.01 0.67162
## 139 West Bank and Gaza Asia 43.80 0.60110
## 140 Yemen, Rep. Asia 30.13 0.60546
## 141 Zambia Africa 47.66 -0.06043
## 142 Zimbabwe Africa 55.22 -0.09302
set.seed(916)
foo <- jCoef[sample(nrow(jCoef), size = 15), ]
foo <- arrange(foo, Intercept)
# an uglier way to order (w/o plyr) foo <- foo[order(foo$Intercept),]
foo <- xtable(foo)
print(foo, type = "html", include.rownames = F)
country | continent | Intercept | slope |
---|---|---|---|
Senegal | Africa | 36.75 | 0.50 |
Oman | Asia | 37.21 | 0.77 |
Central African Republic | Africa | 38.81 | 0.18 |
Comoros | Africa | 40.00 | 0.45 |
Egypt | Africa | 40.97 | 0.56 |
Myanmar | Asia | 41.41 | 0.43 |
West Bank and Gaza | Asia | 43.80 | 0.60 |
Dominican Republic | Americas | 48.60 | 0.47 |
Ecuador | Americas | 49.07 | 0.50 |
Korea, Dem. Rep. | Asia | 54.91 | 0.32 |
Mauritius | Africa | 55.37 | 0.35 |
Lebanon | Asia | 58.69 | 0.26 |
Argentina | Americas | 62.69 | 0.23 |
Slovak Republic | Europe | 67.01 | 0.13 |
Germany | Europe | 67.57 | 0.21 |
plyr is a powerful package for data aggregation.
ddply() is the most important function: data.frames are great, therefore ddply() is great because it takes data.frame as input and returns data.frame as output.
Simple functions, built-in or user-defined, can be provided directly in a call to ddply().
It's better to handle more complicated data aggregation differently. Build it up slowly and revisit earlier steps for improvement. Here is a gentle workflow:
* Create an indicative sub-data.frame
* Compute on it directly with top-level commands until you achieve your goal
* Package those commands inside a function
* Test the function with other sub-data.frames
* Refine your function, e.g. give the return values better names
* Construct a ddply() call using your function
* Identify more weakness in your function, refine, call ddply() again
Results you present to the world generally need polishing to make the best impression and have the best impact. This too will be an iterative process. Here are some good things to think about:
* Is there an obvious piece of ancillary information that I should include? Example: continent above. It was not needed for the computation but it is helpful to retain/restore it for the final table.
* Is there some logical ordering for the data? Alphabetical is ususally the default and it is completely arbitrary and often confusing.
* Have I presented the data in the most aesthetically pleasing way I currently know how? - xtable package