Some `plyr` examples

This is part of an occasional series showing how life can be made easier by using plyr, and it is worth the effort of learning it.

Hadley wrote a great reference to the plyr package, “The Split-Apply-Combine Strategy for Data Analysis”. The goal of this series is to show some examples.

Installation and loading

Like most of Hadley's popular packages, plyr is available from CRAN. If you are using RStudio, just use the “Install Packages” button on the “Packages” tab. If you are not using RStudio, why aren't you using RStudio?

Like all packages, plyr can be loaded from the “Packages” tab, or by using an R command.

library(plyr)

`knitr` bookkeeping

The knitr package by Yihui Xie is used to make the html file for this work. It is a way to combine documentation with “live” R code, to help keep “everything together”. The buzzphrase here is “reproducible research”.

options(width = 120)
opts_chunk$set(comment = NA, tidy = FALSE, fig.align = "center", fig.width = 10, 
    fig.height = 6, dev = "png")

Examples

Writing csv

Let's say you have a data-frame that describes interval data from electrical meters at a variety of sites.

First, let's generate the data:

library(lubridate)
library(plyr)

Attaching package: 'plyr'

The following object is masked from 'package:lubridate':

here

library(stringr)

# define sites
site <- c("house", "shed", "pool")

# define the time intervals
interval_duration <- 900 # seconds (15-minute data)
interval_start <- 
  ymd("2001-01-01", tz="America/Chicago") + seq(0, 3)*interval_duration

# assemble data-frame
elec_usage <- expand.grid(site = site, interval_start = interval_start)
elec_usage <- 
  mutate(
    elec_usage, 
    interval_duration = rep(interval_duration, length(interval_start))
  )

elec_usage

    site      interval_start interval_duration
1  house 2001-01-01 00:00:00               900
2   shed 2001-01-01 00:00:00               900
3   pool 2001-01-01 00:00:00               900
4  house 2001-01-01 00:15:00               900
5   shed 2001-01-01 00:15:00               900
6   pool 2001-01-01 00:15:00               900
7  house 2001-01-01 00:30:00               900
8   shed 2001-01-01 00:30:00               900
9   pool 2001-01-01 00:30:00               900
10 house 2001-01-01 00:45:00               900
11  shed 2001-01-01 00:45:00               900
12  pool 2001-01-01 00:45:00               900

Some notes on the above:

The packages lubridate and plyr both have a function called here(); R is telling us it will go first to the plyr version.
lubridate is another useful Hadley package, particularly for those who are obsessed and frustrated with timezones. That's a different occasional series…
mutate() is the plyr version of transform().
We need to fake some electrical-usage data.

Here, consumption refers to the amount of real electrical energy (kwh) consumed by the site, for the interval starting at interval_start and lasting interval_duration seconds.

elec_usage <- 
  mutate(
    elec_usage, 
    consumption = rnorm(n = length(interval_start), mean = 10, sd = 2)
  )

elec_usage

    site      interval_start interval_duration consumption
1  house 2001-01-01 00:00:00               900      10.144
2   shed 2001-01-01 00:00:00               900       8.875
3   pool 2001-01-01 00:00:00               900       8.100
4  house 2001-01-01 00:15:00               900       9.552
5   shed 2001-01-01 00:15:00               900      11.644
6   pool 2001-01-01 00:15:00               900       7.786
7  house 2001-01-01 00:30:00               900      14.275
8   shed 2001-01-01 00:30:00               900       7.657
9   pool 2001-01-01 00:30:00               900       9.701
10 house 2001-01-01 00:45:00               900      10.035
11  shed 2001-01-01 00:45:00               900      10.572
12  pool 2001-01-01 00:45:00               900      11.676

Let's say we wanted to write this electrical data to a series of files, one file for every site. Here's where d_ply comes in:

d_ply(
  .data = elec_usage,
  .variables = .(site),
  .fun = function(df) {

    # determine name of the file
    file_name <- str_join(df$site[1], ".csv")

    # write out csv file
    write.csv(df, file=file_name, quote=FALSE, row.names=FALSE)

  }
)

Breaking this down:

The d_ply function is used because our input is a data-frame, elec_usage, and we are not expecting any output; writing the files is a side effect.
The .data arguement is our input data-frame elec_usage.
The .variables arguement is used to say that we want to split the input data-frame by the variable site.
We supply a function that we wish to apply to each split part of the data-frame. This function must have a data-frame as its first arguement.

Each time the function is called:
- the column site has only one unique value, thanks to the split.
- we make a file name according to this value.
- we write out our csv file.

In essence, we have an implied loop over the variable site. This is great for simplifying your R program, but potentially not-so-great for the C# programmer who has to adapt your R protoype-code.

Reading in csv files

This time, we will use a different plyr function, ldply(); we will start with a (coerced) list and end with a data-frame.

We need to start with a vector that will be coerced into a list. Let's find all the csv files in our directory. Keep an eye out for the regular expression!

csv_files <- list.files(path=".", pattern="\\.csv$")

csv_files

[1] "house.csv" "pool.csv"  "shed.csv"

Using our string vector, let's read in the files and assemble them into a data-frame. We use ldply() because it is easier to use the vector as a coreced a list than to treat it as a one-dimensional array and to use adply().

elec_usage_new <- 
  ldply(
    .data = csv_files,
    .fun = function(x){
      # read the csv file into a data-frame
      read.csv(file=x, header=TRUE)      
    }
  )

elec_usage_new

    site      interval_start interval_duration consumption
1  house 2001-01-01 00:00:00               900      10.144
2  house 2001-01-01 00:15:00               900       9.552
3  house 2001-01-01 00:30:00               900      14.275
4  house 2001-01-01 00:45:00               900      10.035
5   pool 2001-01-01 00:00:00               900       8.100
6   pool 2001-01-01 00:15:00               900       7.786
7   pool 2001-01-01 00:30:00               900       9.701
8   pool 2001-01-01 00:45:00               900      11.676
9   shed 2001-01-01 00:00:00               900       8.875
10  shed 2001-01-01 00:15:00               900      11.644
11  shed 2001-01-01 00:30:00               900       7.657
12  shed 2001-01-01 00:45:00               900      10.572

Breaking this down:

The ldply() function is used because our input is a coerced list, csv_files, our output will be a data-frame.
The .data arguement is our input list (coerced from vector) csv_files.
We are going to split the list by its members.
We supply a function that we wish to apply to each member of the list. This function must take the member of the list (in this case, a string describing the file name) as its first arguement.

Each time the function is called, the csv file is identified read into a data-frame, that data-frame is returned. Once all the function calls are made, the individual data-frames are combined into a single data-frame (using rbind in the background).

In essence, we have an implied loop over the members of the list. Again, This is great for simplifying your R program, but potentially not-so-great for the C# programmer who has to adapt your R protoype-code.

Note about time

Let's look at the structure of our newly-imported data-frame, and compare it with our existing data-frame.

str(elec_usage)

'data.frame':   12 obs. of  4 variables:
 $ site             : Factor w/ 3 levels "house","shed",..: 1 2 3 1 2 3 1 2 3 1 ...
 $ interval_start   : POSIXct, format: "2001-01-01 00:00:00" "2001-01-01 00:00:00" "2001-01-01 00:00:00" "2001-01-01 00:15:00" ...
 $ interval_duration: num  900 900 900 900 900 900 900 900 900 900 ...
 $ consumption      : num  10.14 8.87 8.1 9.55 11.64 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int  3 4
  .. ..- attr(*, "names")= chr  "site" "interval_start"
  ..$ dimnames:List of 2
  .. ..$ site          : chr  "site=house" "site=shed" "site=pool"
  .. ..$ interval_start: chr  "interval_start=2001-01-01 00:00:00" "interval_start=2001-01-01 00:15:00" "interval_start=2001-01-01 00:30:00" "interval_start=2001-01-01 00:45:00"


str(elec_usage_new)

'data.frame':   12 obs. of  4 variables:
 $ site             : Factor w/ 3 levels "house","pool",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ interval_start   : Factor w/ 4 levels "2001-01-01 00:00:00",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ interval_duration: int  900 900 900 900 900 900 900 900 900 900 ...
 $ consumption      : num  10.14 9.55 14.28 10.03 8.1 ...

Aside from the attributes, we note that the variable interval_start is a POSIXct date-time in the original data-frame, but it is a character factor in the imported data-frame.

To correct this we can use the mutate() function from plyr to operate on the data-frame, and the ymd_hms() function from lubridate to parse the string into a POSIXct date-time.

elec_usage_new <-
  mutate(
    elec_usage_new,
    interval_start = ymd_hms(as.character(interval_start), tz="America/Chicago")
  )

str(elec_usage_new)

'data.frame':   12 obs. of  4 variables:
 $ site             : Factor w/ 3 levels "house","pool",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ interval_start   : POSIXct, format: "2001-01-01 00:00:00" "2001-01-01 00:15:00" "2001-01-01 00:30:00" "2001-01-01 00:45:00" ...
 $ interval_duration: int  900 900 900 900 900 900 900 900 900 900 ...
 $ consumption      : num  10.14 9.55 14.28 10.03 8.1 ...

Other ways to call `ldply()`

The way we originally called this was a bit verbose, but for a purpose. Here are some simpler ways to make exactly the same call, shown here to illustrate more about the --ply() family of functions:

Following the .fun() arguement, you can supply a series of arguements that are passed to that function as its second and subsequent arguements. Here's an example; the arguement header = TRUE is passed as the second arguement for read.csv(), after the member of csv_files.

elec_usage_new <- 
  ldply(
    .data = csv_files,
    .fun = read.csv,
    header = TRUE
  )

elec_usage_new

    site      interval_start interval_duration consumption
1  house 2001-01-01 00:00:00               900      10.144
2  house 2001-01-01 00:15:00               900       9.552
3  house 2001-01-01 00:30:00               900      14.275
4  house 2001-01-01 00:45:00               900      10.035
5   pool 2001-01-01 00:00:00               900       8.100
6   pool 2001-01-01 00:15:00               900       7.786
7   pool 2001-01-01 00:30:00               900       9.701
8   pool 2001-01-01 00:45:00               900      11.676
9   shed 2001-01-01 00:00:00               900       8.875
10  shed 2001-01-01 00:15:00               900      11.644
11  shed 2001-01-01 00:30:00               900       7.657
12  shed 2001-01-01 00:45:00               900      10.572

Of course, header = TRUE is a default for read.csv(), so we need not include it.

elec_usage_new <- 
  ldply(
    .data = csv_files,
    .fun = read.csv
  )

elec_usage_new

    site      interval_start interval_duration consumption
1  house 2001-01-01 00:00:00               900      10.144
2  house 2001-01-01 00:15:00               900       9.552
3  house 2001-01-01 00:30:00               900      14.275
4  house 2001-01-01 00:45:00               900      10.035
5   pool 2001-01-01 00:00:00               900       8.100
6   pool 2001-01-01 00:15:00               900       7.786
7   pool 2001-01-01 00:30:00               900       9.701
8   pool 2001-01-01 00:45:00               900      11.676
9   shed 2001-01-01 00:00:00               900       8.875
10  shed 2001-01-01 00:15:00               900      11.644
11  shed 2001-01-01 00:30:00               900       7.657
12  shed 2001-01-01 00:45:00               900      10.572

Finally, we can also leave out the arguement names (although I will often leave them in for clarity).

elec_usage_new <- ldply(csv_files, read.csv)

elec_usage_new

    site      interval_start interval_duration consumption
1  house 2001-01-01 00:00:00               900      10.144
2  house 2001-01-01 00:15:00               900       9.552
3  house 2001-01-01 00:30:00               900      14.275
4  house 2001-01-01 00:45:00               900      10.035
5   pool 2001-01-01 00:00:00               900       8.100
6   pool 2001-01-01 00:15:00               900       7.786
7   pool 2001-01-01 00:30:00               900       9.701
8   pool 2001-01-01 00:45:00               900      11.676
9   shed 2001-01-01 00:00:00               900       8.875
10  shed 2001-01-01 00:15:00               900      11.644
11  shed 2001-01-01 00:30:00               900       7.657
12  shed 2001-01-01 00:45:00               900      10.572

Parting thoughts

These are but a few ways to use the plyr package to write cleaner, more understandable (for the initiated), bug-free code.

Some plyr examples