Correlations of Multiple Variables

Jeffrey D. Walker
2012-12-13


Revisions

2012-12-16
added correlation p-value to final table using cor.test()
changed my_cor function so it can be used in either method (split/lapply, ddply)
added sub-sections to Apply Step added visualization section at end


Question

I've got a data set with 10 years of monthly mean measurements for a series of stations for various parameters. I'd like to run Spearman's Rho on many subsets of data and get the estimate for rho as well as the associated p-value. Is there an easy way to run the test multiple times with out specifying each subset for each run? In other words one run would be station A for June temp vs. DO; second run would be same but for July; then Station B etc..

Approach

This analysis is a classic example of the split-apply-combine strategy. The steps are

  1. split the dataset by the unique combinations of three factors (e.g. Site, Parameter, Month)
  2. apply the cor() function to each subset of the data
  3. combine the resulting correlation coefficient values in a single dataframe

Like all things in R, there are multiple ways of doing things. The first method uses only functions found in the R core library (primarily split and lappy). The second method uses the plyr package (and the ddply function).


Example Data

For this tutorial, we create a dummy dataset that contains the variables: Site, Parameter, Month, Year, Value. There are three sites: A, B, and C and three parameters: X, Y, Z. We can create a dataset that includes the unique combinations of these four variables using the expand.grid() function. We assign random values from a standard normal distribution to the Value column using rnorm().

df <- expand.grid(Site = c("A", "B", "C"), Parameter = c("X", "Y", "Z"), Month = c(1:12), 
    Year = c(2000:2009))
set.seed(123)  # set the random number generator seed so results are reproducible
df$Value <- rnorm(nrow(df))
head(df)
##   Site Parameter Month Year    Value
## 1    A         X     1 2000 -0.56048
## 2    B         X     1 2000 -0.23018
## 3    C         X     1 2000  1.55871
## 4    A         Y     1 2000  0.07051
## 5    B         Y     1 2000  0.12929
## 6    C         Y     1 2000  1.71506
summary(df)
##  Site    Parameter     Month            Year          Value       
##  A:360   X:360     Min.   : 1.00   Min.   :2000   Min.   :-2.810  
##  B:360   Y:360     1st Qu.: 3.75   1st Qu.:2002   1st Qu.:-0.625  
##  C:360   Z:360     Median : 6.50   Median :2004   Median : 0.021  
##                    Mean   : 6.50   Mean   :2004   Mean   : 0.025  
##                    3rd Qu.: 9.25   3rd Qu.:2007   3rd Qu.: 0.665  
##                    Max.   :12.00   Max.   :2009   Max.   : 3.241

The Split Step

There are multiple ways of splitting datasets into multiple subsets. The most fundamental way is to use the function split(x, f, ...) which takes an object x such as a dataframe, and splits it based on the levels of a factor vector f. If we want to split the dataset according the Site variable (which is a factor with levels corresponding to the unique Site codes), we would call:

df.split.site <- split(df, df$Site)
str(df.split.site, max.level = 2)
## List of 3
##  $ A:'data.frame':   360 obs. of  5 variables:
##   ..$ Site     : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
##   ..$ Month    : int [1:360] 1 1 1 2 2 2 3 3 3 4 ...
##   ..$ Year     : int [1:360] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##   ..$ Value    : num [1:360] -0.5605 0.0705 0.4609 -0.4457 0.4008 ...
##   ..- attr(*, "out.attrs")=List of 2
##  $ B:'data.frame':   360 obs. of  5 variables:
##   ..$ Site     : Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
##   ..$ Month    : int [1:360] 1 1 1 2 2 2 3 3 3 4 ...
##   ..$ Year     : int [1:360] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##   ..$ Value    : num [1:360] -0.23 0.129 -1.265 1.224 0.111 ...
##   ..- attr(*, "out.attrs")=List of 2
##  $ C:'data.frame':   360 obs. of  5 variables:
##   ..$ Site     : Factor w/ 3 levels "A","B","C": 3 3 3 3 3 3 3 3 3 3 ...
##   ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
##   ..$ Month    : int [1:360] 1 1 1 2 2 2 3 3 3 4 ...
##   ..$ Year     : int [1:360] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##   ..$ Value    : num [1:360] 1.559 1.715 -0.687 0.36 -0.556 ...
##   ..- attr(*, "out.attrs")=List of 2

Using the str() function to show the structure of the result, we see that it is a list of length 3, with elements corresponding to each level of the Site factor. Each of these three elements is a data frame corresponding to the rows with the respective Site. So for example, the element 'A' of df.split.site is a dataframe with the same columns as the original dataset but only containing rows for site A. Remember to reference the list elements using double brackets [[…]] instead of single brackets […], which are used for vectors or data frames.

head(df.split.site[["A"]])
##    Site Parameter Month Year    Value
## 1     A         X     1 2000 -0.56048
## 4     A         Y     1 2000  0.07051
## 7     A         Z     1 2000  0.46092
## 10    A         X     2 2000 -0.44566
## 13    A         Y     2 2000  0.40077
## 16    A         Z     2 2000  1.78691
summary(df.split.site[["A"]])
##  Site    Parameter     Month            Year          Value        
##  A:360   X:120     Min.   : 1.00   Min.   :2000   Min.   :-2.6017  
##  B:  0   Y:120     1st Qu.: 3.75   1st Qu.:2002   1st Qu.:-0.6258  
##  C:  0   Z:120     Median : 6.50   Median :2004   Median :-0.0161  
##                    Mean   : 6.50   Mean   :2004   Mean   :-0.0057  
##                    3rd Qu.: 9.25   3rd Qu.:2007   3rd Qu.: 0.5834  
##                    Max.   :12.00   Max.   :2009   Max.   : 2.8322

We can also split using multiple factors, such as Site and Month. The element names for the resulting list is a single string constructed by concatenating the site with the month (e.g. for site A and month 1, the element key is 'A.1').

df.split.site_month <- split(df, df[, c("Site", "Month")])
str(df.split.site_month[1:5], max.level = 2)  # only show first 5 elements
## List of 5
##  $ A.1:'data.frame': 30 obs. of  5 variables:
##   ..$ Site     : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
##   ..$ Month    : int [1:30] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Year     : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
##   ..$ Value    : num [1:30] -0.5605 0.0705 0.4609 -0.3802 0.608 ...
##   ..- attr(*, "out.attrs")=List of 2
##  $ B.1:'data.frame': 30 obs. of  5 variables:
##   ..$ Site     : Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
##   ..$ Month    : int [1:30] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Year     : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
##   ..$ Value    : num [1:30] -0.23 0.129 -1.265 0.919 -1.618 ...
##   ..- attr(*, "out.attrs")=List of 2
##  $ C.1:'data.frame': 30 obs. of  5 variables:
##   ..$ Site     : Factor w/ 3 levels "A","B","C": 3 3 3 3 3 3 3 3 3 3 ...
##   ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
##   ..$ Month    : int [1:30] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Year     : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
##   ..$ Value    : num [1:30] 1.5587 1.7151 -0.6869 -0.5753 -0.0556 ...
##   ..- attr(*, "out.attrs")=List of 2
##  $ A.2:'data.frame': 30 obs. of  5 variables:
##   ..$ Site     : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
##   ..$ Month    : int [1:30] 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ Year     : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
##   ..$ Value    : num [1:30] -0.446 0.401 1.787 -0.641 0.118 ...
##   ..- attr(*, "out.attrs")=List of 2
##  $ B.2:'data.frame': 30 obs. of  5 variables:
##   ..$ Site     : Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
##   ..$ Month    : int [1:30] 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ Year     : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
##   ..$ Value    : num [1:30] 1.224 0.111 0.498 -0.85 -0.947 ...
##   ..- attr(*, "out.attrs")=List of 2

The Apply Step

Now that the dataframe is split according to Site and Month, we want to compute the correlation coefficient between each pair of Parameters (e.g. X vs. Y, X vs. Z, Y vs. Z) using the data for each year. To figure out what the apply function should be, we extract one of the subsets from the split data frame:

sub.A1 <- df.split.site_month[["A.1"]]
head(sub.A1, 20)
##     Site Parameter Month Year    Value
## 1      A         X     1 2000 -0.56048
## 4      A         Y     1 2000  0.07051
## 7      A         Z     1 2000  0.46092
## 109    A         X     1 2001 -0.38023
## 112    A         Y     1 2001  0.60796
## 115    A         Z     1 2001  0.51941
## 217    A         X     1 2002 -0.44116
## 220    A         Y     1 2002 -1.28472
## 223    A         Z     1 2002  1.10985
## 325    A         X     1 2003 -0.52291
## 328    A         Y     1 2003  0.63296
## 331    A         Z     1 2003  1.01756
## 433    A         X     1 2004  0.30004
## 436    A         Y     1 2004 -1.07742
## 439    A         Z     1 2004 -2.22499
## 541    A         X     1 2005 -0.27325
## 544    A         Y     1 2005 -1.19736
## 547    A         Z     1 2005 -1.19862
## 649    A         X     1 2006  0.74081
## 652    A         Y     1 2006  0.10719

Note that the Site and Month variables only have values of 'A' and '1', respectively, but that all combinations of Parameter and Year remain.

Compute Spearman's Rho

To compute Spearman's rho correlation coefficient, we can use the cor(x, y=NULL, method=c('pearson','kendall','spearman')) function. By specifying the argument method='spearman' the function will return the value for Spearman's rho (see the docs by running ?cor in the R command line). However, the cor() function requires either two numeric vectors for arguments x and y or a single matrix/data.frame for x only. Because our current subset data frame is in the so-called long format, we can either create individual vectors for each parameter using the subset() function or convert the data to a wide format so that the values for each Parameter is a column. Using subset, we can compute the pairwise correlation coefficients:

sub.A1.X <- subset(sub.A1, Parameter == "X")$Value
sub.A1.Y <- subset(sub.A1, Parameter == "Y")$Value
sub.A1.Z <- subset(sub.A1, Parameter == "Z")$Value
print(sub.A1.X)
##  [1] -0.5605 -0.3802 -0.4412 -0.5229  0.3000 -0.2732  0.7408  1.0128
##  [9] -0.1633 -1.4987
cor(sub.A1.X, sub.A1.Y, method = "spearman")  # correlation between X and Y for Site A and Month 1
## [1] 0.1273
cor(sub.A1.X, sub.A1.Z, method = "spearman")  # correlation between X and Z for Site A and Month 1
## [1] -0.1394
cor(sub.A1.Y, sub.A1.Z, method = "spearman")  # correlation between Y and Z for Site A and Month 1
## [1] 0.07879

The cor() function also takes a dataframe and computes the correlation between each pair of numeric columns. To use this, we first convert the subsetted dataframe into the long format using the dcast(data, formula) function from the reshape2 package.

library(reshape2)
sub.A1.wide <- dcast(sub.A1, Year ~ Parameter, value.var = "Value")
print(sub.A1.wide)
##    Year       X        Y       Z
## 1  2000 -0.5605  0.07051  0.4609
## 2  2001 -0.3802  0.60796  0.5194
## 3  2002 -0.4412 -1.28472  1.1098
## 4  2003 -0.5229  0.63296  1.0176
## 5  2004  0.3000 -1.07742 -2.2250
## 6  2005 -0.2732 -1.19736 -1.1986
## 7  2006  0.7408  0.10719  0.4806
## 8  2007  1.0128 -0.19589  0.6166
## 9  2008 -0.1633  1.13105 -1.1392
## 10 2009 -1.4987 -1.09849 -0.3227

Now we can pass in the wide format of the sub-dataset (excluding the Year column since we don't want the correlations computed against the year):

sub.A1.cor <- cor(sub.A1.wide[, c("X", "Y", "Z")], method = "spearman")
print(sub.A1.cor)
##         X       Y        Z
## X  1.0000 0.12727 -0.13939
## Y  0.1273 1.00000  0.07879
## Z -0.1394 0.07879  1.00000

The resulting correlation matrix is symmetric since cor(X, Y) = cor(Y, X). We can exclude the duplicated values as well as the diagonal by using the lower.tri(x, diag=TRUE) function to set all values on the diagonal and in the lower triangle to NA. Note that lower.tri() returns a boolean matrix that is the same shape as the data frame, so it is used to index the original data frame:

sub.A1.cor.tri <- lower.tri(sub.A1.cor, diag = TRUE)
print(sub.A1.cor.tri)
##      [,1]  [,2]  [,3]
## [1,] TRUE FALSE FALSE
## [2,] TRUE  TRUE FALSE
## [3,] TRUE  TRUE  TRUE
sub.A1.cor[sub.A1.cor.tri] <- NA
print(sub.A1.cor)
##    X      Y        Z
## X NA 0.1273 -0.13939
## Y NA     NA  0.07879
## Z NA     NA       NA

Now we can use the melt() function (which is like the sibling of dcast()) from the reshape2 package to convert this matrix to long format. The na.rm=TRUE argument will drop all values that are NA.

sub.A1.cor <- melt(sub.A1.cor, na.rm = TRUE)
print(sub.A1.cor)
##   Var1 Var2    value
## 4    X    Y  0.12727
## 7    X    Z -0.13939
## 8    Y    Z  0.07879

Significance Level (p-value)

The cor() function only returns the estimated value of rho, so to get the p-values, we use the cor.test() function, which returns an object of class htest containing various components such as estimate, p.value, etc… (see cor.test() help).

cor.A1.XY <- cor.test(sub.A1.X, sub.A1.Y, method = "spearman")  # correlation between X and Y for Site A and Month 1
print(cor.A1.XY)
## 
##  Spearman's rank correlation rho
## 
## data:  sub.A1.X and sub.A1.Y 
## S = 144, p-value = 0.7329
## alternative hypothesis: true rho is not equal to 0 
## sample estimates:
##    rho 
## 0.1273
str(cor.A1.XY)
## List of 8
##  $ statistic  : Named num 144
##   ..- attr(*, "names")= chr "S"
##  $ parameter  : NULL
##  $ p.value    : num 0.733
##  $ estimate   : Named num 0.127
##   ..- attr(*, "names")= chr "rho"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "rho"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Spearman's rank correlation rho"
##  $ data.name  : chr "sub.A1.X and sub.A1.Y"
##  - attr(*, "class")= chr "htest"

We can thus extract the value of spearman's rho and the associated p-value by:

cor.A1.XY.rho <- cor.A1.XY$estimate
cor.A1.XY.pval <- cor.A1.XY$p.value
print(c(cor.A1.XY.rho, cor.A1.XY.pval))
##    rho        
## 0.1273 0.7329

Custom Correlation Function

Because there are multiple steps involved with computing the pairwise correlation for each subset, it would be easiest to wrap the entire apply step in a single function, which is given a data frame with columns Year, Parameter, Value and returns the correlation coefficient values as a data frame with columns ParameterX, ParameterY, Rho, PValue.

my_cor <- function(x) {
    params <- levels(x$Parameter)  # get list of parameter codes
    params.grid <- expand.grid(params, params)  # make list of parameter pairs (XX, XY, XZ, YX, ...)
    cor.results <- data.frame()
    for (i in 1:nrow(params.grid)) {
        # loop through each row of params.grid
        x.param <- as.character(params.grid$Var1)[i]  # get the x parameter as a character
        y.param <- as.character(params.grid$Var2)[i]  # get the y parameter as a character
        x.values <- subset(x, Parameter == x.param)$Value  # extract x values
        y.values <- subset(x, Parameter == y.param)$Value  # extract x values
        xy.cor <- cor.test(x.values, y.values, method = "spearman")
        cor.results <- rbind(cor.results, data.frame(ParameterX = x.param, ParameterY = y.param, 
            Rho = xy.cor$estimate, PValue = xy.cor$p.value, row.names = NULL))
    }
    return(cor.results)
}

We then test the function with the sub.A1 dataframe, which gives the same result as we computed above.

my_cor(sub.A1)
##   ParameterX ParameterY      Rho PValue
## 1          X          X  1.00000 0.0000
## 2          Y          X  0.12727 0.7329
## 3          Z          X -0.13939 0.7072
## 4          X          Y  0.12727 0.7329
## 5          Y          Y  1.00000 0.0000
## 6          Z          Y  0.07879 0.8380
## 7          X          Z -0.13939 0.7072
## 8          Y          Z  0.07879 0.8380
## 9          Z          Z  1.00000 0.0000

Now we want to apply this function to each subset dataframe in the list df.split.site_month using the lapply(X, FUN, ...) where X is the list of subset data frames and FUN is our correlation function my_cor().

df.cor.list <- lapply(df.split.site_month, my_cor)
print(df.cor.list[1:6])  # only first 6
## $A.1
##   ParameterX ParameterY      Rho PValue
## 1          X          X  1.00000 0.0000
## 2          Y          X  0.12727 0.7329
## 3          Z          X -0.13939 0.7072
## 4          X          Y  0.12727 0.7329
## 5          Y          Y  1.00000 0.0000
## 6          Z          Y  0.07879 0.8380
## 7          X          Z -0.13939 0.7072
## 8          Y          Z  0.07879 0.8380
## 9          Z          Z  1.00000 0.0000
## 
## $B.1
##   ParameterX ParameterY     Rho PValue
## 1          X          X  1.0000 0.0000
## 2          Y          X  0.0303 0.9457
## 3          Z          X -0.4667 0.1782
## 4          X          Y  0.0303 0.9457
## 5          Y          Y  1.0000 0.0000
## 6          Z          Y -0.1273 0.7329
## 7          X          Z -0.4667 0.1782
## 8          Y          Z -0.1273 0.7329
## 9          Z          Z  1.0000 0.0000
## 
## $C.1
##   ParameterX ParameterY     Rho PValue
## 1          X          X  1.0000 0.0000
## 2          Y          X  0.3818 0.2790
## 3          Z          X  0.0303 0.9457
## 4          X          Y  0.3818 0.2790
## 5          Y          Y  1.0000 0.0000
## 6          Z          Y -0.4424 0.2042
## 7          X          Z  0.0303 0.9457
## 8          Y          Z -0.4424 0.2042
## 9          Z          Z  1.0000 0.0000
## 
## $A.2
##   ParameterX ParameterY      Rho  PValue
## 1          X          X  1.00000 0.00000
## 2          Y          X -0.62424 0.06025
## 3          Z          X -0.15152 0.68181
## 4          X          Y -0.62424 0.06025
## 5          Y          Y  1.00000 0.00000
## 6          Z          Y -0.01818 0.97284
## 7          X          Z -0.15152 0.68181
## 8          Y          Z -0.01818 0.97284
## 9          Z          Z  1.00000 0.00000
## 
## $B.2
##   ParameterX ParameterY      Rho    PValue
## 1          X          X  1.00000 0.0000000
## 2          Y          X  0.90303 0.0008802
## 3          Z          X -0.06667 0.8647535
## 4          X          Y  0.90303 0.0008802
## 5          Y          Y  1.00000 0.0000000
## 6          Z          Y -0.24848 0.4915555
## 7          X          Z -0.06667 0.8647535
## 8          Y          Z -0.24848 0.4915555
## 9          Z          Z  1.00000 0.0000000
## 
## $C.2
##   ParameterX ParameterY      Rho PValue
## 1          X          X  1.00000 0.0000
## 2          Y          X -0.01818 0.9728
## 3          Z          X -0.55152 0.1043
## 4          X          Y -0.01818 0.9728
## 5          Y          Y  1.00000 0.0000
## 6          Z          Y  0.40606 0.2474
## 7          X          Z -0.55152 0.1043
## 8          Y          Z  0.40606 0.2474
## 9          Z          Z  1.00000 0.0000

The lapply() function returns a list with the same elements containing the result of my_cor() for each subset.


The Combine Step

The result of the apply step left us with a list of data frames by Site and Month where each list element is a data frame of correlation coefficients between each combination of parameters. Now we want to combine them all into a single data frame with columns [Site, Month, ParameterX, ParameterY, Rho, PValue]. The easiest way is to create an empty data frame, then loop through the list df.cor.list and append each element to final data frame using rbind().

df.cor <- data.frame()
for (site.month in names(df.cor.list)) {
    df.cor.list.element <- df.cor.list[[site.month]]
    df.cor.list.element$Site.Month <- site.month
    df.cor <- rbind(df.cor, df.cor.list.element)
}

Lastly, we want to split the Site.Month column into two separate columns for Site and Month using the colsplit(string, pattern, names) function. The pattern argument to colsplit is regular expression, so instead of specifying '.' as the pattern to split by full stop in each value, we need to escape the full stop which is a special character for regular expressions as '.'.

df.cor <- cbind(df.cor, colsplit(df.cor$Site.Month, "\\.", c("Site", "Month")))
df.cor$Site.Month <- NULL  # drop Site.Month column
df.cor <- df.cor[, c("Site", "Month", "ParameterX", "ParameterY", "Rho", "PValue")]  # reorder columns
head(df.cor, 10)
##    Site Month ParameterX ParameterY      Rho PValue
## 1     A     1          X          X  1.00000 0.0000
## 2     A     1          Y          X  0.12727 0.7329
## 3     A     1          Z          X -0.13939 0.7072
## 4     A     1          X          Y  0.12727 0.7329
## 5     A     1          Y          Y  1.00000 0.0000
## 6     A     1          Z          Y  0.07879 0.8380
## 7     A     1          X          Z -0.13939 0.7072
## 8     A     1          Y          Z  0.07879 0.8380
## 9     A     1          Z          Z  1.00000 0.0000
## 10    B     1          X          X  1.00000 0.0000

And there we have it. Note that the result includes the full correlation matrix, so there are duplicate values (e.g. includes the correlation for both [X,Y] and [Y,X]). These could be removed in the my_cor() function if only upper triangle of the correlation matrix is needed. It depends on what the next step in the analysis is.


A Better Way

In R, there are almost always more than one way to perform a task. When it comes to the split-apply-combine strategy, the plyr package can make life much easier. plyr is comprised of a series of so-called **ply() where each of the two *s can be 'a', 'l', or 'd' for 'array', 'list', 'dataframe'. The first letter is the input format while the second letter is the output format. So alply() takes an array and returns a list. Generally, we stick to ddply() which takes a dataframe and returns a dataframe.

The ddply(.data, .variables, .fun, ...) function takes a data frame as .data, a collection of variables that define how to split the data frame as .variables and a function to apply to each subset as .fun. The function is usually summarise which is a special function that lets you chain multiple functions in the apply step.

As an example, if we wanted to compute the overall mean value for each Parameter computed across all Sites, Months and Years then we would run:

library(plyr)
ddply(df, .(Parameter), summarise, Mean = mean(Value))
##   Parameter     Mean
## 1         X -0.03492
## 2         Y  0.07217
## 3         Z  0.03878

Breaking this down, we pass in the original data frame df with columns [Site, Parameter, Month, Year, Value], then tell ddply to split according to the column named Parameter, then summarise by computing the mean of Value and storing the result in a column called Mean. If we wanted to compute the mean by Parameter and Site we would add Site to the variable collection.

ddply(df, .(Parameter, Site), summarise, Mean = mean(Value))
##   Parameter Site      Mean
## 1         X    A -0.222071
## 2         X    B  0.001607
## 3         X    C  0.115702
## 4         Y    A  0.118508
## 5         Y    B  0.062336
## 6         Y    C  0.035663
## 7         Z    A  0.086383
## 8         Z    B -0.027461
## 9         Z    C  0.057406

The ddply function also lets us compute multiple values from each subset by adding additional functions after summarise. So to compute the standard deviation and count as well as the mean of each subset, we run:

ddply(df, .(Parameter, Site), summarise, Mean = mean(Value), StDev = sd(Value), 
    N = length(Value))
##   Parameter Site      Mean  StDev   N
## 1         X    A -0.222071 0.8765 120
## 2         X    B  0.001607 1.1087 120
## 3         X    C  0.115702 1.0203 120
## 4         Y    A  0.118508 0.9678 120
## 5         Y    B  0.062336 0.9977 120
## 6         Y    C  0.035663 0.9724 120
## 7         Z    A  0.086383 0.9016 120
## 8         Z    B -0.027461 0.9484 120
## 9         Z    C  0.057406 1.1135 120

Since we have already defined a function my_cor() that takes a subset of the original dataframe and returns a data frame of the pairwise correlation estimates and p-values, we simply pass this function to ddply.

df.cor.plyr <- ddply(df, .(Site, Month), my_cor)
head(df.cor.plyr)
##   Site Month ParameterX ParameterY      Rho PValue
## 1    A     1          X          X  1.00000 0.0000
## 2    A     1          Y          X  0.12727 0.7329
## 3    A     1          Z          X -0.13939 0.7072
## 4    A     1          X          Y  0.12727 0.7329
## 5    A     1          Y          Y  1.00000 0.0000
## 6    A     1          Z          Y  0.07879 0.8380

This result is the same as that computed above using split and lapply, but was acheived somewhat more easily and in fewer lines using ddply.

head(df.cor.plyr, 10)  # using plyr
##    Site Month ParameterX ParameterY      Rho PValue
## 1     A     1          X          X  1.00000 0.0000
## 2     A     1          Y          X  0.12727 0.7329
## 3     A     1          Z          X -0.13939 0.7072
## 4     A     1          X          Y  0.12727 0.7329
## 5     A     1          Y          Y  1.00000 0.0000
## 6     A     1          Z          Y  0.07879 0.8380
## 7     A     1          X          Z -0.13939 0.7072
## 8     A     1          Y          Z  0.07879 0.8380
## 9     A     1          Z          Z  1.00000 0.0000
## 10    A     2          X          X  1.00000 0.0000
head(df.cor, 10)  # using split, lapply
##    Site Month ParameterX ParameterY      Rho PValue
## 1     A     1          X          X  1.00000 0.0000
## 2     A     1          Y          X  0.12727 0.7329
## 3     A     1          Z          X -0.13939 0.7072
## 4     A     1          X          Y  0.12727 0.7329
## 5     A     1          Y          Y  1.00000 0.0000
## 6     A     1          Z          Y  0.07879 0.8380
## 7     A     1          X          Z -0.13939 0.7072
## 8     A     1          Y          Z  0.07879 0.8380
## 9     A     1          Z          Z  1.00000 0.0000
## 10    B     1          X          X  1.00000 0.0000

Sanity Check

It's always a good idea to check the results visually, as this is much easier than combing through the resulting data frame. There are few functions and packages in R that can be used to visualize correlations as so-called correlograms (see Friendly2002, and Quick-R). One of the most popular is the corrgram function which is in the corrgram package. Because this function requires the wide format (with individual columns for each Paramater), we have to use the dcast() function from the reshape2 package.

library(corrgram)
df.wide <- dcast(df, Site + Month + Year ~ Parameter, value.var = "Value")
head(df.wide)
##   Site Month Year       X        Y       Z
## 1    A     1 2000 -0.5605  0.07051  0.4609
## 2    A     1 2001 -0.3802  0.60796  0.5194
## 3    A     1 2002 -0.4412 -1.28472  1.1098
## 4    A     1 2003 -0.5229  0.63296  1.0176
## 5    A     1 2004  0.3000 -1.07742 -2.2250
## 6    A     1 2005 -0.2732 -1.19736 -1.1986

We can then generate a corrgram for a given site and month by subsetting the data frame.

params <- levels(df$Parameter)
df.wide.A1 <- subset(df.wide, Site == "A" & Month == 1)
corrgram(df.wide.A1[, params], lower.panel = panel.shade, upper.panel = panel.pie, 
    text.panel = panel.txt, main = "Parameter Correlations by Site, Month")

plot of chunk unnamed-chunk-23

This is useful, but since we would like to check all the data, we can generate a pdf containing a correlogram for each site and month by using a nested for loop.

pdf("corrgram.pdf", width = 8, height = 8)
for (site in levels(df$Site)) {
    for (month in levels(as.factor(df$Month))) {
        print(corrgram(subset(df.wide, Site == site & Month == month)[, params], 
            lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, 
            main = paste("Parameter Correlations for Site=", site, ", Month=", 
                month, sep = "")))
    }
}
dev.off()

We can also show all of the correlation coefficients in one plot using the ggplot2 package. Note that transparency (i.e. alpha level) is defined as 1-PValue so that more significant results come up less transparent.

library(ggplot2)
ggplot(df.cor.plyr, aes(x = factor(Month), y = Rho, fill = Site, alpha = (1 - 
    PValue))) + geom_bar(stat = "identity", position = "dodge", color = "grey50") + 
    facet_wrap(~ParameterX + ParameterY) + labs(x = "Month", y = "Spearman's Rho") + 
    ggtitle("Correlation between Parameters by Site and Month")

plot of chunk unnamed-chunk-25

Since its difficult to interpret the p-values with a continuous range (which are statistically significant?), we can bin the p-values in: [ <0.05, 0.05<0.10, >0.10] using the cut function.

df.cor.plyr$PValue.Bin <- cut(df.cor.plyr$PValue, breaks = c(0, 0.05, 0.1, 1), 
    include.lowest = TRUE)
summary(df.cor.plyr$PValue.Bin)
##   [0,0.05] (0.05,0.1]    (0.1,1] 
##        120         10        194
ggplot(df.cor.plyr, aes(x = factor(Month), y = Rho, fill = Site, alpha = PValue.Bin)) + 
    geom_bar(stat = "identity", position = "dodge", color = "grey50") + facet_wrap(~ParameterX + 
    ParameterY) + labs(x = "Month", y = "Spearman's Rho") + scale_alpha_manual("P-Value", 
    values = c(1, 0.5, 0), labels = c("p <= 0.05", "0.05 < p <= 0.10", "p > 0.10")) + 
    ggtitle("Correlation between Parameters by Site and Month")

plot of chunk unnamed-chunk-27