Jeffrey D. Walker
2012-12-13
2012-12-16
added correlation p-value to final table using cor.test()
changed my_cor function so it can be used in either method (split/lapply, ddply)
added sub-sections to Apply Step
added visualization section at end
I've got a data set with 10 years of monthly mean measurements for a series of stations for various parameters. I'd like to run Spearman's Rho on many subsets of data and get the estimate for rho as well as the associated p-value. Is there an easy way to run the test multiple times with out specifying each subset for each run? In other words one run would be station A for June temp vs. DO; second run would be same but for July; then Station B etc..
This analysis is a classic example of the split-apply-combine strategy. The steps are
cor() function to each subset of the dataLike all things in R, there are multiple ways of doing things. The first method uses only functions found in the R core library (primarily split and lappy). The second method uses the plyr package (and the ddply function).
For this tutorial, we create a dummy dataset that contains the variables: Site, Parameter, Month, Year, Value. There are three sites: A, B, and C and three parameters: X, Y, Z. We can create a dataset that includes the unique combinations of these four variables using the expand.grid() function. We assign random values from a standard normal distribution to the Value column using rnorm().
df <- expand.grid(Site = c("A", "B", "C"), Parameter = c("X", "Y", "Z"), Month = c(1:12),
Year = c(2000:2009))
set.seed(123) # set the random number generator seed so results are reproducible
df$Value <- rnorm(nrow(df))
head(df)
## Site Parameter Month Year Value
## 1 A X 1 2000 -0.56048
## 2 B X 1 2000 -0.23018
## 3 C X 1 2000 1.55871
## 4 A Y 1 2000 0.07051
## 5 B Y 1 2000 0.12929
## 6 C Y 1 2000 1.71506
summary(df)
## Site Parameter Month Year Value
## A:360 X:360 Min. : 1.00 Min. :2000 Min. :-2.810
## B:360 Y:360 1st Qu.: 3.75 1st Qu.:2002 1st Qu.:-0.625
## C:360 Z:360 Median : 6.50 Median :2004 Median : 0.021
## Mean : 6.50 Mean :2004 Mean : 0.025
## 3rd Qu.: 9.25 3rd Qu.:2007 3rd Qu.: 0.665
## Max. :12.00 Max. :2009 Max. : 3.241
There are multiple ways of splitting datasets into multiple subsets. The most fundamental way is to use the function split(x, f, ...) which takes an object x such as a dataframe, and splits it based on the levels of a factor vector f. If we want to split the dataset according the Site variable (which is a factor with levels corresponding to the unique Site codes), we would call:
df.split.site <- split(df, df$Site)
str(df.split.site, max.level = 2)
## List of 3
## $ A:'data.frame': 360 obs. of 5 variables:
## ..$ Site : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
## ..$ Month : int [1:360] 1 1 1 2 2 2 3 3 3 4 ...
## ..$ Year : int [1:360] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## ..$ Value : num [1:360] -0.5605 0.0705 0.4609 -0.4457 0.4008 ...
## ..- attr(*, "out.attrs")=List of 2
## $ B:'data.frame': 360 obs. of 5 variables:
## ..$ Site : Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
## ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
## ..$ Month : int [1:360] 1 1 1 2 2 2 3 3 3 4 ...
## ..$ Year : int [1:360] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## ..$ Value : num [1:360] -0.23 0.129 -1.265 1.224 0.111 ...
## ..- attr(*, "out.attrs")=List of 2
## $ C:'data.frame': 360 obs. of 5 variables:
## ..$ Site : Factor w/ 3 levels "A","B","C": 3 3 3 3 3 3 3 3 3 3 ...
## ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
## ..$ Month : int [1:360] 1 1 1 2 2 2 3 3 3 4 ...
## ..$ Year : int [1:360] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## ..$ Value : num [1:360] 1.559 1.715 -0.687 0.36 -0.556 ...
## ..- attr(*, "out.attrs")=List of 2
Using the str() function to show the structure of the result, we see that it is a list of length 3, with elements corresponding to each level of the Site factor. Each of these three elements is a data frame corresponding to the rows with the respective Site. So for example, the element 'A' of df.split.site is a dataframe with the same columns as the original dataset but only containing rows for site A. Remember to reference the list elements using double brackets [[…]] instead of single brackets […], which are used for vectors or data frames.
head(df.split.site[["A"]])
## Site Parameter Month Year Value
## 1 A X 1 2000 -0.56048
## 4 A Y 1 2000 0.07051
## 7 A Z 1 2000 0.46092
## 10 A X 2 2000 -0.44566
## 13 A Y 2 2000 0.40077
## 16 A Z 2 2000 1.78691
summary(df.split.site[["A"]])
## Site Parameter Month Year Value
## A:360 X:120 Min. : 1.00 Min. :2000 Min. :-2.6017
## B: 0 Y:120 1st Qu.: 3.75 1st Qu.:2002 1st Qu.:-0.6258
## C: 0 Z:120 Median : 6.50 Median :2004 Median :-0.0161
## Mean : 6.50 Mean :2004 Mean :-0.0057
## 3rd Qu.: 9.25 3rd Qu.:2007 3rd Qu.: 0.5834
## Max. :12.00 Max. :2009 Max. : 2.8322
We can also split using multiple factors, such as Site and Month. The element names for the resulting list is a single string constructed by concatenating the site with the month (e.g. for site A and month 1, the element key is 'A.1').
df.split.site_month <- split(df, df[, c("Site", "Month")])
str(df.split.site_month[1:5], max.level = 2) # only show first 5 elements
## List of 5
## $ A.1:'data.frame': 30 obs. of 5 variables:
## ..$ Site : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
## ..$ Month : int [1:30] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ Year : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
## ..$ Value : num [1:30] -0.5605 0.0705 0.4609 -0.3802 0.608 ...
## ..- attr(*, "out.attrs")=List of 2
## $ B.1:'data.frame': 30 obs. of 5 variables:
## ..$ Site : Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
## ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
## ..$ Month : int [1:30] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ Year : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
## ..$ Value : num [1:30] -0.23 0.129 -1.265 0.919 -1.618 ...
## ..- attr(*, "out.attrs")=List of 2
## $ C.1:'data.frame': 30 obs. of 5 variables:
## ..$ Site : Factor w/ 3 levels "A","B","C": 3 3 3 3 3 3 3 3 3 3 ...
## ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
## ..$ Month : int [1:30] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ Year : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
## ..$ Value : num [1:30] 1.5587 1.7151 -0.6869 -0.5753 -0.0556 ...
## ..- attr(*, "out.attrs")=List of 2
## $ A.2:'data.frame': 30 obs. of 5 variables:
## ..$ Site : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
## ..$ Month : int [1:30] 2 2 2 2 2 2 2 2 2 2 ...
## ..$ Year : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
## ..$ Value : num [1:30] -0.446 0.401 1.787 -0.641 0.118 ...
## ..- attr(*, "out.attrs")=List of 2
## $ B.2:'data.frame': 30 obs. of 5 variables:
## ..$ Site : Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
## ..$ Parameter: Factor w/ 3 levels "X","Y","Z": 1 2 3 1 2 3 1 2 3 1 ...
## ..$ Month : int [1:30] 2 2 2 2 2 2 2 2 2 2 ...
## ..$ Year : int [1:30] 2000 2000 2000 2001 2001 2001 2002 2002 2002 2003 ...
## ..$ Value : num [1:30] 1.224 0.111 0.498 -0.85 -0.947 ...
## ..- attr(*, "out.attrs")=List of 2
Now that the dataframe is split according to Site and Month, we want to compute the correlation coefficient between each pair of Parameters (e.g. X vs. Y, X vs. Z, Y vs. Z) using the data for each year. To figure out what the apply function should be, we extract one of the subsets from the split data frame:
sub.A1 <- df.split.site_month[["A.1"]]
head(sub.A1, 20)
## Site Parameter Month Year Value
## 1 A X 1 2000 -0.56048
## 4 A Y 1 2000 0.07051
## 7 A Z 1 2000 0.46092
## 109 A X 1 2001 -0.38023
## 112 A Y 1 2001 0.60796
## 115 A Z 1 2001 0.51941
## 217 A X 1 2002 -0.44116
## 220 A Y 1 2002 -1.28472
## 223 A Z 1 2002 1.10985
## 325 A X 1 2003 -0.52291
## 328 A Y 1 2003 0.63296
## 331 A Z 1 2003 1.01756
## 433 A X 1 2004 0.30004
## 436 A Y 1 2004 -1.07742
## 439 A Z 1 2004 -2.22499
## 541 A X 1 2005 -0.27325
## 544 A Y 1 2005 -1.19736
## 547 A Z 1 2005 -1.19862
## 649 A X 1 2006 0.74081
## 652 A Y 1 2006 0.10719
Note that the Site and Month variables only have values of 'A' and '1', respectively, but that all combinations of Parameter and Year remain.
To compute Spearman's rho correlation coefficient, we can use the cor(x, y=NULL, method=c('pearson','kendall','spearman')) function. By specifying the argument method='spearman' the function will return the value for Spearman's rho (see the docs by running ?cor in the R command line). However, the cor() function requires either two numeric vectors for arguments x and y or a single matrix/data.frame for x only. Because our current subset data frame is in the so-called long format, we can either create individual vectors for each parameter using the subset() function or convert the data to a wide format so that the values for each Parameter is a column. Using subset, we can compute the pairwise correlation coefficients:
sub.A1.X <- subset(sub.A1, Parameter == "X")$Value
sub.A1.Y <- subset(sub.A1, Parameter == "Y")$Value
sub.A1.Z <- subset(sub.A1, Parameter == "Z")$Value
print(sub.A1.X)
## [1] -0.5605 -0.3802 -0.4412 -0.5229 0.3000 -0.2732 0.7408 1.0128
## [9] -0.1633 -1.4987
cor(sub.A1.X, sub.A1.Y, method = "spearman") # correlation between X and Y for Site A and Month 1
## [1] 0.1273
cor(sub.A1.X, sub.A1.Z, method = "spearman") # correlation between X and Z for Site A and Month 1
## [1] -0.1394
cor(sub.A1.Y, sub.A1.Z, method = "spearman") # correlation between Y and Z for Site A and Month 1
## [1] 0.07879
The cor() function also takes a dataframe and computes the correlation between each pair of numeric columns. To use this, we first convert the subsetted dataframe into the long format using the dcast(data, formula) function from the reshape2 package.
library(reshape2)
sub.A1.wide <- dcast(sub.A1, Year ~ Parameter, value.var = "Value")
print(sub.A1.wide)
## Year X Y Z
## 1 2000 -0.5605 0.07051 0.4609
## 2 2001 -0.3802 0.60796 0.5194
## 3 2002 -0.4412 -1.28472 1.1098
## 4 2003 -0.5229 0.63296 1.0176
## 5 2004 0.3000 -1.07742 -2.2250
## 6 2005 -0.2732 -1.19736 -1.1986
## 7 2006 0.7408 0.10719 0.4806
## 8 2007 1.0128 -0.19589 0.6166
## 9 2008 -0.1633 1.13105 -1.1392
## 10 2009 -1.4987 -1.09849 -0.3227
Now we can pass in the wide format of the sub-dataset (excluding the Year column since we don't want the correlations computed against the year):
sub.A1.cor <- cor(sub.A1.wide[, c("X", "Y", "Z")], method = "spearman")
print(sub.A1.cor)
## X Y Z
## X 1.0000 0.12727 -0.13939
## Y 0.1273 1.00000 0.07879
## Z -0.1394 0.07879 1.00000
The resulting correlation matrix is symmetric since cor(X, Y) = cor(Y, X). We can exclude the duplicated values as well as the diagonal by using the lower.tri(x, diag=TRUE) function to set all values on the diagonal and in the lower triangle to NA. Note that lower.tri() returns a boolean matrix that is the same shape as the data frame, so it is used to index the original data frame:
sub.A1.cor.tri <- lower.tri(sub.A1.cor, diag = TRUE)
print(sub.A1.cor.tri)
## [,1] [,2] [,3]
## [1,] TRUE FALSE FALSE
## [2,] TRUE TRUE FALSE
## [3,] TRUE TRUE TRUE
sub.A1.cor[sub.A1.cor.tri] <- NA
print(sub.A1.cor)
## X Y Z
## X NA 0.1273 -0.13939
## Y NA NA 0.07879
## Z NA NA NA
Now we can use the melt() function (which is like the sibling of dcast()) from the reshape2 package to convert this matrix to long format. The na.rm=TRUE argument will drop all values that are NA.
sub.A1.cor <- melt(sub.A1.cor, na.rm = TRUE)
print(sub.A1.cor)
## Var1 Var2 value
## 4 X Y 0.12727
## 7 X Z -0.13939
## 8 Y Z 0.07879
The cor() function only returns the estimated value of rho, so to get the p-values, we use the cor.test() function, which returns an object of class htest containing various components such as estimate, p.value, etc… (see cor.test() help).
cor.A1.XY <- cor.test(sub.A1.X, sub.A1.Y, method = "spearman") # correlation between X and Y for Site A and Month 1
print(cor.A1.XY)
##
## Spearman's rank correlation rho
##
## data: sub.A1.X and sub.A1.Y
## S = 144, p-value = 0.7329
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1273
str(cor.A1.XY)
## List of 8
## $ statistic : Named num 144
## ..- attr(*, "names")= chr "S"
## $ parameter : NULL
## $ p.value : num 0.733
## $ estimate : Named num 0.127
## ..- attr(*, "names")= chr "rho"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "rho"
## $ alternative: chr "two.sided"
## $ method : chr "Spearman's rank correlation rho"
## $ data.name : chr "sub.A1.X and sub.A1.Y"
## - attr(*, "class")= chr "htest"
We can thus extract the value of spearman's rho and the associated p-value by:
cor.A1.XY.rho <- cor.A1.XY$estimate
cor.A1.XY.pval <- cor.A1.XY$p.value
print(c(cor.A1.XY.rho, cor.A1.XY.pval))
## rho
## 0.1273 0.7329
Because there are multiple steps involved with computing the pairwise correlation for each subset, it would be easiest to wrap the entire apply step in a single function, which is given a data frame with columns Year, Parameter, Value and returns the correlation coefficient values as a data frame with columns ParameterX, ParameterY, Rho, PValue.
my_cor <- function(x) {
params <- levels(x$Parameter) # get list of parameter codes
params.grid <- expand.grid(params, params) # make list of parameter pairs (XX, XY, XZ, YX, ...)
cor.results <- data.frame()
for (i in 1:nrow(params.grid)) {
# loop through each row of params.grid
x.param <- as.character(params.grid$Var1)[i] # get the x parameter as a character
y.param <- as.character(params.grid$Var2)[i] # get the y parameter as a character
x.values <- subset(x, Parameter == x.param)$Value # extract x values
y.values <- subset(x, Parameter == y.param)$Value # extract x values
xy.cor <- cor.test(x.values, y.values, method = "spearman")
cor.results <- rbind(cor.results, data.frame(ParameterX = x.param, ParameterY = y.param,
Rho = xy.cor$estimate, PValue = xy.cor$p.value, row.names = NULL))
}
return(cor.results)
}
We then test the function with the sub.A1 dataframe, which gives the same result as we computed above.
my_cor(sub.A1)
## ParameterX ParameterY Rho PValue
## 1 X X 1.00000 0.0000
## 2 Y X 0.12727 0.7329
## 3 Z X -0.13939 0.7072
## 4 X Y 0.12727 0.7329
## 5 Y Y 1.00000 0.0000
## 6 Z Y 0.07879 0.8380
## 7 X Z -0.13939 0.7072
## 8 Y Z 0.07879 0.8380
## 9 Z Z 1.00000 0.0000
Now we want to apply this function to each subset dataframe in the list df.split.site_month using the lapply(X, FUN, ...) where X is the list of subset data frames and FUN is our correlation function my_cor().
df.cor.list <- lapply(df.split.site_month, my_cor)
print(df.cor.list[1:6]) # only first 6
## $A.1
## ParameterX ParameterY Rho PValue
## 1 X X 1.00000 0.0000
## 2 Y X 0.12727 0.7329
## 3 Z X -0.13939 0.7072
## 4 X Y 0.12727 0.7329
## 5 Y Y 1.00000 0.0000
## 6 Z Y 0.07879 0.8380
## 7 X Z -0.13939 0.7072
## 8 Y Z 0.07879 0.8380
## 9 Z Z 1.00000 0.0000
##
## $B.1
## ParameterX ParameterY Rho PValue
## 1 X X 1.0000 0.0000
## 2 Y X 0.0303 0.9457
## 3 Z X -0.4667 0.1782
## 4 X Y 0.0303 0.9457
## 5 Y Y 1.0000 0.0000
## 6 Z Y -0.1273 0.7329
## 7 X Z -0.4667 0.1782
## 8 Y Z -0.1273 0.7329
## 9 Z Z 1.0000 0.0000
##
## $C.1
## ParameterX ParameterY Rho PValue
## 1 X X 1.0000 0.0000
## 2 Y X 0.3818 0.2790
## 3 Z X 0.0303 0.9457
## 4 X Y 0.3818 0.2790
## 5 Y Y 1.0000 0.0000
## 6 Z Y -0.4424 0.2042
## 7 X Z 0.0303 0.9457
## 8 Y Z -0.4424 0.2042
## 9 Z Z 1.0000 0.0000
##
## $A.2
## ParameterX ParameterY Rho PValue
## 1 X X 1.00000 0.00000
## 2 Y X -0.62424 0.06025
## 3 Z X -0.15152 0.68181
## 4 X Y -0.62424 0.06025
## 5 Y Y 1.00000 0.00000
## 6 Z Y -0.01818 0.97284
## 7 X Z -0.15152 0.68181
## 8 Y Z -0.01818 0.97284
## 9 Z Z 1.00000 0.00000
##
## $B.2
## ParameterX ParameterY Rho PValue
## 1 X X 1.00000 0.0000000
## 2 Y X 0.90303 0.0008802
## 3 Z X -0.06667 0.8647535
## 4 X Y 0.90303 0.0008802
## 5 Y Y 1.00000 0.0000000
## 6 Z Y -0.24848 0.4915555
## 7 X Z -0.06667 0.8647535
## 8 Y Z -0.24848 0.4915555
## 9 Z Z 1.00000 0.0000000
##
## $C.2
## ParameterX ParameterY Rho PValue
## 1 X X 1.00000 0.0000
## 2 Y X -0.01818 0.9728
## 3 Z X -0.55152 0.1043
## 4 X Y -0.01818 0.9728
## 5 Y Y 1.00000 0.0000
## 6 Z Y 0.40606 0.2474
## 7 X Z -0.55152 0.1043
## 8 Y Z 0.40606 0.2474
## 9 Z Z 1.00000 0.0000
The lapply() function returns a list with the same elements containing the result of my_cor() for each subset.
The result of the apply step left us with a list of data frames by Site and Month where each list element is a data frame of correlation coefficients between each combination of parameters. Now we want to combine them all into a single data frame with columns [Site, Month, ParameterX, ParameterY, Rho, PValue]. The easiest way is to create an empty data frame, then loop through the list df.cor.list and append each element to final data frame using rbind().
df.cor <- data.frame()
for (site.month in names(df.cor.list)) {
df.cor.list.element <- df.cor.list[[site.month]]
df.cor.list.element$Site.Month <- site.month
df.cor <- rbind(df.cor, df.cor.list.element)
}
Lastly, we want to split the Site.Month column into two separate columns for Site and Month using the colsplit(string, pattern, names) function. The pattern argument to colsplit is regular expression, so instead of specifying '.' as the pattern to split by full stop in each value, we need to escape the full stop which is a special character for regular expressions as '.'.
df.cor <- cbind(df.cor, colsplit(df.cor$Site.Month, "\\.", c("Site", "Month")))
df.cor$Site.Month <- NULL # drop Site.Month column
df.cor <- df.cor[, c("Site", "Month", "ParameterX", "ParameterY", "Rho", "PValue")] # reorder columns
head(df.cor, 10)
## Site Month ParameterX ParameterY Rho PValue
## 1 A 1 X X 1.00000 0.0000
## 2 A 1 Y X 0.12727 0.7329
## 3 A 1 Z X -0.13939 0.7072
## 4 A 1 X Y 0.12727 0.7329
## 5 A 1 Y Y 1.00000 0.0000
## 6 A 1 Z Y 0.07879 0.8380
## 7 A 1 X Z -0.13939 0.7072
## 8 A 1 Y Z 0.07879 0.8380
## 9 A 1 Z Z 1.00000 0.0000
## 10 B 1 X X 1.00000 0.0000
And there we have it. Note that the result includes the full correlation matrix, so there are duplicate values (e.g. includes the correlation for both [X,Y] and [Y,X]). These could be removed in the my_cor() function if only upper triangle of the correlation matrix is needed. It depends on what the next step in the analysis is.
In R, there are almost always more than one way to perform a task. When it comes to the split-apply-combine strategy, the plyr package can make life much easier. plyr is comprised of a series of so-called **ply() where each of the two *s can be 'a', 'l', or 'd' for 'array', 'list', 'dataframe'. The first letter is the input format while the second letter is the output format. So alply() takes an array and returns a list. Generally, we stick to ddply() which takes a dataframe and returns a dataframe.
The ddply(.data, .variables, .fun, ...) function takes a data frame as .data, a collection of variables that define how to split the data frame as .variables and a function to apply to each subset as .fun. The function is usually summarise which is a special function that lets you chain multiple functions in the apply step.
As an example, if we wanted to compute the overall mean value for each Parameter computed across all Sites, Months and Years then we would run:
library(plyr)
ddply(df, .(Parameter), summarise, Mean = mean(Value))
## Parameter Mean
## 1 X -0.03492
## 2 Y 0.07217
## 3 Z 0.03878
Breaking this down, we pass in the original data frame df with columns [Site, Parameter, Month, Year, Value], then tell ddply to split according to the column named Parameter, then summarise by computing the mean of Value and storing the result in a column called Mean. If we wanted to compute the mean by Parameter and Site we would add Site to the variable collection.
ddply(df, .(Parameter, Site), summarise, Mean = mean(Value))
## Parameter Site Mean
## 1 X A -0.222071
## 2 X B 0.001607
## 3 X C 0.115702
## 4 Y A 0.118508
## 5 Y B 0.062336
## 6 Y C 0.035663
## 7 Z A 0.086383
## 8 Z B -0.027461
## 9 Z C 0.057406
The ddply function also lets us compute multiple values from each subset by adding additional functions after summarise. So to compute the standard deviation and count as well as the mean of each subset, we run:
ddply(df, .(Parameter, Site), summarise, Mean = mean(Value), StDev = sd(Value),
N = length(Value))
## Parameter Site Mean StDev N
## 1 X A -0.222071 0.8765 120
## 2 X B 0.001607 1.1087 120
## 3 X C 0.115702 1.0203 120
## 4 Y A 0.118508 0.9678 120
## 5 Y B 0.062336 0.9977 120
## 6 Y C 0.035663 0.9724 120
## 7 Z A 0.086383 0.9016 120
## 8 Z B -0.027461 0.9484 120
## 9 Z C 0.057406 1.1135 120
Since we have already defined a function my_cor() that takes a subset of the original dataframe and returns a data frame of the pairwise correlation estimates and p-values, we simply pass this function to ddply.
df.cor.plyr <- ddply(df, .(Site, Month), my_cor)
head(df.cor.plyr)
## Site Month ParameterX ParameterY Rho PValue
## 1 A 1 X X 1.00000 0.0000
## 2 A 1 Y X 0.12727 0.7329
## 3 A 1 Z X -0.13939 0.7072
## 4 A 1 X Y 0.12727 0.7329
## 5 A 1 Y Y 1.00000 0.0000
## 6 A 1 Z Y 0.07879 0.8380
This result is the same as that computed above using split and lapply, but was acheived somewhat more easily and in fewer lines using ddply.
head(df.cor.plyr, 10) # using plyr
## Site Month ParameterX ParameterY Rho PValue
## 1 A 1 X X 1.00000 0.0000
## 2 A 1 Y X 0.12727 0.7329
## 3 A 1 Z X -0.13939 0.7072
## 4 A 1 X Y 0.12727 0.7329
## 5 A 1 Y Y 1.00000 0.0000
## 6 A 1 Z Y 0.07879 0.8380
## 7 A 1 X Z -0.13939 0.7072
## 8 A 1 Y Z 0.07879 0.8380
## 9 A 1 Z Z 1.00000 0.0000
## 10 A 2 X X 1.00000 0.0000
head(df.cor, 10) # using split, lapply
## Site Month ParameterX ParameterY Rho PValue
## 1 A 1 X X 1.00000 0.0000
## 2 A 1 Y X 0.12727 0.7329
## 3 A 1 Z X -0.13939 0.7072
## 4 A 1 X Y 0.12727 0.7329
## 5 A 1 Y Y 1.00000 0.0000
## 6 A 1 Z Y 0.07879 0.8380
## 7 A 1 X Z -0.13939 0.7072
## 8 A 1 Y Z 0.07879 0.8380
## 9 A 1 Z Z 1.00000 0.0000
## 10 B 1 X X 1.00000 0.0000
It's always a good idea to check the results visually, as this is much easier than combing through the resulting data frame. There are few functions and packages in R that can be used to visualize correlations as so-called correlograms (see Friendly2002, and Quick-R). One of the most popular is the corrgram function which is in the corrgram package. Because this function requires the wide format (with individual columns for each Paramater), we have to use the dcast() function from the reshape2 package.
library(corrgram)
df.wide <- dcast(df, Site + Month + Year ~ Parameter, value.var = "Value")
head(df.wide)
## Site Month Year X Y Z
## 1 A 1 2000 -0.5605 0.07051 0.4609
## 2 A 1 2001 -0.3802 0.60796 0.5194
## 3 A 1 2002 -0.4412 -1.28472 1.1098
## 4 A 1 2003 -0.5229 0.63296 1.0176
## 5 A 1 2004 0.3000 -1.07742 -2.2250
## 6 A 1 2005 -0.2732 -1.19736 -1.1986
We can then generate a corrgram for a given site and month by subsetting the data frame.
params <- levels(df$Parameter)
df.wide.A1 <- subset(df.wide, Site == "A" & Month == 1)
corrgram(df.wide.A1[, params], lower.panel = panel.shade, upper.panel = panel.pie,
text.panel = panel.txt, main = "Parameter Correlations by Site, Month")
This is useful, but since we would like to check all the data, we can generate a pdf containing a correlogram for each site and month by using a nested for loop.
pdf("corrgram.pdf", width = 8, height = 8)
for (site in levels(df$Site)) {
for (month in levels(as.factor(df$Month))) {
print(corrgram(subset(df.wide, Site == site & Month == month)[, params],
lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt,
main = paste("Parameter Correlations for Site=", site, ", Month=",
month, sep = "")))
}
}
dev.off()
We can also show all of the correlation coefficients in one plot using the ggplot2 package. Note that transparency (i.e. alpha level) is defined as 1-PValue so that more significant results come up less transparent.
library(ggplot2)
ggplot(df.cor.plyr, aes(x = factor(Month), y = Rho, fill = Site, alpha = (1 -
PValue))) + geom_bar(stat = "identity", position = "dodge", color = "grey50") +
facet_wrap(~ParameterX + ParameterY) + labs(x = "Month", y = "Spearman's Rho") +
ggtitle("Correlation between Parameters by Site and Month")
Since its difficult to interpret the p-values with a continuous range (which are statistically significant?), we can bin the p-values in: [ <0.05, 0.05<0.10, >0.10] using the cut function.
df.cor.plyr$PValue.Bin <- cut(df.cor.plyr$PValue, breaks = c(0, 0.05, 0.1, 1),
include.lowest = TRUE)
summary(df.cor.plyr$PValue.Bin)
## [0,0.05] (0.05,0.1] (0.1,1]
## 120 10 194
ggplot(df.cor.plyr, aes(x = factor(Month), y = Rho, fill = Site, alpha = PValue.Bin)) +
geom_bar(stat = "identity", position = "dodge", color = "grey50") + facet_wrap(~ParameterX +
ParameterY) + labs(x = "Month", y = "Spearman's Rho") + scale_alpha_manual("P-Value",
values = c(1, 0.5, 0), labels = c("p <= 0.05", "0.05 < p <= 0.10", "p > 0.10")) +
ggtitle("Correlation between Parameters by Site and Month")