Data screening

For a first data screening want to do the following:

Get data from eCO2 experiments
Data availability per experiment
Data availability per variable
Long versus wide issue

Let’s check first if the data is ready to use for these steps.

But before we start:

Some variable names end with an underscore (ANPP_, Asat_, SLA_, coarse_root_biomass_, LMA_, soil_NH4+_, N_uptake_). Needed to delete that manually.
Replaced lots of superscripted numbers (actually appeared as superscript when opening the CSV file in my text editor) with plain number (e.g., m2)
Replaced units given like g_m-2 with g/m2, or kg_ha-1 with kg/ha, or g/m2*yr with g/m2/yr (assuming that’s correct if that’s the way it would be entered in R), _g/m2 with g/m2, cm2/m2*d with cm2/m2/d, … I’ll leave it with that for now.
Replaced Vcmax_normalized_to_25°C with Vcmax25, and Jmax_normalized_to_25°C with Jmax25 (problematic degree sign)
Removed weird symbol in units, e.g., ‘mol_CO2/m2/s to mol_CO2/m2/s
Removed line break in date string

All that is saved as a new file NewData_wide_CORRECTED2.csv

Load the corrected file, which is based on the one Kevin sent on 1.3.2019.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- read_csv("~/data/gcme/data_received_190325/NewData_wide_CORRECTED2.csv") %>%
  
  # something is wrong in columns ambient_Sd, ambient_Se, elevated...
  mutate( ambient_Sd  = as.numeric(ambient_Sd),  ambient_Se  = as.numeric(ambient_Se), 
          elevated_Sd = as.numeric(elevated_Sd), elevated_Se = as.numeric(elevated_Se) )

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   exp_nam = col_character(),
##   prev_name = col_character(),
##   factors = col_character(),
##   treatment = col_character(),
##   TT_Nut_Detail = col_character(),
##   Data_type = col_character(),
##   Unit = col_character(),
##   Sampling_date = col_character(),
##   ambient_Se = col_character(),
##   elevated_Se = col_character()
## )

## See spec(...) for full column specifications.

  # XXX testing
  # filter(exp_nam == "SwissFACE_lolium2" & Data_type == "aboveground_biomass")
  # filter(exp_nam == "SwissFACE_trifolium2" & Data_type == "aboveground_biomass")

# save experiments names
df_experiments <- df %>% select(exp_nam, prev_name) %>% distinct()

0. Issues

Sampling date

This is not read in correctly. Either continue doing manual edits (open CSV in editor and check whether it’s in a consistent format, at least within experiments), or do something clever if sampling date formats are at least consistent within experiments: e.g., reading the table experiment per experiment (making use of automatic interpretation of the sampling date format by read_csv()) and then stack all experiments’ data together again.

Duplicate rows

Some rows are simply duplicates. This is dangerous.

keyvars <- c("exp_nam", "factors", "treatment", "Data_type", "Unit", "Sampling_date", "Start_Year", "Year", "n_plots")
valuevars <-  c("ambient", "ambient_Se", "ambient_Sd", "elevated", "elevated_Se", "elevated_Sd")

df %>% 
  filter(exp_nam == "SwissFACE_trifolium2" & Data_type == "aboveground_biomass") %>% 
  filter(treatment=="cf") %>% 
  select(one_of(keyvars, valuevars)) %>% 
  print()

## # A tibble: 2 x 15
##   exp_nam factors treatment Data_type Unit  Sampling_date Start_Year  Year
##   <chr>   <chr>   <chr>     <chr>     <chr> <chr>              <dbl> <dbl>
## 1 SwissF… cf      cf        abovegro… g/m2… <NA>                1993    NA
## 2 SwissF… cf      cf        abovegro… g/m2… <NA>                1993    NA
## # … with 7 more variables: n_plots <dbl>, ambient <dbl>, ambient_Se <dbl>,
## #   ambient_Sd <dbl>, elevated <dbl>, elevated_Se <dbl>, elevated_Sd <dbl>

This may be remediated automatically by using only columns that are distinct w.r.t. valuevars and keyvars (basically all columns in original data, except ALIAS).

print(nrow(df))

## [1] 36974

df <- df %>% 
  distinct_at(vars(one_of(keyvars, valuevars))) %>%
  mutate( id=1:n() )   # create new ID key (before this was 'ALIAS')
print(nrow(df))

## [1] 36347

df %>% 
  filter(exp_nam == "SwissFACE_trifolium2" & Data_type == "aboveground_biomass") %>% 
  filter(treatment=="cf") %>% 
  select(one_of(keyvars, valuevars)) %>% 
  print()

## # A tibble: 1 x 15
##   exp_nam factors treatment Data_type Unit  Sampling_date Start_Year  Year
##   <chr>   <chr>   <chr>     <chr>     <chr> <chr>              <dbl> <dbl>
## 1 SwissF… cf      cf        abovegro… g/m2… <NA>                1993    NA
## # … with 7 more variables: n_plots <dbl>, ambient <dbl>, ambient_Se <dbl>,
## #   ambient_Sd <dbl>, elevated <dbl>, elevated_Se <dbl>, elevated_Sd <dbl>

Still problems?

df %>% 
  filter(exp_nam == "SwissFACE_lolium2" & Data_type == "belowground_biomass") %>% 
  filter(treatment=="cf", Year==6) %>% 
  select(one_of(keyvars, valuevars)) %>% 
  print()

## # A tibble: 2 x 15
##   exp_nam factors treatment Data_type Unit  Sampling_date Start_Year  Year
##   <chr>   <chr>   <chr>     <chr>     <chr> <chr>              <dbl> <dbl>
## 1 SwissF… cf      cf        belowgro… g/m2  11/1/1998           1993     6
## 2 SwissF… cf      cf        belowgro… g/m2  1-Nov-98            1993     6
## # … with 7 more variables: n_plots <dbl>, ambient <dbl>, ambient_Se <dbl>,
## #   ambient_Sd <dbl>, elevated <dbl>, elevated_Se <dbl>, elevated_Sd <dbl>

Yes. In this case, the problem has to do with Sampling_date. I suspect that the problem with Sampling_date is that it was manually entered and consistency may be respected within experiments (and variables?), but not across variables. When read into R with readr::read_csv(), different formats are interpreted always with the same function. A possible solution to this is to read data separately for each experiment and variable and stack them together again (bind_rows()).

TODO:

First, do some manual cleaning of Sampling_date as far as possible and reasonable. Then, develop a separate read-in by experiments and variables, stack data again, and write the combined data to file.

Distinct ambients

The way the wide format is designed, the ambient values should be the same for different rows corresponding to different treatments (for a given experiment, variable, and sampling year). This is not always the case. It might be that these are actually different data from multiple samplings. However, this is not identifiable, but should be. An example:

df %>% 
  filter(exp_nam == "SwissFACE_lolium2" & Data_type == "Rh" ) %>% 
  print()

## # A tibble: 2 x 16
##   exp_nam factors treatment Data_type Unit  Sampling_date Start_Year  Year
##   <chr>   <chr>   <chr>     <chr>     <chr> <chr>              <dbl> <dbl>
## 1 SwissF… cf      c         Rh        g_SO… 2002                1993    10
## 2 SwissF… cf      cf        Rh        g_SO… 2002                1993    10
## # … with 8 more variables: n_plots <dbl>, ambient <dbl>, ambient_Se <dbl>,
## #   ambient_Sd <dbl>, elevated <dbl>, elevated_Se <dbl>,
## #   elevated_Sd <dbl>, id <int>

TODO:

I haven’t figured out how to do a bulk check on this and resolve it. It doesn’t affect too many entries. Hence, it might have to be cleaned by hand.

4. Long versus wide

An example conversion

How to perform a conversion of our current wide format to a long format? Let’s do it for one example. An example for how the wide format is implemented in the current dataset:

df_wide <- df %>% 
  filter(exp_nam=="SwissFACE_lolium2", Data_type=="aboveground_biomass") %>%
  select(exp_nam, factors, treatment, Data_type) %>% 
  distinct() %>%
  mutate( id=1:nrow(.),
          ambient=rep("A", 3), elevated=c("B", "C", "D"),
          ambient_Sd=rep("sdA", 3), elevated_Sd=c("sdB", "sdC", "sdD"),
          ambient_Se=rep("seA", 3), elevated_Se=c("seB", "seC", "seD")) %>%
  print()

## # A tibble: 3 x 11
##   exp_nam factors treatment Data_type    id ambient elevated ambient_Sd
##   <chr>   <chr>   <chr>     <chr>     <int> <chr>   <chr>    <chr>     
## 1 SwissF… cf      c         abovegro…     1 A       B        sdA       
## 2 SwissF… cf      cf        abovegro…     2 A       C        sdA       
## 3 SwissF… cf      f         abovegro…     3 A       D        sdA       
## # … with 3 more variables: elevated_Sd <chr>, ambient_Se <chr>,
## #   elevated_Se <chr>

To convert this to a long format, first get all available factors and create a vector. The available factors are:

df %>% 
  filter(!(factors %in% c("x","x99"))) %>% 
  select(factors) %>% 
  unique() %>% 
  unlist() %>% 
  unname() %>% 
  print()

##  [1] "cwd"  "c"    "cf"   "cfd"  "cfw"  "ci"   "cfi"  "cw"   "cdi"  "cfwi"
## [11] "cd"   "i"    "cdf"  "df"   "f"    "fi"   "co"   "o"    "cwi"  "w"   
## [21] "d"    "wd"   "fd"   "fw"   "wi"   "wdi"  "fdi"

Create a vector with all single factors

factors_avl <- df %>%
  filter(!(factors %in% c("x","x99"))) %>%      # Something weird with this
  select(factors) %>% 
  unique() %>% 
  mutate( factors_sep = strsplit(factors, NULL) ) %>% 
  select(factors_sep) %>% 
  unlist() %>% 
  unname() %>% 
  unique() %>% 
  print()

## [1] "c" "w" "d" "f" "i" "o"

Convert from wide to long, adding columns for all factors for which we have ambient and elevated levels’ data.

wide_to_long_gcme <- function( df_wide, keyvars ){
  
  require(stringr)
  
  ## if level is 'elevated', turn all factor levels that are part of 'treatment' to TRUE
  switch_factor <- function(df){
    if (df$level=="elevated"){
      for (ifactor in strsplit(df$treatment, NULL) %>% unlist()){
        df[[ifactor]] <- TRUE
      }
    }
    return(df)
  }
  
  ## First get available factors
  factors_avl <- df_wide %>%
    filter(!(factors %in% c("x","x99"))) %>%      # Something weird with this
    select(factors) %>% 
    unique() %>% 
    mutate( factors_sep = strsplit(factors, NULL) ) %>% 
    select(factors_sep) %>% 
    unlist() %>% 
    unname() %>% 
    unique()  
  
  df_long <- df_wide %>%
    
    ## gather 'mean' based on columns: 'ambient', 'elevated'
    select( -ambient_Sd, -elevated_Sd, -ambient_Se, -elevated_Se ) %>%
    tidyr::gather(level, mean, c(ambient, elevated)) %>%
    
    ## gather 'sd' based on columns: 'ambient_Sd', 'elevated_Sd'
    left_join( 
      select( df_wide, -ambient, -elevated, -ambient_Se, -elevated_Se ) %>%
        tidyr::gather(level, sd, c(ambient_Sd, elevated_Sd)) %>%
        mutate( level = stringr::str_split(.$level, "_") %>% purrr::map(., 1) %>% unlist() ),
      by = c(keyvars) ) %>% 
    
    ## gather 'se' based on columns: 'ambient_Se', 'elevated_Se'
    left_join( 
      select( df_wide, -ambient, -elevated, -ambient_Sd, -elevated_Sd ) %>%
        tidyr::gather(level, se, c(ambient_Se, elevated_Se)) %>%
        mutate( level = stringr::str_split(.$level, "_") %>% purrr::map(., 1) %>% unlist() ),
      by = c(keyvars) ) %>% 
    
    ## magically add columns corresponding to all available levels of 'factors'
    `is.na<-`(factors_avl) %>% 
    
    ## fill all new factor columns with FALSE
    dplyr::mutate_at( factors_avl, ~FALSE ) 
  
  ## Determine new factor columns based on information in 'treatment' and 'level'
  df_long <- purrr::map_dfr(as.list(1:nrow(df_long)), ~switch_factor(df_long[.,]))
  
  ## Now we have duplicate ambients, treat them separately and stack ambient and elevated together again
  df_long <- df_long %>% filter(level=="ambient") %>% 
    
    # WARNING: This assumes that if values are identical, they are repetitions, ignoring differences w.r.t Sampling_Date
    distinct( exp_nam, factors, Data_type, Unit, Year, mean, sd, se, .keep_all=TRUE ) %>%
    
    # bind them together again. also cleaning replicated rows (note treatment added as column!)
    bind_rows( ., 
               filter(df_long, level=="elevated") %>% 
                 distinct( exp_nam, factors, treatment, Data_type, Unit, Year, mean, sd, se, .keep_all=TRUE )
                 ) %>% 
    
    ## abandon columns, now obsolete with boolean columns for factors 
    select(-treatment, -level, -id)
  
  return(df_long)
}

# run it
keyvars <- c("id", "exp_nam", "factors", "level", "treatment", "Data_type")
valuevars <-  c("ambient", "ambient_Se", "ambient_Sd", "elevated", "elevated_Se", "elevated_Sd")

df_long_example <- df_wide %>%
  select(one_of(keyvars, valuevars)) %>%
  wide_to_long_gcme( keyvars ) %>%
  print()

## Loading required package: stringr

## # A tibble: 4 x 8
##   exp_nam          factors Data_type          mean  sd    se    c     f    
##   <chr>            <chr>   <chr>              <chr> <chr> <chr> <lgl> <lgl>
## 1 SwissFACE_loliu… cf      aboveground_bioma… A     sdA   seA   FALSE FALSE
## 2 SwissFACE_loliu… cf      aboveground_bioma… B     sdB   seB   TRUE  FALSE
## 3 SwissFACE_loliu… cf      aboveground_bioma… C     sdC   seC   TRUE  TRUE 
## 4 SwissFACE_loliu… cf      aboveground_bioma… D     sdD   seD   FALSE TRUE

Converting the entire table

Let’s apply this to the full data frame. First, keep only some of the columns:

keyvars <- c("id", "exp_nam", "factors", "treatment", "level", "Data_type", "Unit", "Start_Year", "Year",  "n_plots")
valuevars <-  c("ambient", "ambient_Se", "ambient_Sd", "elevated", "elevated_Se", "elevated_Sd")

df_long <- df %>% 
  select(one_of(keyvars, valuevars)) %>%
  # filter(exp_nam=="BioCON" & Data_type=="leaf_N" & Year==4) %>%  # xxx debug
  wide_to_long_gcme(keyvars) %>% 
  write_csv(path="~/data/gcme/data_received_190325/NewData_LONG.csv")

Thus, we have a long-format data table, that has less than twice as many rows as the original table, because the “absolute ambient” is no longer repeated:

print(nrow(df)*2)
nrow(df_long)

Test

Test, whether the conversion was doing alright.

check_cf <- function(df_wide, df_long, verbose=FALSE){

  ## check cf experiments
  df_cf <- df_wide %>% filter(treatment %in% c("f","c","cf","fc"))

  out <- tibble()

  all_exp <- df_cf$exp_nam %>% unique

  for (iexp in all_exp){

    df_sub <- df_cf %>% filter(exp_nam==iexp)

    all_var <- df_sub$Data_type %>% unique

    for (ivar in all_var){

      df_subsub <- df_sub %>% filter(Data_type==ivar)

      all_var_unit <- df_subsub$Unit %>% unique

      for (iunit in all_var_unit){

        df_subsubsub <- df_subsub %>% filter(Unit==iunit)

        # elevated-f treatment
        correct_f <- df_subsubsub %>% 
          filter(treatment=="f") %>% 
          select(exp_nam, Data_type, Unit, Year, mean=elevated, se=elevated_Se, sd=elevated_Sd) %>%
          all_equal( ., df_long %>% filter(f & !c & exp_nam==iexp & Data_type==ivar & Unit==iunit) %>% select(exp_nam, Data_type, Unit, Year, mean, se, sd) )

        # elevated-c treatment
        correct_c <- df_subsubsub %>% 
          filter(treatment=="c") %>% 
          select(exp_nam, Data_type, Unit, Year, mean=elevated, se=elevated_Se, sd=elevated_Sd) %>%
          all_equal( ., df_long %>% filter(!f & c & exp_nam==iexp & Data_type==ivar & Unit==iunit) %>% select(exp_nam, Data_type, Unit, Year, mean, se, sd) )

        # elevated-cf treatment
        correct_cf <- df_subsubsub %>% 
          filter(treatment=="cf") %>% 
          select(exp_nam, Data_type, Unit, Year, mean=elevated, se=elevated_Se, sd=elevated_Sd) %>%
          all_equal( ., df_long %>% filter(f & c & exp_nam==iexp & Data_type==ivar & Unit==iunit) %>% select(exp_nam, Data_type, Unit, Year, mean, se, sd) )

        # (absolute) control: WARNING: DUPLICATED ROWS
        correct_0 <- df_subsubsub %>% 
          distinct(exp_nam, Data_type, Unit, Year, ambient, ambient_Se, ambient_Sd, .keep_all=TRUE) %>% 
          select(exp_nam, Data_type, Unit, Year, mean=ambient, se=ambient_Se, sd=ambient_Sd) %>%
          all_equal( ., df_long %>% filter(!f & !c & exp_nam==iexp & Data_type==ivar & Unit==iunit) %>% select(exp_nam, Data_type, Unit, Year, mean, se, sd) )
        
        if (verbose){
        if (!identical(correct_0, TRUE)){
          rlang::warn(correct_0)
          print("WIDE:")
          df_subsubsub %>%
            distinct(exp_nam, Data_type, Unit, Year, ambient, ambient_Se, ambient_Sd, .keep_all=TRUE) %>%
            select(exp_nam, Data_type, Unit, Year, mean=ambient, se=ambient_Se, sd=ambient_Sd) %>%
            print()
          print("LONG:")
          df_long %>%
            filter(!f & !c & exp_nam==iexp & Data_type==ivar & Unit==iunit) %>%
            select(exp_nam, Data_type, Unit, Year, mean, se, sd) %>%
            print()
          print("---------")
        }
        }

        out <- bind_rows(out, c(factors="cf", exp_nam=iexp, Data_type=ivar, Unit=iunit, correct_0=correct_0, correct_c=correct_c, correct_f=correct_f, correct_cf=correct_cf))

      }
    }
  }

  return(out)
}
#out <- check_cf( df, df_long) %>% head()

Issue:

There are still lots of problems with identifying the ambients (see column correct_0 in out). See for example:

## note row 18 is missing in wide
out <- check_cf( filter(df, exp_nam=="BioCON", Data_type=="leaf_N", Year==4), df_long, verbose = TRUE)

TODO:

Make sure the wide-to-long conversion works fine. That is, out should have TRUE in all columns.

Converting back with a twist

In the original data, ambient was the “absolute ambient”. For example in the cf treatment, ambient was no treatment at all (control), and elevated was c \(\times\) f. We may re-design a wide version of the table and allow flexibility in what treatments are compared to each other, i.e., what ambient means. I would argue that this depends whether we are interested in a single effect or in interactions. In our case, we are interested primarily in the CO2 effect. Hence, in a factorial experiment cf, we may want to compare c to 0 (an “absolute ambient”), and cf to f and thus have more data available to evaluate the CO2-only effect.

Let’s implement this, by converting our example back to a CO2-wide format:

long_to_wide_gcme <- function(df_long, keyvar){

  joinvars <- names(df_long)[-which(names(df_long) %in% c("mean", "sd", "se", keyvar))]
  factors_all <- df_long %>%
    filter(!(factors %in% c("x","x99"))) %>%      # Something weird with this
    select(factors) %>% 
    unique() %>% 
    mutate( factors_sep = strsplit(factors, NULL) ) %>% 
    select(factors_sep) %>% 
    unlist() %>% 
    unname() %>% 
    unique()
  factorvars <- factors_all[-which(factors_all==keyvar)]
    
  # take all data where factor 'c' is TRUE
  df_wide <- dplyr::filter(df_long, eval(parse(text=keyvar)) ) %>% 
    
    # call this 'elevated'
    dplyr::rename(elevated=mean, elevated_sd=sd, elevated_se=se) %>% 
    
    # remove column 'c', is no longer used
    dplyr::select(-keyvar) %>% 
    
    # merge this with the corresponding row where all other factors are the same, while 'c' is FALSE
    left_join( 
      
      # take all data where factor 'c' is FALSE
      dplyr::filter(df_long, !(eval(parse(text=keyvar)))) %>% 

        # call this 'ambient'
        dplyr::rename(ambient=mean, ambient_sd=sd, ambient_se=se) %>% 

        # remove column 'c', is no longer used
        dplyr::select(-keyvar),
      
      ## merge by all other columns
      by = joinvars ) %>% 
    
    # order columns
    select( c(joinvars, ambient, elevated, ambient_se, elevated_se, ambient_sd, elevated_sd)  )

  return(df_wide)
}

Run it

df_long_example %>%
  long_to_wide_gcme("c") %>%
  print()

## # A tibble: 2 x 10
##   exp_nam factors Data_type f     ambient elevated ambient_se elevated_se
##   <chr>   <chr>   <chr>     <lgl> <chr>   <chr>    <chr>      <chr>      
## 1 SwissF… cf      abovegro… FALSE A       B        seA        seB        
## 2 SwissF… cf      abovegro… TRUE  D       C        seD        seC        
## # … with 2 more variables: ambient_sd <chr>, elevated_sd <chr>

The same can be done w.r.t. the fertilisation treatment, with two rows corresponding now to whether or not CO2 was elevated.

df_long_example %>%
  long_to_wide_gcme("f") %>%
  print()

## # A tibble: 2 x 10
##   exp_nam factors Data_type c     ambient elevated ambient_se elevated_se
##   <chr>   <chr>   <chr>     <lgl> <chr>   <chr>    <chr>      <chr>      
## 1 SwissF… cf      abovegro… TRUE  B       C        seB        seC        
## 2 SwissF… cf      abovegro… FALSE A       D        seA        seD        
## # … with 2 more variables: ambient_sd <chr>, elevated_sd <chr>

Let’s apply this to the full data frame.

keyvars <- c("ALIAS", "exp_nam", "factors", "Data_type", "Unit", "Sampling_date", "n_plots")
valuevars <-  c("mean", "sd", "se")

df_wide_c <- df_long %>% 
  # select(one_of(keyvars, valuevars, factors_avl)) %>%
  # select(-prev_name, -TT_Nut_Detail) %>% 
  long_to_wide_gcme("c") %>% 
  write_csv(path="~/data/gcme/data_received_190325/NewData_WIDE_c.csv")

TODO:

Develop a test to check the new-wide format against the long or against the original-wide table.

Data selection

The initial draft of the figures we want to draw looks like this:

The following is a list of variables we need according to this draft, and the corresponding (possibly) relevant variables (Data_type) in our dataset.

Add two columns to the data. One for the new standard variable name (varnam), taken from CSV files. And one for this specific study where multiple variables (based on varnames) are pooled because they contain relevant and comparable information when evaluating response ratios (named my_varnam).

ANPP

ANPP: my_varnam = my_anpp, and varnam %in% c("anpp", "anpp_bm").

Collect information about experiments providing data for any of c("anpp", "anpp_bm"). Use file table_var_exp_names_anpp.csv and manually add a new row varnam there, translating available variable names (Data_type) into new, unified variable names. Add additional information from the experiments table (table_var_exp_names_experiments.csv) and from the data frame df_experiments, which is used to associate new and old experiments names.

df_anpp <- read_csv("~/data/gcme/data_received_190325/table_var_exp_names_anpp.csv") %>% 
  select(prev_name=expnam, varnam, Data_type, Source_Reference) %>%
  distinct() %>% 
  filter(varnam %in% c("anpp", "anpp_bm")) %>% 
  
  ## add 'exp_nam' (new experiment name) from data table
  left_join( df_experiments, by="prev_name" ) %>% 
  
  ## add experiments meta information from experiments table
  left_join( 
    read_csv("~/data/gcme/data_received_190325/table_var_exp_names_experiments.csv") %>% 
      select(prev_name, Fumigation_type, Cquantity_Control, Cquantity_Treatment, Vegetation_type, Plants_specs, Start_Year, End_Year, Remarks),
    by = c("prev_name")
    )

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   ALIAS = col_double(),
##   Data_Number = col_double(),
##   ambient = col_double(),
##   ambient_Se = col_double(),
##   ambient_Sd = col_double(),
##   elevated = col_double(),
##   elevated_Se = col_double(),
##   elevated_Sd = col_double()
## )

## See spec(...) for full column specifications.

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Experiment Number` = col_double(),
##   `N Quantity control` = col_double(),
##   `P Quantity control` = col_double(),
##   `K Quantity control` = col_double(),
##   `K Distribution` = col_logical()
## )

## See spec(...) for full column specifications.

Add my_varnam to the data table, defined by my_varnam = ifelse( varnam %in% c("anpp", "anpp_bm"), "my_anpp", my_varnam ). WARNING: This assumes that variable names ‘Data_type’ are identical in old and new dataset (manually replaced white spaces with underscores in table_var_exp_names_anpp.csv, first)

df <- df %>% 
  mutate( varnam=NA ) %>% 
  left_join( select( df_anpp, -prev_name ), by=c("exp_nam", "Data_type") ) %>% 
  distinct(id, .keep_all=TRUE) %>% 
  rename( varnam = varnam.x ) %>% 
  rowwise() %>% 
  mutate( varnam = ifelse(!is.na(varnam.y), varnam.y, varnam ) ) %>% 
  select( -varnam.y ) %>% 
  mutate( my_varnam = NA ) %>% 
  mutate( my_varnam = ifelse( varnam %in% c("anpp", "anpp_bm"), "my_anpp", my_varnam ) )

Test analysis and plot of ANPP data. Calculate the response ratio of ANPP (mean and variance) for each experiment. To get that, we first need to calcuate the means and standard deviation for the ambient and elevated levels, pooling multiple measurements (years, sampling dates), each given with mean \(\mu_i\), number \(N_i\) (replicates/plots), and standard deviation \(\sigma_i\) or standard error. For the function metafor::escalc(), we need standard deviations (\(SD\)). Calculate them for those rows where only standard errors \(SE\) are given as: \[ SD = SE \sqrt{N} \] Then, the pooled means are: \[ \mu = \frac{ \sum_i N_i \mu_i }{\sum_i N_i} \] and standard deviations: \[ \sigma = \sqrt{ \frac{\sum_i (N_i-1)\sigma_i^2}{\sum_i (N_i-1)} } \]

QUESTION TO CESAR:

Is there no component in \(\sigma\) that accounts for the variance between the sample means?

library(metafor)   # see ?dat.curtis1998 for an example with CO2 manipulation data

## Loading required package: Matrix

## Loading 'metafor' package (version 2.0-0). For an overview 
## and introduction to the package please type: help(metafor).

mean_pooled <- function( x, n ){
  sum(x * n)/sum(n)
}
sd_pooled <- function( sd, n ){
  sqrt( sum( (n-1)*sd^2 ) / sum(n-1) )
}

## aggregate by variable and experiment, pooling multiple years, sampling dates, and plots/replicates and calculate log response ratio
df_c_anpp_agg <- df %>%                                         
  filter(treatment=="c") %>%   # filter experiments with only manipulated CO2 (no other factors manipulated, strong reduction of data)
  filter(my_varnam=="my_anpp") %>% 
  mutate( my_ambient_sd = ambient_Sd, my_elevated_sd = elevated_Sd ) %>%   # get standard deviation for all data
  rowwise() %>% 
  mutate( my_ambient_sd   = ifelse( is.na(my_ambient_sd),  ambient_Se  * sqrt(n_plots), my_ambient_sd ),
          my_elevated_sd  = ifelse( is.na(my_elevated_sd), elevated_Se * sqrt(n_plots), my_elevated_sd )) %>%

  ## pool all measurements (multiple years, sampling dates and plots) by variable and experiment for meta analysis
  group_by( exp_nam, treatment, my_varnam ) %>%                        # prepare for aggregation
  summarise( ambient     = mean_pooled( ambient, n_plots ), 
             elevated    = mean_pooled( elevated, n_plots ),
             ambient_sd  = sd_pooled( my_ambient_sd, n_plots ),
             elevated_sd = sd_pooled( my_elevated_sd, n_plots ),
             n_plots = sum(n_plots) ) %>% 

  ## Get logarithm of response ratio and its variance
  metafor::escalc( 
    measure = "ROM", 
    m1i = elevated, sd1i = elevated_sd, n1i = n_plots, 
    m2i = ambient,  sd2i = ambient_sd,  n2i = n_plots, 
    data=., 
    append = TRUE, var.names = c("logr", "logr_var") ) %>% 
  as_tibble() %>% 
  mutate( logr_se = sqrt(logr_var)/sqrt(n_plots) )

## aggregate by variable, doing a meta-analysis of the log response ratios, with experiment as random factor (method="DL")
agg_meta <- function(df, groupvar){
  out_meta <- df %>% dplyr::filter(my_varnam==eval(parse_character(groupvar))) %>% 
    metafor::rma( logr, logr_var, method = "DL", data = . )
  
  # transform back
  out_meta_scaled <- predict( out_meta, transf=exp )
  
  tibble(
    my_varnam=groupvar, 
    middle = out_meta$b[1,1], 
    ymin   = out_meta$ci.lb, 
    ymax   = out_meta$ci.ub,
    
    middle_scaled = out_meta_scaled$pred, 
    ymin_scaled   = out_meta_scaled$ci.lb, 
    ymax_scaled   = out_meta_scaled$ci.ub
  )
}

# do meta-analysis all variables
df_meta <- purrr::map_dfr(as.list(unique(df_c_anpp_agg$my_varnam)), ~agg_meta(df_c_anpp_agg, .))

# Plot
library(ggplot2)
df_c_anpp_agg %>%
  ggplot( aes(x=my_varnam, y=logr)) +
  geom_jitter( color = rgb(0,0,0,0.3), aes( size = 1/logr_se ), position = position_jitter(w = 0.2, h = 0) ) +
  geom_crossbar( data = df_meta, aes(x=my_varnam, y=middle, ymin=ymin, ymax=ymax), fill = "grey80", alpha = 0.6, width = 0.5 ) +
  coord_flip() +
  ylim(-1,1)

This shows the log-response ratio of ANPP, each dot for one experiment, multiple years, sampling dates, and plots pooled. The size of the dots represents the standard error of the response ratio (the smaller, the bigger). The box represents the meta-analysis mean and confidence interval of the log response ratio.

And some information about the data used for my_anpp (only top 20 rows of a total of 110 available rows)

df_anpp %>% select(-prev_name, -Cquantity_Control) %>% distinct(varnam, Data_type, exp_nam, .keep_all=TRUE) %>% slice(1:10) %>% knitr::kable()

varnam	Data_type	Source_Reference	exp_nam	Fumigation_type	Cquantity_Treatment	Vegetation_type	Plants_specs	Start_Year	End_Year	Remarks
anpp	ANPP	Adair_et_al.,_2009	BioCON	FACE	ambient, 560 (+180)	perennial grassland	Plots were seeded initially with different biodiversity ratios between C3, C4, perennial grasses, forbs and legumes.	1998	NA	NA
anpp	ANPP	Schlesinger_et_al.,_2006	DUKE	FACE	ambient, +200	Pinus taeda	Pinus taeda forest (98%), other 2% is Acer rubrum, Ulmus alata, Liquidambar styraciflua, Liriodendron tulipifera and Cercis canadensis	1996	NA	NA
anpp	ANPP	Liberloo_et_al.,_2005	EUROFACE4_pa	FACE	ambient, 550	Populus alba	hardwood cuttings (3 year SRC)	1999	NA	NA
anpp	ANPP	Liberloo_et_al.,_2005	EUROFACE4_pe	FACE	ambient, 550	Populus x euramericana	hardwood cuttings (3 year SRC)	1999	NA	NA
anpp	ANPP	Liberloo_et_al.,_2005	EUROFACE4_pn	FACE	ambient, 550	Populus nigra	hardwood cuttings (3 year SRC)	1999	NA	NA
anpp	ANPP	Hochstrasser_et_al._2012	GiFACE	FACE	450–>ambient+20%	temperate, mixed-species pasture	12 grass sp, 2 legumes, 15 non-leguminous herbs; characterized as an Arrhenatheretum elatioris Br.Bl. Filipendula ulmaria sub-community	1998	NA	NA
anpp	ANPP	Dukes_et_al.,_2005	JRBP_FACE1	FACE	ambient, +300	California annual grassland	Avena barbata and A. fatua, Bromus hordeaceus and Lolium multiflorum dominate the site	1998	NA	NA
anpp	ANPP	Dukes_et_al.,_2005	JRBP_FACE2	FACE	ambient, 680	California annual grassland	Avena barbata and A. fatua, Bromus hordeaceus and Lolium multiflorum dominate the site	1998	NA	NA
anpp	ANPP	Henry_et_al.,_2006	JRBP_FACE3	FACE	ambient, +300	California annual grassland	Avena barbata and A. fatua, Bromus hordeaceus and Lolium multiflorum dominate the site	1998	NA	NA
anpp	ANPP	Dijkstra_et_al.,_2002	MI	OTC	410, 760	Shrub Oak system	Quercus myrtifolia (76%), Q. Geminata (15%), Q. Chapmanii (7%), Serenoa repense and Lyonia ferreginea	1996	NA	NA

Issue:

There are un-interpretable characters in files, concerning (most problematically) the experiments names. The problem is now, that these are in all data that is read in from multiple CSVs. Changing this by hand has to make sure that they are changed the same way in multiple files.
Some old experiment names could not be associated with new experiments name because the old names given in the Excel table (sent 25.2.19) are not detected in the new data table (sent 25.3.19). This old-new name association is based on the columns exp_nam and prev_name in the new data table. For the following old experiment names, no new names could be found:

df_anpp %>% filter(is.na(exp_nam)) %>% select(prev_name) %>% distinct() %>% unlist() %>% unname() %>% print()

## [1] "N\x8cntuna_c"

CO2 experimental data analysis

Beni Stocker

2019-03-28

Data screening

0. Issues

Sampling date

Duplicate rows

Distinct ambients

4. Long versus wide

An example conversion

Converting the entire table

Test

Converting back with a twist

Data selection

ANPP