Goal

The objective here is to make a clean reproducible pipeline to transform Netlogo output into a tidy rectangular dataframe. Here, I import behavior space data (i.e., experiment results) from Netlogo to R. Netlogo’s behavior space data comes in a wonky format, and the goal is to do all tidying in R (vs. on spreadsheet) so it’s easy to feed through new data. The data I’ve imported here does several ‘runs’ (multiple treatments and replicates) of an simulation testing effects of phenology on species interactions. There is summary data for each run and there’s also time series info on size for each turtle/agent.

Import Data

First, be sure you export behavior space data from Netlogo in “table” format. When the raw data is exported as a spreadsheet, it comes in a really gnarly format. R can’t read it because it’s not even close to rectangular, big header, mix of space delimited and bracket delimited. It’s a mess. The table is still a bit messy, but much easier to wrangle.

test <- read.table("bs2.csv",
                   header = T,  
                   sep = ',',        # define the separator between columns
                   skip = 6,         # 6 header rows we don't need
                   quote = "\"",     # correct the column separator
                   fill = T)         # add blank fields if rows have unequal length

# [,1:12] are parameter values (some consistent across all, some representing trts)
# [,13:ncol] are the output we designated
str(test)

## 'data.frame':    60 obs. of  16 variables:
##  $ X.run.number.              : int  2 4 3 1 5 6 8 7 9 10 ...
##  $ show.label.                : Factor w/ 1 level "false": 1 1 1 1 1 1 1 1 1 1 ...
##  $ n.fishes                   : int  50 50 50 50 50 50 50 50 50 50 ...
##  $ sprout.tick                : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ var.hatch.dflies           : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ mean.hatch.dflies          : int  10 10 10 10 15 15 15 15 25 25 ...
##  $ n.dflies                   : int  50 50 50 50 50 50 50 50 50 50 ...
##  $ growth.per.patch           : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ grass.grow.rate            : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ var.hatch.fishes           : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ grass.death.rate           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ X.step.                    : int  149 149 149 149 149 149 149 149 149 149 ...
##  $ n.meta.dflies              : int  50 48 49 48 50 50 50 50 22 22 ...
##  $ n.meta.fishes              : int  12 10 10 10 9 7 12 9 22 24 ...
##  $ X.dfly.size.list..of.dflies: Factor w/ 60 levels "[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"| __truncated__,..: 58 60 57 56 52 42 50 48 40 37 ...
##  $ X.fish.size.list..of.fishes: Factor w/ 60 levels "[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"| __truncated__,..: 14 4 15 9 11 31 3 8 30 28 ...

Inspect Data

First, let’s rename columns and drop some that we don’t need. I keep some columns that are the same across all model runs because it will be useful to call them later, especially since different runs of the experiment might have differetn values. Some columns will always be the same, or are irrelevant. These I drop.

For this model, I had two turtle types (fishes and dflies), and varied the timing of hatching and densities of each. Outputs I collected from Netlogo were number of survivors of each turtle type and a vector of sizes for each individual turtle.

# select and rename columns. new_name = old name. 
# renaming everything I keep, so can rename with select
test <- test %>% 
  select(run_num = X.run.number.,                  # each run number is an experimental unit
         total_time = X.step.,                     # length of time the experiment ran
         n_fish = n.fishes,                        # two types of "turtles" here: fish & dfly
         n_dfly = n.dflies,                        # n_fish/dfly is the initial number
         sync_dfly = var.hatch.dflies,             # amount of variation in ind hatching timing
         mean_dfly = mean.hatch.dflies,            # mean hatching timing
         surv_dfly = n.meta.dflies,                # number that 'metamorphed', i.e., size >10
         sync_fish = var.hatch.fishes,             # amount of variation in ind hatching timing
         surv_fish = n.meta.fishes,                # number that 'metamorphed', i.e., size >10
         sizes_dfly = X.dfly.size.list..of.dflies, # list of size for each ind for each time step
         sizes_fish = X.fish.size.list..of.fishes) # list of size for each ind for each time step

# Extract parameter values-- these are the 'treatments' and inputs of the model for each run
# I'll rejoin the output data with params later, after some summarizing/processing
# Here, add qualitative levels for the treatment parameters. This will be different for each BS
params <- test[,1:9]
params$mean_dfly_num <- params$mean_dfly
params$sync_dfly_num <- params$sync_dfly
params$sync_dfly <- as.factor(params$sync_dfly)
params$mean_dfly <- as.factor(params$mean_dfly)
levels(params$sync_dfly) <- c("high", "med", "low")
levels(params$mean_dfly) <- c("dfly v early", "dfly early", "same", "fish early", "fish v early")

Parse Individuals

The ‘sizes_x’ columns have a time series of size for each individual (so each ‘sizes_x’ column has 50 individuals x 150 time steps values). We ultimately want one row per individual per time step. Let’s first separate by individual.

# First, separate so that each individual has it's own column
test <- test %>% 
  separate(sizes_dfly,                                      # separate sizes_dfly
           into = paste("d_", c(1:test$n_dfly), sep = ""),  # levels for new var
           sep = "]") %>%                                   # every ] marks a new ind
  separate(sizes_fish, 
           into = paste("f_", c(1:test$n_fish), sep = ""), 
           sep = "]")

# Next, I'll gather individuals so that they appear in rows
# I separate this step by species because it's easier for me to catch errors

## FIRST, DRAGONFLIES 
# Select only unique dfly individuals and gather to long format
test_dfly <- test %>%
  select(run_num, d_1:d_50) %>% 
  unique(.) %>% 
  gather(d_1:d_50, key = "dfly_id", value = 'size_dfly') %>% 
  arrange(run_num)

# Remove the brackets
test_dfly$size_dfly <- gsub("[", "", test_dfly$size_dfly, fixed = T)

## NEXT, FISH
# Select only unique fish individuals and gather to long format
test_fish <- test %>%
  select(run_num, f_1:f_50) %>% 
  unique(.) %>% 
  gather(f_1:f_50, key = "fish_id", value = 'size_fish') %>% 
  arrange(run_num)

# Remove the brackets
test_fish$size_fish <- gsub("[", "", test_fish$size_fish, fixed = T)

Parse Time Series

Now, do the same separate and gather operations to parse the time vector and put in long format.

## FIRST, DRAGONFLIES 
# Separate the size time series- each space represents a new time point
test_dfly <- test_dfly %>% 
  separate(size_dfly, into = as.character(c(0:149)), sep = " ") %>% 
  select(-c(as.character(0))) # I drop this because there's only a value for ind 1

# Gather to long format, so we now have one line per dfly per time step
test_dfly <- test_dfly %>% 
  group_by(run_num) %>% 
  gather(as.character(1:149), key = "time", value = 'current_size_dfly') %>%
  arrange(run_num, dfly_id)

## NEXT, FISH
# Separate the size time series- each space represents a new time point
test_fish <- test_fish %>% 
  separate(size_fish, into = as.character(c(0:149)), sep = " ") %>% 
  select(-c(as.character(0))) # I drop this because there's only a value for ind 1

# Gather to long format, so we now have one line per dfly per time step
test_fish <- test_fish %>% 
  group_by(run_num) %>% 
  gather(as.character(1:149), key = "time", value = 'current_size_fish') %>%
  arrange(run_num, fish_id)

Compile & Tidy

Now, join the two species’ dataframes together, rejoin with the treatment information, and write a csv. This csv has one line per individual per time step. So nrow should be max(run_num) x (ndfly + nfish) x total_time

# Paste dfs together and remove redundant variables
test <- cbind(test_dfly, test_fish)
test <- test %>% 
  select(-c(run_num1, time1)) 

# This is clunkier than it needs to be, but allows some double checking
test$dfly <- paste(test$dfly_id, test$current_size_dfly)
test$fish <- paste(test$fish_id, test$current_size_fish)
test1 <- test %>% 
  select(run_num, time, dfly, fish)
test2 <- test1 %>% 
  gather(key = "species", "size", dfly, fish)
test2$time <- as.numeric(test2$time)

test2 <- test2 %>% 
  separate(size, into = c("id", "size"), sep = " ") %>% 
  separate(id, into = c("sp", "id"), sep = "_") %>% 
  select(-sp) %>% 
  arrange(run_num, id, time)
test2$size <- as.numeric(test2$size)

# Rejoin with treatment identifiers for each run number
test <- left_join(test2, params, by = "run_num")
ind_time <- test %>%
  select(run_num, n_fish, n_dfly, sync_fish, sync_dfly, mean_dfly,
         surv_fish, surv_dfly, time, species,
         id, size)

write.csv(ind_time, "individual_time.csv")

Summarize by Individual

Now, I want to make some summarised datasets. First with one row per individual, then with one row per run number

ind_time <- read.csv("individual_time.csv", header = T)

ind <- ind_time %>% 
  group_by(run_num, species, id, sync_fish, sync_dfly, mean_dfly, surv_dfly, surv_fish) %>%
  summarise(max_size = max(size),
            meta = if_else(max_size >= 10, 1, 0),
            hatch_date = min(time[size > 0]),
            # use base R ifelse here, so we can have 2 data types for T/F
            meta_date = ifelse(meta == 1, min(time[size >= 10]), NA), 
            death_date = ifelse(meta == 0, min(time[size = max(max_size)]), NA),
            growth_time = meta_date - hatch_date)

# Currently, number of metamorphs doesn't match raw data. think it might be a >= mismatch. check.
check <- ind %>% 
  group_by(run_num, species, surv_dfly, surv_fish) %>%
  summarise(no_meta = sum(meta))
ggplot(subset(check, subset = (species == "fish")), aes(x = surv_fish, y = no_meta)) +
  geom_point() + theme_bw() +
  geom_abline(slope = 1, intercept = 0)

Summarize by Treatment

Now, make a df with 1 row per unique treatment with means/se’s

trt <- params %>% 
  group_by(sync_dfly, mean_dfly, mean_dfly_num, sync_dfly_num) %>% 
  summarise(mean_surv_dfly = mean(surv_dfly),
            mean_surv_fish = mean(surv_fish),
            se_surv_dfly   = sd(surv_dfly)/sqrt(length(surv_dfly)),
            se_surv_fish   = sd(surv_fish)/sqrt(length(surv_fish)))

ind$sync_dfly <- factor(ind$sync_dfly, levels = c("low", "med", "high"))
ind$mean_dfly <- factor(ind$mean_dfly, levels = c("dfly v early", "dfly early", "same", "fish early", "fish v early"))
trt$sync_dfly <- factor(trt$sync_dfly, levels = c("low", "med", "high"))
trt$mean_dfly <- factor(trt$mean_dfly, levels = c("dfly v early", "dfly early", "same", "fish early", "fish v early"))

Plots

Here are some diagnostic plots to get a quick look at the data. These will give an idea of what parameter space is reasonable and interesting to explore in future behavior space runs.

Netlogo to R Pipeline

Shannon Carter

March 11, 2019