Bird Trait Networks example

Lets have a look at an example documentation I created using the data I compiled for the Bird Trait Networks project and R Markdown

So I have compiled my large dataset and I want to start exploring it. The plain text chunks in an R Markdown document (.Rmd) are a great space to document any procedures and methods used to produce the data. I’ve spared you that here for my dataset but you should use this space with your own example dataset to include as much useful detail as you can to make the methods used to generate your data as understandable as possible.

Data

Lets first have a look at what we are dealing with and start by loading the data and the metadata:

### SETTINGS ##############################################################
input.folder <- "~/Documents/WORK/ACCE Data management course/workflow/inputs/exercises/metadata/"

### FILES #################################################################

meta <- read.csv(paste(input.folder,"metadata.csv", sep =""), stringsAsFactors = F)
dd   <- read.csv(paste(input.folder,"data.csv", sep =""))

### PACKAGES #################################################################

require(knitr) # needed for fuction kable

So now the data has been loaded into and r environmnent. I’ll use the kable function to make a cool html table of the first 30 rows of the data

kable(head(dd, 30), caption = "Table 1: Sample of the Bird Trait Networks dataset")

Table 1: Sample of the Bird Trait Networks dataset
species	max.altitude	inc	dev.mode	courtship.feed.m	song.dur	breed.system
Abroscopus_albogularis	NA	NA	NA	NA	NA	NA
Abroscopus_superciliaris	NA	NA	NA	NA	NA	NA
Acanthagenys_rufogularis	NA	NA	NA	NA	NA	NA
Acanthidops_unicolor	NA	NA	NA	NA	NA	NA
Acanthis_flammea	1400	10.0	1	1	19.8	1
Acanthis_hornemanni	NA	12.0	NA	1	NA	NA
Acanthisitta_chloris	NA	19.5	2	NA	NA	4
Acanthiza_apicalis	NA	NA	NA	NA	NA	NA
Acanthiza_chrysorrhoa	NA	NA	NA	NA	NA	NA
Acanthiza_lineata	NA	NA	NA	NA	NA	NA
Acanthiza_nana	NA	NA	NA	NA	NA	NA
Acanthiza_pusilla	NA	NA	NA	NA	NA	NA
Acanthiza_reguloides	NA	NA	NA	NA	NA	NA
Acanthiza_uropygialis	NA	NA	NA	NA	NA	NA
Acanthorhynchus_superciliosus	NA	NA	NA	NA	NA	NA
Acanthorynchus_tenuirostris	NA	NA	NA	NA	NA	NA
Accipiter_badius	NA	30.0	2	NA	NA	1
Accipiter_bicolor	NA	NA	NA	NA	NA	NA
Accipiter_brevipes	NA	32.5	2	NA	NA	1
Accipiter_cirrocephalus	NA	NA	NA	NA	NA	NA
Accipiter_cooperii	NA	24.0	2	NA	NA	5
Accipiter_fasciatus	NA	30.0	NA	NA	NA	NA
Accipiter_gentilis	NA	33.0	2	NA	NA	1
Accipiter_melanoleucus	NA	37.5	2	NA	NA	1
Accipiter_nisus	1930	34.0	2	NA	NA	5
Accipiter_novaehollandiae	NA	NA	NA	NA	NA	NA
Accipiter_striatus	NA	34.0	2	NA	NA	1
Acridotheres_cristatellus	NA	15.0	NA	NA	NA	NA
Acridotheres_tristis	NA	15.5	2	0	NA	1
Acrocephalus_agricola	NA	NA	NA	NA	NA	NA

Exploratory plots

To begin with I want to do some basic sanity checks. So I might want to firstly check the distribution of the data for each variable. This can help me identify outliers or other data entry errors. Here the metadata table can be very useful. Lets have a quick look at it:

Table 2: Variable metadata
code	orig.vname	cat	descr	scores	levels	type	units
max.altitude	Altitude	ECOLOGY	Maximum altitudinal distribution	NA	NA	con	m
inc	Incubation period	LIFE-HISTORY	Incubation period	NA	NA	con	days
dev.mode	Developmental mode	LIFE-HISTORY	Developmental mode	1;2;3	Altricial;Semiprecocial;Precocial	cat	NA
courtship.feed.m	Courtship feeding (by the male)	SEXUAL SELECTION	Courtship feeding (by the male)	0;1	FALSE;TRUE	bin	NA
song.dur	Song duration	BEHAVIORAL	Song duration	NA	NA	con	seconds
breed.system	Breeding system	BEHAVIORAL	Which adult(s) provides the majority of care:	1;2;3;4;5	Pair;Female;Male;Cooperative;Occassional	cat	NA

Descriptive plot axes labels

Firstly, while the coded variable names are reasonably descriptive, I want to make sure the variables are clearly specified in their plots. I also want to include units. All this makes the plots and therefore the data more understandable by both myself and my collaborators. So I can use information in the metadata to construct more informative axis labels.

For this I’ve created the function axisLabel that takes a single row of the metadata dataframe (the row containing the information for the variable I want to create the axis label for) and combines information in columns descr and units.

### FUNCTIONS ##############################################################

# function takes dataframe consisting of a single variable metadata row. 
# Row must have columns named `descr` containing variable description and units
# containing units (NA if variable is unitless)
axisLabel <- function(metadata){
  
  # select description for variable
  descr <- metadata$descr
  
  # select units for variable if applicable and place in parenthesis
  units <- if(is.na(metadata$units)){NULL}else{
    paste(" (", metadata$units, ")", sep ="")}
  
  # combine description and units to create axis label
  label <- paste(descr, units, sep = "")
  
  # return label
  return(label)
    
}

You are welcome to copy and use this function. Just make sure you supply the function argument metadata with the a dataframe with a single row and with the appropriate data in approrpiately named columns descr and units. You can ofcourse edit the function to change these requirements.

The right plot for the right data type

I am also dealing with a variety of data types including continuous, integer, categorical and binary. So I can use the cat column in the metadata to determine the right plot type to use for each data type.

More informative labels for categorical/binary variables

Finally, because my categorical/binary variables are coded, I can use the information in the levels metadata column to provide more informative barplot labels.

To produce a series of plots for my dataset, I write a for() loop, and nest if() conditional statements in it, to specify when to use the two most appropriate plot types. I also want to calculate and include n, the number of observations available for each variable, in each plot because it’s a basic check I always investigate.

CODE & PLOTS

vars <- c("max.altitude", "inc", "dev.mode", "courtship.feed.m", "song.dur", "breed.system")

for(var in vars){
  
  # subset master dataset to variable data to plot. omit NAs
  x <- na.omit(dd[,var])
  
  # subset metadata to a single variable row
  var.meta <- meta[meta$code == var,]
  
  # use axisLabel function to create axis label for variable
  xlabel <- axisLabel(metadata = var.meta)
  
  
  
  # Use variable metadata to determine the most approriate plot for the data method
  # _________________________________________________________________________
  
  ################################################ 
  ### Plotting continuous or integer variables ###
  ################################################
  
  #  if data type is continuous or integer, use histogram
  
  if(var.meta$type %in% c("con", "int")){
    
    hist(x, xlab = xlabel, main = paste("n =", length(x)), col = "gray")
    
  }
  
  
  ################################################ 
  ### Plotting categorical or binary variables ###
  ################################################
  
  #  if data type is binary or categorical, use barplot
  if(var.meta$type %in% c("bin", "cat")){
    
    # split the string containing levels to get category labels for your codes
    # skip this step if your data catecories are not coded     
    levels <- strsplit(x = var.meta$levels,
                       split = ";")
    
    # plot TABLE of category frequencies
    barplot(table(dd[,var]), main = paste("n =", length(x)), xlab = xlabel, ylab = "Frequency" 
            , names.arg = levels[[1]] # remove argument if you skiped previous step. Delete line
            )
  }
  
}

Data exploration in R Markdown

Anna Krystalli

12 April 2016