Lets have a look at an example documentation I created using the data I compiled for the Bird Trait Networks project and R Markdown
So I have compiled my large dataset and I want to start exploring it. The plain text chunks in an R Markdown document (.Rmd) are a great space to document any procedures and methods used to produce the data. I’ve spared you that here for my dataset but you should use this space with your own example dataset to include as much useful detail as you can to make the methods used to generate your data as understandable as possible.
Lets first have a look at what we are dealing with and start by loading the data and the metadata:
### SETTINGS ##############################################################
input.folder <- "~/Documents/WORK/ACCE Data management course/workflow/inputs/exercises/metadata/"
### FILES #################################################################
meta <- read.csv(paste(input.folder,"metadata.csv", sep =""), stringsAsFactors = F)
dd <- read.csv(paste(input.folder,"data.csv", sep =""))
### PACKAGES #################################################################
require(knitr) # needed for fuction kable
So now the data has been loaded into and r environmnent. I’ll use the kable function to make a cool html table of the first 30 rows of the data
kable(head(dd, 30), caption = "Table 1: Sample of the Bird Trait Networks dataset")
| species | max.altitude | inc | dev.mode | courtship.feed.m | song.dur | breed.system |
|---|---|---|---|---|---|---|
| Abroscopus_albogularis | NA | NA | NA | NA | NA | NA |
| Abroscopus_superciliaris | NA | NA | NA | NA | NA | NA |
| Acanthagenys_rufogularis | NA | NA | NA | NA | NA | NA |
| Acanthidops_unicolor | NA | NA | NA | NA | NA | NA |
| Acanthis_flammea | 1400 | 10.0 | 1 | 1 | 19.8 | 1 |
| Acanthis_hornemanni | NA | 12.0 | NA | 1 | NA | NA |
| Acanthisitta_chloris | NA | 19.5 | 2 | NA | NA | 4 |
| Acanthiza_apicalis | NA | NA | NA | NA | NA | NA |
| Acanthiza_chrysorrhoa | NA | NA | NA | NA | NA | NA |
| Acanthiza_lineata | NA | NA | NA | NA | NA | NA |
| Acanthiza_nana | NA | NA | NA | NA | NA | NA |
| Acanthiza_pusilla | NA | NA | NA | NA | NA | NA |
| Acanthiza_reguloides | NA | NA | NA | NA | NA | NA |
| Acanthiza_uropygialis | NA | NA | NA | NA | NA | NA |
| Acanthorhynchus_superciliosus | NA | NA | NA | NA | NA | NA |
| Acanthorynchus_tenuirostris | NA | NA | NA | NA | NA | NA |
| Accipiter_badius | NA | 30.0 | 2 | NA | NA | 1 |
| Accipiter_bicolor | NA | NA | NA | NA | NA | NA |
| Accipiter_brevipes | NA | 32.5 | 2 | NA | NA | 1 |
| Accipiter_cirrocephalus | NA | NA | NA | NA | NA | NA |
| Accipiter_cooperii | NA | 24.0 | 2 | NA | NA | 5 |
| Accipiter_fasciatus | NA | 30.0 | NA | NA | NA | NA |
| Accipiter_gentilis | NA | 33.0 | 2 | NA | NA | 1 |
| Accipiter_melanoleucus | NA | 37.5 | 2 | NA | NA | 1 |
| Accipiter_nisus | 1930 | 34.0 | 2 | NA | NA | 5 |
| Accipiter_novaehollandiae | NA | NA | NA | NA | NA | NA |
| Accipiter_striatus | NA | 34.0 | 2 | NA | NA | 1 |
| Acridotheres_cristatellus | NA | 15.0 | NA | NA | NA | NA |
| Acridotheres_tristis | NA | 15.5 | 2 | 0 | NA | 1 |
| Acrocephalus_agricola | NA | NA | NA | NA | NA | NA |
To begin with I want to do some basic sanity checks. So I might want to firstly check the distribution of the data for each variable. This can help me identify outliers or other data entry errors. Here the metadata table can be very useful. Lets have a quick look at it:
| code | orig.vname | cat | descr | scores | levels | type | units |
|---|---|---|---|---|---|---|---|
| max.altitude | Altitude | ECOLOGY | Maximum altitudinal distribution | NA | NA | con | m |
| inc | Incubation period | LIFE-HISTORY | Incubation period | NA | NA | con | days |
| dev.mode | Developmental mode | LIFE-HISTORY | Developmental mode | 1;2;3 | Altricial;Semiprecocial;Precocial | cat | NA |
| courtship.feed.m | Courtship feeding (by the male) | SEXUAL SELECTION | Courtship feeding (by the male) | 0;1 | FALSE;TRUE | bin | NA |
| song.dur | Song duration | BEHAVIORAL | Song duration | NA | NA | con | seconds |
| breed.system | Breeding system | BEHAVIORAL | Which adult(s) provides the majority of care: | 1;2;3;4;5 | Pair;Female;Male;Cooperative;Occassional | cat | NA |
Firstly, while the coded variable names are reasonably descriptive, I want to make sure the variables are clearly specified in their plots. I also want to include units. All this makes the plots and therefore the data more understandable by both myself and my collaborators. So I can use information in the metadata to construct more informative axis labels.
For this I’ve created the function axisLabel that takes a single row of the metadata dataframe (the row containing the information for the variable I want to create the axis label for) and combines information in columns descr and units.
### FUNCTIONS ##############################################################
# function takes dataframe consisting of a single variable metadata row.
# Row must have columns named `descr` containing variable description and units
# containing units (NA if variable is unitless)
axisLabel <- function(metadata){
# select description for variable
descr <- metadata$descr
# select units for variable if applicable and place in parenthesis
units <- if(is.na(metadata$units)){NULL}else{
paste(" (", metadata$units, ")", sep ="")}
# combine description and units to create axis label
label <- paste(descr, units, sep = "")
# return label
return(label)
}
You are welcome to copy and use this function. Just make sure you supply the function argument metadata with the a dataframe with a single row and with the appropriate data in approrpiately named columns descr and units. You can ofcourse edit the function to change these requirements.
I am also dealing with a variety of data types including continuous, integer, categorical and binary. So I can use the cat column in the metadata to determine the right plot type to use for each data type.
Finally, because my categorical/binary variables are coded, I can use the information in the levels metadata column to provide more informative barplot labels.
To produce a series of plots for my dataset, I write a for() loop, and nest if() conditional statements in it, to specify when to use the two most appropriate plot types. I also want to calculate and include n, the number of observations available for each variable, in each plot because it’s a basic check I always investigate.
vars <- c("max.altitude", "inc", "dev.mode", "courtship.feed.m", "song.dur", "breed.system")
for(var in vars){
# subset master dataset to variable data to plot. omit NAs
x <- na.omit(dd[,var])
# subset metadata to a single variable row
var.meta <- meta[meta$code == var,]
# use axisLabel function to create axis label for variable
xlabel <- axisLabel(metadata = var.meta)
# Use variable metadata to determine the most approriate plot for the data method
# _________________________________________________________________________
################################################
### Plotting continuous or integer variables ###
################################################
# if data type is continuous or integer, use histogram
if(var.meta$type %in% c("con", "int")){
hist(x, xlab = xlabel, main = paste("n =", length(x)), col = "gray")
}
################################################
### Plotting categorical or binary variables ###
################################################
# if data type is binary or categorical, use barplot
if(var.meta$type %in% c("bin", "cat")){
# split the string containing levels to get category labels for your codes
# skip this step if your data catecories are not coded
levels <- strsplit(x = var.meta$levels,
split = ";")
# plot TABLE of category frequencies
barplot(table(dd[,var]), main = paste("n =", length(x)), xlab = xlabel, ylab = "Frequency"
, names.arg = levels[[1]] # remove argument if you skiped previous step. Delete line
)
}
}
metadata.csv for your dataset.Here’s a link to the .Rmd file that created this .html file.