Great work this summer in collecting mountains of data! I now want to make sure that we are all on the same page with how these data (and metadata) are handled for both analysis and long-term storage. I’m coordinating data management and the associated best practices to ensure that we meet the goals of our management plan but each of you will help APECS reach such goals via assisting me in the process with respect to the data that each of you have collected and are curating.
First, you should be aware that both grants under the APECS umbrella have a data management plan. They are not complex but serve as a promise to ourselves and to NSF that we will be responsible with the expensive data that we were funded to collect. The plans outline:
Onwards! Best practices for data management with APECS:
After fieldwork, labwork, etc., collate the raw data into a spreadsheet. Be mindful, however, how those data are entered/organized into a working dataset. A spiffy dataset is a “clean” dataset (see below). It saves a lot of time to enter it in such a way that the data (1) are readily accessble to whatever statistical software you plan to use and (2) readily accessible to future users. For archival, the data eventually need to be structured into a format that happens to align with structure appropriate for R.
What is a clean dataset? Essentially, each variable (dependent and independent) should have its own column, where the rows in each column includes a single point of information. Columns can have multiple factors included to help characterize a single datapoint – for example a “sea_otter_region” column can have “high”, “mid”, “low” factors within them. Further examples/guidlines are described below, paired with the following image:
A. Column titles and text in rows: Some packages, like R, do not like spaces in the titles for columns/variable, instead use an underscore or period between words/characters. Typically, spaces in rows/factors is just fine.
B. It’s easiest to keep lats and longs in decimal degrees.
C. For dates, please annotate what format your data are in, especially in the metadata but it is also useful for column titles. e.g. “MM.DD.YY”, “YYYY.MM.DD”. Date data are notorious troublemakers in both analysis and archival, especially for international access.
D. These columns are examples of how multiple factors can be included in one column, notice how the information is nested…i.e. that each replicate is nested within its respective transect, nested within the respective site. Depending on your preference, these different factors can occupy separate columns (like shown with the density data in columns K and L).
E. Note that if there are site-level data, such needs to be filled in for EVERY unique datapoint; its OK if it looks repetitive.
F. Two issues here: (1) running calculations in Excel and (2) saving files with calculation sections (see bottom of columns). Unless you’re running data analysis in Excel (please don’t do this), there should be no calculations saved in the spreadsheet (these won’t be saved in a .csv, anyway). If you’re using R for analysis, please avoid calculating anything in Excel, even simple sums or means…mistakes will happen. In this example, it is better to leave the two density columns as raw and calculate their mean using code in R (this is reproducible and more reliable). The second issue will be misinterpretted by your statistical software; whatever platform you use incorporates information from the entire column called upon and thus would include “Averages”, “Nossuk”, “N Pass”, and “Shinaku” as factors in the sediment columns, and the associated mean values as new datapoints in the pits column. Just don’t do these things.
Neat. Please see me or Wendel (I just volunteered him) if you have any questions about this.
It’s necessary to check ALL of the entered data with assure that it is accurate…hopefully it goes faster than the initial entry. Even after verifying everything, I like to go through the dataset in R to see if there are any odd entries or values that don’t make sense before fully accepting that the QC/QA process in complete. There are a lot of ways to check data but I at least like to make sure that functions like “summary()” work without errors and manually sort each column to check for errors (e.g. odd placement of “NA” or accidental character included, like an “.” instead of “0” or “NA”).
It is also critical to build metadata for each datasheet, for both APECS and for data archival. What do I mean by meta data? Two things:
Short, direct definitions: This often involves transposing the column titles that you have and providing an informative definition/description of each factor in your datasheet (example below). This can be saved as a seperate tab in an Excel document but will ultimately need to be a separate .csv file to pair with your data for archival.
| Factor | Description |
|---|---|
| site_name | Identifying name of the field site |
| longitude | Longitude (E) of the site in decimal degrees |
| latitude | Latitude (N) of the site in decimal degrees |
| date_MM.DD.YY | Date of sampling expressed as month, day, and year (MM/DD/YY) |
| so_region | Regional categorization of sea otter presence; 3 levels: “high”, “mid”, “low” |
| transect | Transect location within each site relative to the seagrass bed |
| replicate | Replicate number within each transect |
| sed_primary | Qualitative characterization of the primary sediment type (e.g. sand, mud) |
| sed_secondary | Qualitative characterization of the secondary sediment type |
| n_pits | Number of pits counted within the transect replicate |
| n_otters1 | Number of sea otters counted in the first boat survey |
| n_otters2 | Number of sea otters counted in the second boat survey |
The two most important things to do remember when storing data are to keep a Masterfile of the each datasheet (after QC/QA) that will not be manipulated and to not lose the data. There are two places that we store data so that it is safe and accessible by APECS members: the shared Google Drive and GitHub platforms that APECS curates. Backing-up data on personal/lab hard drives and computers is also approprate ( if not inevitable). Keep in mind, however, that it is important to be weary and conscious of multiple file versions. In the end, APECS really only wants two versions of the datasheet (i.e. raw and clean) for APECS-related storage and analysis, and that can get messy if you have 20 different versions of the same file.
Generally, you will work with Tiff to successfully upload all APECS-collected data into our preferred data repositories. In our data management plan with NSF, we committed to archiving our data to BCO-DMO (Biological and Chemical Oceanography Data Management Office) but are also submitting to KNB (Knowledge Network for Biocomplexity). The latter is more user friendly and the archived data is accessible to BCO-DMO.
The important prep information for data archival was largely covered in the section about data structure and metadata. These are critical for archival. The following files should be submitted for archival:
There isn’t much more to say here other than archival is a serious business and it’s best to be thorough because you don’t know who will access your dataset, what they’ll use it for, and whether you’ll still be alive if they have questions. You can look at an example from last year, where Tiff archived 2017 data using a series of submissions to KNB. Note: KNB has been updating their user platform a lot, recently, so even Tiff needs to give it a new browse to make sure that functionality is as she knows it.