flowchart LR A[Efficiency] --> B[Collaborative work] --> C[Reproducibility/impact]
A primer for scientists
Digital Research Alliance of Canada
Monday, January 6, 2025
Agenda
Why do we care about sharing data?
Current issues with data
Principles of sharing data
General guidelines for dataset deposits
Data submission checklist
Canadian generalist repositories
Some reasons to share research data are:
Avoid unnecessary or costly experiments by using previous research results.
Validate research findings: Independent verification of scientific results and conclusions (by replicating research workflows).
Repurpose data: Use the data for new research questions or in combination with other datasets. They are also extremely valuable as educational resources.
Build upon previous work: to accelerate scientific discovery and meta-analysis by avoiding duplication of efforts or reliance on irreproducible research.
The Goverment of Canada promotes RDM in its Tri-Agency Research Data Management Policy.
Through its federal funding agencies, the the Government of Canada seeks to implement data management plans (DMPs) and sharing of research data to maximize the benefits to society.
Depositing a dataset in a repository is NOT ONLY an exercise in meeting the requirements of funding agencies and journals. It is an ethical and professional responsibility of researchers to ensure reproducible science, and the access and reuse of scientific data.
flowchart LR A[Efficiency] --> B[Collaborative work] --> C[Reproducibility/impact]
flowchart LR A[Rigorous peer review] --> B[Validation and reproducibility] --> C[Open science]
flowchart LR A[Transparency] --> B[Accountability] --> C[Return on investment]
Agenda
Why do we care about sharing data?
Current issues with data
Principles of sharing data
General guidelines for dataset deposits
Data submission checklist
Canadian generalist repositories
Data availability statement
“The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.”
When shared, more often than not we observed the the data:
Lacks comprehensive metadata and readme file(s) explaining the context, methodology, and structure of the dataset.
Presents a disorganized structure that makes its reuse impossible.
Is treated only as a supplement of research articles.
Agenda
Why do we care about sharing data?
Current issues with data
Principles of sharing data
General guidelines for dataset deposits
Data submission checklist
Canadian generalist repositories
The following are essential aspects researchers must consider when sharing data:
Your dataset should be a standalone resource.
Your dataset should be discoverable and understandable.
Your dataset must be reusable by the community.
Datasets as standalone objects
Regardless of whether the dataset is linked to a scientific publication, it must be understandable and independently navigable.
Findable
Accessible
Interoperable
Reusable
Agenda
Why do we care about sharing data?
Current issues with data
Principles of sharing data
General guidelines for dataset deposits
Data submission checklist
Canadian generalist repositories
The title must reflect the nature and content of the dataset.
Original: PiPaw2.0
Better: Home cage based motor learning platform PiPaw2.0
Original: Foliar Functional Trait Mapping
Better: Foliar Functional Trait Mapping of a mixed temperate forest using imaging spectroscopy
Original: Covariation in Width and Depth in Bedrock Rivers Data Archive
Better: Data archive for width and depth covariation within the bedrock Fraser Canyon, British Columbia, Canada
Caution
The title of your dataset IS NOT the same as the title of your research article
The description must reflect the nature, content and methods of the dataset. The use of numerous keywords is recommended to increase its discoverability.
Original: This dataset provides climate data (19 bioclimate variables as defined by worldclim) that were generated using the Biosim 11 software at a spatial resolution of 9 km across Canada between 1980-2020.
Suggested: This dataset provides climate data (19 bioclimate variables as defined by worldclim) that were generated using the Biosim 11 software at a spatial resolution of 9 km across Canada between 1980-2020. Please refer to https://www.worldclim.org/data/bioclim.html for information about the variables. The dataset contains: the annual mean temperature, mean diurnal range, isothermality, temperature seasonality, maximum temperature of warmest month, minimum temperature of coldest month, temperature annual range, mean temperature of wettest quarter, mean temperature of driest quarter, mean temperature of warmest quarter, mean temperature of coldest quarter, annual precipitation, precipitation of wettest month, precipitation of driest month, precipitation seasonality (coefficient of variation), precipitation of wettest quarter, precipitation of driest quarter, precipitation of warmest quarter, precipitation of coldest quarter.
Original: Exposure to neuromodulatory chemicals in the polychaete marine worm, Capitella teleta, has been used to assess changes in locomotory behaviour in adult and juvenile life stages.Worms were exposed to nicotine, fluoxetine, apomorphine, and phenobarbital and had their distance moved, maximum velocity, time to/at the edge of the arena, and time to first move measured.
Suggested: The presence of compounds such as pharmaceuticals and pesticides act as neurochemicals in aquatic organisms. This repository contains the raw data from a study investigating the effects of neuromodulatory chemicals in the marine polychaete worm Capitella teleta. We investigated the effects of nicotine, fluoxetine, apomorphine and phenobarbital, which are known to interact with acetylcholine, serotonin, dopamine and GABA pathways. We measured locomotory behaviour using a high throughput multi-well plate assay, using parameters such as total distance moved, time spent moving, time spent at the edge and maximum velocity. We also performed RNA extraction and sequencing with juvenile and adult worms to determine if genes in the pathway were expressed. We share gene sequences, alignments, motif searching, and phylogenetic analysis files for each receptor (with acetylcholine, serotonin, dopamine and GABA) and videos, together with raw .fasta files for RNA sequencing and R code for processing/analysis.
To find relevant keywords, ask yourself the following question:
What terms can a reuser use in a search field to find my record?
Define an organized scheme for your data at the beginning (best) or during your research (not bad).
Think about
Overall, ensure that the schema is logical and consistent. An external user must be able to understand the directory structure.
The (main) readme file is a guide to understanding the dataset and enabling its reuse or execution.
FRDR users can use our [text] or [web] template to generate a readme file for submission to FRDR.
Additional resources are:
- Creating a README file
- Readme.so
- Readme.ai
In general, a dataset readme file shows:
To refresh your memory
The dataset is a separate object (from the research article). Methods and tools for data collection MUST NOT be relegated to the research article.
A set of instructions and software for opening, handling and reproducing research pipelines.
Sharing and access information detailing permissions and terms of use.
And organized scheme is the key to understanding data structure.
The data must be organized logically and hierarchically according to the characteristics of each dataset.
Sharing the input/raw data is a research integrity and data management best practice. The Data_Input/ can contain:
This Metadata/ contains information about the listed data files to ensure understanding and usability. It may list:
The aim of these resources is to support the reuse of the data by providing a faithful and sufficient description of the variables.
A Data_Analysis/ contains the processed files, those used to generate the research results.
Like the input data, these files contain a codebook/data dictionary. Also, these files can be accompanied by a Data_Appendix files that showcase basic descriptive statistics or show data distributions.
A Data_Intermediate/ can contain intermediate processed data, or pre-processed files as part of an analysis pipeline. For example, image ‘masks’ and machine learning classifiers that are used to further process images.
Although most scientists may be more comfortable with GUIs, the current research landscape requires the use of scripts and (analysis) code to ensure the reproducibility of research results.
Tip
Coding should be considered an essential skill, as well as other methods such as animal surgery, patch clamp, or flow cytometry.
Caution
The data you get from your measurements may not be formatted and organized in a way that allows you to analyse it and generate results.
A Scripts_Processing may contain scripts/code that prepare (or transform) the raw data (images, tables) for analysis Data_Analysis/ .
Examples of workflows:
Tip
You may want to consider saving the generated intermediate files in the Data_Intermediate/ .
You will create several processing scripts. Logical naming conventions are the key to linking the input/output data to the processing scripts.
The Scripts_Analisys folder hosts scripts/code to generate results that may be in the form of:
Tip
In general, these scripts import and process the analysis data.
The Scripts/ can also contain a master script that executes all other scripts, creating a fully automated pipeline.
The Output/ contains subfolders storing the files generated by the analysis scripts in the form of:
Sharing the output resulting from computations/code is one of the best commitments to open and reproducible science. It is also a way to preserve material for future use in an organized way.
Agenda
Why do we care about sharing data?
Current issues with data
Principles of sharing data
General guidelines for dataset deposits
Data submission checklist
Canadian generalist repositories
When you submit your data to a repository (FRDR), make sure it meets these characteristics:
Your folders and files are organized in a clear and structured way (understandable to the community): Use standardized file formats (e.g., CSV, TIFF) and check for consistency in naming conventions.
The metadata/readme is as complete as possible and can be understood as a standalone object that provides data collection methods, processing steps, and relevant context.
Verify independent usability: Data must be complete and understandable (including any necessary instructions for data interpretation) without the need for the accompanying research article.
When do I start organizing my data for sharing?
We recommend implementing RDM practices early and throughout the research process. Organizing data after years of chaotic data management is not a good idea.
When do I share my data?
Your data can be shared at any time during the research process. You do not have to wait until a research article is published to share your data.
What if my dataset does not fit into protocols such as TIER 4?
You do not need to worry about this. The most important thing is that your dataset is well documented, logically organized, and has naming conventions that make it understandable to potential reusers.
Is my data citable?
Of course it is. Your dataset gets a DOI, which makes it a citable object independent of your research article. In fact, if you publish your dataset before your article, you can even cite your datasets in your research.
How can others use my dataset?
That depends on the license you use. We recommend a CC-BY 4.0 license, which allows broad reuse of the data.
Where do I share my data?
You can share your data in specialized or generalist repositories like The Federated Research Data Repository (FRDR) or Borealis.
Be aware that the dataset is a research object that serves the public and the scientific community, and that can be used (and cited) independently of the research article.
Better yet, think of articles as supplements to your dataset!!!
Agenda
Why do we care about sharing data?
Current issues with data
Principles of sharing data
General guidelines for dataset deposits
Data submission checklist
Canadian generalist repositories
The Federated Research Data Repository (FRDR) is a national platform for Canadian researchers to discover, store, and share research data.
Our goals:
Improve data discoverability (in partnership with Lunaris).
Promote open science practices and the reuse of research data.
Ensure the long-term preservation of valuable research data.
FRDR is for canadian researchers
FRDR supports a wide range of disciplines and data types, providing a robust infrastructure for management and dissemination of research data across Canada.
FRDR ensures the long-term preservation, accessibility and usability of datasets through its curation and preservation team.
FRDR supports funding agencies requirements related to open access to data (and research data management plans).
Promotes dataset visibility and reuse across a wide range of disciplines.
FRDR supports large datasets, making it an ideal repository for data-intensive research.
FRDR supports researchers in data management best practices.
FRDR supports researchers and institutions
FRDR has competent staff to guide researchers and institutions to ensure that datasets are valuable and comply with FAIR principles.
At FRDR, we aim for datasets to be standalone objects (independent of research articles) with potential social, research or educational uses.
Borealis is a Canadian research data repository supported by academic libraries, research institutions, and the Digital Research Alliance of Canada.
Features:
Built on Dataverse open-source software hosted by Scholars Portal / University of Toronto Libraries.
Integrated with single sign-on login for Canadian Institutions (Canadian Access Federation).
Indexed in DataCite search, Google dataset search, Lunaris for discoverability.
File preview to explore files directly in the browser.
Data explorer tool to visualize variables in tabular data files (e.g., SPSS, Excel, CSV). Chart
Github integration using GitHub actions.
Contact us to ensure that your data is well prepared and can be effectively shared with the research community.
Depositing data - Daniel Manrique-Castano, Ph.D