A primer for researchers
Digital Research Alliance of Canada
Wednesday, September 25, 2024
Agenda
Basic principles for handling research data
Working with data tables
Working with images
Sharing data handling/analysis pipelines
Organizing and sharing data
Data sharing checklist
Research data comes in many tastes and shapes (tables, images, videos, text).
In all cases, it is essential that the dataset has a clear structure and is understandable by others.
Tip
Try to put yourself in the shoes of an outside observer when structuring the data.
Use comprehensive metadata (readme/codebook) to describe and contextualize research files.
Implement coding pipelines (R, Python) to transform the raw data into clean data for analysis.
Tip
Following these practices ensure organized, clean, and validated datasets.
Agenda
Basic principles for handling research data
Working with data tables
Working with images
Sharing data handling/analysis pipelines
Organizing and sharing data
Data sharing checklist
Despite being the most common file type (.xls) for recording/storing data, tables are the most poorly organized and unusable objects in research.
In a wide format table, each subject occupies a single row and variables are individual columns:subject, Id1, Id2, Var1, Var2, Time 1, Time2, Time3.
Tip
Here, columns are responses or predictors in a regression. Example:
Cells_3D ~ Cells_2D + Cells_3D.
In a long format table, each subject occupies various rows and has associated observations in different rows:
subject(repeat), Id1, Id2(repeat), Time (1, 2 , 3).
Tip
Useful when analyzing time-lapse data, grouping different condition variables in a single column. Example:
Cells ~ TimePoint (1D, 2D, 3D).
Long-format is usually the first choice for data analysis.
You can use R (or Python) and Quarto to convert from long to wide table format, or visceversa. Check this tutorial.
Your tables are unintelligible if they are not accompanied by codebooks/readme files describing their content. Recommended formats are .txt, .pdf, .md.
Agenda
Basic principles for handling research data
Working with data tables
Working with images
Sharing data handling/analysis pipelines
Organizing and sharing data
Data sharing checklist
You can easy transform your proprietary files (.czi) to open formats (.tif) using i.e FIJI scripts.
Caution
Saving .czi images as .tif using FIJI will result in metadata loss (archived within the .czi file).
Export technical metadata from proprietary images (i.e .czi) as .txt or .csv files (This may apllied for all images in a batch).
Generate descriptive readme files to explain the provenance and naming conventions of the images.
Agenda
Basic principles for handling research data
Working with data tables
Working with images
Sharing data handling/analysis pipelines
Organizing and sharing data
Data sharing checklist
We live in a pandemic of fraudulent and irreproducible science.
This worrying landscape demands that as integral researchers we employ best scientific practices sharing research data and analysis procedures.
Tip
Scientists have a huge amount of resources available to help them in this process.
Handle data tables and variables using the R Tidyverse.
Process Flow cytometry files/data using R FlowCore from BioConductor.
Analyze RNS-seq data using R DESeq2 from BioConductor.
Perform state-of-the-art statistical modeling using brms.
And all other things you can imagine…
With GitHub or GitLab you can:
Store your code/data in a secure place and share it with collaborators and the public.
Keep a history of changes and version your code (v 1.0, 1.2, 2.0).
Link/render your code in different platforms (i.e Open Science Framework Repository).
With your code you support other researchers and contribute to a culture of open and reproducible science.
Agenda
Basic principles for handling research data
Working with data tables
Working with images
Sharing data handling/analysis pipelines
Organizing and sharing data
Data sharing checklist
At the beginning (optimal) or during (not bad) your research, define an organized scheme for data.
Think about
Overall, ensure the schema is logical and consistent. An external user must be able to understand your datset structure.
README files are guides to understand datasets and tables.
There are templates/resources to guide the generation of readme files: - Creating a README file
- Readme.so
- Readme.ai
Generally, a dataset readme file showcases:
Dataset identifier showing information such as title, authors, data collection date, Geographic information.
A map of files/folders defining the content and hierarchy of folders and subfolders, together with naming conventions.
Methodological information showcasing methods for data collection/generation, analysis, and experimental conditions.
A set of instructions and software for opening, handling and reproduce research pipelines.
Sharing and access information detailing permissions and conditions of use.
Please note
A dataset is a standalone object. Methodological information MUST NOT be relegated to associated research articles.
And organized scheme is the key to understand data structure.
The data folder must be organized logically and hierarchically according to the characteristics of each dataset.
Sharing the input/raw data is a best research integrity practice. The Data_Input folder contain:
This folder contains descriptive details about the dataset files:
The aim of this resources is to sustain the reuse of the data by providing a faithful and sufficient description of the variables.
A Data_Analysis foler contains processed files used to generate research results.
Apart from a codebook/data dictionary, these files may be accompanied by a Data_Appendix showcasing basic descriptive statistics.
A Data_Intermediate folder may contain mid-step processed data, or pre-processed files produced during the analysis workflow. Examples may be images ‘masks’ and machine learning classifiers.
A Scripts_Processing folder may contain scripts or code to transform the raw data (images, tables) for analysis.
Examples of workflows:
Caution
The data you obtain from measurements may not be formatted for analysis.
You will generate several processing scripts. Logical naming conventions are the key to link the input/output data with the processing scripts.
The Scripts_Analysis folder host scripts/code to generate results. They may be in the form of:
Tip
Analysis scripts import and handle the Analysis data to produce research results.
The Output folder contains files generated by analysis scripts in the form of:
Agenda
Basic principles for handling research data
Working with data tables
Working with images
Sharing data handling/analysis pipelines
Organizing and sharing data
Data sharing checklist
When you share data, make sure it meets these characteristics:
Folders and files are organized in a structured way: Use standardized file formats (e.g., CSV, TIFF) and check for consistency in naming conventions.
The metadata/readme allow the understanding of the dataset as standalone object, providing data collection methods, processing steps, and relevant context.
It is desirable that it contains reproducible workflows used to process and generate the research results.
Be aware that the dataset is a research object to serve the public and the scientific community, and that can be used (and cited) independently of the research article.
Think about research articles as supplements to your dataset!!!
Organize and handle research data - Daniel Manrique-Castano, Ph.D