Organize and handle research data

A primer for researchers

Daniel Manrique-Castano, Ph.D

daniel.manrique-castano@alliancecan.ca

Digital Research Alliance of Canada

Wednesday, September 25, 2024

Basic principles for handling research data

Agenda

Basic principles for handling research data
Working with data tables
Working with images
Sharing data handling/analysis pipelines
Organizing and sharing data
Data sharing checklist

Make datasets understandable

Research data comes in many tastes and shapes (tables, images, videos, text).

In all cases, it is essential that the dataset has a clear structure and is understandable by others.

Tip

Try to put yourself in the shoes of an outside observer when structuring the data.

Others generally do not understand research data

Four best practices for organizing a dataset

Use consistent naming conventions that fairly describe file’s content and allow interrelation between files:

A1.tif Exp_MouseID_Day_Condition_Marker.tif
CellsTable.xls Widefield_5x_Cortex_NeuN_Counts.csv

Use proper, accessible file formats to improve accessibility:

.tif for images (preserve the metadata).
.csv for tables (non-proprietary).
.png or .svg for graphs (preserves quality).
.txt or .pdf for documentation (non-proprietary).

Four best practices for organizing a dataset

Use comprehensive metadata (readme/codebook) to describe and contextualize research files.
Implement coding pipelines (R, Python) to transform the raw data into clean data for analysis.

Codebook example (https://domstat.med.ucla.edu/)

Tip

Following these practices ensure organized, clean, and validated datasets.

Working with data tables

Agenda

Basic principles for handling research data
Working with data tables
Working with images
Sharing data handling/analysis pipelines
Organizing and sharing data
Data sharing checklist

Tables are the core of scientific data

Despite being the most common file type (.xls) for recording/storing data, tables are the most poorly organized and unusable objects in research.

Examples from published research

Zhao et al. (2024). Nature Comm. DOI: 10.1038/s41467-024-50836-6

Balinda et al. (2024). Nature Comm. DOI: doi.org/10.1038/s41467-024-50558-9

Examples from Crystal Lewis (2024)

Lewis (2024). DOI: 10.1201/9781032622835-3

Examples from Crystal Lewis (2024)

Building accesible data tables

Columns

Identifier variables: animal ID, Time point, condition (factors or characters).
Analysis variables: score, area, number of cells, etc (numerical or categorical).
Variables created during processing (proportions, ratios, etc).

Rows

Variable values: entries for each column (variable). Each row corresponds to a unique observation.

Wide and long table formats

A typical wide-format data table, from Lewis (2024). DOI: 10.1201/9781032622835-3

In a wide format table, each subject occupies a single row and variables are individual columns:subject, Id1, Id2, Var1, Var2, Time 1, Time2, Time3.

Tip

Here, columns are responses or predictors in a regression. Example:

Cells_3D ~ Cells_2D + Cells_3D.

Wide and long table formats

In a long format table, each subject occupies various rows and has associated observations in different rows:

subject(repeat), Id1, Id2(repeat), Time (1, 2 , 3).

Tip

Useful when analyzing time-lapse data, grouping different condition variables in a single column. Example:

Cells ~ TimePoint (1D, 2D, 3D).

Long-format is usually the first choice for data analysis.

The best of all…

You can use R (or Python) and Quarto to convert from long to wide table format, or visceversa. Check this tutorial.

Long to wide format (https://tavareshugo.github.io/)

Provide metadata (readme file)

Your tables are unintelligible if they are not accompanied by codebooks/readme files describing their content. Recommended formats are .txt, .pdf, .md.

Working with images

Agenda

Basic principles for handling research data
Working with data tables
Working with images
Sharing data handling/analysis pipelines
Organizing and sharing data
Data sharing checklist

When handling research images, please consider:

Manrique-Castano et al. (2024). DOI: DOI 10.17605/OSF.IO/3VG8J

When possible, convert propietary files (i.e .czi) to open formats with no compression (.tif).
Share technical (acquisition parameters) and descriptive (context and content) metadata along with the images.
Use FIJI, Python or related coding/scripting software (prefered) to document image transformations (resize. background substraction, etc.).
Extract information/perform analysis using coding/scripting software to ensure reproducibility. Please avoid manual counts/analysis.

Transform images to open formats

FIJI script to save .czi images as tiff. From Manrique-Castano et al. (2024). DOI: DOI 10.17605/OSF.IO/3VG8J

You can easy transform your proprietary files (.czi) to open formats (.tif) using i.e FIJI scripts.

Caution

Saving .czi images as .tif using FIJI will result in metadata loss (archived within the .czi file).

Keep track of metadata

Technical

Export technical metadata from proprietary images (i.e .czi) as .txt or .csv files (This may apllied for all images in a batch).

Descriptive

Generate descriptive readme files to explain the provenance and naming conventions of the images.

A worrying research landscape

We live in a pandemic of fraudulent and irreproducible science.

Increase in the number of retracted articles in the last three decades

This worrying landscape demands that as integral researchers we employ best scientific practices sharing research data and analysis procedures.

Tip

Scientists have a huge amount of resources available to help them in this process.

Partners to handle analysis pipelines

R-studio/Quarto (R + Python)

GitHub (Version control)

With R-studio (R and Python) you can

R-studio/Quarto (R + Python)

Handle data tables and variables using the R Tidyverse.
Analyze images using Python skimage.
Process Flow cytometry files/data using R FlowCore from BioConductor.
Analyze RNS-seq data using R DESeq2 from BioConductor.
Perform state-of-the-art statistical modeling using brms.
And all other things you can imagine…

Keep track with version control

With GitHub or GitLab you can:

Store your code/data in a secure place and share it with collaborators and the public.
Keep a history of changes and version your code (v 1.0, 1.2, 2.0).
Link/render your code in different platforms (i.e Open Science Framework Repository).
With your code you support other researchers and contribute to a culture of open and reproducible science.

Global supporting communities

1. Define a dataset schema/road

At the beginning (optimal) or during (not bad) your research, define an organized scheme for data.

Think about

Folders/directory structures
Think about file types/formats
Establish logical/descriptive naming conventions

Overall, ensure the schema is logical and consistent. An external user must be able to understand your datset structure.

2. Write a readme file

README files are guides to understand datasets and tables.

From https://github.com/twbs/bootstrap-rubygem

There are templates/resources to guide the generation of readme files: - Creating a README file
- Readme.so
- Readme.ai

Contents of a readme file

Generally, a dataset readme file showcases:

Dataset identifier showing information such as title, authors, data collection date, Geographic information.
A map of files/folders defining the content and hierarchy of folders and subfolders, together with naming conventions.
Methodological information showcasing methods for data collection/generation, analysis, and experimental conditions.
A set of instructions and software for opening, handling and reproduce research pipelines.
Sharing and access information detailing permissions and conditions of use.

Please note

A dataset is a standalone object. Methodological information MUST NOT be relegated to associated research articles.

3. Organize dataset folders

And organized scheme is the key to understand data structure.

Diving into the folder tree

Tip

Plan/define directory structures, file formats, and naming conventions.

For example, TIER 4.0 is systemic template to standardize and increasing transparency and reproducibility of research data. The user can download a folder structure and adapt it to specific cases.

Organizing a data folder

The data folder must be organized logically and hierarchically according to the characteristics of each dataset.

Input data

Sharing the input/raw data is a best research integrity practice. The Data_Input folder contain:

Data files (stored in subfolders if necessary)

Original images (.tiff, .czi)
Output files from a measuring device (.txt, .csv, .pdf)
Original registration datasheets (.png, .csv, .xlxs)

A metadata file/folder

This folder contains descriptive details about the dataset files:

README files: showcases identifiers and methodological/technical details.
Codebooks / data dictionaries: Explain the content of tables. They are generally .txt or .csv-xlxs files.

The aim of this resources is to sustain the reuse of the data by providing a faithful and sufficient description of the variables.

Analysis data

A Data_Analysis foler contains processed files used to generate research results.

Apart from a codebook/data dictionary, these files may be accompanied by a Data_Appendix showcasing basic descriptive statistics.

Intermediate data (Optional)

A Data_Intermediate folder may contain mid-step processed data, or pre-processed files produced during the analysis workflow. Examples may be images ‘masks’ and machine learning classifiers.

Processing scripts

A Scripts_Processing folder may contain scripts or code to transform the raw data (images, tables) for analysis.

Examples of workflows:

Dropping variables (subsetting the dataset).
Generating new variables (Perform computations, calculate means, etc.).
Combing different information sources (merging tables or files).

Caution

The data you obtain from measurements may not be formatted for analysis.

Keep in mind

You will generate several processing scripts. Logical naming conventions are the key to link the input/output data with the processing scripts.

Analysis scripts

The Scripts_Analysis folder host scripts/code to generate results. They may be in the form of:

Batch processing scripts (FIJI, QuPath, CellProfiler)
Quarto (.qmd) or Markdown (.md) documents
Jupyter notebooks (ipynb) or .py (Python) scripts
Matlab files (.m).

Tip

Analysis scripts import and handle the Analysis data to produce research results.

The output folder

The Output folder contains files generated by analysis scripts in the form of:

Images
Figures
Tables
Statistical models

In summary

Be aware that the dataset is a research object to serve the public and the scientific community, and that can be used (and cited) independently of the research article.

Why not?

Think about research articles as supplements to your dataset!!!

Find more supporting material

FRDR guide for deposit research data.
Guidance on depositing existing data in public repositories
RDMkit

Visit us at

https://www.frdr-dfdr.ca/repo/

or contact us

Organize and handle research data

Basic principles for handling research data

Make datasets understandable

Four best practices for organizing a dataset

Four best practices for organizing a dataset

Working with data tables

Tables are the core of scientific data

Examples from published research

Examples from Crystal Lewis (2024)

Examples from Crystal Lewis (2024)

Building accesible data tables

Columns

Rows

Wide and long table formats

Wide and long table formats

The best of all…

Provide metadata (readme file)

Working with images

When handling research images, please consider:

Transform images to open formats

Keep track of metadata

Technical

Descriptive

Sharing data handling/analysis pipelines

A worrying research landscape

Partners to handle analysis pipelines

R-studio/Quarto (R + Python)

GitHub (Version control)

With R-studio (R and Python) you can

R-studio/Quarto (R + Python)

Keep track with version control

Global supporting communities

Organizing and sharing data

1. Define a dataset schema/road

2. Write a readme file

Contents of a readme file

3. Organize dataset folders

Diving into the folder tree

Organizing a data folder

Input data

Analysis data

Intermediate data (Optional)

Processing scripts

Keep in mind

Analysis scripts

The output folder

Data sharing checklist

Sharing data (in repositories)

In summary

Why not?

Find more supporting material