Organize and handle research data

A primer for researchers

Daniel Manrique-Castano, Ph.D

Digital Research Alliance of Canada

Wednesday, September 25, 2024

Basic principles for handling research data

Agenda

  1. Basic principles for handling research data

  2. Working with data tables

  3. Working with images

  4. Sharing data handling/analysis pipelines

  5. Organizing and sharing data

  6. Data sharing checklist

Make datasets understandable

Research data comes in many tastes and shapes (tables, images, videos, text).

In all cases, it is essential that the dataset has a clear structure and is understandable by others.

Tip

Try to put yourself in the shoes of an outside observer when structuring the data.

Others generally do not understand research data

Four best practices for organizing a dataset

  1. Use consistent naming conventions that fairly describe file’s content and allow interrelation between files:
  • A1.tif Exp_MouseID_Day_Condition_Marker.tif
  • CellsTable.xls Widefield_5x_Cortex_NeuN_Counts.csv
  1. Use proper, accessible file formats to improve accessibility:
  • .tif for images (preserve the metadata).
  • .csv for tables (non-proprietary).
  • .png or .svg for graphs (preserves quality).
  • .txt or .pdf for documentation (non-proprietary).

Four best practices for organizing a dataset

  1. Use comprehensive metadata (readme/codebook) to describe and contextualize research files.

  2. Implement coding pipelines (R, Python) to transform the raw data into clean data for analysis.

Codebook example (https://domstat.med.ucla.edu/)

Tip

Following these practices ensure organized, clean, and validated datasets.

Working with data tables

Agenda

  1. Basic principles for handling research data

  2. Working with data tables

  3. Working with images

  4. Sharing data handling/analysis pipelines

  5. Organizing and sharing data

  6. Data sharing checklist

Tables are the core of scientific data

Despite being the most common file type (.xls) for recording/storing data, tables are the most poorly organized and unusable objects in research.

from https://dansteer.wordpress.com/

Courtesy of researcher

Examples from published research

Zhao et al. (2024). Nature Comm. DOI: 10.1038/s41467-024-50836-6

Balinda et al. (2024). Nature Comm. DOI: doi.org/10.1038/s41467-024-50558-9

Examples from Crystal Lewis (2024)

Lewis (2024). DOI: 10.1201/9781032622835-3

Lewis (2024). DOI: 10.1201/9781032622835-3

Lewis (2024). DOI: 10.1201/9781032622835-3

Lewis (2024). DOI: 10.1201/9781032622835-3

Examples from Crystal Lewis (2024)

Lewis (2024). DOI: 10.1201/9781032622835-3

Lewis (2024). DOI: 10.1201/9781032622835-3

Building accesible data tables

A typical long-format data table organizes the information by rows and columns

Columns

  • Identifier variables: animal ID, Time point, condition (factors or characters).
  • Analysis variables: score, area, number of cells, etc (numerical or categorical).
  • Variables created during processing (proportions, ratios, etc).

Rows

  • Variable values: entries for each column (variable). Each row corresponds to a unique observation.

Wide and long table formats

A typical wide-format data table, from Lewis (2024). DOI: 10.1201/9781032622835-3

In a wide format table, each subject occupies a single row and variables are individual columns:subject, Id1, Id2, Var1, Var2, Time 1, Time2, Time3.

Tip

Here, columns are responses or predictors in a regression. Example:

Cells_3D ~ Cells_2D + Cells_3D.

Wide and long table formats

A typical wide-format data table, from Lewis (2024). DOI: 10.1201/9781032622835-3

In a long format table, each subject occupies various rows and has associated observations in different rows:

subject(repeat), Id1, Id2(repeat), Time (1, 2 , 3).

Tip

Useful when analyzing time-lapse data, grouping different condition variables in a single column. Example:

Cells ~ TimePoint (1D, 2D, 3D).

Long-format is usually the first choice for data analysis.

The best of all…

You can use R (or Python) and Quarto to convert from long to wide table format, or visceversa. Check this tutorial.

Long to wide format (https://tavareshugo.github.io/)

Provide metadata (readme file)

Your tables are unintelligible if they are not accompanied by codebooks/readme files describing their content. Recommended formats are .txt, .pdf, .md.

Example of a readme file

Working with images

Agenda

  1. Basic principles for handling research data

  2. Working with data tables

  3. Working with images

  4. Sharing data handling/analysis pipelines

  5. Organizing and sharing data

  6. Data sharing checklist

When handling research images, please consider:

Manrique-Castano et al. (2024). DOI: DOI 10.17605/OSF.IO/3VG8J
  • When possible, convert propietary files (i.e .czi) to open formats with no compression (.tif).
  • Share technical (acquisition parameters) and descriptive (context and content) metadata along with the images.
  • Use FIJI, Python or related coding/scripting software (prefered) to document image transformations (resize. background substraction, etc.).
  • Extract information/perform analysis using coding/scripting software to ensure reproducibility. Please avoid manual counts/analysis.

Transform images to open formats

FIJI script to save .czi images as tiff. From Manrique-Castano et al. (2024). DOI: DOI 10.17605/OSF.IO/3VG8J

You can easy transform your proprietary files (.czi) to open formats (.tif) using i.e FIJI scripts.

Caution

Saving .czi images as .tif using FIJI will result in metadata loss (archived within the .czi file).

Keep track of metadata

Technical

Export technical metadata from proprietary images (i.e .czi) as .txt or .csv files (This may apllied for all images in a batch).

Example of technical metadata in FIJI: image -> show info

Descriptive

Generate descriptive readme files to explain the provenance and naming conventions of the images.

Sharing data handling/analysis pipelines

Agenda

  1. Basic principles for handling research data

  2. Working with data tables

  3. Working with images

  4. Sharing data handling/analysis pipelines

  5. Organizing and sharing data

  6. Data sharing checklist

A worrying research landscape

We live in a pandemic of fraudulent and irreproducible science.

Increase in the number of retracted articles in the last three decades

This worrying landscape demands that as integral researchers we employ best scientific practices sharing research data and analysis procedures.

Tip

Scientists have a huge amount of resources available to help them in this process.

Partners to handle analysis pipelines

R-studio/Quarto (R + Python)

R-studio/quarto screen

GitHub (Version control)

GitHub screen

With R-studio (R and Python) you can

R-studio/Quarto (R + Python)

R-studio/quarto screen

Keep track with version control

GitHub screen

With GitHub or GitLab you can:

  • Store your code/data in a secure place and share it with collaborators and the public.

  • Keep a history of changes and version your code (v 1.0, 1.2, 2.0).

  • Link/render your code in different platforms (i.e Open Science Framework Repository).

  • With your code you support other researchers and contribute to a culture of open and reproducible science.

Global supporting communities

Organizing and sharing data

Agenda

  1. Basic principles for handling research data

  2. Working with data tables

  3. Working with images

  4. Sharing data handling/analysis pipelines

  5. Organizing and sharing data

  6. Data sharing checklist

1. Define a dataset schema/road

At the beginning (optimal) or during (not bad) your research, define an organized scheme for data.

Think about

  • Folders/directory structures
  • Think about file types/formats
  • Establish logical/descriptive naming conventions

Overall, ensure the schema is logical and consistent. An external user must be able to understand your datset structure.

2. Write a readme file

README files are guides to understand datasets and tables.

From https://github.com/twbs/bootstrap-rubygem

There are templates/resources to guide the generation of readme files: - Creating a README file
- Readme.so
- Readme.ai

Contents of a readme file

Generally, a dataset readme file showcases:

  • Dataset identifier showing information such as title, authors, data collection date, Geographic information.

  • A map of files/folders defining the content and hierarchy of folders and subfolders, together with naming conventions.

  • Methodological information showcasing methods for data collection/generation, analysis, and experimental conditions.

  • A set of instructions and software for opening, handling and reproduce research pipelines.

  • Sharing and access information detailing permissions and conditions of use.

Please note

A dataset is a standalone object. Methodological information MUST NOT be relegated to associated research articles.

3. Organize dataset folders

And organized scheme is the key to understand data structure.

From pexels.com

File structure

Diving into the folder tree

Tip

Plan/define directory structures, file formats, and naming conventions.

For example, TIER 4.0 is systemic template to standardize and increasing transparency and reproducibility of research data. The user can download a folder structure and adapt it to specific cases.

Folder tree

Organizing a data folder

The data folder must be organized logically and hierarchically according to the characteristics of each dataset.

Input data

Sharing the input/raw data is a best research integrity practice. The Data_Input folder contain:

  1. Data files (stored in subfolders if necessary)
  • Original images (.tiff, .czi)
  • Output files from a measuring device (.txt, .csv, .pdf)
  • Original registration datasheets (.png, .csv, .xlxs)

Folder tree

  1. A metadata file/folder

This folder contains descriptive details about the dataset files:

  • README files: showcases identifiers and methodological/technical details.
  • Codebooks / data dictionaries: Explain the content of tables. They are generally .txt or .csv-xlxs files.

The aim of this resources is to sustain the reuse of the data by providing a faithful and sufficient description of the variables.

Analysis data

A Data_Analysis foler contains processed files used to generate research results.

Apart from a codebook/data dictionary, these files may be accompanied by a Data_Appendix showcasing basic descriptive statistics.

Folder tree

Intermediate data (Optional)

A Data_Intermediate folder may contain mid-step processed data, or pre-processed files produced during the analysis workflow. Examples may be images ‘masks’ and machine learning classifiers.

Processing scripts

A Scripts_Processing folder may contain scripts or code to transform the raw data (images, tables) for analysis.

Examples of workflows:

  • Dropping variables (subsetting the dataset).
  • Generating new variables (Perform computations, calculate means, etc.).
  • Combing different information sources (merging tables or files).

Caution

The data you obtain from measurements may not be formatted for analysis.

Keep in mind

You will generate several processing scripts. Logical naming conventions are the key to link the input/output data with the processing scripts.

Analysis scripts

The Scripts_Analysis folder host scripts/code to generate results. They may be in the form of:

  • Batch processing scripts (FIJI, QuPath, CellProfiler)
  • Quarto (.qmd) or Markdown (.md) documents
  • Jupyter notebooks (ipynb) or .py (Python) scripts
  • Matlab files (.m).

Folder tree

Tip

Analysis scripts import and handle the Analysis data to produce research results.

The output folder

The Output folder contains files generated by analysis scripts in the form of:

  • Images
  • Figures
  • Tables
  • Statistical models

Folder tree

Data sharing checklist

Agenda

  1. Basic principles for handling research data

  2. Working with data tables

  3. Working with images

  4. Sharing data handling/analysis pipelines

  5. Organizing and sharing data

  6. Data sharing checklist

Sharing data (in repositories)

When you share data, make sure it meets these characteristics:

  1. Folders and files are organized in a structured way: Use standardized file formats (e.g., CSV, TIFF) and check for consistency in naming conventions.

  2. The metadata/readme allow the understanding of the dataset as standalone object, providing data collection methods, processing steps, and relevant context.

  3. It is desirable that it contains reproducible workflows used to process and generate the research results.

In summary

Be aware that the dataset is a research object to serve the public and the scientific community, and that can be used (and cited) independently of the research article.

Why not?

Think about research articles as supplements to your dataset!!!

Find more supporting material

Visit us at

https://www.frdr-dfdr.ca/repo/

or contact us