‘Data Without Software are Just Numbers’

Reproducability proves to be difficult when we are only given the data and not the code that was used to analyze the data. Citation: Abstruse Goose

Summary

In “Data Without Software is Just Numbers” (linked below), Davenport et al. describe the harrowing reality of published data. That is, if we cannot run the same analysis performed on the data set that the authors have done, the data loses its novelty. Then, the data turns into an arbitrary Excel sheet of numbers. What we need is for the scientists publishing their magnificent work to high-impact journals like Nature and Science to include the software and code they used to analyze their work. This would increase the reliability of their findings and help with the issue of reproducability. Publishing software with the data provides for “sustainable” science, not magical, unicorn science that works perfectly for the authors but cannot be reproduced after the research is published. In the meantime, we also need to be teaching our researchers, such as doctoral fellows, who do research full-time, how to manage and analyze data correctly. Luckily, Nature, a well-known science journal, has taken heed to this advice and now requires authors to submit their software when applicable. A program in the UK is fighting to force researches to publish their data with their results as well. All to mitigate the reproducability problem. (Davenport et al.)

[The link to the paper!] (https://datascience.codata.org/articles/10.5334/dsj-2020-003/)

est_of_scientific_jounrals <- 30000
nature <- 1
proportion_of_journals_that_we_know_that_force_data_to_be_published <- nature/est_of_scientific_jounrals
proportion_of_journals_that_we_know_that_force_data_to_be_published

## [1] 3.333333e-05

Discussion

The main points in this article bring attention to a growing concern in our data-centered society. All research, results, decisions, etc. must be backed by data. So, we have a lot of it! However, if we do not have the means to correctly analyze this data, we must learn quickly. Ignorance is not an excuse for false behaviour. Furthermore, if we do not publish our data and analysis method with the methods portion in our research papers, then we are lying to our readers. Lying by omission. Readers do not show how we came to our conclusions, and thus, we lose credibility. However, many journals have yet to join this train. Researchers all around the world publish their results, but refuse to publish their data and their analysis methods. Why? Because they are afraid of others stealing their ideas? Because they don’t want the competition using this resource and getting ahead? These are not good enough excuses to jeapordize sustainable science.

#REMEMBER
proportion_of_journals_that_we_know_that_force_data_to_be_published

## [1] 3.333333e-05

Real Life Example

In my Data Science class last week, my instructor presented to us one of the most famous reproducability snafus in the Data Science world. It was Baggerly and Coomes vs. a too-good-to-be-true biology paper. This paper had documented the discovery of a viable cancer therapuetic for childhood leukemia tailored to individuals based on their DNA and was already being trialed on human subjects when Baggerly and Coomes noticed that they could not replicate the results found in the published article. When they asked the authors for help or clues in how the data was processed, they were met with crickets. Finally, in a method coined by Baggerly as “Forensic Bioinformatics”, B & C were able to replicate the results only when implementing two key errors in the data analysis method. First, they had to implement “off-by-one” error where the genes used in the data correlated with the data in the cell above, not adjacent to it. Second, in coding the treatment as 1 or 2, someone confused the meanings for 1 and 2 as the experiment went on and coded in the treatment incorrectly. So, B & C had to replicate this mistake in order to come to the same conclusions as the published paper. It turns out that the cancer therapuetic was NOT beneficial for childhood leukemia AT ALL. The worst part of the matter: unveiling this finding was not enough to stop the trials from happening. Even after B & C fought for the trials to end, they did not until it was uncovered that the PI of the project lied on his CV. People’s lives were at risk because these original authors did not publish their analysis method with their data, and therefore, were not checked on their analysis.

To avoid the common copy and paste errors made in the chemotherapeutics paper, make sure to be coding in R, not Excel. Make sure you know what your data looks like. If there is a header, read in the data with the header. AND ALWAYS TAKE GOOD NOTES OF WHAT YOUR VARIABLES MEAN! Check out below what happens when we incorrectly code in our data

data1 <- read.csv("This_data_does_exist.csv", header = TRUE)
mean(data1[[1]])

## [1] 18.26667

data2 <-read.csv("This_data_does_exist.csv", header = FALSE)
mean(data2[[1]])

## Warning in mean.default(data2[[1]]): argument is not numeric or logical:
## returning NA

## [1] NA

Takeaways

Too long? Didn’t read the whole article? No worries. I will summarize here:

When you publish research, publish the data with the results.
When you publish the data, publish the software analysis used to get to the results.
Don’t know how to properly analyze data? Join the bandwagon, and learn!
Programs in the UK are advocating for it and so should we!
Data is just numbers if you don’t show us how you came to your results.
Let us avoid the below error message together.

data <- read.csv("This_data_wasnt_published.csv")

## Warning in file(file, "rt"): cannot open file
## 'This_data_wasnt_published.csv': No such file or directory

## Error in file(file, "rt"): cannot open the connection

References

Davenport, J.H., Grant, J. and Jones, C.M., 2020. Data Without Software Are Just Numbers. Data Science Journal, 19(1), p.3. DOI: http://doi.org/10.5334/dsj-2020-003