The Ethics of Reproducibility in Computational Research

Sumeet Kulkarni shares his ethics & society case study, which he completed as part of our Young Scientist Program.

I was motivated to choose this topic for study since my own BMSIS YSP project involved the computational modelling of bubbles growing in lava. The vesicles they leave behind when the lava solidifies into rock are important proxies in determining the atmospheric pressures at that epoch. This method has already been put into use in field studies of vesicle size distributions in volcanic rocks, placing bounds on pre-historic atmospheric pressures [1]. My computational model aims to study the physics behind why it works. Hypothetically, it could discredit these important experimental findings if it finds dependence of additional parameters in the bubble growth process. But who tests the credibility of my own code? Will I be obliged to share it? I ask more such questions, and seek answers to them here.

The concept of reproducibility is one of the pillars of the Scientific Method. Any experimental finding should stand the test of repeating the results under a fresh environment, and using similar or different experimental apparatus or methods. Just like the insight that lead Einstein to the Special Theory of Relativity, viz. ‘all laws of Physics remain the same for all inertial observers,’ one must expect all scientific results to be reproduced identically (to a certain level of precision) by independent experimenters, thereby removing instrumental, environmental and human bias.

The inability of doing so severely disqualifies the scientific integrity of a new finding. This is especially crucial in areas of applied research like medicine and pharmaceutics. For instance, literally countless trials must be undertaken to ensure that a newly synthesized drug acts correctly on inhibiting a particular molecular pathway, and countless experiments must be performed to ensure the responsibility of that pathway as the cause of a particular disease. Even in areas of basic research, failure to reproduce has put into the bin several potentially path-breaking discoveries. Highlighting examples include Joseph Weber’s detection of Gravitational Waves using a bar detector in 1968, and the recent discovery of neutrinos travelling faster than light by the ICARUS particle detector.

While the ethics of reproducibility in experimental science is an entirely comprehensive topic in itself; (indeed, there have been several controversies associated with it [2]), we turn our attention to the field of computational research. What does reproducibility imply in a computational study? Why is it important? Let’s take a practical example: Imagine a researcher simulating the binding of an enzyme onto a protein using molecular dynamics. Their code might provide new insights into the enzyme action and open a new avenue for research in the synthesis of drugs which act in conditions where the enzyme is ineffective. A large amount of funds for this research, as well as the follow-up clinical trials may rest on the integrity of what the code outputs. This logically calls for an independent evaluation and implementation of the computational code by a peer. But does this actually happen? Do researchers share their code publicly or amongst their peers while publishing their study? Should they be morally obliged to do so? What are the standards of reproducibility we need to apply to computational research?

At first thought, it would be prudent to make it mandatory for everyone to share their computational code used to make a particular finding. All code would be open-source for anyone to test and verify the results. However, one can think of some deterrents against this action. Many computational studies analysis a huge amount of datasets on large computing clusters. Hence, even if the code were made accessible, it cannot be tested by individuals without access to a large computing clusters. With large datasets, there is also a question of where reproducibility is applied. Take large astronomical surveys like the Sloane Digital Sky Survey (SDSS) for instance, which contain tonnes of sky images. Reproducing this dataset is surely impossible, and sharing its analysis code simply amounts to a reanalysis of the same data. Another complexity is introduced by the nitty-gritties of the programming language used, which hinders readability of the code for the peers who want to run it.

There’s also a subtle difference in shareability between experimental and computational work. Experimental papers usually contain a detailed section discussing the materials and methods used in the study. Computational papers too, mention the ‘materials’ (details of the computing machine(s) used) and the ‘methods’ (any particular algorithms or statistical analysis techniques used). Unlike experimental papers, there is no practice of sharing details of how a code works while publishing computational research. Now, it may well happen that a particular researcher uses an ingenious coding technique, perhaps an innovative way to simplify a problem that may influence the outcome of his/her future work. This presents another possible reluctance to the proposition of making all code public.

Now that we’ve discussed various issues arising in the topic, it’s time to address what is ethically right, and discuss ways of making it viable. Pragmatism calls against making it mandatory to share all available code publicly, for reasons discussed above. Computational code should have the same footing as intellectual property. However, in the interests of scientific integrity and the global endeavour to push forth the boundaries of research, the ethical premise of Utilitarianism calls for each individual to make code available to peers. Utilitarianism decides than an action is ethically correct if it is useful for the majority of masses. Sharing code (whenever practical) with peers and even better, publicly, is surely beneficial for the entire scientific community. Withholding this information clearly puts research that builds upon the work at a risk of inaccuracy and straying out of scope.

Are there steps being taken in order to achieve this unified ethical conduct? Yes indeed! Today, we find plenty of research code being shared and documented on websites like github. The way many different open source projects are uploaded and presented in an user-friendly manner today is a big positive. Many research groups also have agreements to make their data publicly available after a certain period of time. Many amongst them also provide code to play around with the data. Since ultimately it is intellectual property, any work done using such shared code must be properly cited to its author(s). Journal publishers should also encourage computational scientists to utilise platforms like github and share it via the publication. This will definitely be the right step in the direction of enhancing reproducibility in computational science.

[1] Earth’s air pressure 2.7 billion years ago constrained to less than half of modern levels. By S. Som et al. (Nature Geoscience, 2016)
[2] 1,500 scientists lift the lid on reproducibility. (Nature, 2016) doi:10.1038/533452a
[3] Reproducible Research in Computational Science. By R.Peng (Science, 2011)