Debunking myths around open data

By Elina Takola | April 12, 2024

 

Introduction

Scientific research has led to multiple advancements and methodological innovations. However, modern scientists function under constant time pressure to produce a high number of publications and statistically significant results, thus sometimes they resort to questionable research practices. In a survey that examined how widespread these practices are in the field of Ecology and Evolutionary Biology, the majority of participants admitted to having implemented a questionable practice in the past. 64% of the respondents had selected only the statistically significant results of an analysis (cherry picking) in at least one publication. 42% collected additional data after checking the significance of the data (p-hacking) and 51% admitted to presenting an unexpected result as their initial hypothesis (HARKing).

The aforementioned practices can be avoided by adopting, the tenets of open science (Fig. 1). Pre-prints, open data and code promote the transparency of decisions, methods, and result interpretations.

 

open-sci

Figure 1. Components of open science (Gallagher et al. 2020).

 

Who is sharing their data?

Many. Open data is a practice that is supported by the scientific community and scientific journals. For example, Nature Ecology & Evolution, Nature Climate Change, Science, American Naturalist, Evolution, Proceedings of the Royal Society B: Biological Sciences and Ecology Letters, are a few of the journals asking authors to share their primary data during the submission of their manuscript. In addition, most European funding schemes, such as Horizon Europe, require that receivers of funds share their data by the end of the funding period. Moreover, some scientific journals have either created data journals, or added data editors to their editing group. For example, Nature Portfolio has created Scientific Data, a journal that includes only data publications.

In 2021, a new learned society was founded that focuses its activities on the advocacy of open science practices in the field of ecology and evolutionary biology. SORTEE (Society for Open, Reliable and Transparent Ecology and Evolutionary biology) consists of students, scientists, and professionals devoted to improving ecology and evolutionary biology. Member registration in SORTEE ranges from 0 to 20 €/USD. Society members have access to a variety of materials, tools and seminars, as well as a free registration for its annual conference.

Despite the existence of a supportive and dynamic community around open science, in practice, data sharing depends on the will and initiative of each individual scientist. It is not necessary to share our data only when a journal asks us to do so. The shortest answer I can give to the question “Who is sharing their data?” is “I am, along with the majority of my collaborators”.

 

Why are we sharing our data?

Data sharing has numerous positive effects on research. Open data facilitate reproducibility and replicability of scientific results and increase their robustness and reliability. A quantitative study relative to the impact of open practices showed that the publication of research data increases citations by 25%. Moreover, through transparency of research, there is an increase of trust of the general public. In parallel, transparency of scientific studies improves the quality of meta-analyses and reduces publication bias or the file-drawer problem (the disproportionate publication of statistically significant results).

Another reason to practice open science is that the scientific profession is based largely on the reputation and credibility of individual scientists. When a scientist adopts open science practices, they are perceived as more reliable by the community. Yet, the opposite is not necessarily true: a scientist who does not adopt open science practices is not necessarily perceived as less reliable. At the institutional level, those institutions that adopt open science practices have a greater probability of attracting funding. Thus, open science provides primarily positive motivation, not mandatory rules, nor punishment for non-compliance.

It is fair for science to be exposed to social control and public discourse. The responsibility of non-effective communication of scientific results lies primarily on the scientific community. Overall, the positive effects of open data contribute towards more fair science.

 

“Yes, but…”: Scepticisms on open data

“The preparation of open data takes more time”

  • Correct, but the positive effects described above make up for this time.

“We don’t need Science Police.”

  • I completely agree! We don’t need Science Police. What we need is honesty and transparency.

“I am scared that someone will take my data/steal my work.”

  • Open data have a usage license, therefore are legally protected.
  • Other scientists can use your data, under the condition that they will cite them (similarly to published papers), leading to the aforementioned increase of citations.
  • Ecological data are under intellectual property, but the ultimate goal of their creation is their contribution to the collective knowledge about the natural environment. Scientists shall wonder and rethink “why are we doing science?” and what is their goal as professionals. It is counterproductive to collect primary data every time we want to publish a paper, because we spend a considerable amount of time and resources. The same resources and time could be allocated to the development or training on a new methodology or a more complex statistical analysis or even for additional fieldwork for the enrichment of already existing data. If we want ecology to move forward as a field, it is necessary to build on pre-existing work. As an analogy, it would be counterproductive if we had to reinvent the wheel every time we wanted to build a car.
  • It is hard to imagine the fields of ecology and environmental science without open data platforms, such as GBIF, IUCN, Natura2000, PREDICTS, TRY plant trait database, WorldClim, Copernicus and many others.
  • Who will steal them and for what? Our fellow scientists are our colleagues and they are not after us. But even if someone does ‘steal’ our data, they can’t get very far. They may publish a paper, maybe more. Fine. Sooner or later, they will be uncovered by their colleagues, which will have consequences. The risk for their career is disproportionately big, only to avoid adding a line to the reference list of their papers.
  • “How would I know if my data have been stolen?”. Although there are still no official tools or software available to check ‘data plagiarism’, there are algorithms that can detect similarities between datasets or other such disparities. However, artificial intelligence is developing so quickly that soon we will have appropriate solutions. In the meantime, there are practical guidelines to deal with such cases. In addition, scientific community has the ability to self-regulate and the contribution of whistle-blowers is crucial.

“I am afraid that someone will find a mistake in my data.”

  • To err is human and the recognition and correction of mistakes increases the reliability of a scientist. Open science practices help to identify errors in data and code, thus increasing the quality of the final results.

“I am familiar with open science practices, but where I work/study we haven’t adopted them.”

  • Unfortunately, academia has very strong research traditions and universities are often hesitant to modernize and modify the way they operate. However, systemic change is rarely “top-down” without any “bottom-up” pressure. There are currently many calls from multi-authored studies advocating for change in research conduct (e.g. reproducibility studies in ecology or calls for transparency in the authorship contribution). As a result, open science practices are more likely to be introduced in research by under- or post-graduate students and early career researchers.

“I want to use my data in another publication later.”

  • One more reason to publish them. Since their publication, data can be cited like any other source. “I want to add more data later.”
  • No problem. There are no rules regarding when to make your data open. The recommendation is to do it upon publication. Alternatively, data repositories offer the possibility to create versions of a dataset.

“General public cannot understand scientific data.”

  • Open data are usually aimed at the scientific community and/or individuals with expertise. Their publication helps other scientists who want to use them for their own research, instead of communicating separately with every single data author. At the same time, other professional groups can also use them: policy advisors, industry professionals, NGO members and others.
  • The dissemination of scientific data and results to the general public can be done through educational and outreach activities or through GUIs.
  • The complexity of concepts and terminologies in ecology is a good argument against data sharing, given that the CERN experiment has its own portal with petabytes of open data. CERN scientists were not afraid of their data being misunderstood by the general public; therefore, natural scientists have nothing to worry about. (“It’s not rocket science…”)

 

“I am an ecologist, and I would like to share my data but I don’t know how.”

No problem. The basic rules of open data are simple: Findability, Accessibility, Interoperability and Reusability (FAIR data principles). In other words, other scientists should be able to find and open the data files, connect them with other data and/or software and be able to use them for further analyses. Additionally, there are guidelines that appeal specifically to institutions.

The process of making your data open is:

  1. Think of whether your data contain sensitive information (e.g. geographic information for an endangered species). If yes, you will need to remove those data or anonymize them.
  2. Create a metadata file (or a readMe file). What is shown in each column? What do acronyms mean? Who is the data creator? What are the units of each measure?
  3. Select a data repository. Some require a paid subscription (e.g. Dryad) while others are for free (e.g. Open Science Framework, Zendo, Figshare). It is important that the repository provides a DOI for the data file. In some repositories there is a storage limit, but usually it is a few dozen gigabytes.
  4. Select a usage license for your data (usually platforms offer the option of Creative Commons License).

The scientific community has already started adopting open science practices. Many studies provide guidelines of how we can maximize the usefulness of open data and there are free seminars and information material (e.g. from SORTEE or other institutions on video platforms) available.

 

Epilogue

Ecology is on its way to becoming a “big data” science and many questions remain unanswered. However, commonly used ecological databases still contain important gaps (e.g. regarding species richness or population abundances). The practice of open data has already been adapted by a large number of scientists that are willing to rethink their relationship with society. It has been made clear that scientific results cannot remain in file drawers or behind paywalls. To conclude, in an era where conspiracy theories thrive, science cannot continue to be a black box.

 

Dr. Elina Takola, postdoctoral researcher (elina.takola@ufz.de)

Department of Computational Landscape Ecology, UFZ—Helmholtz Centre for Environmental Research, Permoserstrasse 15, Leipzig, 04318, Germany