Complexities of reuse and synthesis in the open data landscape

By SORTEE | November 14, 2023

Open data offers immense opportunities for ecologists and evolutionary biologists. The more good quality data are available, the more questions can be answered—and at broader spatial and temporal scales and at greater taxonomic generality. However, making use of open data is far from straightforward.

At this year’s SORTEE conference, Rose Trappes and Alfredo Sánchez-Tójar co-organized a productive unconference to tackle this complex topic. For that, they invited three experienced panellists: Matt Grainger, Antica Culina and Benno Simmons, and held a discussion about the opportunities and challenges of data reuse and data synthesis in the fast-moving world of open data. We heard from researchers experienced in reusing and synthesising data, as well as those active in creating open data resources for ecologists and evolutionary biologists.

As anyone who has tried to reuse or synthesise open data can tell you, there are many challenges you encounter when trying to find, access, and make sense of open data. These range from unintuitive variable names and missing units, through incomplete or non-standard metadata to poor-quality data and corrupt or missing files [1,2]. Then there are the difficulties of combining data from different sources, which means grappling with all sorts of heterogeneity.

One major message was the need for more readily reusable data. Achieving this will include:

Discussing data and metadata standards and the importance of complete and understandable information about how data were produced.
Implementing version control (e.g., GBIF versioning) and methods for linking data to track provenance, to deal with data duplication and errors in datasets.
Educating data producers on how to make data accessible and usable (complete, clear, etc.) [3,4].
Convincing funders and journals to enforce data-sharing (and code-sharing) policies [5].
Campaigning for funders and universities to provide more resources to ensure that data producers can actually follow best practices.
Considering how to best acknowledge data producers (and software package developers), to recognise and encourage their contributions.

We also discussed some key opportunities for open data to enhance synthesis studies:

Getting big data producers (e.g., WWF) to release their data is one way to massively increase the amount of data available for reuse and synthesis.
Greater consensus around what to measure and how will lead to data that is much easier to synthesise and more meaningful for important research questions. Ecologists could adapt methods from medicine for developing ‘core domain sets’ or ‘core outcomes,’ where researchers agree on what should be measured and how across studies [6].

There are of course limitations to be recognised. Ecological systems are inherently very complex and diverse, which might limit the ability to standardise data, naming, methods, and so on across studies. In addition, there are (perceived) costs and benefits of data-sharing that need to be balanced [1] and which require changes in incentives [7]. More generally, we may have to accept that we cannot fix all the problems with open data.

Nevertheless, there was a sense that the open data landscape in ecology and evolutionary biology is actually improving. We are heading in the right direction, and we should continue pushing for better practices.

A final proposal was to start a group of evidence synthesists with the support of SORTEE. This group could keep track of all the issues that they encounter while doing synthesis, as well as what and how data should be reported to be useful for generating further knowledge. They could use these to construct a wish list of best practices for open data and more generally, materials (e.g., code), based on experiences in data reuse and synthesis.

References

Soeharjono, S. and Roche, D.G. (2021) Reported Individual Costs and Benefits of Sharing Open Data among Canadian Academic Faculty in Ecology and Evolution. BioScience 71, 750–756 https://doi.org/10.1371/journal.pbio.1002295
Roche, D.G. et al. (2022) Slow improvement to the archiving quality of open datasets shared by researchers in ecology and evolution. Proceedings of the Royal Society B: Biological Sciences 289, 20212780 https://doi.org/10.1098/rspb.2021.2780
Gerstner, K. et al. (2017) Will your paper be used in a meta‐analysis? Make the reach of your research broader and longer lasting. Methods Ecol Evol 8, 777–784 https://doi.org/10.1111/2041-210X.12758
Hennessy, E.A. et al. (2022) Ensuring Prevention Science Research is Synthesis-Ready for Immediate and Lasting Scientific Impact. Prev Sci 23, 809–820 https://doi.org/10.1007/s11121-021-01279-8
Culina, A. et al. (2020) Low availability of code in ecology: A call for urgent action. PLOS Biology 18, e3000763 https://doi.org/10.1371/journal.pbio.3000763
Reed, M.S. et al. (2022) Peatland core domain sets: building consensus on what should be measured in research and monitoring. Mires Peat 28, 1–21 http://dx.doi.org/10.19189/MaP.2021.OMB.StA.2340
O’Dea, R.E. et al. (2021) Towards open, reliable, and transparent ecology and evolutionary biology. BMC Biol 19, 68 https://doi.org/10.1186/s12915-021-01006-3