Good, clean and FAIR data for all !!

FAIR principles

FAIR principles is a set of guidelines aimed at improving the management and stewardship of data focusing on data Findability, Accessibility, Interoperability, and Reusability. FAIR Principles were introduced in 2014 and published in 2016 in Nature followed by the GO FAIR initiative in 2018.

The primary goal of FAIR principles is to support present and future users to get value out of their data effectively. The cost of not having FAIR data (for scientific research) is estimated to EUR 10 to 20 billion annually across the European Union.

Implementing FAIR principles presents challenges and necessitates dedicated resources, including financial investment, time, effort, skills, and infrastructure. Therefore, it is essential to expect a return on investment that is grounded in reliable metrics.

FAIR metrics

Numerous initiatives propose metrics or frameworks that assess the FAIRness status and maturity level of data. However, akin to the challenges of implementing FAIR principles, the deployment of FAIR metrics also requires significant resources and effort.

Decomposition of FAIR principles

FAIR principles decompose as follows:

  • Findability: Data should be easy to find for both humans and machines. This includes the use of unique identifiers (like DOIs) and metadata that describe the data.
    • F1. (meta)data are assigned a globally unique and eternally persistent identifier.
    • F2. data are described with rich metadata.
    • F3. (meta)data are registered or indexed in a searchable resource.
    • F4. metadata specify the data identifier.
  • Accessibility: Once found, data should be accessible. This means that the data should be retrievable using standard protocols, and any restrictions on access should be clearly stated.
    • A1  (meta)data are retrievable by their identifier using a standardized communications protocol.
      • A1.1 the protocol is open, free, and universally implementable.
      • A1.2 the protocol allows for an authentication and authorization procedure, where necessary.
    • A2 metadata are accessible, even when the data are no longer available.
  • Interoperability: Data should be able to integrate with other datasets and systems. This involves using common standards and formats that facilitate data exchange and collaboration.
    • I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
    • I2. (meta)data use vocabularies that follow FAIR principles.
    • I3. (meta)data include qualified references to other (meta)data.
  • Reusability: Data should be well-documented and accompanied by clear licenses that allow for reuse. This includes providing sufficient metadata to understand the context and limitations of the data.
    • R1. meta(data) have a plurality of accurate and relevant attributes.
      • R1.1. (meta)data are released with a clear and accessible data usage license.
      • R1.2. (meta)data are associated with their provenance.
      • R1.3. (meta)data meet domain-relevant community standards.

FAIR is about data, not systems

FAIR principles fundamentally focused on enhancing the quality and usability of data rather than the technology used to search, store or manage it. These principles emphasize the importance of structuring data in a way that promotes its discovery and reusability by users and machines alike. While IT systems may facilitate these processes, they do not inherently possess a FAIR characteristics.

Labelling an IT system as “FAIR” create misconceptions about what FAIR principles truly means. While specific infrastructures and technologies are indeed needed to support the FAIRification of data, the principles themselves cannot be confined to the systems that manage the data. This mislabeling can lead to a superficial understanding of FAIR, akin to the term “fake agile,” which refers to organizations that claim to utilize Agile methodologies without truly embracing its core values.

It is certainly more convenient to apply a FAIR label to existing non-FAIR data (or, even worse, to systems) and assert, “My data/system is FAIR! It says so right here,” rather than investing the necessary resources and embark on the challenging and long-term journey of data FAIRification. If we truly wish to reap the benefits that FAIR data promises, it is essential to understand data FAIRification requires a significant cultural shift in how data is managed and valued, rather than just a technological upgrade.

Data FAIRification

First, we need to segregate two different exercises:

  1. FAIRification of legacy data can be a substantial and resource-intensive task doing data archeology. This necessites to

    • thoroughly inventory existing datasets
    • evaluate feasibility (incl. access, quality, foreseen added-value)
    • establish data FAIRification pipelines
    • run selected data in the pipeline
  2. FAIRification of current/future data comes down to building the plane while flying it:

    • establish data FAIRification pipelines
    • iteratively run data throught the pipeline
    • implement piepline as main data practic
    • get data FAIR-by-design

Next, we need to operationalize FAIR principles. The Swiss National Science Foundation (SNSF) has provided a comprehensive resource that illustrates how the FAIR principles can be operationalized through specific actions that enhance the structure and content of data.

URIs and metadata

FAIR principles are fundamentally about
1) uniquely identifying data and
2) annotating/enriching data with metadata.

Let’s have a closer look:

  1. Findable:
    • GUPRI (Global Unique Persistent Resolvable Identifer)
    • Rich metadata (including GUPRI) to be used by search engine
  2. Accessible: rich and persistent metadata including
    • Open, free communication protocol
    • Authorisation, access procedure
  3. Interoperable: rich metadata uses
    • Standard knowledge representation
    • Standard vocabulary
    • Links to open reference data
  4. Reusable: rich metadata describing
    • License terms
    • Provenance
    • Domain-specific standards

FAIRification of legacy data

FAIRification of existing/historical data, can be achieved with two approaches: modify or annotate existing or virtualized data.

We recommend annotating virtualized data. This task is typically less resource-intensive and only involves adding metadata to virtualized instances of legacy data and create a new representation:

  1. Virtualize legacy data in a knowledge graph
  2. Create a new representation of your existing data without directly dealing with and/or altering original datasets.
  3. Enriche this virtual data layer with appropriate annotation (FAIR metadata),
  4. expose your newly FAIRified legacy data

By choosing the virtualization approach, organizations can streamline their FAIRification processes. Having a FAIRified virtual data layer minimizes the challenges associated with legacy data management, such as compatibility issues and the need for extensive data cleansing.