6.6 Ontologies and controlled vocabulary

Ontologies provide definitions of terms and their relationships to other terms in a human-interpretable way and thereby create a semantic model of the concepts that are used within a specific research domain. In simpler words, ontologies are a way of showing the terms used in a subject area and how they relate to one another. With these references you can ensure that terms in your data are always interpreted the same way and clearly understood by others. For example, ontologies can be particularly useful for filling in Darwin Core terms like measurementType or measurementMethod (see section Terms of class Measurement or fact). Next to ontologies, you can also refer to terms listed in a thesaurus, which can be seen as a domain specific dictionary. In contrast to ontologies, searching for a specific term across different thesauri can be a bit more cumbersome, as there are no look-up services where you can directly query several thesauri simultaneously. However, thesauri can be quite helpful for filling in the keywords of your metadata, and it is recommended to use them. We therefore provide a few examples of thesauri tailored to biodiversity or ecological data:

GEMET - General Multilingual Environmental Thesaurus
EnvThes - Thesaurus for long term ecological research, monitoring and experiments
UNESCO Thesaurus

6.6.1 Tools to help you

The Ontology Lookup Service (OLS) is a repository for biomedical ontologies but it also holds plenty of terms and ontologies relevant for ecology. You can search across ontologies for specific terms or filter for certain ontologies. There is also an API available to facilitate the use of OLS in workflows/programs.
Ontobee is a linked data server and another option to browse through around 260 different ontologies and directly search for specific terms.
If you are more interested in finding an ontology dedicated to a specific domain, looking directly at the OBO foundry can be helpful. The OBO Foundry (Open Biological and Biomedical Ontology Foundry) is tailored to biological sciences and develops and maintains ontologies. It is not searchable for individual terms but provides information on each ontology.
If you chose a specific ontology and before using it you want to assess how FAIR this ontology is, you can use FOOPS!. It is considered an ontology pitfall scanner for FAIR and by providing the URI of an ontology it assesses how well the ontology matches the FAIR principles.

6.7 Biological taxonomies

There is a diversity of biological taxonomies that you can use to query taxonomic information for the taxa occurring in your dataset. In this guide we cannot cover all of them but we want to provide some more information on a selected set of taxonomies.

6.7.1 GBIF Backbone taxonomy

The GBIF backbone taxonomy, as the name indicates, builds the basis of the indexing of the species occurrence records stored at GBIF and aims to cover all the species that GBIF deals with. It further aims to bring all different taxa names together and organise them. Taxa are assembled from a hierarchical list of 105 sources, using the Catalogue of Life (COL) as a starting point and thereby tightly linking these two taxonomies. Species not found in the COL are then assembled from the remaining sources that are checked afterwards, making the GBIF backbone taxonomy relatively wide-ranging.

6.7.2 Catalogue of Life (COL)

The Catalogue of Life is an international community for listing species and aims to create a consistent and up-to-date list of currently accepted species across all known taxonomic groups, which is freely accessible. Besides listing taxa, it aims to show all scientific names a taxon is referenced by.

6.7.3 Encyclopedia of Life (EOL)

The Encyclopedia of Life aims to gather knowledge about life on earth and make it globally, openly and freely accessible to everyone. Besides taxonomic information, it also provides details on food webs and other ecological aspects of taxa. The community behind it consists of open access biodiversity knowledge providers, such as museums, libraries and universities.

6.7.4 Integrated Taxonomic Information System (ITIS)

ITIS is an authoritative system that contains information about taxa and their relationships. It provides a comprehensive and openly available taxonomy and is used as the taxonomic backbone for the Encyclopedia of Life and within the Catalogue of Life. It aims to provide a comprehensive taxonomy of species worldwide to allow sharing of biodiversity data.

6.7.5 World Registry of Marine Species (WoRMS)

WoRMS is authoritative classification and catalogue for marine taxa managed by taxonomists and thematic experts that includes accepted and synonym taxonomic information allowing for interpretation of the taxonomic literature. It is the recommended biological taxonomy to retrieve information from when publishing data to OBIS.

6.7.6 Tools to help you

If you want to retrieve taxonomic information directly from one of the aforementioned taxonomies, there is a helpful R package available that effectively uses the APIs of each of these taxonomies, which is called taxize (Chamberlain et al. (2022)). With taxize you can do plenty of different operations, for example, directly parsing in a list of taxa and retrieving their taxonomic classification or their identifiers from one of the taxonomies (e.g., get_gbifid_() retrieves the taxon information from the GBIF backbone taxonomy). If you want to check whether the species names you use in your data are up to date, if they are spelled correctly or if you only have common names but not scientific names in your data, you can use the global name resolving function of taxize (gnr_resolve()). The Global Names Resolver is a service provided by the EOL and shows you which names could be matched to your input name and in which taxonomies or data sources they can be found.

Bud burst:

One of the tree species in the bud burst data is the Pedunculate oak (Quercus robur). To get detailed taxonomic information for this species, we query it from the GBIF backbone taxonomy. As the results are presented in a list, we additionally bind the rows into a data frame using the dplyr package (Wickham, François, et al. (2023)).

taxonInfo <- taxize::get_gbifid_(sci = "Quercus robur") |>
  dplyr::bind_rows()
  
taxonInfo

##   usagekey             scientificname    rank   status matchtype canonicalname
## 1  2878688           Quercus robur L. species ACCEPTED     EXACT Quercus robur
## 2  8206510 Quercus robur (Ten.) A.DC. species  SYNONYM     EXACT Quercus robur
## 3  7911626   Quercus robur Asso, 1779 species DOUBTFUL     EXACT Quercus robur
## 4  7586523        Quercus robur Pall. species DOUBTFUL     EXACT Quercus robur
##   confidence kingdom       phylum   order   family   genus       species
## 1         97 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur
## 2         97 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur
## 3         96 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur
## 4         96 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur
##   kingdomkey phylumkey classkey orderkey familykey genuskey specieskey synonym
## 1          6   7707728      220     1354      4689  2877951    2878688   FALSE
## 2          6   7707728      220     1354      4689  2877951    2878688    TRUE
## 3          6   7707728      220     1354      4689  2877951    7911626   FALSE
## 4          6   7707728      220     1354      4689  2877951    7586523   FALSE
##           class acceptedusagekey
## 1 Magnoliopsida               NA
## 2 Magnoliopsida          2878688
## 3 Magnoliopsida               NA
## 4 Magnoliopsida               NA
##                                                                                  note
## 1                                                                                <NA>
## 2  Similarity: name=110; authorship=0; classification=-2; rank=6; status=0; score=114
## 3 Similarity: name=110; authorship=0; classification=-2; rank=6; status=-5; score=109
## 4 Similarity: name=110; authorship=0; classification=-2; rank=6; status=-5; score=109

There are four matches of our taxon in the GBIF backbone taxonomy. In this example, they differ in their scientific name and the author information given there. If we look at the column “status”, is becomes clear that only one of the matches contains the accepted scientific name, while the second match is a synonym and the others are considered “doubtful”. We therefore want to filter the results for only the matches that have the status “accepted” and the matchtype “exact”, which means that the canonical name matches our input name letter by letter. We again use the package dplyr to filter the data, which leaves us with one match.

taxonInfo |> 
  dplyr::filter(status == "ACCEPTED" & matchtype == "EXACT")

##   usagekey   scientificname    rank   status matchtype canonicalname confidence
## 1  2878688 Quercus robur L. species ACCEPTED     EXACT Quercus robur         97
##   kingdom       phylum   order   family   genus       species kingdomkey
## 1 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur          6
##   phylumkey classkey orderkey familykey genuskey specieskey synonym
## 1   7707728      220     1354      4689  2877951    2878688   FALSE
##           class acceptedusagekey note
## 1 Magnoliopsida               NA <NA>

If you have more than one taxon in your data, you can also directly query the taxonomic information for a number of species at once.

tree_species <- c("Quercus robur", "Quercus rubra", "Larix kaempferi", "Pinus sylvestris", "Betula pendula")

taxize::get_gbifid_(sci = tree_species) |>
  dplyr::bind_rows() |>
  dplyr::filter(status == "ACCEPTED" & matchtype == "EXACT")

##   usagekey                   scientificname    rank   status matchtype
## 1  2878688                 Quercus robur L. species ACCEPTED     EXACT
## 2  2880539                 Quercus rubra L. species ACCEPTED     EXACT
## 3  2686157 Larix kaempferi (Lamb.) Carrière species ACCEPTED     EXACT
## 4  5285637              Pinus sylvestris L. species ACCEPTED     EXACT
## 5  5331916              Betula pendula Roth species ACCEPTED     EXACT
##      canonicalname confidence kingdom       phylum   order     family   genus
## 1    Quercus robur         97 Plantae Tracheophyta Fagales   Fagaceae Quercus
## 2    Quercus rubra         97 Plantae Tracheophyta Fagales   Fagaceae Quercus
## 3  Larix kaempferi         97 Plantae Tracheophyta Pinales   Pinaceae   Larix
## 4 Pinus sylvestris         98 Plantae Tracheophyta Pinales   Pinaceae   Pinus
## 5   Betula pendula         99 Plantae Tracheophyta Fagales Betulaceae  Betula
##            species kingdomkey phylumkey classkey orderkey familykey genuskey
## 1    Quercus robur          6   7707728      220     1354      4689  2877951
## 2    Quercus rubra          6   7707728      220     1354      4689  2877951
## 3  Larix kaempferi          6   7707728      194      640      3925  2686156
## 4 Pinus sylvestris          6   7707728      194      640      3925  2684241
## 5   Betula pendula          6   7707728      220     1354      4688  2875008
##   specieskey synonym         class acceptedusagekey note
## 1    2878688   FALSE Magnoliopsida               NA <NA>
## 2    2880539   FALSE Magnoliopsida               NA <NA>
## 3    2686157   FALSE     Pinopsida               NA <NA>
## 4    5285637   FALSE     Pinopsida               NA <NA>
## 5    5331916   FALSE Magnoliopsida               NA <NA>

If you query your taxonomic information from a taxonomy you should however always check manually, whether the taxa are identified correctly. Not all taxa are present in all taxonomies or names between taxa are so similar that they are confused for the same taxon in the name matching process.

6.8 Creating GUIDs

A globally unique identifier (GUID) is a text string of 36 characters that can be used, amongst others, to assign unique identifiers to each data record. It was established as a variation of the Universally Unique Identifier (UUID) but now both are used synonymously. In contrast to other persistent identifiers (PID) that are assigned to the data level, such as DOI, GUIDs do not have to be issued by a central authority but can be created individually by using specific algorithms or generators. There are different types of GUID, for more information see here.

Structure of GUIDs: A GUID is build as follows:

{XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}

where each X is a hexadecimal digit, meaning a number from 0 to 9 or a letter from A to F. This structure ensures an extremely low probability of duplication, making each GUID globally unique.

To assign GUIDs to your data you can:

use an online GUID generator, for example https://www.uuidgenerator.net
use the R package uuid(Urbanek & Ts’o (2023)) and its function UUIDgenerate()

Note: Once you created an UUID for your dataset, this should not change anymore.

6.9 Cross-referencing other resources in the data

Some contents of datasets do not exist in isolation. To provide context to your data and to properly link the data to related resources, your data should refer to these by cross-referencing, for example, their unique identifier.

If you, for example, have two datasets belonging to two different experiments conducted in the same study area, these datasets can be linked with a cross-reference to give context to them (see CLUE example below). This type of information can be added to the EML metadata file by using the <additionalMetadata> term and its subterm <metadata>. The <metadata> term can then be filled with the Dublin Core term dc:relation that ideally is filled with the URI/DOI of the dataset that should be linked. Note that it is important to add the Dublin Core namespace xmlns:dc="http://purl.org/dc/terms/" to the other namespaces in the <eml> element at the top of the document.

CLUE data:

As described for the CLUE data, two different experiments have been performed in the same plots resulting in two different datasets that are closely linked through the experimental plots. As described above, the EML file of the dataset belonging to the first experiment can link to the dataset of the second experiment by using the Dublin Core term dc:relation in the <additionalMetadata> filed of the EML file to reference to its DOI.

Note: As the CLUE data is not published yet, there is no DOI, so this is a mock-up DOI.

<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xmlns:dc="http://purl.org/dc/terms/"
         xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2"
         packageId="09aa6a9d-bccc-4eba-a98c-191e3cd09322" system="uuid"
         xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd">
  <dataset>
    <title xml:lang="en">CLUE field data - Vegetation cover under 4 different treatments - Terrestrial Ecology/NIOO-KNAW</title>
  </dataset>
  <additionalMetadata>
    <metadata>
      <dc:relation>10.3xxxx/5S3xxx</dc:relation>
    </metadata>
  </additionalMetadata>
</eml>

Identifiers for the same entity or concept are likely different in different resources and therefore the different identifiers should be mapped to one another. The simplest way of mapping is through a comma- or tab-separated file where each row describes a single entity and the columns provide the identifiers in each dataset. There are other ways of doing this and a more detailed description of these mapping methods can be found here.

Crickets:

The European field cricket (Gryllus campestris Linnaeus, 1758) is listed in several taxonomies with different identifiers. To integrate these different sources, links can be described by creating a mapping file (e.g., CSV, see below), where the row represents the species, and columns contain the identifiers from each taxonomy.

scientificName,GBIF_ID,COL_ID,BOL_ID
"Gryllus campestris (Linnaeus, 1785)",1716462,9GQRY,208632

Note: All IDs are the real IDs from the respective taxonomies but we have not done this mapping in the cricket data ourselves (i.e., it cannot be found in the code on GitHub).

This structured approach ensures that data referencing the same entity across different systems is linked effectively, facilitating data integration and analysis.

References

Chamberlain, S., Szoecs, E., Foster, Z., & Arendsee, Z. (2022). Taxize: Taxonomic information from around the web. https://docs.ropensci.org/taxize/

Urbanek, S., & Ts’o, T. (2023). Uuid: Tools for generating and handling of UUIDs. https://www.rforge.net/uuid

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. https://CRAN.R-project.org/package=dplyr