6.6 Ontologies and controlled vocabulary
Ontologies provide definitions of terms and their relationships to other terms in a human-interpretable way and thereby create a semantic model of the concepts that are used within a specific research domain. In simpler words, ontologies are a way of showing the terms used in a subject area and how they relate to one another. With these references you can ensure that terms in your data are always interpreted the same way and clearly understood by others. For example, ontologies can be particularly useful for filling in Darwin Core terms like measurementType or measurementMethod (see section Terms of class Measurement or fact). Next to ontologies, you can also refer to terms listed in a thesaurus, which can be seen as a domain specific dictionary. In contrast to ontologies, searching for a specific term across different thesauri can be a bit more cumbersome, as there are no look-up services where you can directly query several thesauri simultaneously. However, thesauri can be quite helpful for filling in the keywords of your metadata, and it is recommended to use them. We therefore provide a few examples of thesauri tailored to biodiversity or ecological data:
GEMET - General Multilingual Environmental Thesaurus
EnvThes - Thesaurus for long term ecological research, monitoring and experiments
6.6.1 Tools to help you
The Ontology Lookup Service (OLS) is a repository for biomedical ontologies but it also holds plenty of terms and ontologies relevant for ecology. You can search across ontologies for specific terms or filter for certain ontologies. There is also an API available to facilitate the use of OLS in workflows/programs.
Ontobee is a linked data server and another option to browse through around 260 different ontologies and directly search for specific terms.
If you are more interested in finding an ontology dedicated to a specific domain, looking directly at the OBO foundry can be helpful. The OBO Foundry (Open Biological and Biomedical Ontology Foundry) is tailored to biological sciences and develops and maintains ontologies. It is not searchable for individual terms but provides information on each ontology.
If you chose a specific ontology and before using it you want to assess how FAIR this ontology is, you can use FOOPS!. It is considered an ontology pitfall scanner for FAIR and by providing the URI of an ontology it assesses how well the ontology matches the FAIR principles.
6.7 Biological taxonomies
There is a diversity of biological taxonomies that you can use to query taxonomic information for the taxa occurring in your dataset. In this guide we cannot cover all of them but we want to provide some more information on a selected set of taxonomies.
6.7.1 GBIF Backbone taxonomy
The GBIF backbone taxonomy, as the name indicates, builds the basis of the indexing of the species occurrence records stored at GBIF and aims to cover all the species that GBIF deals with. It further aims to bring all different taxa names together and organise them. Taxa are assembled from a hierarchical list of 105 sources, using the Catalogue of Life (COL) as a starting point and thereby tightly linking these two taxonomies. Species not found in the COL are then assembled from the remaining sources that are checked afterwards, making the GBIF backbone taxonomy relatively wide-ranging.
6.7.2 Catalogue of Life (COL)
The Catalogue of Life is an international community for listing species and aims to create a consistent and up-to-date list of currently accepted species across all known taxonomic groups, which is freely accessible. Besides listing taxa, it aims to show all scientific names a taxon is referenced by.
6.7.3 Encyclopedia of Life (EOL)
The Encyclopedia of Life aims to gather knowledge about life on earth and make it globally, openly and freely accessible to everyone. Besides taxonomic information, it also provides details on food webs and other ecological aspects of taxa. The community behind it consists of open access biodiversity knowledge providers, such as museums, libraries and universities.
6.7.4 Integrated Taxonomic Information System (ITIS)
ITIS is an authoritative system that contains information about taxa and their relationships. It provides a comprehensive and openly available taxonomy and is used as the taxonomic backbone for the Encyclopedia of Life and within the Catalogue of Life. It aims to provide a comprehensive taxonomy of species worldwide to allow sharing of biodiversity data.
6.7.5 World Registry of Marine Species (WoRMS)
WoRMS is authoritative classification and catalogue for marine taxa managed by taxonomists and thematic experts that includes accepted and synonym taxonomic information allowing for interpretation of the taxonomic literature. It is the recommended biological taxonomy to retrieve information from when publishing data to OBIS.
6.7.6 Tools to help you
If you want to retrieve taxonomic information directly from one of the aforementioned taxonomies, there is a helpful R package available that effectively uses the APIs of each of these taxonomies, which is called taxize
(Chamberlain et al. (2022)). With taxize you can do plenty of different operations, for example, directly parsing in a list of taxa and retrieving their taxonomic classification or their identifiers from one of the taxonomies (e.g., get_gbifid_()
retrieves the taxon information from the GBIF backbone taxonomy). If you want to check whether the species names you use in your data are up to date, if they are spelled correctly or if you only have common names but not scientific names in your data, you can use the global name resolving function of taxize (gnr_resolve()
). The Global Names Resolver is a service provided by the EOL and shows you which names could be matched to your input name and in which taxonomies or data sources they can be found.
Bud burst:
One of the tree species in the bud burst data is the Pedunculate oak (Quercus robur). To get detailed taxonomic information for this species, we query it from the GBIF backbone taxonomy. As the results are presented in a list, we additionally bind the rows into a data frame using the dplyr package (Wickham, François, et al. (2023)).
## usagekey scientificname rank status matchtype canonicalname
## 1 2878688 Quercus robur L. species ACCEPTED EXACT Quercus robur
## 2 8206510 Quercus robur (Ten.) A.DC. species SYNONYM EXACT Quercus robur
## 3 7911626 Quercus robur Asso, 1779 species DOUBTFUL EXACT Quercus robur
## 4 7586523 Quercus robur Pall. species DOUBTFUL EXACT Quercus robur
## confidence kingdom phylum order family genus species
## 1 97 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur
## 2 97 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur
## 3 96 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur
## 4 96 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur
## kingdomkey phylumkey classkey orderkey familykey genuskey specieskey synonym
## 1 6 7707728 220 1354 4689 2877951 2878688 FALSE
## 2 6 7707728 220 1354 4689 2877951 2878688 TRUE
## 3 6 7707728 220 1354 4689 2877951 7911626 FALSE
## 4 6 7707728 220 1354 4689 2877951 7586523 FALSE
## class acceptedusagekey
## 1 Magnoliopsida NA
## 2 Magnoliopsida 2878688
## 3 Magnoliopsida NA
## 4 Magnoliopsida NA
## note
## 1 <NA>
## 2 Similarity: name=110; authorship=0; classification=-2; rank=6; status=0; score=114
## 3 Similarity: name=110; authorship=0; classification=-2; rank=6; status=-5; score=109
## 4 Similarity: name=110; authorship=0; classification=-2; rank=6; status=-5; score=109
There are four matches of our taxon in the GBIF backbone taxonomy. In this example, they differ in their scientific name and the author information given there. If we look at the column “status”, is becomes clear that only one of the matches contains the accepted scientific name, while the second match is a synonym and the others are considered “doubtful”. We therefore want to filter the results for only the matches that have the status “accepted” and the matchtype “exact”, which means that the canonical name matches our input name letter by letter. We again use the package dplyr to filter the data, which leaves us with one match.
## usagekey scientificname rank status matchtype canonicalname confidence
## 1 2878688 Quercus robur L. species ACCEPTED EXACT Quercus robur 97
## kingdom phylum order family genus species kingdomkey
## 1 Plantae Tracheophyta Fagales Fagaceae Quercus Quercus robur 6
## phylumkey classkey orderkey familykey genuskey specieskey synonym
## 1 7707728 220 1354 4689 2877951 2878688 FALSE
## class acceptedusagekey note
## 1 Magnoliopsida NA <NA>
If you have more than one taxon in your data, you can also directly query the taxonomic information for a number of species at once.
tree_species <- c("Quercus robur", "Quercus rubra", "Larix kaempferi", "Pinus sylvestris", "Betula pendula")
taxize::get_gbifid_(sci = tree_species) |>
dplyr::bind_rows() |>
dplyr::filter(status == "ACCEPTED" & matchtype == "EXACT")
## usagekey scientificname rank status matchtype
## 1 2878688 Quercus robur L. species ACCEPTED EXACT
## 2 2880539 Quercus rubra L. species ACCEPTED EXACT
## 3 2686157 Larix kaempferi (Lamb.) Carrière species ACCEPTED EXACT
## 4 5285637 Pinus sylvestris L. species ACCEPTED EXACT
## 5 5331916 Betula pendula Roth species ACCEPTED EXACT
## canonicalname confidence kingdom phylum order family genus
## 1 Quercus robur 97 Plantae Tracheophyta Fagales Fagaceae Quercus
## 2 Quercus rubra 97 Plantae Tracheophyta Fagales Fagaceae Quercus
## 3 Larix kaempferi 97 Plantae Tracheophyta Pinales Pinaceae Larix
## 4 Pinus sylvestris 98 Plantae Tracheophyta Pinales Pinaceae Pinus
## 5 Betula pendula 99 Plantae Tracheophyta Fagales Betulaceae Betula
## species kingdomkey phylumkey classkey orderkey familykey genuskey
## 1 Quercus robur 6 7707728 220 1354 4689 2877951
## 2 Quercus rubra 6 7707728 220 1354 4689 2877951
## 3 Larix kaempferi 6 7707728 194 640 3925 2686156
## 4 Pinus sylvestris 6 7707728 194 640 3925 2684241
## 5 Betula pendula 6 7707728 220 1354 4688 2875008
## specieskey synonym class acceptedusagekey note
## 1 2878688 FALSE Magnoliopsida NA <NA>
## 2 2880539 FALSE Magnoliopsida NA <NA>
## 3 2686157 FALSE Pinopsida NA <NA>
## 4 5285637 FALSE Pinopsida NA <NA>
## 5 5331916 FALSE Magnoliopsida NA <NA>
If you query your taxonomic information from a taxonomy you should however always check manually, whether the taxa are identified correctly. Not all taxa are present in all taxonomies or names between taxa are so similar that they are confused for the same taxon in the name matching process.
6.8 Creating GUIDs
A globally unique identifier (GUID) is a text string of 36 characters that can be used, amongst others, to assign unique identifiers to each data record. It was established as a variation of the Universally Unique Identifier (UUID) but now both are used synonymously. In contrast to other persistent identifiers (PID) that are assigned to the data level, such as DOI, GUIDs do not have to be issued by a central authority but can be created individually by using specific algorithms or generators. There are different types of GUID, for more information see here.
Structure of GUIDs: A GUID is build as follows:
{XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}
where each X is a hexadecimal digit, meaning a number from 0 to 9 or a letter from A to F. This structure ensures an extremely low probability of duplication, making each GUID globally unique.
To assign GUIDs to your data you can:
use an online GUID generator, for example https://www.uuidgenerator.net
use the R package
uuid
(Urbanek & Ts’o (2023)) and its functionUUIDgenerate()
Note: Once you created an UUID for your dataset, this should not change anymore.
6.9 Cross-referencing other resources in the data
Some contents of datasets do not exist in isolation. To provide context to your data and to properly link the data to related resources, your data should refer to these by cross-referencing, for example, their unique identifier.
If you, for example, have two datasets belonging to two different experiments conducted in the same study area, these datasets can be linked with a cross-reference to give context to them (see CLUE example below). This type of information can be added to the EML metadata file by using the <additionalMetadata>
term and its subterm <metadata>
. The <metadata>
term can then be filled with the Dublin Core term dc:relation
that ideally is filled with the URI/DOI of the dataset that should be linked. Note that it is important to add the Dublin Core namespace xmlns:dc="http://purl.org/dc/terms/"
to the other namespaces in the <eml>
element at the top of the document.
CLUE data:
As described for the CLUE data, two different experiments have been performed in the same plots resulting in two different datasets that are closely linked through the experimental plots. As described above, the EML file of the dataset belonging to the first experiment can link to the dataset of the second experiment by using the Dublin Core term dc:relation
in the <additionalMetadata>
filed of the EML file to reference to its DOI.
Note: As the CLUE data is not published yet, there is no DOI, so this is a mock-up DOI.
<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2"
packageId="09aa6a9d-bccc-4eba-a98c-191e3cd09322" system="uuid"
xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd">
<dataset>
<title xml:lang="en">CLUE field data - Vegetation cover under 4 different treatments - Terrestrial Ecology/NIOO-KNAW</title>
</dataset>
<additionalMetadata>
<metadata>
<dc:relation>10.3xxxx/5S3xxx</dc:relation>
</metadata>
</additionalMetadata>
</eml>
Identifiers for the same entity or concept are likely different in different resources and therefore the different identifiers should be mapped to one another. The simplest way of mapping is through a comma- or tab-separated file where each row describes a single entity and the columns provide the identifiers in each dataset. There are other ways of doing this and a more detailed description of these mapping methods can be found here.
Crickets:
The European field cricket (Gryllus campestris Linnaeus, 1758) is listed in several taxonomies with different identifiers. To integrate these different sources, links can be described by creating a mapping file (e.g., CSV, see below), where the row represents the species, and columns contain the identifiers from each taxonomy.
scientificName,GBIF_ID,COL_ID,BOL_ID
"Gryllus campestris (Linnaeus, 1785)",1716462,9GQRY,208632
Note: All IDs are the real IDs from the respective taxonomies but we have not done this mapping in the cricket data ourselves (i.e., it cannot be found in the code on GitHub).
This structured approach ensures that data referencing the same entity across different systems is linked effectively, facilitating data integration and analysis.