Meeting Report: Hackathon-Workshop on Darwin Core and MIxS Standards Alignment (February 2012)

The Global Biodiversity Information Facility and the Genomic Standards Consortium convened a joint workshop at the University of Oxford, 27-29 February 2012, with a small group of experts from Europe, USA, China and Japan, to continue the alignment of the Darwin Core with the MIxS and related genomics standards. Several reference mappings were produced as well as test expressions of MIxS in RDF. The use and management of controlled vocabulary terms was considered in relation to both GBIF and the GSC, and tools for working with terms were reviewed. Extensions for publishing genomic biodiversity data to the GBIF network via a Darwin Core Archive were prototyped and work begun on preparing translations of the Darwin Core to Japanese and Chinese. Five genomic repositories were identified for engagement to begin the process of testing the publishing of genomic data to the GBIF network commencing with the SILVA rRNA database.

The Global Biodiversity Information Facility and the Genomic Standards Consortium convened a joint workshop at the University of Oxford, 27-29 February 2012, with a small group of experts from Europe, USA, China and Japan, to continue the alignment of the Darwin Core with the MIxS and related genomics standards. Several reference mappings were produced as well as test expressions of MIxS in RDF. The use and management of controlled vocabulary terms was considered in relation to both GBIF and the GSC, and tools for working with terms were reviewed. Extensions for publishing genomic biodiversity data to the GBIF network via a Darwin Core Archive were prototyped and work begun on preparing translations of the Darwin Core to Japanese and Chinese. Five genomic repositories were identified for engagement to begin the process of testing the publishing of genomic data to the GBIF network commencing with the SILVA rRNA database.

Background
The Global Biodiversity Information Facility [1] (GBIF) Strategic Plan 2012-2016 [2] highlights the need to address the coming challenge and opportunity of making accessible information regarding the estimated 90% of the planet's biodiversity that is still to be discovered and shared, the currency of which will primarily be genomic biodiversity data. To this end, GBIF is collaborating with the Genomic Standards Consortium [3] (GSC) Biodiversity Working Group (GBWG) on common issues, principally around the alignment of standards. During February 27-29, 2012, GBIF led a joint hackathon-workshop on species-level biodiversity and genomic data standards with the aim of ensuring alignment and harmonization of efforts in these related domains and contributing to the ongoing work and series of workshops of the USA National Science Foundation funded Research Coordination Network (RCN) project for the GSC (RCN4GSC) [4] which seeks to promote the integration of genomic standards with ecological and species level standards. Hosted at the Oxford e-Research Centre [5], the workshop brought together a small group of experts from Europe, USA, China and Japan.

Purposes of the Meeting
The goals of the workshop were to continue the process of aligning the Darwin Core [6] with the MIxS [7] and related genomic standards (e.g. ABCDDNA [8] and WFCC [9]), advance issues on vocabulary/ontology management including multilingual aspects, develop a DwC-A extension for serving genomic data, and identify suitable genomic data repositories with which to engage on connecting to the GBIF network.

Participants
The participants were chosen for their technical knowledge of the various standards and genomic databases, although, in the case of the latter, it was not the intention to have representation of all the major repositories (see appended list).

Vocabulary and Ontology Management
The use and management of controlled vocabulary terms was considered in relation to both GBIF and the GSC, and tools for working with terms were reviewed. The MIxS standard [10] is maintained in a rela-

DwC-A for genomic data
During the workshop, two extensions for publishing genomic biodiversity data to the GBIF network via a DwC-A were prototyped. Both of these use a DwC "occurrence" as the core data type. The extensions are "MIxS Sample" and "TaxonAbundance".

Genomic repositories
Five genomic repositories were identified for engagement to begin the process of testing the publishing of genomic data to the GBIF network: As a representative of SILVA was participating in the meeting (PY), it was possible to explore in some detail the structure of this database and its mapping to the DwC-A format, and in the weeks immediately following the workshop, a prototype export was completed and is currently being processed by GBIF. SILVA data was represented as a Darwin Core Occurrence, plus two extensions; Literature References and Identification History. As the core requires a taxonomic designation for each sequence, the SILVA classification was chosen. Alternative taxonomic opinions are represented in the Identification History extension. The core contains relevant rRNA sequence metadata parsed by SILVA from EMBL-ENA, which are mapped to relevant Darwin Core properties. For example, "col-lection_date" field is represented by verbatimEventDate, while "country" corresponds to locality. The Literature References extension contains the publication title, identifier, journal, as well as author information, if these were present alongside sequence records. Finally, the Identification History extension was used to represent the different taxonomic opinions for the sequences, i.e., the SILVA classification, and the Ribosomal Database Project II (RDP-II) classification.
Thanks to the development of the new/next generation sequencers, the number of sequences of microbial genes and genomes has literally exploded in recent years. In the meantime, pipelines for the annotation of sequences have been developed and served via the Internet to relieve the bottleneck in data mining of sequences, e.g. IMG [16], RAST [17], MiGAP [18]). Our next step, as a community, is to approach the developers of these pipelines to ensure conformance to the standards. This will greatly improve the quality and interoperability of diverse databases and contribute to the efficient re-use of data.

Conclusions / Outcomes
A GBIF community site has been established to act as focal point for the group to continue collaborations: http://community.gbif.org/pg/groups/22216/ge nomic-biodiversity-data/. Membership is open to all (requires login) and all workshop participants have received an invitation to join. Several followon action items were identified and are being dealt with by the parties listed.
The following tasks have been identified as the next steps in building on the outcomes of the workshop: