Publishing Data

Data Repositories

Generally, a data repository refers to a data storage entity. There are two categories of data repositories that are recommended for Hakai researchers or affiliates to publish data to.

Hakai Institute Data Catalogue and Repository. Data produced or aggregated by Hakai employees, affiliates, postdocs or students funded by the Tula Foundation should be hosted on the Hakai Institute GitHub Repository for long-term archiving, and a metadata record created in the Hakai Data Catalogue using the Hakai metadata intake form.
Domain-specific Repositories. Wherever possible, data should be published to open-access, domain-specific data repositories in addition to being hosted in the Hakai GitHub repository and recorded in the Hakai Data Catalogue. These are data repositories that have well-developed community standards for data and metadata. Examples include CIOOS, ERDDAP, OBIS, Genbank, SOCAT, and HydroShare just to name a few. You can discover a relevant domain-specific repository for your data using https://www.re3data.org/.

Hakai Institute Data Catalogue and Data Repository

Metadata (data about data) are stored in the Hakai Data Catalogue in the form of metadata records that data providers can create using the Hakai Metadata Intake Form. The metadata record contains important and standardized information about data that ensure it is broadly discoverable and accessible. The Hakai Data Catalogue is not itself a database or data repository but rather an index for where datasets or data packages are stored. Therefore, in the Resources section of the metadata intake form you must provide a link to where your data package is stored. Figure 1 details the workflow for creating metadata records for the Hakai Data Catalogue and publishing data. It is recommended that Hakai data packages be hosted in the Hakai Institute GitHub Repository according to the data package content recommendations below. See Hosting Data on GitHub for more information.

Figure 1. Hakai Data Publishing Decision Tree for creating a metadata record in the Hakai Data Catalogue and publishing data to domain-specific, open-access data repositories.

Hakai data package content recommendations:

Describe thoroughly the field, lab, and data processing protocols used to produce your data. This could be done in a Readme.txt, a .pdf, or a previous publication defining your methods.
Create a Data Dictionary (.txt or .csv file). This describes each variable in every table of your data package. Include variable name, units, description.
Assign a version to your data package using a major. minor version ie. v2.1. The semantic versioning of the data package should match the semantic versioning included in the metadata record in the Hakai Data Catalogue.
Create a changelog in a .txt file to keep track of what changes have occurred since the last version. Follow this guide to keep a changelog.
Include all your data tables as plain text files (.csv, .txt, .tsv).
Include any scripts that were used to clean up, filter data from the raw data, calculate values in the final data package, or example scripts to join data.
If applicable, the data package should contain an Archive folder to house older/previous versions of the data package or time-series data.

Optional:

Include literature referenced, equipment manuals, anything relevant to your methods.
If your data package has numerous tables that fit together in a relational database structure, include a diagram such as an Entity Relationship Diagram to describe hierarchical relationships of tables.
A folder containing the standardized data that is published to an open-access, domain-specific repository.

Domain-specific Repositories/Databases/Knowledge Bases

In addition to storing data packages in the Hakai GitHub Data Repository, each Hakai affiliated project should determine whether (a subset of) their data can (or should) be mobilized to a global repository. Publishing data to a domain-specific repository may require special formatting or structuring to ensure that data are interoperable with other datasets on the platform. The effort to transform your data can be worthwhile because it increases the reach of your science and should result in more citations and recognition for your scientific work and is an important facet of the work Hakai does in the public interest.

Publishing Data to CIOOS

Datasets produced by Hakai that include Essential Ocean Variables (EOVs) should be published to the Canadian Integrated Ocean Observing System (CIOOS). The workflow for publishing a metadata record to the CIOOS Data Catalogue, starts with the same process for publishing data to the Hakai Data Catalogue using the Hakai metadata intake form and publishing a data package on the GitHub repository. Often, only a subset of the full data package can be mobilized to a domain-specific repository. It is recommended to create a data package for the overall, complete, processed data, storing this in the Hakai GitHub repository. The data package should include a separate folder that contains the standardized data file. This data file should then be published to the identified repository.

The metadata intake form will identify which metadata records should be published to the CIOOS Data Catalogue based on the presence of EOVs. However, there is the additional step of transforming data and including links to standardized data in the metadata record. CIOOS requires data to be hosted on specific platforms requiring transformation and standardization to ensure they are interoperable. CIOOS recommends the Ocean Biodiversity Information System (OBIS) for biological species occurrences and ERDDAP for physical and biogeochemical oceanographic data. These repositories will require data to be standardized to a format and may require additional metadata and/or the creation of an additional metadata record in that repository. If that is the case, just ensure that the metadata record in the Hakai Data Catalogue includes a link to the domain-specific metadata record as well as a link to the Hakai GitHub Repository where the complete dataset is hosted.

Workflows for publishing data to domain-specific repositories have been outlined in other sections (see e.g. 05 - OBIS and GBIF Best-practices for publishing biological occurrence data to OBIS). To begin the process of transforming your dataset for a domain-specific repository, contact data@hakai.org for guidance.

Restricted data

Restricted data might include e.g. data that is (partially) collected in collaboration with First Nation partners, on their ancestral lands, or include sensitive data e.g. information on endangered species. As such, data providers might choose to not make this data publicly accessible. When data contains restricted or sensitive data which should not be publicly accessible, this needs to be disclosed in a data management plan (DMP). Alternatively, this information can be captured in a research agreement or a memorandum of understanding (MoU) between Hakai Institute (affiliates) and involved parties. It is recommended for metadata records in the Hakai Data Catalogue to link to the compressed data package that only contains the publicly accessible data, and include the limitations to the dataset in the specific field of the Metadata Intake Form (e.g. endangered species occurrence data is omitted from the dataset).

In the case of sensitive data that First Nations have collected in collaboration with Hakai, the Local Contexts initiative may be a suitable avenue to facilitate the right of Data Sovereignty and self-determination in how datasets may be licensed or restricted.

Report abuse