Frequently Asked Questions

General Topics

The NFDI is a non-profit association that aims to manage research data systematically, preserve it in the long term, and make it accessible both nationally and internationally. For more information, please refer to the NFDI website: NFDI | Nationale Forschungsdateninfrastruktur e. V..

NFDI4Microbiota is a consortium within the NFDI that specializes in microbiological data. It is made up of ten German research institutions with high expertise in microbiology. The consortium aims to advance microbiological research through digital transformation. For more information, please refer to the NFDI4Microbiota Knowledge Base.

If you would like to get involved with NFDI4Microbiota, please visit our website: NFDI4Microbiota-Home. There, you will find information on participation options, like our ambassador program, as well as on the services and infrastructures we offer, including ARUNA Object Storage, the Cloud-based Workflow Manager (CloWM), training events, the Knowledge Base and the Helpdesk.

If you would like to register as a participant in NFDI4Microbiota, please follow the instructions on this page: Participants.

The NFDI4Microbiota Ambassador program aims to connect and train early-career researchers within the microbiology research community. Our goal is to help these researchers expand their networks and teach them best practices for handling data, metadata standards, standardized bioinformatic workflows, and related topics. For more information on the program, please refer to this page: The Ambassador program.

If you would like to become an NFDI4Microbiota ambassador, please register here: NFDI4Microbiota Ambassador Registration

We welcome questions from all individuals working with microbial data – whether students, early-career researchers, senior scientists, or data stewards. Support is provided irrespective of the organism (e.g. bacteria, archaea, eukaryotic microbes or viruses), environment (e.g. soil, aquatic, host-associated or plant), or data type (e.g. nucleic acid sequences, protein data, functional genomics, image data).

NFDI4Microbiota Services

You can find news, events and newsletters in the ‘Newsroom’ tab of the main NFDI4Microbiota page: NFDI4Microbiota-Home. You can also subscribe to our Newsletter and follow us on LinkedIn, Mastodon, and Bluesky.

NFDI4Microbiota supports a variety of microbial data, including, but not limited to, nucleic acid sequences, protein data, functional genomics and image data. You can find a list of common microbiology data types in our Knowledge Base: Research Data.

NFDI4Microbiota offers a range of specialized services and tools to support you throughout the research data lifecycle. These include a Data Management Plan (DMP) template, a collection of experimental protocols, recommendations on metadata standards, and the databases StrainInfo and VirJenDB. You can find out more about all our services on our website: NFDI4Microbiota Services.

No, all of NFDI4Microbiota’s services and platforms are offered free-of-charge, since we are funded by the German Research Foundation (Deutsche Forschungsgesellschaft - DFG).

The Cloud-based Workflow Manager (CloWM)

The Cloud-based Workflow Manager (CloWM) is a fully open platform that the research community and non-profit organisations can use free of charge. To make your workflow available on the CloWM platform, you must first apply for the developer role by emailing info@clowm.bi.denbi.de. The workflow must also be written in the Nextflow workflow language and adhere to the NF-Core standard, meaning that every step must be containerized to guarantee maximum portability, and the workflow must be well documented. Please note that workflows must undergo a review process by workflow reviewers to ensure that CloWM compute resources are used appropriately and not misused.

In the unlikely event that CloWM is no longer operational or available, this would not directly affect the availability of your workflows. Even if a workflow is only registered on the CloWM platform, it remains and is stored in its original GitHub or GitLab repository. The workflow can be accessed, downloaded and modified from this repository completely independently of CloWM.

NF-Core, EPI2ME Labs and CloWM all adhere to the same workflow standard, known as the NF-Core standard. If developers follow this standard and the recommended best practices, the workflow should run everywhere. The workflows can also technically be added to other public workflow collections, such as nf-core. However, we would like to point out that they have their own workflow review process.

Currently, no data encryption is used on the platform. However, the platform provides secure, personalized access.

Although we do not explicitly prohibit this, we would like to point out that we do not take responsibility for any user data stored on the platform.

CloWM relies on a highly scalable execution layer powered by de.NBI cloud resources, one of the largest academic clouds in Europe. Consequently, there is almost no limit on resources. If you are planning something that is particularly demanding in terms of computing or storage, or if the current quotas are insufficient for your needs, please contact info@clowm.bi.denbi.de.

None. Workflows can be executed via a user-friendly web interface.

The Cloud-based Workflow Manager (CloWM /klaʊm/) is an open-source, free-of-charge platform designed to streamline data-intensive scientific analysis. It offers a seamless integration of four core components:

  1. Curated Workflows: Scientific workflows written in the Nextflow DSL.
  2. Robust Data Storage: Secure and reliable data persistence.
  3. Highly Scalable Compute: An elastic layer for demanding analysis tasks.
  4. User-friendly Interface: An intuitive experience for managing and running analyses. CloWM already contains several  best-practice workflows covering a broad spectrum of research fields and tasks, such as metagenomics (WGS and 16S), human-variation-analysis, infectious diseases (SARS-CoV2, Mpox, Influenza A+B), meta-barcoding, genome assembly, transcriptomics, basecalling, phylogenomics, … and many more hopefully coming soon! Some of these workflows are exclusively available on CloWM while others are sourced from well-known open-source repositories like nf-core or EPI2ME Labs

Training

In general, anyone can attend the training events advertised on our website, unless otherwise specified. Some events are organized for specific research groups or institutions only, and these are indicated as ‘closed’. You can view upcoming training events here: Training.

You can view upcoming training events here: Training.

We currently offer on-demand training on topics related to research data management, such as data management plans (DMPs), electronic lab notebooks (ELNs), data organisation, data documentation, data sharing and publishing, and data discovery and reuse.

Training materials can be found on our Zenodo Community (have a look at both lessons and presentations)

Research Data Management (RDM)

Research Data Management (RDM) is the care and maintenance required to (1) obtain high-quality data, (2) make the data available and usable in the long term and (3) make research results reproducible beyond the research project. For more information on RDM, check out our Knowledge Base: The NFDI4Microbiota Knowledge Base.

For general recommendations on metadata standards, please refer to our Knowledge Base. We have also collected important ethnical, biological and environmental minimal metadata suggestions on our GitHub page.

Third parties, such as research funders, institutions and publishers, may have specific requirements regarding how researchers should handle their data. One example of such a requirement in the field of microbiology is Nature Microbiology’s policy on reporting standards and availability of data, materials, code and protocols.

  • RDM platforms:

    • BExIS2 by NFDI4Biodiversity at FSU Jena
    • Coscine by RWTH Aachen
      • Coscine is a research data management platform for your research projects. Coscine adheres to the FAIR principles with structured storage, metadata management, collaborative working and long-term storage of your data from research projects in accordance with good scientific practice. To get started simply register with your university account or ORCID and create a project.
      • Coscine is available for use by employees of participating universities or research institutions in North Rhine-Westphalia (NRW). Usage is also permitted by third parties who have been invited by an employee of a participating university or research institution in NRW to collaborate on a Coscine project.
      • For more information check out the Coscine documentation page: About Coscine - Documentation | Coscine
    • GfBio consortium services
    • Research Data Management Competence Base (RDM Compas) by KonsortSWD (social, behavioural, educational and economic sciences)
  • Tools:

    • bio.tools (ELIXIR): essential scientific and technical information on software tools, databases and services for bioinformatics and the life science.

    • The Research Data Management toolkit for Life Sciences (RDMkit by ELIXIR)

    • ToolPool Gesundheitsforschung (TMF): The TMF-Portal was launched in 2017 and is operated by the Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). It provides a collection of IT infrastructure-related products for networked medical research. There are products from the TMF and from other providers such as companies and research institutions. There are over 80 products, more than half of which are software tools. Other product categories include eServices, reports and expert opinions, working materials and checklists, consultancy services and training courses. Products can be filtered by category, topic, project phase, keywords, provider and year. Similar products can also be compared using a feature matrix. On each product page you will find information about the use of the product in projects, testimonials from other users and references. New products can be submitted by anyone. Each product is then reviewed by a team of TMF members against a set of criteria before being added to the portal.

      To use the portal, follow this link. Many offerings are free and can be accessed directly from the portal. Software products usually require local installation and configuration.

  • SOPs:

  • Caliskan, A., Dangwal, S., & Dandekar, T. (2023). Metadata integrity in bioinformatics: Bridging the gap between data and knowledge. Computational and Structural Biotechnology Journal, 21, 4895–4913. https://doi.org/10.1016/j.csbj.2023.10.006
  • Egli, A., Schrenzel, J., & Greub, G. (2020). Digital microbiology. Clinical Microbiology and Infection, 26(10), 1324–1331. https://doi.org/10.1016/j.cmi.2020.06.023
  • Kyrpides, N. C., Eloe-Fadrosh, E. A., & Ivanova, N. N. (2016). Microbiome Data Science: Understanding Our Microbial Planet. Trends in Microbiology, 24(6), 425–427. https://doi.org/10.1016/j.tim.2016.02.011
  • Nasr, E., Amato, P., Bernt, M., Bhardwaj, A., Blankenberg, D., Brites, D., Cumbo, F., Do, K., Ferrari, E., Griffin, T. J., Gruening, B., Hiltemann, S., Hyde, C. J., Jagtap, P., Mehta, S., Métris, K. L., Momin, S., Oba, A., Pavloudi, C., … Batut, B. (2024). The Microbiology Galaxy Lab: A community-driven gateway to tools, workflows, and training for reproducible and FAIR analysis of microbial data. Cold Spring Harbor Laboratory. https://doi.org/10.1101/2024.12.23.629682
  • Zhou, R., Ng, S. K., Sung, J. J. Y., Goh, W. W. B., & Wong, S. H. (2023). Data pre-processing for analyzing microbiome data – A mini review. Computational and Structural Biotechnology Journal, 21, 4804–4815. https://doi.org/10.1016/j.csbj.2023.10.001

The FAIR Principles can be applied not only to datasets, but also to metadata, services and infrastructure or any research object. Please refer to the GO-FAIR foundation website for a step-by-step guide on how to FAIRify your data: https://www.go-fair.org/fair-principles/fairification-process as well as guidelines to make services (e.g. bioinformatics tools) and infrastructure (e.g. databases) FAIR.

Plan

A Data Management Plan (DMP) is a formal and living document that defines responsibilities and provides guidance. It describes data and data management during a project as well as measures for archiving and making the data and research results available, usable and understandable after the project has ended.

The NFDI4Microbiota DMP template is available on Zenodo.

There is no need to use a simple text editor anymore, many different tools are available for writing a DMP. These tools offer similar functions and benefits and mainly differ in DMP specifications requested by different funding agencies. Using a DMP tool makes managing a DMP and collaborating much easier.

The Research Data Management Organiser (RDMO) is the most common DMP tool used in Germany. It is an open-source web application developed to support the structured and collaborative planning and implementation of RDM. It allows users to create DMPs in text format and offers templates for questionnaires, project descriptions, tasks, and DMPs. Input is collected through a structured interview, and all responses are stored in a database. Question catalogues can be modified without losing information, and many questions allow dataset-specific answers. Key features include versioning, import/export functions, collaborative editing, snapshots, a timeline of RDM-related tasks, and notifications for upcoming events. DMP4NFDI offers demonstartions on how to set up DMPS using RDMO.

DMPonline was developed by the Digital Curation Centre in the UK. It is an open-source, web-based tool designed for researchers, primarily those working on UK-funded projects, though it is also used internationally. DMPonline enables users to create, review, and share DMPs that comply with institutional and funder requirements.

The Data Stewardship Wizard (DSW) was developed by ELIXIR Netherlands and ELIXIR Czech Republic. It is an open-source, dynamic web-based system aimed at data stewards who support researchers in creating machine-readable DMPs. The DSW is recommended by the Horizon Europe Programme Guide. It features user-friendly questionnaires, a variety of built-in templates, and the ability to develop custom templates. Various ELIXIR nodes offer training on how to use the DSW.

Other DMP tools include ARBOS, DataPLAN, DataWiz, DMPRoadMap, DMPTool, GFBio DMPT and TUB-DMP. A comprehensive guide to DMP tools is available on Zenodo.

Collect

Microbial data are highly heterogeneous, as are the methods used to collect them. The following list comprises examples of microbial data and the collection method(s) associated with each:

  • Microbiome data: High-Throughput Sequencing (HTS), Next Generation Sequencing (NGS)
  • Crystallographic data for small molecules: Single crystal X-ray diffraction
  • Protein sequences: Mass spectrometry, Edman degradation using a protein sequenator
  • Nucleic acid sequences: (RT-)PCR, sequencing, …
  • Linked genotype and phenotype data: High-throughput genotyping and ongoing patient care/clinical trial 
  • Macromolecular structures: Diffraction, electron cryo-microscopy
  • Clinical data: Ongoing patient care, clinical trial
  • Functional genomics and gene expression data: High-throughput functional genomics experiments
  • Standardized bacterial information: Culture collections, species descriptions

Protocols for collecting microbial data can be found, for example, on the NFDI4Microbiota protocols.io workspace and on the websites of the International Human Microbiome Standards (IHMS) and the Earth Microbiome Project (EMP). IHMS’s protocols focus on the collection, identification, extraction, sequencing and analysis of faecal samples. The EMP’s protocols focus on the extraction and sequencing of DNA from environmental samples. The NFDI4Microbiota’s protocols.io instance aims to collect relevant protocols from the community for the community.

To select an ELN, we recommend that you define selection criteria that reflect the needs of your institution and labs. You can then use these criteria to compare the available ELNs with your requirements, for example, by entering the criteria into the ELN Finder. The ELN Finder is a tool developed by the University and State Library Darmstadt and ZB MED – Information Centre for Life Sciences. It is an interactive tool for filtering ELNs based on 40 criteria.

Important criteria to consider include discipline, whether the ELN is proprietary or open-source, whether it is a cloud-computing service (SaaS) or self-hosted, and performance and stability. Other important criteria to consider when selecting an ELN include your lab’s established practices and preferences, your institution’s ELN policy, the security level needed for your data and your budget.

Once you have selected an ELN, you need to licence it and introduce it to your institution’s research groups. This involves ensuring that all technical requirements are met (e.g. a stable wireless connection), creating and implementing a distribution plan, training users, and setting up support services. Finally, you will need to monitor the application.

If you would like to find out more about working with eLabFTW, take a look at this demo with eLabFTW or this video tutorial from ZB MED, which explores both eLabFTW and Labfolder. You can also request NFDI4Microbiota training on how to work with eLabFTW if you are interested.

If your home institution does not support an open-source ELN and your research group would like to set up a proprietary solution, we can offer insights and suggestions based on our experience with eLabJournal, for instance.

For data collection that does not require wet lab experiments, there are alternative documentation methods for both the collection and subsequent analysis, such as README files, literate programming, narrative descriptions, data dictionaries and codebooks.

Process

The most common formats for microbial data are as follows:

  • FASTA (*.fasta, .fas, .fa, .fna, .ffn, .faa, .mpfa, *.frn) for nucleotide and protein sequences.
  • FASTQ (*.fq, *.fastq) for raw biological sequences and their corresponding quality scores.
  • General Feature Format (GFF) (*.gff, *.gff3) sequence annotations.
  • Sequence Alignment Map (SAM) (*.sam) and Binary Alignment Map (BAM)  (*. bam) for biological sequences aligned to a reference sequence.
  • Variant Call Format (VCF) (*.vcf) for gene sequence variations.

For more information on suitable file formats for long-term archiving, please refer to the Digital Preservation page of our Knowledge Base.

Analyzing Data

If you would like to learn more about using Bash, Python and R for data analysis, please take a look at our training calendar to see if there are any upcoming training events on these topics, or take a look at the training materials we have published on Zenodo. You can also take a look at The Carpentries’ teaching materials, such as the following:

One way to share your scripts is to archive your GitHub repository on Zenodo and assign it a Digital Object Identifier (DOI). To do so, follow these steps:

  1. Create a Zenodo account.
  2. Create a Binder-ready repository on GitHub (see here for instructions).
  3. Make sure your repository is ready to be published.
  4. Create a Zenodo DOI for your repository (see here for best practices).
  5. Create a Binder link for your Zenodo DOI (see here for the form).

Workflow Standards and Provenance

We currently offer a guided evaluation system for provenance standards for your workflows. If you are interested in rating your workflows and tools, please reach out to the helpdesk. We recommend evaluating five different aspects individually to check if they are easily reproducible and verifiable (see below). You can find more detailed information on the guidelines and rating system in this guideline (Provenance Guidelines for Workflow and Tool Developers)

(1) Improve reproducibility
(2) Version report (output file: versions.yml)
(3) Data and metadata management (output file: provenance.yml)
(4) Documentation
(5) Validation, cooperation and sharing

Metagenomic samples are inherently complex as they contain a mixture of DNA sequences from multiple organisms and sometimes from various environmental sources, including genetic sequences from contaminants and the host organism (e.g. humans, animals or plants).
Preprocessing these samples by removing the contaminants is a critical step before conducting further analyses. You can find our suggestions for improving data quality in the NFDI4Microbiota Knowledge Base.

The “nf-core/mag + CAMI 2 Best practices” pipeline is accessible via the scalable, cloud-based workflow management platform CloWM: https://clowm.bi.denbi.de/. This workflow offers a range of functionalities for processing metagenomic data, including metagenome assembly, genome and taxonomic binning, and taxonomic profiling. These functionalities are based on the best practices determined through the systematic benchmarking efforts of international initiatives such as CAMI (the Critical Assessment of Metagenome Interpretation Initiative). These recommendations are published and will be updated according to new benchmarking results published on the CAMI benchmarking portal (https://cami-challenge.org). Contact us via the Helpdesk if you would like to speak to our experts. Furthermore, CloWM offers other curated, best-practice workflows that cover a broad range of research fields and tasks, including metagenomics (WGS and 16S), human variation analysis, infectious diseases (SARS-CoV2, Mpox and influenza A+B), metabarcoding, genome assembly, transcriptomics, base calling, phylogenomics and more.

Preserving Data

A proper backup and storage strategy for any type of research should include the following:

  • Consult your local IT or library staff to learn about backup and storage options.
  • Follow the 3-2-1 rule when backing up your data: keep 3 copies of any important file; store your files on 2 different types of media; and keep at least 1 copy offsite or in the cloud.
  • Back up versions.
  • Use incremental backups or specialized storage systems for large data sets.
  • Generate checksums for files and compare them after data transfers to verify data integrity. A checksum is a digit representing the sum of the correct digits in a piece of stored or transmitted digital data, against which later comparisons can be made to detect errors in the data.
  • Plan for regular updates and migration to newer technologies.

As a researcher, if you want to ensure that your data is preserved in the long term, you must handle it sustainably. This includes complying with community standards (e.g. your discipline’s metadata standard), providing curated and extensive metadata and contextual information for your data (e.g. comments, detailed descriptions of methods, units and formats, and user licences), organizing your data, validating your data (i.e. cleaning and quality-controlling your data), and using acceptable file formats.

Sharing Data

If you want to share your microbial data while you are still working on your research project, you can use tools such as Academic Torrents, B2DROP or the Open Science Framework (OSF). There are also Git-based tools, such as GitHub, GitLab and DataLad. For large datasets, take a look at Git-annex and Git Large File Storage, which provide file management and versioning systems without requiring you to check the file contents into Git. If you are working with health-related data, take a look at the Framework for the Responsible Sharing of Genomic and Health-Related Data. This framework centers on human rights and is intended for researchers, clinicians, data generators and others. Its foundational principles are to respect individuals, families and communities, advance research and scientific knowledge, promote health and well-being, and foster trust, integrity and reciprocity.

Publishing in Open Access means that the research publication or data is publicly and freely available, allowing anyone to access it.

If you are looking for a trustworthy repository for your microbial data, please refer to our Knowledge Base page on Data Repositories or visit re3data.org.

Reusing Data

If you are looking for existing, trusted microbial datasets to reuse, please refer to our ‘Resources to Facilitate Data Reuse in Microbiology’ list on our Knowledge Base page on Data Reuse.

Micorobial data per se can not be copyrighted, but it can be made available under a data license. The best way to let others know how they may reuse your data is to publish it under an appropriate Creative Commons license. Creative Commons provide a wide array of licenses, under which data can be published, for example when publishing in repositories such as Zenodo. The data can be placed in the public domain (CC0 license), or require attribution and acknowledgement of the data creator or publisher (CC BY 4.0). Some licenses also restrict use, for example, to non-commercial purposes (CC BY-NC), to prevent derivative works (CC BY-ND), or to combine both restrictions (CC BY-NC-ND). For more information please visit the section of licenses in the MetadataStandards repository (link) or visit the Creative Commons website directly (link).