BioExcel Centre of Excellence Going Strong: New Funding Secured

Rossen Apostolov, PDC

BioExcel , the leading European Centre of Excellence for Computational Biomolecular Research, was established in 2015 with support from the European H2020 program. The centre is coordinated by PDC, with the KTH GROMACS team providing the scientific lead. The consummate execution of all the centre’s activities since its foundation has enabled us to secure a new round of larger funding for another three years of operation. Here we present the future plans for the even more exciting upcoming activities in the centre!

Driving Computational Biomolecular Research

The life sciences in general, and biomolecular research in particular, have grown to be one of the major users of large-scale compute infrastructures. The potential impact of, and requirements for, computing in these areas are extreme. Software in these areas has been making significant progress in the last decade, however, for applications in these areas to be able to take full advantage of the upcoming exascale systems, the research communities are facing immense challenges to co-design software and algorithms suitable for the new generations of hardware that are evolving. The life sciences are areas where European applications lead the world, however, there had not been suffcient interaction between the code developers, the researchers using the software and the organisations providing the computing resources. This had, in turn, affected the productivity of the researchers, as well as the utilization of the European HPC infrastructures. Indeed part of the initial motivation for BioExcel was to bridge the gaps between geographically diverse developers, researchers and infrastructure providers by establishing BioExcel as a focal point for interactions.

Based on the successes of the first phase of the project, BioExcel is continuing its mission to advance science and technology in the life sciences by:

pushing the performance, effciency, scalability, and usability of several key software packages towards exascale;
improving the usability of existing applications and tools to support the convergence of high performance computing (HPC), high throughput computing (HTC), and high performance data analytics (HPDA) via the development of work flows combining HPC simulations with data management and analytics;
significantly expanding the range of training, commercial applications and services that are offered;
making publications, libraries, workflows, codes, and training material developed by BioExcel available via open-source and open- access principles;
making BioExcel sustainable so researchers in academia and industry can rely on having access to BioExcel’s open resources in the longer term; and
collaborating with international initiatives to strengthen the links between the research communities and leverage on worldwide expertise.

BioExcel provides a wide range of products and services for computational biomolecular researchers.

World-Leading Exascale Biomolecular Software

BioExcel has a strong focus on getting life sciences applications ready to take advantage of the superior speed of the exascale systems that will be available soon. We host the development of several of the most widely-used European HPC codes: GROMACS for molecular dynamics simulations of biomolecular systems, HADDOCK for integrative modelling of biomolecular complexes, and the combined QM/MM capabilities of CP2K. GROMACS and CP2K are among the 12 PRACE benchmark codes – this is significant because they already sustain multi-petaflop performance – and HADDOCK is one of the most-widely used data-driven codes. To support researchers further, BioExcel has invested considerable effort in the development of related tools, such as PMX to prepare free energy calculations, POWERFIT for docking into cryo-EM maps, and RELION for the reconstruction of single-particle cryo-EM data (our GPU-accelerated cryo-EM reconstruction has been used in 31 Nature and 12 Science publications). The codes are used throughout the biotechnology, pharmaceutical and chemical industries. Several SMEs that use our software have started close collaborations with BioExcel.

BioExcel engages extensively in co-design projects with HPC vendors and system designers to improve the performance of various codes on specific types of hardware. GROMACS is one of the most highly tuned codes in the world, and is presently involved in co-design projects with AMD/OpenCL, NVIDIA/CUDA, Intel, IBM/Power9, and Arm. HADDOCK is one of the best examples of a data-driven application, for which co-design efforts focus on exploiting new storage, network and memory caching technologies in collaboration with companies such as Intel and Seagate. In upcoming projects, the QM/MM capabilities of CP2K will be improved (which will be a big advantage for the life sciences research communities) and CP2K will also be tuned for future exascale systems in collaboration with Intel, NVIDIA, and Cray amongst others. The codes will be optimised for the EuroHPC systems (including the upcoming new European accelerators). New memory and storage technologies for ensembles and high-throughput computing will be exploited, and sub-checkpointing will be employed to improve the resilience of code against hardware failures. BioExcel will also improve the scaling and usability of the codes, and make them easier to handle by developing and offering effcient workflows, distributed as container images from the BioExcel software hub.

BioExcel codes drive European business research in the life sciences. HADDOCK is currently cited in about 300 papers per year, GROMACS in about 4,000 papers per year, and CP2K in about 500 papers per year. The HADDOCK portal sees over 30,000 submissions yearly from users around the world (10 million jobs are submitted to EOSC HTC resources). The core applications are distributed as open-source and free software to ensure that they can be used freely by non-profit organisations, and also have business-friendly licensing to allow for commercial reuse and linking. The codes are already used by all of the top ten pharmaceutical companies, plus a number of SMEs, and there are several vendors selling cloud resources or hardware tailored to the codes.

This is a stylized image of the structure of the Zika virus (PDB ID: 5IRE). BioExcel core applications are routinely used by researchers for understanding the function of biomolecules and the development of novel drugs.

Convergence of HPC/HPDA and Improved Usability

Workflows are vital for the effcient usage of exascale resources and the improved productivity of researchers. BioExcel has already developed a framework for FAIR (Findable, Accessible, Interoperable and Reusable) workflows, and a well-defined set of best practices in collaboration with ELIXIR (CWL, bio.tools, and interoperability platforms); these have been tested with pharmaceutical companies (which generated a range of scientific success stories of biological interest). BioExcel is now continuing this work by combining HPC compute engines with high performance data analytics (HPDA) and machine learning methods for the automated retrieval and deposition of data. The portfolio of platforms we support includes major ones such as CWL, KNIME, and Galaxy, as well as managers speci cally designed for HPC (PyCOMPSs, Nextflow, CWLEXEC). Interoperability and reproducibility will be strengthened using containers (Docker, Singularity) coupled to repositories such as BioConda, BioContainers and the BioExcel hub3. The integration of HPC-HPDA tools will support the simultaneous execution of simulations and the online analysis of the resulting data. This will, in a natural way, help to attract members of the bio- and cheminformatics research communities to HPC, since they commonly use workflows as their working environment for automation.

User-Driven Development

BioExcel’s activities are guided by input from users (namely researchers using biomolecular codes) as the purpose of the centre is to make computational research in the life sciences easier to undertake and more efficient. For example, application and workflow development is being done in tandem with use cases taken from the wider academic and industrial research communities. Consequently, these use cases will become well-documented, best-practice examples of how to scale diffcult problems, not only with more nodes, but also by including disruptive changes in algorithms. Ensuring that users run applications with optimal settings would double the resource efficiency for many projects. In addition, the development road maps for the core applications take into account the needs of researchers via input provided by the wider research communities.

Addressing the Skills Gap

BioExcel’s mission is to enable researchers to fully exploit the power of data and computing e-infrastructures by providing support and training for both non-expert and advanced users. BioExcel pioneered an advanced training program (which was praised by reviewers) based on a competency-based training needs analysis in collaboration with the broader life sciences research community. The resulting training resources are openly available at the BioExcel Knowledge Resource Centre . BioExcel will continue to expand the training program by delivering new educational webinars and courses (face-to-face and online), as well as running best practice and knowledge-exchange workshops on cutting-edge topics and support forums focused around the core applications. BioExcel will also continue to make targeted support teams available to help researchers solve specific problems. Our educational webinars (of which 34 will have been produced by the end of 2018) have proved to be a popular and efficient way to share expert knowledge. In addition, BioExcel will set up a dedicated support group working with academic and industrial researchers in synergy with other support structures (such as PRACE HLSTs). BioExcel’s online and virtual training products and services will make much-needed training available to more and more researchers all around the world as we expand the range of online tutorials for various software tools to guide researchers at their own pace. BioExcel can help researchers who are just beginning to use HPC with training on topics such as HPC readiness and introductory modules in biomolecular simulations, as well as assisting highly experienced HPC researchers with advanced training and support on specialist topics such as MD, QM/MM, and workflows. In addition, the centre provides customised training for industry researchers by arrangement.

Collaboration and Community Links

BioExcel is based on the premise that the significant developments in science are achieved by strengthening the interactions within the whole research ecosystem, consisting of software developers, academic and industrial researchers and research infrastructure providers. To facilitate that vital intra-ecosystem communication, we continue to work on forging wider connections within the life sciences research communities (both in Europe and internationally), so that the BioExcel Centre of Excellence serves as a hub, that is, a focal point for the worldwide life sciences research communities. For example, thanks to our close collaborations with the developers of other major life sciences codes (like NAMD and AMBER), we expect to be able to make these resources open to all life sciences applications in the near future. This would mean that researchers would be able to choose a code based either on their preference or its performance for a specific type of problem. BioExcel will also continue to work with other centres of excellence, relevant EU initiatives (namely PRACE, ELIXIR, INSTRUCT, and the EOSC-Hub), and international ones (like MolSSI) to organise joint training events targeting other important codes in biomolecular research.

Long-Term Operations

In 2019 BioExcel will be establishing a commercial arm for delivering life sciences products and services to both the academic and business research communities. For example, companies in Europe, particularly in the pharmaceutical industry, would benefit significantly from having access to appropriate workflows and training. BioExcel’s intellectual property plan will ensure that the products resulting from BioExcel’s activities (such as code, libraries, workflows, publications, and training material) are open source and open access following the FAIR (Findable, Accessible, Interoperable and Re-usable) data principles.

All academic and industrial researchers involved in computational biomolecular research are warmly invited to contact BioExcel to find out how to take advantage of the many opportunities for support and joint activities, or to discuss new directions for collaboration. BioExcel has representatives in Finland, Germany, the Netherlands, Norway, Malta, Spain, Sweden and the UK who are happy to provide guidance and assistance with your work.