Large-Scale Computation Made Possible for Sensitive Human Data
Ann-Charlotte Sonnhammer, SNIC, Peter Ankerstål, UPPMAX, and Sverker Lundin, SciLifeLab
There have been dramatic technological advancements within genomics in recent years. These advancements are transforming the way researchers can investigate the genetic basis of disease. This development is also impacting healthcare by bringing forth more precise methods for diagnosis and treatment.
Swedish researchers are in a unique position to contribute to the understanding of disease due to the combination of a tradition of technology development, the existence of Swedish bio-banks with large collections of tissue and cell samples, along with well-organized and comprehensive Swedish population and disease registries, and longitudinal population-based epidemiologic studies in different geographical regions of Sweden.
Now that human whole genome sequencing is possible on a large scale, the amount of data is also dramatically increasing, in turn demanding large resources for data-intensive computations. In addition to this, human genetic data is, by its very nature, personal and sensitive and the Swedish National Infrastructure for Computing (SNIC) lacked a suitable computer system for dealing with this type of data. To address this, SNIC was granted funding to set up and maintain computing and storage resources to handle sensitive personal data generated by large-scale molecular experiments. The funding comes from the Swedish Research Council, and the Knut and Alice Wallenberg Foundation, with co-funding from Uppsala University and the Science for Life Laboratory (SciLifeLab). A new SNIC project, called SNIC SENS, was initiated to set up and maintain the resources, with the Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) and PDC as the participating SNIC centres.
The aim of the SNIC SENS project is to set up and maintain a production resource for the Swedish National Genomics Infrastructure (NGI), and a national SNIC resource for handling sensitive personal data that originates from large-scale molecular experiments, such as next-generation sequencing. Since the data is sensitive personal data, the information and IT security work, plus the legal considerations, have been essential parts of the project. The Security and Safety Division at Uppsala University has been of great help, as well as the Legal Affairs Division of Uppsala University and the Legal Department at the KTH Royal Institute of Technology. The information security work started before the procurement process, and provided important information regarding the system requirements, and has continued since then.
The production system for NGI is called Irma and consists of 250 compute nodes with 2 x 8 cores each and 256 GB RAM per node. The storage system associated with Irma is called Lupus and is a 1 PB Lustre file system with a peak write performance of 25 GB/s. Irma is a “regular” high performance computing (HPC) cluster with fast storage and infiniband interconnect for MPI and file traffic. Irma was put into production in March 2016.
The new national computing resource within SNIC is called Bianca. It consists of 200 compute nodes with 2 x 8 cores, with 128 GB RAM per node and 4 TB of node local storage. The storage system associated with Bianca is called Castor, and it is a 6 PB Gluster file system with a peak write performance of 25 GB/s. Bianca was put into production on the 7th of April 2017, and inaugurated on the 24th of April 2017. The first users in the pilot phase started in mid-December 2016.
For both parts of the project, the secure backup resource at PDC is used, and plays an important part in ensuring that the information security guidelines are fulfilled. The project has given SNIC and the participating SNIC centres new knowledge, which can be utilized in new projects.
One of the first groups of users on Bianca was the SweGen project. It is one of the projects in the Swedish Genomes Program of the National Projects initiative at SciLifeLab funded by the Knut and Alice Wallenberg Foundation. The aim of the SweGen project is to construct a reference dataset for the genetics research community and clinical genetics laboratories. A high-quality genetic variant database for the Swedish population is being established from the genomes of one thousand individuals selected to reflect the genetic structure and geographical distribution of the Swedish population. The variant frequencies have been made available at swefreq.nbis.se .
To have researchers from several universities and research institutes analysing sequencing data using a national resource has required strengthening the IT security, and the development of appropriate routines and procedures. Due to the security concerns that arise when working with sensitive personal data, the UPPMAX team has worked hard to maintain full separation between projects. Moreover, data transfers are logged in such a way that it should be impossible to extract any data from the system without having a complete audit trail. In order to achieve this, Bianca provides a fully virtualized and compartmentalized environment based on OpenStack, where every project gets its own virtualized version of an UPPMAX cluster, so there is no such thing as sharing a login or compute node.
When working with the design and implementation of the resources, the UPPMAX team has made several internal and external security assessments of the system. The efforts to maintain and improve security will continue throughout the entire lifetime of the resources.