Skip to main content

BioExcel Assists Release of Workflow Manager for Large Genetic Studies

Rossen Apostolov, PDC

The BioExcel Center of Excellence has assisted in the recent release of a new version of the COMPSs  programming model which powers GUIDANCE , a pipeline for large-scale genetic studies, by parallelizing it at the task level and enabling it to run on distributed computing platforms.

To understand the significance of this, one needs to be aware that the computational requirements associated with large genetic studies keep growing alarmingly, both in capacity and in complexity. For example, the full analysis of the genotypes of thousands of individuals involves thousands of different types of tasks, each of which has specific computational requirements. In order to address these needs, the BioExcel partner team at the Barcelona Supercomputing Center (BSC) developed GUIDANCE, a modular compilation of programs for performing complete genetic association analyses. GUIDANCE makes it possible for researchers to perform all the steps involved in a large-scale genome- and phenome-wide association analysis in a single execution – it also enables users to perform the steps in a modular way with optional user intervention.

The GUIDANCE implementation is based on COMPSs, which is a task-based programming framework that facilitates the development and execution of parallel applications and workflows in distributed infrastructures, such as high performance computing (HPC) clusters, grids and clouds, making this application integrable into multiple parallel platforms. COMPSs is able to parallelize (at task level) sequential applications written in Python, Java and C/C++. At execution time, COMPSs schedules, balances and organizes all the necessary subtasks to ensure efficient usage of the computing resources. It also takes care of the data transfers between tasks, when those are distributed between remote nodes.

As an example, GUIDANCE was used for a recent genetic study identifying the locations of genes associated with type 2 diabetes. The study was based on the reanalysis of seventy thousand publicly available genetic samples, and lead to the identification of seven new gene locations associated with the illness, which included variants of low and rare frequency in the population that could have only been found using GUIDANCE methodology (see image).

COMPSs is used to manage workflows for studies of genetic data. These results are from a genetic study about diabetes using data from seventy thousand individuals.

Both COMPSs and GUIDANCE are open source and can be used free by biomolecular research communities. If you would like more information about using COMPSs and GUIDANCE, please contact us  or visit our workflow support forums at Ask.BioExcel.eu .