Supporting Research Software Development

Dirk Pleiter, PDC

As the digitalisation of science progresses, an increasing number of research and engineering teams are relying more and more on software [4]. Scientific progress, therefore, depends on this software being continuously enhanced according to the needs of those teams while also being maintained sustainably. As the lifetimes of the computer architectures on which this software is being used are typically much shorter than the lifetimes of scientific software, this software must not only be portable but also performance portable. This means that it must be possible to run the software on different computer architectures and perform at a similar level of efficiency. All these observations are, in particular, true for research software being used on high-performance computing (HPC) systems. While all modern HPC architectures are extremely parallel, the parallelism may be at very different levels. For instance, some systems may come with more but less powerful nodes based on CPUs only, while other systems may feature fat nodes with several graphics processing units (GPUs) that are used as compute accelerators. Therefore, software applications need to be able to adapt to the scale of the different architectures where they may be used and run efficiently irrespective of the inherent parallelism of the system architectures.

Most people in academia would probably agree with the statement that research software is becoming increasingly relevant to research and development. However, there is frequently a dichotomy between being an expert in a given research field and being an expert in developing and/or using HPC software for that particular domain. This distinction gives rise to a need for support to be provided to assist research domain experts in the use and/or development of software for their domain. This article reflects on this evolving need and, in particular, on whether there is currently sufficient support within academia for developing and maintaining such software. For Sweden, now is an ideal time to consider this question. The pending changes to the Swedish HPC research infrastructure offer an opportunity to trigger improvements in how Sweden provides the necessary support for and makes progress in this area. (This task has been referred to as the provision of application expert support services.) Supporting the development of performance portable research software is not a standalone process - it should be regarded as an effort that involves linking research infrastructures with researchers in an optimal way that leads to the production of high-quality research. In the case of an HPC research infrastructure, that link is primarily via the people who develop or maintain suitable research software or who assist the researchers to make the best use of that software on the available HPC systems.

At this stage, there is no simple answer to the question of whether there is adequate support, either globally or in Sweden, for developing research software. This article will focus only on one aspect of this question, namely the role of the people who develop such software. Too often their role and the significance of their contributions are not fully recognized. Depending on the particular research and engineering domain, opportunities for scientific recognition through publications are often lacking for these people and career paths are mostly absent. Part of the problem has been that there was no universal concept or term for this type of work. (In the Swedish context, the term application experts has been used for people undertaking, among others, these kinds of tasks.) One big step towards overcoming this situation consisted of establishing the concept of a research software engineer (RSE). This move towards a global concept started more than ten years ago in the UK, and the momentum of the RSE movement has been growing since then with international attention to the importance of RSEs increasing significantly in recent times. A Nordic RSE initiative was started in 2018 (see www.pdc.kth.se/publications/pdc-newsletter-articles/2018-no-1/the-nordic-research-software-engineer-initiative-1.823934 ), and for more information about the general development of the RSE movement, see the overview in [2]. For details about a very recent activity addressing this area, see the RSE-HPC-2022 workshop ( us-rse.org/rse-hpc-2022 ) that was held during the recent SC22 supercomputing conference. Meanwhile, RSEs continue to be a vital part of the research communities utilising HPC and organisations have been established that have hired dozens of RSEs (explicitly as RSEs or as applications experts or with other job titles).

While promoting the concept of RSEs is an essential step towards creating awareness of the important work done by the many such people in academia, it does not define a specific job profile for an RSE. Although a focus on developing research software is a common denominator of the work done by RSEs, there are many options regarding the roles, skills and primary tasks of any given RSE. In addition, there are many options for where to place RSEs within academic organisations and how to organise the engagement of RSEs with the researchers they support. To complicate matters, what constitutes a good choice for many of these options is highly context-dependent and can, for example, vary significantly with the type of research being done and the area it is in and the computer architectures that are being used. Some of the possible choices are discussed in the following.

Let us first start with the question of how to engage RSEs with research teams. From the perspective of the research team, short-term assignments with a duration of a few months would enable RSEs to join the team temporarily and address tasks with a narrow scope. For example, an RSE could provide advice on porting code from one HPC system to another or on refactoring code (that is, rewriting code to make it portable or making significant changes to the code structure to make it easier to maintain). As part of such assignments RSEs could assist in the implementation of proofs-of-concept, for instance, to test whether a certain way of rewriting code would have the desired result. While such short-term engagements allow, in principle, for the dynamic assignment of RSEs, long-term assignments will be much more important, particularly those related to the implementation of new codes or the refactorisation of existing codes to support new architectures. For example, assignments to enable codes for GPU-accelerated HPC systems like Dardel or future EuroHPC systems, including the upcoming EuroHPC exascale system JUPITER, will be significant, and such tasks may take several years.

The question of the requisite skill profiles for RSEs is heavily impacted by the underlying question of the expected outcomes of each particular task or assignment. Developing research software for HPC is particularly demanding because of the diversity of skills that is required. Knowledge and experience in best practices of modern software engineering, which includes, for example, the use of modern C++, is only sufficient to implement software components that have already been well specified by others. An understanding of modern computer architectures is needed for performance engineering tasks. However, only the combination of a domain-specific background plus knowledge of relevant numerical algorithms will give an RSE the abilities necessary for playing a leading role in the development of a complex research code. Such a background will also simplify the interactions with researchers in that domain based on a common understanding of the problems that are being addressed with a given code or workflow.

Thirdly, let us consider the question of where to place RSEs within academic organisations. To answer this question, trade-off decisions need to be made. On the one hand, an RSE should work closely with researchers in a specific domain to develop a good understanding of their needs and requirements. Allowing an RSE to act as a member of a research team or community can be an important factor in attracting people who are highly qualified for such tasks. On the other hand, RSEs also need a range of skills that are relevant across multiple research areas as they relate more to computer systems than specific research topics. Some examples are advanced software engineering skills, or the ability to transform and port code for using compute accelerators like GPUs, adopt software quality standards or implement standards for findable, accessible, interoperable and reusable (FAIR) software. In many cases, these types of skills need to be continuously improved as HPC systems evolve. Hence RSEs would benefit from working together with other RSEs and participating in joint upskilling efforts. In fact, various universities and other organisations have established dedicated organisational units for their RSEs, which have been shown to work successfully. (See [3] for some case studies.) Note that applying the latter approach would still allow for RSEs to be placed in research groups for part of their time. This has worked well for various PDC staff members involved in software development. In this context, engagement in research software development community organisations has also proven to be useful in countries like the UK (see society-rse.org ) and the US (see us-rse.org ).

The realisation of the concept of research software engineering has come a long way in the past ten years and the benefits of research software engineering can now be considered as being well established. However, as a discipline, research software engineering is still in a state of relative infancy with many choices for specific realisations yet to be explored. The academic HPC research infrastructure in Sweden is currently undergoing an important transition towards consolidation of hardware resources. This makes it possible to reduce the costs for the provisioning of hardware. It also allows for a transition towards a more service-oriented provisioning of HPC and other resources based on a separation between the lower-level generic layer of e-infrastructure services (provided by the underlying physical HPC and data storage resources) and the higher-level services, which can be domain-specific services as described in [1]. Such higher-level services can include, for instance, web-based services that are specific to particular research domains and also domain-specific support services (such as those provided by RSEs with expertise in a particular research area). The important task of improving, broadening and further organising the support available to Swedish academic HPC users (by improving the support provided for research software development) needs to be addressed in parallel with the transitions that are happening in the Swedish HPC infrastructure for academic research. Sweden currently has a great opportunity to target and take advantage of synergies between the ongoing evolution of its HPC research infrastructure and this window of opportunity to strengthen the support that is provided to assist researchers to make the best possible use of that infrastructure, thereby benefiting Swedish research as a whole.

References

S. Alam, J. Bartolome, M. Carpene, K. Happonen, J.-C. Lafourcriere, D. Pleiter, Fenix: a Pan-European federation of supercomputing and cloud e-infrastructure services. Communications of the ACM. 2022. doi.org/10.1145/3511802
J. Cohen et al., The Four Pillars of Research Software Engineering. IEEE Software. 2021. doi.org/10.1109/MS.2020.2973362
D. S. Katz et al., Research Software Development & Management in Universities: Case Studies from Manchester’s RSDS Group, Illinois’ NCSA, and Notre Dame’s CRC. IEEE/ACM 14th International Workshop on Software Engineering for Science. 2019. doi.org/10.1109/SE4Science.2019.00009
J. Switters, D. Osimo, Recognising the Importance of Software in Research - Research Software Engineers (RSEs), a UK Example. Publications Office of the European Union, 2019. doi.org/10.2777/787013