Skip to main content

Research Software Engineer Teams: Organising the Most Advanced Level of User Support at the SeRC Universities

Dan Henningson, SeRC Director, & Erik Lindahl, SeRC Co-Director

This short article describes how researchers in the Swedish e-Science Research Centre (SeRC) have successfully developed and improved simulation software that utilises the fastest computers in the world. As the organisation of the computational infrastructure in Sweden is changing rapidly, we believe that the lessons that have been learned by SeRC researchers may provide inspiration for Swedish e-science researchers, e-infrastructure providers and university leaders as they tackle the vitally important task of organising the most advanced forms of supercomputer user support in the future.

What is SeRC?

The Swedish e-Science Research Centre (SeRC) has been funded by the government’s Strategic Research Area (SRA) initiative since 2010. The universities that are involved in SeRC are the KTH Royal Institute of Technology, Linköping University, Stockholm University and the Karolinska Institute, and some of SeRC’s most important achievements have been summarised in a brochure that is readily available on the SeRC website: e-science.se . The centre funds about 40 principal investigators (PIs) through its Multidisciplinary Collaboration Programs (MCPs) in several areas. In each MCP, there are a number of collaborating projects, where each of the more application-oriented projects must clearly define how collaboration with a more method-development-oriented project is achieved, and vice versa. An important aspect within each MCP is the involvement of key e-science experts, often designated as RSEs (Research Software Engineers). This way, substantial collaboration has been achieved between method-development researchers and researchers from application areas, directly contributing to the success of SeRC.

In connection with the national computer infrastructure moving from the Swedish National Infrastructure for Computing (SNIC) to the National Academic Infrastructure for Supercomputing in Sweden (NAISS), there are also extensive changes in funding and responsibility for the organisation of user support. User support can be broadly divided into three levels where responsibility for simpler user support (such as creating accounts) will lie entirely with NAISS, according to the NAISS contract with the Swedish Research Council (VR). The intermediate level of support will, for most Swedish universities, be organised through NAISS by means of co‑funding from the universities. In contrast, the most advanced level of user support will be the direct responsibility of the universities themselves, both in terms of organisation and providing funding. This is a large change compared to how user support was managed before. Previously, all levels of support were organised through SNIC, and the universities only contributed funding. One advantage of the new approach is that, when it comes to advanced user support, the universities have the strategic freedom to invest in “strong” areas where the research has significant impact. This approach should also result in a stronger correlation between investments and benefits to local researchers. For this new approach to work well, there is a need for a broader discussion about how intermediate-level user support should be distinguished from advanced user support and how advanced user support should be handled by the universities and computer centres. SeRC has helped build several strong environments comprising both outstanding research and infrastructure since 2010 at the KTH Royal Institute of Technology (KTH), Linköping University (LiU), Stockholm University (SU) and the Karolinska Institute (KI). To kick-start the next stage of the process, we have had a strategy meeting discussing advanced user support with SeRC principal investigators (PIs), Swedish e-infrastructure leaders and research software engineers at PDC. This short article summarises some of the lessons learned and invites others to start thinking about this important topic. It should not be seen as a final product but as an initiative to start a process where these matters are discussed among e-science researchers in Sweden.

Different Levels of User Support Need Different Staff and Organisations

Historically, Sweden has had no clear division between basic routine support, more specialised requests, and long-term software development. This has led to an emphasis on general expertise for handling matters of moderate complexity, but compared to other infrastructures, it has been very rare for Swedish infrastructure experts to develop specific expertise to a level where they define international infrastructure frontiers. With the new funding streams in Sweden, it will be more straightforward to define support responsibilities according to the IT Information Library (ITIL) framework standard levels.

  • Tier-1 support is the traditional helpdesk handling routine matters, troubleshooting minor difficulties, and being the first source of assistance that occasionally escalates an issue to the next level. This work is funded entirely by the NAISS grant from VR and corresponds both to staff working on specific computing resources and to staff with broader national responsibility.
  • Tier-2 support covers intermediate-level services performed by staff with extensive experience, including solving unknown errors that have been escalated by Tier-1, training, providing assistance with installing and benchmarking, and undertaking general programming, such as helping with workflows or data management, perhaps up to several weeks of assistance for an area with an important need. This level of support covers most of the application experts previously funded by SNIC. As the amount of time per case is limited – but there are many cases – this type of support needs to be managed and prioritised by infrastructure staff using a national ticketing system. NAISS is finalising agreements whereby the universities that will fund this level of support will contribute 1-4 million SEK each to provide nationally available application experts whose work will be steered and prioritised entirely by NAISS. Defining these scientific areas and how researchers will collaborate with these application experts is an important topic for the first NAISS User Forum in December this year.
  • Tier-3 support consists of highly advanced tasks where staff whose profiles are dedicated to specific research areas will spend months or years writing new software, developing methodology, co-designing activities, and taking extensive responsibility for evolving the infrastructure in close collaboration with researchers who use the software. This category of support staff is expected to achieve independent international recognition from the impact of their work, and, according to the NAISS agreement with VR, these staff are entirely the responsibility of the higher education institutions. This level of support will be organised and steered independently of NAISS, and the resources available to different researchers and research fields will depend on investments by them and their universities. Some educational institutions – in particular KTH – have declared that their previous substantial co-funding to SNIC will now be directed towards this advanced level of support. It is important that other institutions follow this lead, and the SeRC universities (KTH, LiU, SU and KI) are well on the way to establishing and coordinating such activities. In the discussions so far, it has become apparent that the confusion between the tier-2 and tier-3 levels of support needs to be avoided, and we thus use the international designation “Research Software Engineer” (RSE) for the tier-3 most advanced level of expertise.

Organisation of RSEs and SeRC’s Role

It is critical that the most advanced (RSE) user support is organised based on research areas and that RSE resources are prioritised long-term and controlled by the people whose research activities have a significant impact. However, this approach engenders an expectation that those researchers should take a share of the responsibility for funding the RSE organisation that provides the tier-3 support. SeRC has, in fact, built up such a support organisation over the last decade or so, and that organisation has contributed to creating research environments that have attracted several grants from the European Research Council (ERC) and the Knut and Alice Wallenberg Foundation (KAW), Centre of Excellence grants from VR, research grants from businesses, as well as co-funding from other infrastructure organisations and hardware vendors. These research environments encompass some of the SeRC universities’ most internationally cited researchers.

SeRC’s existing RSEs define the highest level of expertise in the infrastructure, and since they also have broad high-performance computing (HPC) experience, we have seen that they are often asked to help with general infrastructure tasks. However, the RSE role is separate from the tier-2 support tasks or administration. The RSE work should be organised in application domain teams with at least one scientifically responsible PI, and the practical RSE work is expected to take place equally divided between research groups and the common RSE environment at a centre. The long-term prioritisation between different scientific areas must be determined both by the likely scientific impact and demonstrated ability of the researchers to contribute funding to the common RSE environment. Together with the team lead, the scientifically responsible researchers should report annually how the support is used, for example, by giving details of how it defines the international infrastructure frontier, how the experts share their achievements through presentations and organising symposia at international HPC infrastructure venues such as PASC, ISC and SC, and how they build cross-disciplinary interactions with other RSE teams.

Lessons Learned from SeRC’s Programs

Some of our ideas are based on experiences from SeRC’s Multidisciplinary Collaboration Programs (MCPs), where research teams include researchers working on applications, method development, and core infrastructure. One of the most successful programs is the SeRC Efficient Simulation Software Initiative (SESSI), where researchers in the areas of Computational Fluid Dynamics (CFD) and Molecular Dynamics (MD) have been collaborating with computer scientists and programmers. The program has been going on for a decade and has had the main goal of further developing the GROMACS MD software and the derivatives of the Nek5000 CFD code to make them ready for exascale computer hardware, particularly making them run efficiently on graphics processing units (GPUs) and other types of accelerators. While the work of these teams had already had an extensive scientific impact, there are broad shared sentiments that the work in SESSI required the RSEs to become much more professional and identify themselves internationally in the RSE role rather than only working in a research group. This, in turn, has led to dozens of new grants for infrastructure and has also been a key enabler for high-profile funding, for example, from ERC and KAW. The SESSI program has also resulted in significant cross-disciplinary information sharing that led to software being extensively rewritten, including a completely new code, Neko, that scales on the largest available GPU machines and that is a finalist for the 2023 Gordon Bell prize at the time of writing.

What are the success factors that we can identify in this program, and how can they be translated into requirements for the organisation of the RSE teams at the SeRC universities?

  1. The recruitment and identification of competent researchers (from the professors and senior scientists leading the research to the researchers, postdocs and Ph.D. students performing much of the detailed work) is vital.
  2. Starting with concrete code and concrete science cases is a key element, otherwise the different members of the team will not be able to work together efficiently but will tend to work on their own individual research problems instead of the complex multidisciplinary task of developing software that can actually solve an important scientific problem. There needs to be a critical path for advancing the science case, not just working on prototypes or general HPC activities, and the progress has to be measured by the progress being made towards this goal.
  3. Having a long-term strategic vision is important as it takes time to achieve impact, and hence, it is imperative to focus on achievements that will have a major impact not just on science, but also within the international infrastructure landscape. This, in turn, requires a long-term commitment of resources, in particular resources that act as bridging funding when, for example, external EU project funding fluctuates. The new approach (whereby universities directly provide funding for RSEs) should strive to have a similar long-term nature, and the long-term focus must be maintained by a continued focus on concrete science cases and what is needed by the advanced research groups. There needs to be an organisation at the computer centres for the RSEs that the domain scientists can work with in the long term.
  4. Shorter-term complementary funding is needed for the RSE teams. In order to keep focusing the work on concrete science cases, there needs to be shorter-term complementary funding. This is primarily done by writing and executing successful applications for additional funding. The shorter project funding is also very important to ensure that the researchers and the RSE teams have sufficient excellence and produce scientific results of sufficient quality. All of the MCPs initiated and funded within SeRC have had the requirement that SeRC cannot be the sole funder of the research, and this has been important for the scientific excellence of the work that has been performed.
  5. The focus needs to be on internationally leading software. As mentioned, it is necessary to engage internationally excellent researchers in the software development processes, but the software itself also has to be internationally leading, meaning that it has to be top quality, have a large userbase, or feature some other high-impact aspects that mean that the people working on the software (in contrast to the scientific applications) will participate in international research and infrastructure collaborations. An excellent example is the GROMACS software, which clearly excels in both aspects. This means that one focus of the RSE team is the impact of the software itself as a success factor within academic research and education or in industry.

We believe that these success factors for the development of the codes for MD and CFD are generally applicable but may need to be developed further to also include the needs of additional areas within e-science.