Application-Driven Exascale: The JUPITER Benchmark Suite (2024)

\UseTblrLibrary

booktabs \addbibresourcelibrary.bib

Andreas Herten \orcidlink0000-0002-7150-2505, Sebastian Achilles \orcidlink0000-0002-1943-6803, Damian Alvarez \orcidlink0000-0002-8455-3598, Jayesh Badwaik \orcidlink0000-0002-5252-8179, Eric Behle \orcidlink0000-0001-6616-8867, Mathis Bode \orcidlink0000-0001-9922-9742,
Thomas Breuer \orcidlink0000-0003-3979-4795, Daniel Caviedes-Voullième \orcidlink0000-0001-7871-7544, Mehdi Cherti \orcidlink0000-0001-5865-0469, Adel Dabah \orcidlink0000-0001-9175-469X, Salem El Sayed \orcidlink0000-0002-7217-6027,
Wolfgang Frings \orcidlink0000-0002-9463-9264, Ana Gonzalez-Nicolas \orcidlink0000-0003-2869-8255, Eric B.Gregory \orcidlink0000-0001-9408-5719, Kaveh Haghighi Mood \orcidlink0000-0002-8578-4961, Thorsten Hater \orcidlink0000-0002-6249-7169,
Jenia Jitsev \orcidlink0000-0002-1221-7851, Chelsea Maria John \orcidlink0000-0003-3777-7393, Jan H.Meinke \orcidlink0000-0003-2831-9761, Catrin I.Meyer \orcidlink0000-0002-9271-6174, Pavel Mezentsev \orcidlink0009-0000-4114-5482, Jan-Oliver Mirus \orcidlink0009-0006-7975-1393,
Stepan Nassyr \orcidlink0000-0002-0035-244X, Carolin Penke \orcidlink0000-0002-4043-3885, Manoel Römmer \orcidlink0009-0007-3513-5932, Ujjwal Sinha \orcidlink0000-0002-4609-0940, Benedikt von St. Vieth \orcidlink0000-0002-5386-633X, Olaf Stein \orcidlink0000-0002-6684-7103,
Estela Suarez \orcidlink0000-0003-0748-7264, Dennis Willsch \orcidlink0000-0003-3855-5100, Ilya Zhukov \orcidlink0000-0003-4650-3773
Jülich Supercomputing Centre
Forschungszentrum Jülich
Jülich, Germany

Abstract

Benchmarks are essential in the design of modern HPC installations, as they define key aspects of system components. Beyond synthetic workloads, it is crucial to include real applications that represent user requirements into benchmark suites, to guarantee high usability and widespread adoption of a new system. Given the significant investments in leadership-class supercomputers of the exascale era, this is even more important and necessitates alignment with a vision of Open Science and reproducibility.In this work, we present the JUPITER Benchmark Suite, which incorporates 16 applications from various domains. It was designed for and used in the procurement of JUPITER, the first European exascale supercomputer. We identify requirements and challenges and outline the project and software infrastructure setup. We provide descriptions and scalability studies of selected applications and a set of key takeaways. The JUPITER Benchmark Suite is released as open source software with this work at github.com/FZJ-JSC/jubench.

Index Terms:

Benchmark, Procurement, Exascale, System Design, System Architecture, GPU, Accelerator

I Introduction

The field of High Performance Computing (HPC) is governed by the interplay of capability and demand driving each other forward.During the design and purchase phase of supercomputer procurements for public research, the capability of a machine is usually assessed not only by theoretical, system-inherent numbers, but also by effective numbers relating to actual workloads. These workloads are traditionally benchmark programs that test specific aspects of the system design — like the floating-point throughput, memory bandwidth, or internode latency. While these synthetic benchmarks are well-suited for the assessment of distinct features, for a more integrated and realistic perspective, they should be complemented by application benchmarks. Application benchmarks use state-of-the-art scientific applications to assess the performance of integrated designs. Complex application profiles utilize various types of hardware resources dynamically during the benchmark’s runtime, showcasing real-world strengths and limitations of the system.

This paper introduces the JUPITER Benchmark Suite, a comprehensive collection of 23 benchmark programs meticulously documented and designed to support the procurement of JUPITER, Europe’s first exascale supercomputer. On top of 7 synthetic benchmarks, 16 application benchmarks were developed in close collaboration with domain scientists to ensure relevance and rigor. Additionally, this paper offers valuable insights into the state-of-the-practice of exascale procurement, shedding light on the challenges and methodologies involved.

Preparations for the procurement of JUPITER were launched in early 2022 and finally came to fruition with the awarding of the contract in October 2023. JUPITER is funded by the EuroHPC Joint Undertaking (\qty50), Germany’s Federal Ministry for Education and Research (\qty25), and the Ministry of Culture and Science of the State of North Rhine-Westphalia of Germany (\qty25), and is hosted at Jülich Supercomputing Centre (JSC) of Forschungszentrum Jülich. As part of the procurement, the benchmark suite was developed to motivate the system design and evaluate the proposals committed for the Request for Proposals. The suite focuses on application benchmarks to ensure high practical usability of the system. This work presents the JUPITER Benchmark Suite in detail, highlighting design choices and project setup, describing the benchmark workloads, and releasing them as open source software.The suite includes 23 benchmarks across different domains, each with unique characteristics such as compute-intensive, memory-intensive, and I/O-intensive workloads. The applications are grouped into three categories: Base, representing a mixed base workload for the system, High-Scaling, highlighting scalability to the full exascale system, and synthetic, determining various key hardware design features.The benchmark suite represents a first step towards Continuous Benchmarking to detect system anomalies during the production phase of JUPITER.

The main contributions of this paper are:

  • An in-depth description of the use of benchmarks in HPC procurement, including relevant background information.

  • The description of a novel methodology to assess exascale system designs in the form of High-Scaling benchmarks.

  • The development of suitable benchmark workloads based on a representative set of scientific problems, applications, and synthetic codes.

  • Scalability results on the preparation system for all application benchmarks of the suite.

  • Insights and best practices learned from application scaling and the procurement process in general.

  • Release of the full JUPITER Benchmark Suite as open source software.

After providing background information (sectionII) and details on the used infrastructure (sectionIII), the suite’s individual benchmarks are presented in sectionIV. Key takeaways are provided in sectionV, followed by a conclusion and outlook in sectionVI.

II Background

II-A Requirements

The main requirements of the benchmark suite stem from the need to represent existing and upcoming user communities in the system design process. This ensures a fitting design and fosters adoption of the system by users. The suite must cover the wide user portfolio of the HPC center, containing typical applications from various domains utilizing the current HPC infrastructure, and also represent expected future workloads. Moreover, it is essential to ensure diversity in terms of methods, programming languages, and execution models, since such diversity is an inherent characteristic of the application portfolio for upcoming large-scale systems.

The context of a procurement poses high requirements for replicability, reproducibility, and reusability[Fehr16RRR]. Replicability, i.e., the seamless execution on the same hardware by the developer, is an elementary requirement to guarantee robustness. Beyond that, reproducibility describes the seamless execution on different hardware by someone else, making it a key requirement in the context of the procurement since both the site and the system provider must be able to run the suite to obtain the same results. Ensuring reusability, i.e., designing the framework for easy adaptation for a variety of tasks in the future, is essential to justify the substantial investment involved in the creation of the suite.

Vendors participating in the procurement process must also invest considerable time in system-specific adjustments while meeting the procurement’s requirements and complying to its rules, usually within short time scales. Therefore, it is in everyone’s best interest to have clear, well-defined guidelines and, ideally, to leverage an existing benchmark suite to build on established expertise.

II-B JUPITER Procurement Scheme

The procurement for the JUPITER system uses a Total-Cost-of-Ownership-based (TCO) value-for-money approach, in which the number of executed reference workloads over the lifespan of the system determines the value. Given the size of state-of-the-art supercomputers as well as their corresponding power consumption, costs for electricity and cooling are a substantial part of the overall project budget. Using a mixture of application benchmarks as well as synthetic benchmarks, operational costs are computed in a well-defined procedure. While synthetic benchmarks allow for assessing key performance characteristics, they do not allow a realistic assessment of resource consumption during the lifetime of the system for TCO. Therefore, a greater emphasis is placed on application performance rather than on synthetic tests.

Given the targeted system performance of \qty[per-mode=symbol]1\exa with \qty64 precision, an additional novel benchmark type focusing on the scale of the system is introduced — High-Scaling benchmarks. A subset of applications able to fully utilize JUPITER is identified and use cases were defined to make them part of this novel category. In the course of the paper, we will refer to either High-Scaling benchmarks or Base benchmarks, to differentiate between both.

The JUPITER system is envisioned to consist of two connected sub-systems, modules in the Modular Supercomputing Architecture (MSA) concept[etp4hpcMSA]. JUPITER Booster is the exascale-level module utilizing massive parallelism through GPUs for maximum performance with high energy efficiency (\unit). JUPITER Cluster is a smaller general-purpose module employing state-of-the-art CPUs for applications with lower inherent parallelism and stronger memory demands. Both compute modules are procured jointly, together with a third module made of high-bandwidth flash storage. The benchmark suite has dedicated benchmarks for all modules, partly even benchmarks spanning Cluster and Booster, dubbed MSA benchmarks. Execution targets of the benchmarks are listed in the last columns of TableII.

During the procurement, the results of the execution of the benchmark suite for a given system proposal are weighted and combined to compute a value-for-money metric. The outcomes are compared and incorporated with other aspects into the final assessment of the system proposals.

II-C Implementation for JUPITER

Application-Driven Exascale: The JUPITER Benchmark Suite (1)

The previous two sections outline general requirements for the JUPITER Benchmark Suite. Their assessment and implementation is laid out in the following and are visually sketched in Figure1.

Based on an analysis of current and previous compute-time allocations on predecessor systems, the suite covers a variety of scientific areas and includes applications from the domains of Weather and Climate, Neuroscience, Quantum Physics, Material Design, Biology, and others. Beyond that, diversity in workloads is realized: Artificial-Intelligence-based (AI) methods as well as classical simulations, codes based on C/C++ and Fortran, OpenACC and CUDA. Various application profiles are included, such as memory-bound or sparse computations. Future trends of workloads, e.g., the uptake of machine learning algorithms, are inferred from general trends in research communities and from recent changes in allocations on the predecessor system.

The requirement of replicability is met by streamlining and automating benchmark execution employing the JUBE framework[jube]. Reproducibility is ensured by extensive documentation of all components and the verification of computational results. Thorough testing ensures stable execution in different environments, e.g., with a varying number of compute devices, and lowers risks of commitments by vendors. Reusability is accomplished by a modular design, effective project management workflows, clear licensing, and open source publication. To ensure a high quality of the suite, the benchmarks are standardized through a well-defined setup, with consistent directory structures, uniform descriptions, and similar JUBE configurations. The infrastructure aspects are discussed further in sectionIII).

For each of the Base benchmarks, i.e., the sixteen benchmarks used for the TCO/value-for-money calculation, a Figure-of-Merit (FOM) is identified and normalized to a time-metric. In most cases, the FOM is the runtime of either the full application or a part of it. In case the application focuses on rates, the time-metric is achieved by pre-defining the number of iterations and multiplying with the rate.Each benchmark is executed on a certain number of nodes of the preparation system (see subsectionIII-A) in order to create a reference execution time. The number of nodes is usually selected to be 8, but deviations are possible due to workload-inherent aspects. This is a trade-off between resource economy and agility in benchmark creation (fewer nodes are more productive), and robustness towards anticipated generational leaps in the hardware of the envisioned system (fewer nodes incorporate latent danger to severely change application profiles when the requested runtime can be achieved with one node).The time-metric, determined on the reference number of nodes, is the value to be improved upon and committed to by proposals of system designs. The number of nodes used to surpass the time-metric can be freely specified by the proposal, but is typically smaller than the reference number of nodes.

Five applications used in the Base benchmarks are capable of scaling to the full scale of the preparation system. They form the additional High-Scaling category and define a set of benchmarks that aim to compare the proposed system designs with the preparation system for large-scale executions. By requirement, the future system is known to achieve \qty1\exa\qty1\exa\qty{1}{\exa}1 High-Performance Linpack (HPL) performance, implying a theoretical peak performance larger than that.In JUPITER’s procurement, runtimes must be committed for the High-Scaling benchmarks using a \qty[qualifiermode=bracket]1\exa\theoretical\qtydelimited-[]𝑞𝑢𝑎𝑙𝑖𝑓𝑖𝑒𝑟𝑚𝑜𝑑𝑒𝑏𝑟𝑎𝑐𝑘𝑒𝑡1\exa\theoretical\qty[qualifier-mode=bracket]{1}{\exa\theoretical}[ italic_q italic_u italic_a italic_l italic_i italic_f italic_i italic_e italic_r - italic_m italic_o italic_d italic_e = italic_b italic_r italic_a italic_c italic_k italic_e italic_t ] 1 (\qty[qualifiermode=bracket]1\exa\qtydelimited-[]𝑞𝑢𝑎𝑙𝑖𝑓𝑖𝑒𝑟𝑚𝑜𝑑𝑒𝑏𝑟𝑎𝑐𝑘𝑒𝑡1\exa\qty[qualifier-mode=bracket]{1}{\exa}[ italic_q italic_u italic_a italic_l italic_i italic_f italic_i italic_e italic_r - italic_m italic_o italic_d italic_e = italic_b italic_r italic_a italic_c italic_k italic_e italic_t ] 1 theoretical peak performance) sub-partition. For each high-scaling application, a workload is defined to fill a \qty[qualifier-mode = bracket]50\peta\theoretical sub-partition of the preparation system (about 640640640640 nodes) and a 20×20\times20 × larger sub-partition of the future system (20×\qty50\peta=\qty1\exa20\qty50\peta\qty1\exa20\times\qty{50}{\peta}=\qty{1}{\exa}20 × 50 = 1).111Some benchmarks have algorithmic limitations, like requirement of powers-of-two in node counts. In this case, the smaller, closest compatible number of nodes is taken (for example, 512512512512 nodes). The final assessment is based on the ratio of the runtime value committed for the future \qty[qualifier-mode = bracket]1\exa\theoretical sub-partition and the reference value.When the benchmark’s specific workload configuration fills up a large portion of the GPU memory on the preparation system, there is a danger that on a proposed system the scaled-up version could become limited by available memory and not showcase an accelerator’s full compute capability. This is due to the trend of growing imbalance between the advancement of compute power and memory. To give more flexibility in system design, up to four reference variants of the respective workload are prepared, taking up \qty25 (tiny, T), \qty50 (small, S), \qty75 (medium, M), and \qty100 (large, L) of the available GPU memory on the preparation system (\qty40\giga), respectively. The system proposal may choose the variant that best exploits the available memory on the proposed accelerator after scale-up.

The seven synthetic benchmarks are selected to test individual features of the hardware components, such as compute performance, memory bandwidth, I/O throughput, and network design. Each benchmark has an individual FOM with unique rules and possibly sub-benchmarks, evaluated distinctly.

II-D Related Work

Benchmarks have always played an important role in HPC. One objective is the assessment and comparison of technical solutions, whether at the scale of world-leading cluster systems[top500, Dongarra03HPL] or on a smaller node-level scale [Sorokin20, Xu15NAS, Bienia08PARSEC]. Benchmarks can be specific to one application topic[Herten19ICEI, Albers22beNNch, Farrell21MLPerf] or cover multiple areas in the form of suites [Dongarra12HPCChallenge, Stratton12Parboil, Danalis10SHOC, Che09Rodinia].

Foundational work focused on identifying common patterns across various applications, with the objective of classifying them based on similar computational and communication characteristics [asanovic06dwarfs]. These patterns, referred to as dwarfs, are defined at a high level of abstraction and are intended to capture the most relevant workloads in high-performance computing. subsectionII-D categorizes the benchmarks of the JUPITER Benchmark Suite according to these dwarfs, and also gives the predominant scientific domain a benchmark represents. While dwarfs can be viewed as a set of blueprints for application-inspired synthetic benchmarks, their simplicity by design prevents them to fully capture all dynamics of real applications. To address this issue in supercomputer co-design, more complex and versatile computational motifs, termed octopodes, are proposed by \citeauthorMatsuokaRethinkingProxy[MatsuokaRethinkingProxy].

[t]{tblr}colspec=ccccccc,font=,row1-Z=font=,row2=font=,vline3=2-Zdashed,rowsep=0pt,width=column3-Z=leftsep=5pt,rightsep=5pt,column3=leftsep=5pt,row3=abovesep=2pt,& \SetCell[c=1]l\bullet Dense LA \SetCell[c=1]l\bullet Sparse LA \SetCell[c=1]l\bullet Spectral \SetCell[c=1]l\bullet Particle \SetCell[c=1]l\bullet Structured Grid \SetCell[c=1]l\bullet Unstructured Grid \SetCell[c=1]l\bullet Monte Carlo
Benchmark Domain \SetCell[c=7]cDwarfs
Amber* MD \bullet \bullet
Arbor Neurosci. \bullet \bullet \bullet
Chroma-QCD QCD \bullet \bullet \bullet
GROMACS MD \bullet \bullet
ICON Climate \bullet \bullet
JUQCS QC \bullet
nekRS CFD \bullet \bullet \bullet
ParFlow* Earth Sys. \bullet \bullet
PIConGPU Plasma \bullet \bullet
Quantum Espresso Materials Sci. \bullet \bullet \bullet
SOMA* Polymer Sys. \bullet \bullet
MMoCLIP AI (MM) \bullet
Megatron-LM AI (LLM) \bullet
ResNet* AI (Vision) \bullet
DynQCD QCD \bullet \bullet \bullet
NAStJA Biology \bullet \bullet
Graph500 Graph \SetCell[c=7]cGraph Traversal (D. 9)
HPCG CG \bullet
HPL LA \bullet
IOR Filesys. \SetCell[c=7]cInput/Output
LinkTest Network \SetCell[c=7]cP2P, Topology
OSU Network \SetCell[c=7]cMessage Exchange, DMA
STREAM Memory \SetCell[c=7]cRegular Access

  • *

    The benchmarks were prepared for the procurement, but not actually used.

  • §

    The following abbreviations are used:MD - Molecular Dynamics;QCD - Quantum Chromo Dynamics;QC - Quantum Computing;CFD - Computational Fluid Dynamics;MM - Multi-Modal;LLM - Large Language Model;LA - Linear Algebra;P2P - Point-to-Point;DMA - Direct Memory Access.

SPEC[Juckeland15SPEC, Boehm18SPEC] is one of the most extensive, well-known benchmarking initiatives aimed at commercial users. The benchmark suite SPECaccel2023 uses the offloading APIs OpenACC and OpenMP to measure performance with computationally intensive parallel applications, following the principle “same source code for all”.Benchmarks for the use case of system design and exascale procurement[Malaya23Frontier] have specific requirements (see subsectionII-A) and are typically not made publicly available due to concerns regarding sensitive information and elaborate implementation. The PRACE Unified European Applications Benchmark Suite[ueabs2024] represents a step towards a culture of open sharing, but its future support is uncertain. Other notable efforts include the CORAL-2 benchmarks[coral2] used for procurement of the three exascale systems in the US, and the recently developed NERSC-10 Benchmark Suite[nersc10] used in an ongoing procurement. HPC centers can benefit from each other’s experience, driven by a spirit of Open Science, reproducibility, and sustainable software development[Fehr16RRR]. The integration of DevOps principles, such as Continuous Benchmarking, is gaining popularity to support these aims[Pearce23Benchpark, Adamson24Jacamar].

III Benchmark Infrastructure

[t]Application FeaturesExecution Targets BenchmarkName Progr. Language,[Libraries, ]Prog. ModelsLicence NodesBase NodesHigh-Scale Module/DeviceNMem Varssuperscript𝑁Mem VarsN^{\text{Mem Vars}}italic_N start_POSTSUPERSCRIPT Mem Vars end_POSTSUPERSCRIPTAmber*Fortran, CUDACustom1\bulletArborC++, CUDA/HIPBSD-3-Clause8642T,S,M,Lsuperscript642T,S,M,L642^{\text{T,S,M,L}}642 start_POSTSUPERSCRIPT T,S,M,L end_POSTSUPERSCRIPT\bulletChroma-QCDC++, QUDA, CUDA/HIPJLab8512S,M,Lsuperscript512S,M,L512^{\text{S,M,L}}512 start_POSTSUPERSCRIPT S,M,L end_POSTSUPERSCRIPT\bulletGROMACSC++, CUDA/SYCLLGPLv2.13/128\bulletICONFortran/C, OpenACC/CUDA/HIPBSD-3-Clause120/300\bulletJUQCSFortran, CUDA/OpenMPNone8512S,Lsuperscript512S,L512^{\text{S,L}}512 start_POSTSUPERSCRIPT S,L end_POSTSUPERSCRIPT\bullet\bulletnekRSC++/C, OCCA, CUDA/HIP/SYCLBSD-3-Clause8642S,M,Lsuperscript642S,M,L642^{\text{S,M,L}}642 start_POSTSUPERSCRIPT S,M,L end_POSTSUPERSCRIPT\bulletParFlow*C, Hypre, CUDA/HIPLGPL4\bulletPIConGPUC++, Alpaka, CUDA/HIPGPLv3+8640S,M,Lsuperscript640S,M,L640^{\text{S,M,L}}640 start_POSTSUPERSCRIPT S,M,L end_POSTSUPERSCRIPT\bulletQuantum EspressoFortran, ELPA, OpenACC/CUFGPL8\bulletSOMA*C, OpenACCLGPL8\bulletMMoCLIPPython, PyTorch, CUDA/ROCm1MIT8\bulletMegatron-LMPython, PyTorch/Apex, CUDA/ROCm1BSD-3-Clause96\bulletResNet*Python, TensorFlow, CUDA/ROCm1Apache-2.010\bulletDynQCDC, OpenMPNone8\bulletNAStJAC++, MPIMPL-2.08\bulletGraph500C, MPIMIT4/16/all\bulletHPCGC++, OpenMP, CUDA/HIPBSD-3-Clause1/4/all\bullet\bulletHPLC, BLAS, OpenMP, CUDA/HIPBSD-4-Clause1/16/all\bullet\bulletIORC, MPIGPLv2-/>64absent64>64> 64§\bullet\bulletLinkTestC++, MPI/SIONlibBSD-4-Clause+all\bullet\bullet\bulletOSUC, MPI, CUDABSD1/2\bullet\bullet\bulletSTREAMC, CUDA/ROCm/OpenACCCustom1\bullet\bulletNMem Varssuperscript𝑁Mem VarsN^{\text{Mem Vars}}italic_N start_POSTSUPERSCRIPT Mem Vars end_POSTSUPERSCRIPT BenchmarkName Progr. Language,[Libraries, ]Prog. ModelsLicence NodesBase NodesHigh-Scale Module/Device

  • *

    The benchmarks were prepared for the procurement, but not actually used.

  • 1

    For PyTorch and TensorFlow, CUDA and ROCm backends are available; through extensions, also backends for Intel GPUs exist (not in mainline repositories).

  • §

    IOR features two sub-benchmarks, easy and hard. The number of nodes is a free parameter in easy. In hard, it can also be chosen freely, as long as more than 64 nodes are taken.

III-A Preparation Systems

The JUPITER Benchmark Suite was prepared on JUWELS, in particular JUWELS Booster, a top 20 system[top500nov2023] hosted at JSC[juwels]. JUWELS Booster was installed in 2020 and provides a performance of \qty[qualifier-mode = bracket]73\peta\theoretical peak and \qty44\peta for the HPL. The system is connected to the JUST5 storage system[graf2021just].JUWELS Booster provides 936 GPU nodes integrated into 39 Eviden BullSequana XH2000 racks, with 2racks (48 nodes) building a cell in the DragonFly+ topology of the high-speed interconnect. Each node has 4NVIDIA A100 GPUs and 4NVIDIA Mellanox InfiniBand HDR200 adapters, with one adapter available per GPU. 2 AMD EPYC Rome 7402 CPUs (2×242242\times 242 × 24 cores) are connected to \qty512\giga DDR4 memory.

Preparations for the High-Scaling benchmarks utilized a \qty[qualifier-mode = bracket]50\peta\theoretical sub-partition of the JUWELS Booster.

JUWELS provides general software dependencies through EasyBuild[hoste2012easybuild]. Reproducibility is achieved by either using upstream installation recipes, easyconfigs, or upstreaming custom recipes.

III-B JUBE

Every benchmark is implemented in the JUBE[jube] workflow environment to facilitate productive development and reproducibility. In benchmark-specific definition files, JUBE scripts, parameters and execution steps (compilation, computation, data processing, verification) are defined. These are then interpreted by the JUBE runtime, resolving dependencies and eventually submitting jobs for execution to the batch system. By inheriting from system-specific definition files, platform.xml, batch submission templates are populated and independence of the underlying system is achieved. The various sub-benchmarks and variants are implemented by tags, which select different versions of parameter definitions. After execution, the benchmark results are presented by JUBE in a concise tabular form, including the FOM.

Within the JUPITER procurement, the JUBE scripts are part of the documentation. They exactly define execution parameters and instructions with descriptive annotations. A textual documentation is provided as part of the benchmark-accompanying description.

III-C Descriptions

Beyond the execution reference through JUBE scripts, each benchmark is accompanied by an extensive description. All descriptions are normalized, using identical structure with similar language. Example parts are information about the source and the compilation, execution parameters and rules, detailed instructions for execution and verification, sample results, and concluding commitment requests. In all relevant sections, relations to the JUBE scripts are made in addition to the textual descriptions. For the vendors, the use of JUBE is recommended but not mandatory.

PDFs generated from the benchmark descriptions are part of the committed procurement documentation, including hashes of archived benchmark repositories.

III-D Git and Submodules

All components of a benchmark are available in a single Git repository as a single source of truth. A common structure is established, containing description, JUBE scripts, auxiliary scripts, benchmark results, and the source code of the benchmarked application. Utilizing the attached issue tracker, project management and collaboration are facilitated.

Per default, the sources are included as references in the form of Git Submodules. Submodules enable a direct linkage to well-defined versions of source code, but do not unnecessarily clutter the benchmark repository by static, potentially extensive copies. They are well integrated into the Git workflow and easily updated. In cases where inclusion as a Git Submodule is not possible, scripts and detailed instructions for download are provided.

For delivery as part of the procurement specification package, each benchmark repository is archived as a tar file.

If too large for inclusion in the Git repository, input data is provided as a separate download, including a verifying hash.

III-E Project Management

The benchmark suite development efforts were supported efficiently by clear project management workflows over several months.

A core team of HPC specialists and scientific researchers initiated the process early, curating a list of potential benchmarks based on their expertise and experience with existing HPC systems, while also incorporating insights from previous procurements. In close collaboration with domain scientists, this list was gradually refined to ensure a balanced and diverse selection that met the requirements outlined in subsectionII-A.

Competent teams, each led by a team captain, were responsible for individual applications. GitLab issues were used to document biweekly meetings and track per-application progress in the form of a pre-defined checklist with 11 points (ranging from source code availability, over JUBE integration, to description creation). Collaborative hack days facilitated collaboration while running the benchmarks on the preparation system.

IV Benchmarks

This section describes the JUPITER Benchmark Suite, containing 23 benchmarks: 7 synthetic and 16 application benchmarks. In the procurement process, the number of application benchmarks was reduced to 12.TableII gives an overview of all benchmarks, including application features and execution targets.

With this work, the JUPITER Benchmark Suite is released as open source software at https://github.com/FZJ-JSC/jubench, with individual repositories for each benchmark[zenodo_amber, zenodo_arbor, zenodo_chroma, zenodo_gromacs, zenodo_icon, zenodo_juqcs, zenodo_nekrs, zenodo_parflow, zenodo_picongpu, zenodo_qe, zenodo_soma, zenodo_mmoclip, zenodo_megatron, zenodo_resnet, zenodo_dynqcd, zenodo_nastja, zenodo_graph500, zenodo_hpcg, zenodo_hpl, zenodo_icon, zenodo_linktest, zenodo_osu, zenodo_stream, zenodo_streamgpu].

IV-A Application Benchmarks

In the following, we describe in detail eleven of the application benchmarks, divided into Base and High-Scaling benchmarks. The benchmarks are based on prominent workloads in the HPC community and were specifically developed for the suite. Given their complex computational dynamics, a certain level of technical and scientific background is necessary and will be provided accordingly. The remaining benchmarks, Amber, ParFlow, SOMA, ResNet, and DynQCD, are briefly introduced first for completeness; they are either using closed-source software (DynQCD) or were ultimately not used for the JUPITER procurement (the others).

  • Amber[AmberGPU, AmberTools] is a popular commercial molecular dynamics code for biomolecules. The Satellite Tobacco Mosaic Virus (STMV) case from the Amber20 benchmark suite[AmberBenchmark] (1 067 09510670951\,067\,0951 067 095 atoms) is chosen. The code is mainly optimized for single GPU calculations and is not intended to scale beyond a single node.

  • ParFlow[ashby1996parallel, jones2001newton] is a massively-parallel, open source, integrated hydrology model for surface and subsurface flow simulation. The ClayL test from ParFlow’s test suite (simulating infiltration into clay soil) is selected, with a problem size of 1008×1008×240100810082401008\times 1008\times 2401008 × 1008 × 240 cells[Hokkanen2021].

  • SOMA[soma] performs Monte Carlo simulations for the “Single Chain in Mean Field” model[daoulas2006single], studying the behaviour of soft coarse-grained polymer chains in a solution.

  • ResNet[he2015deep] uses convolutions and residual connections for training deep neural networks and serves as a reference model in computer vision tasks. The suite includes ResNet50, implemented in TensorFlow with Horovod.

  • DynQCD[dynqcd] is a CPU-only code which performs numerical simulations for Lattice Quantum-Chromodynamics (LQCD). The benchmark generates 600 quark propagators using a conjugate gradient solver for sparse LQCD fermion matrices, with high demands to the memory sub-system.

IV-A1 Base Benchmarks (Selection)

The Base benchmarks are designed to incentivize system designs that optimize the time to solution. They are first executed on a reference number of nodes on the preparation system (see subsectionII-C). Figure2 gives an overview of application runtimes and respective strong scaling behaviors for surrounding number of nodes. While the absolute number can be used to judge system designs quantitatively, the strong scaling behavior can be used as an additional data point to understand the overall design qualitatively. Note the example for reading the graph in the figure caption.

Application-Driven Exascale: The JUPITER Benchmark Suite (2)
GROMACS

GROMACS[Berendsen1995, pall_tackling_2014, Abraham2015, Pall2020] is a versatile package to perform molecular dynamics simulations, focusing on biochemical molecules and soft condensed matter systems. The application integrates Newton’s equations of motion for systems with hundreds to millions of particles and provides time-resolved trajectories.Two biological systems from the Unified European Applications Benchmark Suite (UEABS)[ueabs2024] are used, test cases A and C. Test case A simulates a GluCl ion channel embedded in a membrane. Test case C contains 27 replicas of the STMV with about 28 000 0002800000028\,000\,00028 000 000 atoms and allows testing the scalability of system-supplied Fast Fourier Transform (FFT) libraries.

ICON

The ICOsahedral Non-hydrostatic model (ICON)[Zaengl2015] is a modelling framework for weather, climate, and environmental prediction used for operational weather forecasting at the German Weather Service. ICON also provides an Earth System Model for climate simulations, i.e., a general circulation model of the atmosphere, including a land module[Giorgetta2018] and an ocean model[Korn2022]. While the atmosphere part has been ported to GPUs[Giorgetta2022], the ocean component is still running on CPUs only. ICON is available under a permissive open source licence.The JUPITER benchmark case is based on the atmosphere component, with global forecast simulation in two resolutions, resulting in two sub-benchmarks: R02B09 (\qty5\kilo grid point distance) and R02B10 (\qty2.5\kilo grid point distance)[Hohenegger2023]. The coarser resolution is targeted for execution on 120 nodes, and the finer resolution is for 300 nodes. While reasonable scaling to 2×2\times2 × the node count (240 nodes and 600 nodes, respectively) is possible, this is not the usual mode of operation for ICON.These simulations are crucial for ICON’s development towards a storm-resolving climate model with \qty1\kilo resolution or even less[Stevens2020]. A unique aspect of the ICON benchmark is its large input dataset: R02B09 requires\qty1.8\tera of data, R02B10 needs \qty4.5\tera. Therefore, the ICON benchmark also tests the performance of I/O operations on a system.

Megatron-LM

Megatron-LM[shoeybi2020megatronlm] is a prominent codebase in Natural Language Processing (NLP), known for its vast scale and performance capabilities. It employs the transformer model architecture[vaswani2023attention] and leverages various parallelization techniques and optimizations[narayanan2021efficient, korthikanti2022reducing, dao2023flashattention2, rajbhandari2020zero] through PyTorch to achieve high hardware utilization with excellent efficiency. The training of various recent open source GPT-like models was carried out with Megatron-LM[John2023OGX], making this benchmark crucial to assess a system’s capability to handle disruptive generative AI workloads.The benchmark trains a 175 billion parameter model, converting the usual throughput metric (tokens per time) to a hypothetical time-to-solution FOM by training 20 million tokens.

MMoCLIP

Contrastive Language-Image Pre-training (CLIP)[radford2021learning] is a method that conducts self-supervised learning on weakly-aligned image-text pairs with open vocabulary language, resulting in language-vision foundation models. The approach enables usage of substantially increased datasets, like web-scale datasets[Schuhmann2021, Schuhmann2022]. The OpenCLIP[ilharco_gabriel_2021_5143773, Cherti2023] codebase, an open source implementation of CLIP, enables efficiently distributed training of CLIP models by using multiple data parallelism schemes through PyTorch, scaling to more than a thousand GPUs.Due to its strong transferability and robustness, OpenCLIP is used in diverse multi-modal learning approaches and downstream applications[ldm, liu2023visual, sun2023emu], and efficient training is crucial for the machine learning community dealing with open, fully reproducible foundation models.

The MMoCLIP benchmark is curated from OpenCLIP. It trains an ViT-L-14 model on a synthetic dataset of 3 200 00032000003\,200\,0003 200 000 image-text pairs and records the total training time as a FOM.

Quantum ESPRESSO

Quantum ESPRESSO (QE)[QE1, QE2, QEGPU] is an open source, density-functional-theory-based electronic structure software used both in academia and industry. QE calculates different material properties using a plane wave basis set and pseudo-potentials and can exploit novel accelerators well[QEexa].The dominant kernel in QE performs a three-dimensional FFT, which is usually a memory-bound kernel and is communication-bound for large systems[QEexa].

For the benchmark suite, the Car-Parrinello Molecular Dynamics model was chosen. The benchmark is based on a use case created in the MaX project[MaXrepo, MaX] and does calculations for a slab of ZrO2 with 792 atoms.

NAStJA

The Neoteric Autonomous Stencil code for Jolly Algorithms (NAStJA) is a massively-parallel simulation framework of biological tissues using a Cellular Potts Model[BerghoffNastja, Graner1992]. This model relies on nearest neighbour interactions and is parallelized by dividing the overall workload into multiple sub-regions, called blocks. Each block is treated independently by an MPI process, with boundaries being exchanged. Using NAStJA, tissues composed of thousands to millions of cells can be simulated at subcellular resolution[berghoff2020cells].As a test case, adhesion-driven cell sorting is used, a common process in tissue development and segregation[Steinberg1962].The benchmark investigates the first 5050 Monte Carlo (MC) steps of a system of size 720×720×1152\unit\micro\cubed7207201152\unit\micro\cubed720\times 720\times 1152\,\unit{\micro\cubed}720 × 720 × 1152, containing roughly 600 000600000600\,000600 000 cells. NAStJA utilizes MPI for parallelization/distribution and is one of the few CPU-only benchmarks in the suite. The application exhibits an irregular memory access pattern at each iteration, which is not suitable for GPU execution.

IV-A2 High-Scaling Benchmarks

In this section, we describe the five High-Scaling benchmarks in detail.

Application-Driven Exascale: The JUPITER Benchmark Suite (3)

Figure3 juxtaposes the weak-scaling behaviours of the applications over a wide range of node numbers, using the reference HPC system JUWELS Booster.

Arbor

Arbor is a library for simulating biophysically-realistic neural networks, bridging the gap between point and nanoscale models[arbor].Developed in the HBP[hbp], it aims at efficient use of modern HPC hardware behind an intuitive interface.Neurons are modelled by morphology, ion channels, and connections.Arbor is written in C++ with a Python interface and available under a permissive open source license.The user-centric description is discretized and aggregated to optimize data layouts for individual hardware.At runtime, the cable equation is integrated alternating with a system of ODEs for the channels.Users model channels via a domain-specific language that must be compiled for the target hardware.Communication is performed, concurrently with time evolution, every n𝑛nitalic_n steps, determined by neural delays.The benchmark is parameterized to fill the GPU memory in the variants T, S, M, L.To differentiate from point models, it is weighted heavily towards computation, emphasized by a sparse connectivity.A complex cell from the Allen Institute was selected and adapted to random morphologies of fixed depth[allen].Cells are organized into rings propagating a single spike.Rings are interconnected to place load on the network without altering dynamics, yielding a deterministic, scalable workload.Profiling shows two cost centers: \qty52 ion channels and \qty33 cable equation; hiding communication completely.The Base version runs on 32 JUWELS Booster nodes, filling all 4 GPUs’ memory.This was scaled up to the full Booster to verify efficient resource usage and extrapolated to \qty1\exa.The number of generated spikes is used for validation.

Chroma

Chroma[Edwards:2004sx] is an all-purpose application for LQCD computations. It is compiled on top of the USQCD software stack[usqcd], which provides LQCD-specific libraries for communication, data-parallelism, I/O, and, importantly, sparse linear solver libraries optimized for different architectures.Key libraries used in this benchmark include QMP for MPI wrapping, QDP-JIT for data-parallelism and parallel-I/O via QIO, and the GPU-targeted QUDA solver library[Clark:2009wm].Chroma and the USQCD stack are open source and community-developed. Chroma is one of the most widely used LQCD suites and is representative of LQCD codes in general.

LQCD calculations generally depend heavily on solving very large, regular, sparse linear systems (dimension 106109superscript106superscript10910^{6}-10^{9}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT - 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT generally). Due to the regularity of the data and the calculation, LQCD lends itself to many levels of concurrency.

The Chroma LQCD benchmark in the JUPITER Benchmark Suitecontains the representative benchmarks for the Hybrid Monte-Carlo (HMC) component of the LQCD simulations. In the benchmark, a number of HMC update trajectories are performed using the 3+1313+13 + 1 flavours of Clover Wilson fermions — three light quark flavours with identical mass, and a fourth flavour with heavier mass — and the Lüscher-Weisz gauge action. The 4D lattice is initialized with a random SU(3)𝑆𝑈3SU(3)italic_S italic_U ( 3 ) element on each link. Checkpointing is disabled by a source patch to remove the I/O overhead for the calculations.The benchmark also contains a fix to Chroma, allowing simulation of 4D lattice volumes greater than 231superscript2312^{31}2 start_POSTSUPERSCRIPT 31 end_POSTSUPERSCRIPT and among the largest LQCD simulations anywhere to date. The benchmark performance is sensitive to the decomposition configuration used for distributing the 4D lattice to different tasks and to the affinity between the CPU, GPU, NUMA domains, and the network controller for each task.

The benchmark is validated by comparing the output with a reference solution with a tolerance of 1010superscript101010^{-10}10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT for the Base benchmark and 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT for High-Scaling benchmarks.

The relevant metric (FOM) is the total time spent in HMC updates, excluding the first update, which includes overhead for tuning QUDA parameters. So a minimum of two updates must be prescribed.

JUQCS

JUQCS is a massively parallel simulator for universal gate-based Quantum Computers (QCs)[DeRaedt2007JUQCS] written in Fortran90 using MPI, OpenMP, and CUDA. During the past decades, JUQCS has been used to benchmark some of the largest supercomputers worldwide, including the Sunway TaihuLight and the K computer[DeRaedt2019JUQCS] as well as JUWELS Booster[Willsch2022JUQCSG], and it was part of Google’s quantum supremacy demonstration[Google2019QuantumSupremacy]. Various versions of JUQCS are available in binary form as part of a container[JUQCSDocker]; a light version with available sources was created specifically for this benchmark suite[JUQCSLight].JUQCS simulates an n𝑛nitalic_n-qubit gate-based QC by iteratively updating a rank-n𝑛nitalic_n tensor of 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT complex numbers (state vector) stored in double precision and distributed over the supercomputer’s memory. The total available memory determines the size of the largest QC that can be simulated. For instance, a universal simulation of n=45𝑛45n=45italic_n = 45 qubits requires a little over 16×245\unit=\qty0.5\pebi16superscript245\unit\qty0.5\pebi16\times 2^{45}\,\unit{}=\qty{0.5}{\pebi}16 × 2 start_POSTSUPERSCRIPT 45 end_POSTSUPERSCRIPT = 0.5. Many operations require the transfer of half of all memory, i.e.,2n/2superscript2𝑛22^{n}/22 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT / 2 complex double-precision numbers, across the network, which can help to assess the performance of a supercomputer’s communication network[DeRaedt2019JUQCS, Willsch2022JUQCSG].As Figure3 shows, the deviation of JUQCS w.r.t.the theoretically expected linear scaling (green triangles) reveals both a drop in performance from intra-node to inter-node GPU communication (from 1 to 2 nodes) and another drop when communication enters the large-scale regime at 256 nodes.

All present JUQCS benchmarks simulate successive applications of a single-qubit quantum gate that requires large memory transfers. The Base benchmark simulates n=36𝑛36n=36italic_n = 36 qubits requiring \qty1\tebi\qty1\tebi\qty{1}{\tebi}1 of GPU memory. The High-Scaling benchmark contains two memory variants: a large memory variant with n=42𝑛42n=42italic_n = 42 qubits requiring \qty64\tebi\qty64\tebi\qty{64}{\tebi}64 (L) and a small memory variant with n=41𝑛41n=41italic_n = 41 qubits requiring \qty32\tebi\qty32\tebi\qty{32}{\tebi}32 of GPU memory (S). Rules are given for extrapolation to an Exascale setup using n=46𝑛46n=46italic_n = 46 (L) or n=45𝑛45n=45italic_n = 45 qubits (S). For all test cases, verification is done using the theoretically known results[DeRaedt2019JUQCS].

In addition, an MSA version of the JUQCS benchmark simulates n=34𝑛34n=34italic_n = 34 qubits on both JUWELS Cluster and Booster simultaneously. The total amount of memory is split into two parts, with \qty128\gibi\qty128\gibi\qty{128}{\gibi}128 residing on the CPU nodes and \qty128\gibi\qty128\gibi\qty{128}{\gibi}128 residing on the GPU nodes. MPI is used for communication between the Cluster and the Booster, and the number of MPI tasks is similarly split into two. On the Cluster, each MPI task launches 12 OpenMP threads, with one thread per CPU core. On the Booster, each MPI task controls one of the GPUs.

nekRS

nekRS[Fischer2022] is a fast CFD solver designed for GPUs that solves the low-Mach Navier-Stokes equations (NSEs), potentially coupled with multiphysics effects. nekRS has been run at scale on many large supercomputers, featuring excellenttime-to-solution due to its high GPU throughput, and was nominated for the 2023 Gordon-Bell Award[Merzari2023]. nekRS uses high-order spectral elements[Patera1984] in which the solution, data, and test functions are represented as locally structured Nthsuperscript𝑁thN^{\mathrm{th}}italic_N start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT-order tensor product polynomials on a set of E𝐸Eitalic_E globally unstructured curvilinear hexahedral brick elements.There are two key benefits to this strategy. First, high-order polynomial expansions significantly reduce the number of unknowns (nEN3𝑛𝐸superscript𝑁3n\approx EN^{3}italic_n ≈ italic_E italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) needed to achieve engineering tolerances. Second, the locally structured forms allow tensor product sum factorization, which yields low 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ) storage cost and 𝒪(nN)𝒪𝑛𝑁\mathcal{O}(nN)caligraphic_O ( italic_n italic_N ) work complexity[Orszag1980].The leading order 𝒪(nN)𝒪𝑛𝑁\mathcal{O}(nN)caligraphic_O ( italic_n italic_N ) work terms can be cast as small dense matrix-matrix products with good computational intensity[Deville2002]. nekRS is written in C++ and the kernels are implemented using the portable Open Concurrent Compute Abstraction (OCCA) library [occa] for abstraction between different parallel languages/hardware architectures.

The benchmark case is derived from a Rayleigh-Bénard convection (RBC) application[Samuel2024, Bode2024] which simulates turbulence induced by a temperature gradient — a typical case executed at scale.The simulation domain is a sheet. It is much more extended in the periodic directions than in the wall-bounded direction. The chosen polynomial order is 9999 with 600 time steps per run. Verification is based on pre-computed results and derived tolerances.The High-Scaling benchmark variants use between 28 836 9002883690028\,836\,90028 836 900 (small, 11 229similar-toabsent11229\sim 11\,229∼ 11 229 per GPU) and 57 760 0005776000057\,760\,00057 760 000 (large, 22.492similar-toabsent22.492\sim 22.492∼ 22.492 per GPU) elements, which is more than the minimum number of elements required for the "strong scaling limit" of 7000800070008000$7000$-$8000$7000 - 8000 elements per GPU. The Base benchmark case uses 719 104719104719\,104719 104 elements resulting in 22 4722247222\,47222 472 elements per GPU.

PIConGPU

PIConGPU is an open source, fully relativistic particle-in-cell (PIC) code designed for studying laser-plasma interactions and astrophysical phenomena. It uses the PIC algorithm with several key components, namely, particle initialization, charge calculations using grid interpolation, field calculations using densities, and time-marching due to Lorentz force. This approach allows particles to interact via fields on the grid rather than direct pairwise interactions, reducing computational steps from N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to N𝑁Nitalic_N for N𝑁Nitalic_N particles.PIConGPU employs a unique data model with asynchronous data transfers to handle the computational challenge. It can simulate complex plasma systems with billions of particles on GPU clusters[PIConGPU2013]. PIConGPU is developed with a hardware-agnostic approach using the Alpaka library[ZenkerAsHES2016, Worpitz2015], providing outstanding performance across all supported platforms, like CPUs, AMD and NVIDIA GPUs, and FPGAs[PIConGPU_alpaka].The benchmark suite uses a 3D test-case simulating the Kelvin-Helmholtz Instability (KHI), a non-relativistic shear-flow instability, utilizing a pre-ionized hydrogen plasma with periodic boundary conditions.While relevant for various research communities, the nature of shear-flows and the use of periodic boundary conditions does not impose a significant load imbalance throughout the simulation. Therefore, the performance of the code is based on its structure rather than the physics of the problem. In the KHI use case, the number of particles per cell is kept constant to 25, using as many cells as the GPU memory allows. A grid size of x=(4096,2048,1024)𝑥409620481024\vec{x}=(4096,2048,1024)over→ start_ARG italic_x end_ARG = ( 4096 , 2048 , 1024 ) is chosen for the small memory variant, and extended to (4096,2048,2048)409620482048(4096,2048,2048)( 4096 , 2048 , 2048 ) (M) and (4096,4096,2560)409640962560(4096,4096,2560)( 4096 , 4096 , 2560 ) (L) for the larger variants. PIConGPU employs domain decomposition for distribution, dividing the computational domain into smaller subdomains along three dimensions. To distribute along these three dimensions, the maximum number of nodes that can be utilized is limited to 640640640640, rather than 642642642642.

IV-B Synthetic Benchmarks

The JUPITER Benchmark Suite also includes seven well-known synthetic benchmarks: Graph500, HPCG, HPL, IOR, LinkTest, OSU, and STREAM (including a GPU variant).The IOR and LinkTest implementation are presented in the following, highlighting some unique aspects of the setup.

IOR

IOR is the de facto standard for measuring I/O performance and is being used by the IO500[kunkel:2018:io500] to compare the I/O characteristics of storage systems. The benchmark provides a large list of parameters such as block size, transfer size, API, and task reordering, which in turn allows simulating multiple I/O patterns.To target the high-bandwidth, NVMe-based JUPITER storage module, the upper and lower bounds on the mean read and write bandwidth are our focus in the Benchmark Suite. Similar to IO500, two variants of the IOR are implemented, Easy and Hard. The Easy variant requires a transfer size of \qty16\mebi, with each process writing to its own file. The Hard variant uses a transfer size of \qty4\kibi and a block size of \qty4\kibi, with all processes writing and reading a single file. The setup forces multiple processes to write to the same file system data block, stressing the filesystem with the lock processes. The remaining parameters were selected to avoid caching effects.The number of nodes is a free parameter (with a lower bound) to allow for optimization on the level of parallelism that the underlying file system can provide.

LinkTest

LinkTest[LinkTest] is designed to test point-to-point connections between processes in serial or parallel mode and is capable of handling very large numbers of processes. It is an essential tool in network operations, used mostly internally by system administrators for acceptance testing, maintenance, and troubleshooting.

The JUPITER Benchmark Suite utilizes LinkTest’s bisection test, to concisely evaluate interconnectivity between parts of the system’s network, quantified by a single metric. In the bisection test, a number of test processes (one per high-speed network adapter) is separated to two equal halves of the system, and messages are bounced between partnering processes in parallel (bidirectional mode). To achieve optimal bandwidth, the message size is set to \qty16\mebi. An assessment is made mainly based on the minimum bisection bandwidth.

V Lessons Learned

We developed the JUPITER Benchmark Suite building upon our experience from previous HPC system procurements.The suite constitutes a substantial expansion from those earlier endeavours, and should be considered as a living object that will continue to evolve over time.In the following, we summarize the lessons learned, covering first the perspective of application developers, then the benchmark suite creation, and finally the overall procurement process.

V-A Application Development

The applications of the JUPITER Benchmark Suite not only need to be executed on current large-scale resources like JUWELS Booster, but also need to be extrapolated to larger future resources, amplifying scaling challenges.

To understand the performance characteristics on a future system better, it proved useful for some application developers to create models of their applications. JUQCS, for example, has a non-trivial weak scaling behaviour. In the benchmark, the execution time is reported in relation to an ideal scenario, enabling comparability. A model was developed for nekRS to predict the performance of a later part of the simulation early in the process, allowing much shorter and more resource-efficient benchmarks. During scalability studies for PIConGPU, a model for the scaling behaviour could be developed, informing valid simulation parameters for the benchmark setup.

While the approximate scale of the future system is known, the details of the setup are not.When domain decomposition is important for performance, preliminary studies are usually employed to determine the best parameters for production runs. But decomposition studies are impossible in the benchmark context, especially for an unknown system design. Through labour- and resource-intensive investigation, estimates, rules, or scripts for ideal domain decomposition were devised, e.g., for Chroma-QCD, PIConGPU, NAStJA, and DynQCD. This also documented the experience of individual researchers, improving reproducibility.To understand different scaling regimes of the application, a network communication model was developed for JUQCS. The model can be employed to understand topological aspects of the high-speed network of current and future systems, for example with respect to congestion.

The preparation for future system designs had a direct effect on application development. For example, it became apparent for Arbor developers that they need to optimize memory usage, as memory capacity and bandwidth will continue to extend more slowly compared to compute performance. During benchmark preparation, they also needed to trade highly-valued user experience for scalability, as the approach of referring to connection endpoints with labels did not scale as required. A short-term solution (using local indexing) was found for the suite, and a hash-based solution is being developed upstream. On a similar note, Chroma-QCD authors needed to patch the code to facilitate execution on the envisioned scale.Unexpected effects can occur depending on the extrapolation method to the future system. For Chroma-QCD, it was found out that the employed benchmark is not guaranteed to converge, and a cut-off after a certain number of iterations is a more robust approach.

Result verification is essential for a benchmark suite to ensure the validity of submitted results. Yet, the experience with verification during the suite preparation varied.Some results could be verified either exactly (JUQCS), or within a certain numerical limit by comparing to a pre-computed solution (Chroma-QCD); more involved simulations were verified by extracting key metrics from the computed solution for comparison to a model (ICON, nekRS). The verification of some applications with iterative algorithms, which were stopped before convergence, relied on framework-inherent verification and required key data in the output (PIConGPU, Megatron-LM) — arguably the weakest form of verification.

V-B Benchmark Design

Creating a vast benchmark suite that picks up the status quo in workloads, and bringing it to mature levels, is a resource-intensive endeavour. Beyond the human resources, compute resources of the reference system constitute a significant investment. To use these resources efficiently, it is important to design benchmarks with runtimes as short as possible, while keeping it as large as needed. Short runtimes also enable swift turn-around times for rapid prototyping – especially useful in a large suite. The size of the input data and the files generated at runtime should be minimized to ensure that the benchmark suite is easy to handle. For reproducibility, non-core application parts like pre-processing or data-staging should be kept short. Modelling domain decomposition effects, also beyond typical production execution profiles, is further consuming resources.

Preparing for target system designs with unknown details and scales beyond available resources is a demanding task. Care should be taken to consider future hardware trends in the benchmark design. To explicitly accommodate different compute-to-memory ratios, up to four memory variants of benchmarks were introduced. On the preparation system, the memory variants can be used to study artificially-limited compute profiles and determine possible bottleneck shifts on future systems. With unknown hardware, algorithmic behaviours might shift as well, and iterations may not converge. A more robust approach is to not compute until convergence, but stop after a predetermined amount of iterations.

Some parameters in the benchmark suite are free to chose, like the number of nodes, others are fixed, for example simulation parameters. Thorough execution rules and modification guidelines determine the envisioned outcome and need to be developed as part of the suite. Beyond general rules, benchmarks may explicitly deviate to either loosen or tighten rules. Parameter validation should be part of the overall verification process; further extending on the importance of the verification task.

V-C Procurement

Using application benchmarks in the procurement of HPC systems is essential to realistically represent user requirements when deciding the configuration of the future system.An additional challenge was given by the particular system architecture of JUPITER, in which two compute modules of very disparate sizes, a small CPU-only module and an exascale GPU-accelerated module, are coupled together with a shared storage module. It took some discussions until finding the right number and balance between CPU and GPU benchmarks, which ended up being in the ratio of about 1:5.

To formulate and develop the benchmarks, it proved fruitful to collaborate closely with the domain researchers intending to utilize the system. Established relationships in joint projects are especially productive, while it is more difficult for new user domains.A fundamental limitation of our approach is the reliance on existing application codes executed on current systems. By design, disruptive approaches are not well covered and a tendency to favour evolutionary technology is introduced. However, considering the effort associated with adopting new technologies among HPC users, this focus on incremental developments is justified.Still, predicting future system usage trends is crucial — like the AI applications in the suite, which aim at representing a user domain expected to gain importance over time. However, the rapidly evolving software and algorithms in this domain make it hard to accurately estimate their future needs. It is therefore important to consider the most recent breakthroughs in AI beyond the HPC context, including also commercially-driven domains.

The time window for the development of the JUPITER Benchmark Suite was limited by the constraints of the procurement process. The endeavour started several months before the procurement, and required dedicated work by tens of people.Clear management structures and collaboration platforms were essential tools for extensive collaboration. In particular, transparent communication with all bidders was crucial, which was possible thanks to the dialogue phase that was part of the procurement. The suite development fostered collaboration, team-building, and knowledge-sharing. Code and environment optimizations were openly shared between benchmark developers and vendors, iteratively improving the benchmarks further. The suite itself is now open source and can benefit the wider HPC community.

VI Conclusions and Future Work

In this paper, we presented the JUPITER Benchmark Suite, which has been successfully employed in the procurement process of the exascale system JUPITER. The suite has served as a valuable tool in assessing HPC system performance during the procurement and beyond. The benchmark applications (see sectionIV) were chosen to represent the workloads on the future system after careful consideration of requirements and constraints. Bidders used this suite to test different technologies, put together their proposals, and prepare and commit the associated performance numbers. The procuring entity selected the vendor based on these values, together with multiple additional evaluation criteria. At the time of writing, the JUPITER system is being installed. The benchmark suite will be employed again during the acceptance procedure.

The JUPITER Benchmark Suite lays a foundation to further extend and automate HPC system benchmarking. Facilitated by the reusable design of the suite, Continuous Benchmarking will be realized as future work, employing the CI/CD features of GitLab in conjunction with novel tools such as Jacamar[Adamson24Jacamar].Running the suite at regular intervals (e.g., after maintenances), we will ensure that the system does not see performance degradation over its lifetime or after updates.Application optimization for JUPITER will continue during the system deployment and installation phase, utilizing experiences gained and tools created.We will strive for further improvements regarding the reproducibility of individual benchmarks, including a focus on verification. Also, individual technical enhancements are in progress (for example, using git-annex for the large input data).

Acknowledgment

Many benchmarks presented in this work have roots in projects that received funding from national and European sources; for example, Quantum Espresso (from the MaX project[maxcoe]) or nekRS (from the CoEC project[coec]). The authors thank the funding agencies for their support and the colleagues from the projects for being available for advice when designing the suite.

The authors would like to thank the following people for their contributions: Max Holicki and Yannik Müller, for their contributions to the LinkTest benchmark; Jonathan Windgassen and Christian Witzler for their support with the nekRS benchmark; Hans De Raedt for the discussions and help with the JUQCS benchmark.

The double-anonymous review of this publication was enabled by the Anonymous GitHub service at https://anonymous.4open.science/. The authors thank the creator, also for the immediate, last-minute support.

Finally, the authors would like to thank EuroHPC JU, BMBF, MKW-NRW, and GCS for the supportive atmosphere, productive discussions, and for the financial support of the JUPITER project.

\printbibliography

Application-Driven Exascale: The JUPITER Benchmark Suite (2024)
Top Articles
Latest Posts
Article information

Author: Lakeisha Bayer VM

Last Updated:

Views: 6151

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Lakeisha Bayer VM

Birthday: 1997-10-17

Address: Suite 835 34136 Adrian Mountains, Floydton, UT 81036

Phone: +3571527672278

Job: Manufacturing Agent

Hobby: Skimboarding, Photography, Roller skating, Knife making, Paintball, Embroidery, Gunsmithing

Introduction: My name is Lakeisha Bayer VM, I am a brainy, kind, enchanting, healthy, lovely, clean, witty person who loves writing and wants to share my knowledge and understanding with you.