\UseTblrLibrary
booktabs \addbibresourcelibrary.bib
Andreas Herten \orcidlink0000-0002-7150-2505, Sebastian Achilles \orcidlink0000-0002-1943-6803, Damian Alvarez \orcidlink0000-0002-8455-3598, Jayesh Badwaik \orcidlink0000-0002-5252-8179, Eric Behle \orcidlink0000-0001-6616-8867, Mathis Bode \orcidlink0000-0001-9922-9742,
Thomas Breuer \orcidlink0000-0003-3979-4795, Daniel Caviedes-Voullième \orcidlink0000-0001-7871-7544, Mehdi Cherti \orcidlink0000-0001-5865-0469, Adel Dabah \orcidlink0000-0001-9175-469X, Salem El Sayed \orcidlink0000-0002-7217-6027,
Wolfgang Frings \orcidlink0000-0002-9463-9264, Ana Gonzalez-Nicolas \orcidlink0000-0003-2869-8255, Eric B.Gregory \orcidlink0000-0001-9408-5719, Kaveh Haghighi Mood \orcidlink0000-0002-8578-4961, Thorsten Hater \orcidlink0000-0002-6249-7169,
Jenia Jitsev \orcidlink0000-0002-1221-7851, Chelsea Maria John \orcidlink0000-0003-3777-7393, Jan H.Meinke \orcidlink0000-0003-2831-9761, Catrin I.Meyer \orcidlink0000-0002-9271-6174, Pavel Mezentsev \orcidlink0009-0000-4114-5482, Jan-Oliver Mirus \orcidlink0009-0006-7975-1393,
Stepan Nassyr \orcidlink0000-0002-0035-244X, Carolin Penke \orcidlink0000-0002-4043-3885, Manoel Römmer \orcidlink0009-0007-3513-5932, Ujjwal Sinha \orcidlink0000-0002-4609-0940, Benedikt von St. Vieth \orcidlink0000-0002-5386-633X, Olaf Stein \orcidlink0000-0002-6684-7103,
Estela Suarez \orcidlink0000-0003-0748-7264, Dennis Willsch \orcidlink0000-0003-3855-5100, Ilya Zhukov \orcidlink0000-0003-4650-3773Jülich Supercomputing Centre
Forschungszentrum Jülich
Jülich, Germany
Abstract
Benchmarks are essential in the design of modern HPC installations, as they define key aspects of system components. Beyond synthetic workloads, it is crucial to include real applications that represent user requirements into benchmark suites, to guarantee high usability and widespread adoption of a new system. Given the significant investments in leadership-class supercomputers of the exascale era, this is even more important and necessitates alignment with a vision of Open Science and reproducibility.In this work, we present the JUPITER Benchmark Suite, which incorporates 16 applications from various domains. It was designed for and used in the procurement of JUPITER, the first European exascale supercomputer. We identify requirements and challenges and outline the project and software infrastructure setup. We provide descriptions and scalability studies of selected applications and a set of key takeaways. The JUPITER Benchmark Suite is released as open source software with this work at github.com/FZJ-JSC/jubench.
Index Terms:
Benchmark, Procurement, Exascale, System Design, System Architecture, GPU, Accelerator
I Introduction
The field of High Performance Computing (HPC) is governed by the interplay of capability and demand driving each other forward.During the design and purchase phase of supercomputer procurements for public research, the capability of a machine is usually assessed not only by theoretical, system-inherent numbers, but also by effective numbers relating to actual workloads. These workloads are traditionally benchmark programs that test specific aspects of the system design — like the floating-point throughput, memory bandwidth, or internode latency. While these synthetic benchmarks are well-suited for the assessment of distinct features, for a more integrated and realistic perspective, they should be complemented by application benchmarks. Application benchmarks use state-of-the-art scientific applications to assess the performance of integrated designs. Complex application profiles utilize various types of hardware resources dynamically during the benchmark’s runtime, showcasing real-world strengths and limitations of the system.
This paper introduces the JUPITER Benchmark Suite, a comprehensive collection of 23 benchmark programs meticulously documented and designed to support the procurement of JUPITER, Europe’s first exascale supercomputer. On top of 7 synthetic benchmarks, 16 application benchmarks were developed in close collaboration with domain scientists to ensure relevance and rigor. Additionally, this paper offers valuable insights into the state-of-the-practice of exascale procurement, shedding light on the challenges and methodologies involved.
Preparations for the procurement of JUPITER were launched in early 2022 and finally came to fruition with the awarding of the contract in October 2023. JUPITER is funded by the EuroHPC Joint Undertaking (\qty50), Germany’s Federal Ministry for Education and Research (\qty25), and the Ministry of Culture and Science of the State of North Rhine-Westphalia of Germany (\qty25), and is hosted at Jülich Supercomputing Centre (JSC) of Forschungszentrum Jülich. As part of the procurement, the benchmark suite was developed to motivate the system design and evaluate the proposals committed for the Request for Proposals. The suite focuses on application benchmarks to ensure high practical usability of the system. This work presents the JUPITER Benchmark Suite in detail, highlighting design choices and project setup, describing the benchmark workloads, and releasing them as open source software.The suite includes 23 benchmarks across different domains, each with unique characteristics such as compute-intensive, memory-intensive, and I/O-intensive workloads. The applications are grouped into three categories: Base, representing a mixed base workload for the system, High-Scaling, highlighting scalability to the full exascale system, and synthetic, determining various key hardware design features.The benchmark suite represents a first step towards Continuous Benchmarking to detect system anomalies during the production phase of JUPITER.
The main contributions of this paper are:
- •
An in-depth description of the use of benchmarks in HPC procurement, including relevant background information.
- •
The description of a novel methodology to assess exascale system designs in the form of High-Scaling benchmarks.
- •
The development of suitable benchmark workloads based on a representative set of scientific problems, applications, and synthetic codes.
- •
Scalability results on the preparation system for all application benchmarks of the suite.
- •
Insights and best practices learned from application scaling and the procurement process in general.
- •
Release of the full JUPITER Benchmark Suite as open source software.
After providing background information (sectionII) and details on the used infrastructure (sectionIII), the suite’s individual benchmarks are presented in sectionIV. Key takeaways are provided in sectionV, followed by a conclusion and outlook in sectionVI.
II Background
II-A Requirements
The main requirements of the benchmark suite stem from the need to represent existing and upcoming user communities in the system design process. This ensures a fitting design and fosters adoption of the system by users. The suite must cover the wide user portfolio of the HPC center, containing typical applications from various domains utilizing the current HPC infrastructure, and also represent expected future workloads. Moreover, it is essential to ensure diversity in terms of methods, programming languages, and execution models, since such diversity is an inherent characteristic of the application portfolio for upcoming large-scale systems.
The context of a procurement poses high requirements for replicability, reproducibility, and reusability[Fehr16RRR]. Replicability, i.e., the seamless execution on the same hardware by the developer, is an elementary requirement to guarantee robustness. Beyond that, reproducibility describes the seamless execution on different hardware by someone else, making it a key requirement in the context of the procurement since both the site and the system provider must be able to run the suite to obtain the same results. Ensuring reusability, i.e., designing the framework for easy adaptation for a variety of tasks in the future, is essential to justify the substantial investment involved in the creation of the suite.
Vendors participating in the procurement process must also invest considerable time in system-specific adjustments while meeting the procurement’s requirements and complying to its rules, usually within short time scales. Therefore, it is in everyone’s best interest to have clear, well-defined guidelines and, ideally, to leverage an existing benchmark suite to build on established expertise.
II-B JUPITER Procurement Scheme
The procurement for the JUPITER system uses a Total-Cost-of-Ownership-based (TCO) value-for-money approach, in which the number of executed reference workloads over the lifespan of the system determines the value. Given the size of state-of-the-art supercomputers as well as their corresponding power consumption, costs for electricity and cooling are a substantial part of the overall project budget. Using a mixture of application benchmarks as well as synthetic benchmarks, operational costs are computed in a well-defined procedure. While synthetic benchmarks allow for assessing key performance characteristics, they do not allow a realistic assessment of resource consumption during the lifetime of the system for TCO. Therefore, a greater emphasis is placed on application performance rather than on synthetic tests.
Given the targeted system performance of \qty[per-mode=symbol]1\exa with \qty64 precision, an additional novel benchmark type focusing on the scale of the system is introduced — High-Scaling benchmarks. A subset of applications able to fully utilize JUPITER is identified and use cases were defined to make them part of this novel category. In the course of the paper, we will refer to either High-Scaling benchmarks or Base benchmarks, to differentiate between both.
The JUPITER system is envisioned to consist of two connected sub-systems, modules in the Modular Supercomputing Architecture (MSA) concept[etp4hpcMSA]. JUPITER Booster is the exascale-level module utilizing massive parallelism through GPUs for maximum performance with high energy efficiency (\unit). JUPITER Cluster is a smaller general-purpose module employing state-of-the-art CPUs for applications with lower inherent parallelism and stronger memory demands. Both compute modules are procured jointly, together with a third module made of high-bandwidth flash storage. The benchmark suite has dedicated benchmarks for all modules, partly even benchmarks spanning Cluster and Booster, dubbed MSA benchmarks. Execution targets of the benchmarks are listed in the last columns of TableII.
During the procurement, the results of the execution of the benchmark suite for a given system proposal are weighted and combined to compute a value-for-money metric. The outcomes are compared and incorporated with other aspects into the final assessment of the system proposals.
II-C Implementation for JUPITER
The previous two sections outline general requirements for the JUPITER Benchmark Suite. Their assessment and implementation is laid out in the following and are visually sketched in Figure1.
Based on an analysis of current and previous compute-time allocations on predecessor systems, the suite covers a variety of scientific areas and includes applications from the domains of Weather and Climate, Neuroscience, Quantum Physics, Material Design, Biology, and others. Beyond that, diversity in workloads is realized: Artificial-Intelligence-based (AI) methods as well as classical simulations, codes based on C/C++ and Fortran, OpenACC and CUDA. Various application profiles are included, such as memory-bound or sparse computations. Future trends of workloads, e.g., the uptake of machine learning algorithms, are inferred from general trends in research communities and from recent changes in allocations on the predecessor system.
The requirement of replicability is met by streamlining and automating benchmark execution employing the JUBE framework[jube]. Reproducibility is ensured by extensive documentation of all components and the verification of computational results. Thorough testing ensures stable execution in different environments, e.g., with a varying number of compute devices, and lowers risks of commitments by vendors. Reusability is accomplished by a modular design, effective project management workflows, clear licensing, and open source publication. To ensure a high quality of the suite, the benchmarks are standardized through a well-defined setup, with consistent directory structures, uniform descriptions, and similar JUBE configurations. The infrastructure aspects are discussed further in sectionIII).
For each of the Base benchmarks, i.e., the sixteen benchmarks used for the TCO/value-for-money calculation, a Figure-of-Merit (FOM) is identified and normalized to a time-metric. In most cases, the FOM is the runtime of either the full application or a part of it. In case the application focuses on rates, the time-metric is achieved by pre-defining the number of iterations and multiplying with the rate.Each benchmark is executed on a certain number of nodes of the preparation system (see subsectionIII-A) in order to create a reference execution time. The number of nodes is usually selected to be 8, but deviations are possible due to workload-inherent aspects. This is a trade-off between resource economy and agility in benchmark creation (fewer nodes are more productive), and robustness towards anticipated generational leaps in the hardware of the envisioned system (fewer nodes incorporate latent danger to severely change application profiles when the requested runtime can be achieved with one node).The time-metric, determined on the reference number of nodes, is the value to be improved upon and committed to by proposals of system designs. The number of nodes used to surpass the time-metric can be freely specified by the proposal, but is typically smaller than the reference number of nodes.
Five applications used in the Base benchmarks are capable of scaling to the full scale of the preparation system. They form the additional High-Scaling category and define a set of benchmarks that aim to compare the proposed system designs with the preparation system for large-scale executions. By requirement, the future system is known to achieve High-Performance Linpack (HPL) performance, implying a theoretical peak performance larger than that.In JUPITER’s procurement, runtimes must be committed for the High-Scaling benchmarks using a ( theoretical peak performance) sub-partition. For each high-scaling application, a workload is defined to fill a \qty[qualifier-mode = bracket]50\peta\theoretical sub-partition of the preparation system (about nodes) and a larger sub-partition of the future system ().111Some benchmarks have algorithmic limitations, like requirement of powers-of-two in node counts. In this case, the smaller, closest compatible number of nodes is taken (for example, nodes). The final assessment is based on the ratio of the runtime value committed for the future \qty[qualifier-mode = bracket]1\exa\theoretical sub-partition and the reference value.When the benchmark’s specific workload configuration fills up a large portion of the GPU memory on the preparation system, there is a danger that on a proposed system the scaled-up version could become limited by available memory and not showcase an accelerator’s full compute capability. This is due to the trend of growing imbalance between the advancement of compute power and memory. To give more flexibility in system design, up to four reference variants of the respective workload are prepared, taking up \qty25 (tiny, T), \qty50 (small, S), \qty75 (medium, M), and \qty100 (large, L) of the available GPU memory on the preparation system (\qty40\giga), respectively. The system proposal may choose the variant that best exploits the available memory on the proposed accelerator after scale-up.
The seven synthetic benchmarks are selected to test individual features of the hardware components, such as compute performance, memory bandwidth, I/O throughput, and network design. Each benchmark has an individual FOM with unique rules and possibly sub-benchmarks, evaluated distinctly.
II-D Related Work
Benchmarks have always played an important role in HPC. One objective is the assessment and comparison of technical solutions, whether at the scale of world-leading cluster systems[top500, Dongarra03HPL] or on a smaller node-level scale [Sorokin20, Xu15NAS, Bienia08PARSEC]. Benchmarks can be specific to one application topic[Herten19ICEI, Albers22beNNch, Farrell21MLPerf] or cover multiple areas in the form of suites [Dongarra12HPCChallenge, Stratton12Parboil, Danalis10SHOC, Che09Rodinia].
Foundational work focused on identifying common patterns across various applications, with the objective of classifying them based on similar computational and communication characteristics [asanovic06dwarfs]. These patterns, referred to as dwarfs, are defined at a high level of abstraction and are intended to capture the most relevant workloads in high-performance computing. subsectionII-D categorizes the benchmarks of the JUPITER Benchmark Suite according to these dwarfs, and also gives the predominant scientific domain a benchmark represents. While dwarfs can be viewed as a set of blueprints for application-inspired synthetic benchmarks, their simplicity by design prevents them to fully capture all dynamics of real applications. To address this issue in supercomputer co-design, more complex and versatile computational motifs, termed octopodes, are proposed by \citeauthorMatsuokaRethinkingProxy[MatsuokaRethinkingProxy].