Earlier studies have demonstrated GPU sharing by using virtualiza tion as a means to increase utilization. Further studies identified that memory capacity is frequently over-provisioned in modern HPC systems and thus proposed disaggregating memory modules in a manner similar to hyperscale datacenters. This is further motivated by the observation that many HPC applications already share memory across nodes. Later studies take the first steps to show the promise of resource disaggregation in HPC to increase resource utilization. Related studies demonstrate that emerging high-bandwidth-density silicon-photonic com munication technology is a promising means to deliver resource disaggregation with high bandwidth density and low latency. However, just the photonic link bandwidth and efficiency alone cannot fully satisfy the bandwidth, energy, and latency requirements of fully flexible, fine-grain, system-wide disaggregation, making system-wide disaggregation impractical in many cases. For this reason, a few studies argue for rack-level disaggregation of GPUs. In addition, other studies rely on electrical networks in some cases with the aid of software-defined networks. Finally,cannabis indoor grow system how the software stack should adapt to best take advantage of resource disaggregation remains a significant challenge both for the OS and job scheduler.Cori is an open-science top 20 HPC system that is operated by NERSC and supports the diverse set of workloads that are in scope for USA’s department of energy.
Cori supports in the order of one thousand projects, a few thousand users, and executes a vastly diverse set of representative single-core to full-machine workloads from fusion energy, material science, climate research, physics, computer science, and many other science domains. In this work, we use Cori as a model system to represent the common requirements of HPC systems that are also typical of many other HPC systems in the sciences. We recognize that HPC systems with vastly different workloads or hardware configurations should consider repeating an analysis similar to ours, but the broad user base of this system should provide a microcosm of potential requirements across a wide variety of systems across the world. Furthermore, over the 45-year history of the NERSC HPC facility and 12 generations of systems with diverse architectures, the workload over time has evolved very slowly despite substantial changes to the underlying system architecture, because the requirements of the scientific mission of such facilities take precedence over the architecture of the facility. Cori is an approximately 30 Pflop/s Cray XC40 system and consists of 2,388 Intel Xeon “Haswell” and 9,688 Intel Xeon Phi “Knight’s Landing” compute nodes. Each Haswell node has two 2.3 GHz 16-core Intel Xeon E5-2698 v3 CPUs with two hyperthreads per core. Each KNL node has a single Intel Xeon Phi 7250 central processing unit with 68 cores at 1.4 GHz and four hyperthreads per core. Haswell nodes have eight 16 GB 2,133 MHz DDR4 modules with a peak transfer rate of 17 GB/s per module . KNL nodes have 96 GB of 2,400 MHz DDR4 modules and 16 GB of MCDRAM modules. Therefore, Haswell nodes have 128 GB of memory while KNL 112 GB. Each XC40 cabinet housing Haswell and KNL nodes has three chassis; each chassis has 16 compute blades with four nodes per blade.
Similar to many modern systems, hard disk space is already disaggregated in Cori by placing hard disks in a common part of the system. Therefore, we do not study file system disaggregation. Nodes connect to Aries routers through NICs using PCIe 3.0 links with 16 GB/s sustained bandwidth per direction. The four nodes in each blade share a network interface controller . Cray’s Aries in Cori employs a Dragonfly topology with over 45 TB/s of bisection bandwidth and support for adaptive routing. An Aries router connects eight NIC ports to 40 network ports. Network ports operate at rates of 4.7 to 5.25 GB/s per direction. Currently, Cori uses SLURM version 20.02.6 and Cray OS 7.0.UP01. When submitting a job, users request a type of node , number of nodes, and a job duration. Users do not indicate an expected memory or CPU usage, but depending on their expectations they can choose “high memory” or “shared” job queues. The maximum job duration that can be requested is 48 h.We collect system-wide data using the lightweight distributed metric service on Cori. LDMS samples statistics in every node every second and also records which job ID each node is allocated to. Statistics include information from the OS level such as CPU idle times and memory capacity usage, as well as hardware counters at the CPUs and NICs. After collection, statistics are aggregated into data files that we process to produce our results. We also calibrate counters by comparing their values against workloads with known metric values, by comparing to literature, and by comparing similar metrics derived through different means, such as hardware counters, the OS, or the job scheduler. For generating graphs, we use python 3.9.1 and pandas 1.2.2. We sample Cori from April 1 to April 20 of 2021 . However, NIC bandwidth and memory capacity are sampled from April 5 to April 19 due to the large volume of data.
Sampling more frequently than 1s or for longer periods would produce even more data to analyze than the few TBs that our analysis already processes. For this reason, sampling for more than three weeks at a time is impractical. Still, our three weeks sampling period is long enough such that the workload we capture is typical of Cori. Unfortunately, since we cannot capture application executable files and input data sets, and because we find job names to not be a reliable indicator of the algorithm a job executes, we do not have precise application information to identify the exact applications that executed in our captured date ranges. For the same reason, we cannot replay application execution in a system simulator. To focus on larger jobs and disregard test or debug jobs, we also generate statistics only for nodes that are allocated to jobs that use at least four nodes and last at least 2 h. Our data does not include the few tens of high memory or interactive nodes in Cori. Because statistics are sampled no more frequently than every second , any maximum values we report such as in Table 1 are values calculated within each second boundary and thus do not necessary accurately capture peaks that last for less than one second. Therefore, our analysis focuses on sustained workload behavior. Memory occupancy and NIC bandwidth are expressed as utilization percentages from the aforementioned per-node maximums. We do not express memory bandwidth as a percentage,cannabis equipment because the memory module data is placed in and the CPU that accesses the data affect the maximum possible bandwidth. We present most data using node-wide cumulative distribution function graphs that sample each metric at each node individually at every sampling interval. Each sample from each node is added to the dataset. This includes idling nodes, but NERSC systems consistently assign well over 99% of nodes to jobs at any time. This method is useful for displaying how each metric varies across nodes and time, and is insensitive to job size and duration, because it disregards job IDs. Table 1 summarizes our sampled data from Cori. Larger maximum utilizations for KNL nodes for some metrics are partly, because there are more KNL nodes in Cori.Figure 1 shows area plots for the size of jobs nodes are assigned to, as a percentage of total nodes. For each node, we determine the size of the job that the node is assigned to. We record that size value for each node in the system individually. Results are shown as a function of time, with one sample every 30s. We only show this data for a subset of our sampling period because of the large volume of data, particularly for KNL nodes, because there are 4× more KNL nodes than Haswell nodes. There are no KNL jobs larger than 2,048 nodes in the illustrated period. As shown, the majority of jobs easily fit within a rack in Cori. In our 20-day sample period, 77.5% of jobs in Haswell and 40.5% in KNL nodes request only one node. Similarly, 86.4% of jobs in Haswell and 75.9% in KNL nodes request no more than four nodes. 41% of jobs in Haswell and 21% for KNL nodes use at least four nodes and execute for at least 2 h. Also, job duration is multiple orders of magnitude larger than the reconfiguration delay of modern hardware technologies that can implement resource disaggregation, such as photonic switch fabrics that can reconfigure from a few tens of microseconds to a few tens of nanoseconds.
Our data set includes a handful of jobs in both Haswell and KNL nodes that use special reservation to exceed the normal maximum number of hours for a job. In Alibaba’s systems, batch jobs are shorter with 99% of jobs under two minutes and 50% under ten seconds.Figure 2 shows a CDF of node-wide memory occupied by user applications, as calculated from “proc” reports of total minus available memory. Figure 2 shows a CDF of maximum memory occupancy among all nodes, sampled every 1s. Occupied memory is expressed as percent age utilization from the capacity available to user applications , and thus does not include memory reserved for the firmware and kernel binary code. As shown by the red example lines, three quarters of the time, Haswell nodes use no more than 17.4% of on-node memory and KNL nodes 50.1%. Looking at the maximum occupancy among all nodes in each sampling period, half of the time that is no more than 11.9% for Haswell nodes and 59.85% for KNL nodes. Looking at individual jobs instead of nodes , 74.63% of Haswell jobs and 86.04% of KNL jobs never use more than 50% of on-node memory. Our analysis is in line with past work. In four Lawrence Livermore National Laboratory clusters, approximately 75% of the time, no more than 20% of memory is used. Other work observed that many HPC applications use in the range of hundreds of MBs per computation core. Interestingly, Alibaba’s published data show that a similar observation is only true for ma chines that execute batch jobs, because batch jobs tend to be short and small. In contrast, for on line jobs or co-located batch and online jobs, the minimum average machine memory utilization is 80%. These observations suggest that memory capacity is over provisioned by average. Over provisioning of memory capacity is mostly a result of sizing memory per node to satisfy memory intensive applications or phases of applications that execute infrequently but are considered important, and allocating memory to nodes statically.Figure 3 shows CDFs of bi-directional node-wide memory bandwidth for data transfers to and from memory. These results are calculated by the last-level cache line size and the load and store LLC misses reported by “perf” counters and crosschecked against PAPI counters. Due to counter accessibility limitations, LLC statistics are only avail able for Haswell nodes. In Haswell, hardware prefetching predominantly occurs at cache levels below the LLC. Thus, prefetches are captured by the aforementioned LLC counters we sample. As shown, three quarters of the time Haswell nodes use at most 0.46 GB/s of memory band width. 16.2% of the time Haswell nodes use more than 1 GB/s. The distribution shows a long tail, indicative of bursty behavior. In addition, aggregate read bandwidth is 4.4× larger than aggregate write bandwidth and the majority of bursty behavior comes from reads. If we focus on individual jobs , then 30.9% of jobs never use more than 1 GB/s per node. To focus on the worst case, Figure 3 shows a CDF of the maximum memory bandwidth among all Haswell nodes, sampled in 1s periods . As shown, half of the time the maximum bandwidth among all Haswell nodes is at least 17.6 GB/s and three quarters of the time 31.5 GB/s. In addition, sustained system-wide maximum memory bandwidth at 30s time windows within our overall sampling period rarely exceeds 40 GB/s, while 30s average values rarely exceed 2 GB/s. Inevitably, our analysis is sensitive to the configuration of Cori’s Haswell nodes. Faster or more computation cores per node would place more pressure on memory bandwidth as already ev ident in GPUs.