Optimizing a Goldilocks AI Compute Infrastructure

ComnetCo Blog

Optimizing a Goldilocks AI Compute Infrastructure

Segment 1. Optimized Networking

In a world rife with hype, hand-waving and broad brushstrokes about AI, it would appear you can easily accelerate innovation and reduce costs with a pre-built computer architecture. In the real world, however, buying and deploying an HPC system to accelerate AI model training is not that simple. Yes, it would be nice to find a system on the shelf somewhere that meets your needs—one that supports your users with enough power and within your budget—with a scalable, cost-effective, and modular architecture. However, the number of users you need to support is surely different from the requirements of enterprise or research institutions down the road. Or, for example, if your teams are under pressure to train AI models quickly, you need a robust computing infrastructure with high-performance hardware, fast networking and storage interconnects, and optimized software frameworks.

In other words, to do your work efficiently and stay within our budget, what you really need is a compute infrastructure that is neither under-provisioned nor overprovisioned. Rather, you need a system right sized and optimized to meet your unique requirements. So how does your organization purchase and deploy such a Goldilocks HPC system? Architecting an AI compute infrastructure with just the right mix of power and capabilities—one that won’t gobble up more electricity for processing and cooling than you can afford—requires more than simply getting your hands on a wheelbarrow full of state-of-the-art GPUs. It comprises an immensely complex undertaking that generally requires working with experienced partners.

Fortunately, Hewlett Packard Enterprise (HPE) has amassed unparalleled expertise through years of delivering powerful IT systems for enterprises and supercomputers for researchers. Add to this the wealth of technology assets and expertise gained in acquisitions such as Silicon Graphics and Cray Research and you see why HPE is a leader in HPC used by researchers and engineers to speed time to insight in everywhere from drug discovery and medicine to energy to aerospace.

In the case of Cray, this institutional knowhow extends all the way back to 1964, when Seymour Cray designed the world’s first supercomputer, the CDC 6600. In 1972 Cray founded Cray Research where he led a small team of engineers in Chippewa Falls, Wisconsin that developed the Cray-1. That supercomputer ranked as the world’s fastest system from 1976 to 1982. A masterpiece of engineering, the Cray-1 rewrote compute technology from processing to cooling to packaging. Cray-1 systems were widely used in government and university laboratories for large-scale scientific applications, such as simulating complex physical phenomena.

Since acquiring Cray in 2019, HPE has invested heavily in research at Chippewa Falls, which has resulted in record-breaking innovations, such as the world’s first exascale supercomputers. Once considered theoretically impossible to develop or power, exascale systems can perform over a quintillion (1,000,000,000,000,000,000) calculations per second, or 1 exaflop. A Cray exascale AI system built for the US Department of Energy’s Lawrence Livermore National Laboratory in California zooms through calculations at 1.72 exaflops, making it the fastest computer on the planet. In fact, only HPE has successfully built exascale systems, and those Cray EX supercomputers hold the three top spots on the Top500 list.

In addition to assimilating and enhancing the world-class technologies acquired in the SGI acquisition, HPE has invested heavily in continuing this legacy of innovation. Today, HPE continues to lead advances in AI for supercomputing, which includes developing HPE Cray EX systems that are purpose-built for AI workloads. For one instance, in collaboration with Intel, HPE designed Aurora, an HPE Cray supercomputer purpose-built to efficiently handle AI workloads for Argonne National Laboratories, Aurora is the second exascale system designed and delivered by HPE, and it also ranks as the world’s third-fastest supercomputer.

As one example of innovation that helps deliver optimized AI systems, HPE is at the bleeding-edge of technologies like Direct Liquid Cooling (DLC), which is a hot topic with the density of today’s system on chip (SoCs) devices producing incredible amounts of heat. For example, the widely used NVIDIA H200 GPU is manufactured on a cutting-edge 4 nm process node to pack 80 billion transistors onto a single sliver of silicon. Just one of these GPUs when running at full speed can consume up to 700 watts of power, which generates significant heat. If you have eight H200s in a server, that’s 5,600 watts, which pushes the limits of traditional air cooling techniques. To address this challenge, HPE has pioneered new DLC technologies as the most effective way to cool next-generation AI systems. In 2024, HPE introduced a 100% fanless direct liquid cooling system architecture for large-scale AI deployments, which reduces cooling power required per server blade by 37%, lowering electricity costs, carbon production, and data center fan noise.

AI workloads have pushed the limits of air cooling, which is why Aurora uses direct liquid cooling to dissipate the heat generated by powerful Intel blades.

Fitting together the pieces of an optimized AI compute system

A custom-tuned system begins with the customer’s spec, but getting a system that meets your AI needs up and running requires more than partnering with a leading maker like HPE. It also requires trusted experts who can help to navigate the entire deployment process—from pre-sales scoping studies to deployment to after-deployment support. This includes right-sizing a system for your needs — because not everyone needs or can afford a massively powerful machine like Aurora.

As a trusted HPE Solution Provider with a specialization in HPC & AI, ComnetCo, o ers customers 20 years of experience. This includes deploying systems ranging from some of the world’s fastest supercomputers to smaller systems that speed time-to-insight at leading universities, engineering departments, and teaching hospitals.

This expertise includes navigating the vast range of solutions available from HPE and other partners such as NVIDIA’S ConnectX InfiniBand SmartNICs, which deliver industry-leading performance and efficiency for AI systems. Fast memory linked by fast networking is also key to efficient AI performance as the machine must sometimes access many billions of data points in LLMs with lightning speed.

ComnetCo also provides expertise in linking the various components of an AI system—e.g. I/O, memory CPUs, and GPUs—using open industry standards, which is an important aspect of an efficient architecture that avoids vendor lock-in. For instance, as an NVIDIA Elite Solution Provider, ComnetCo delivers strong core competencies in Compute, Networking, and Visualization. This includes deploying systems with state-of-the-art networking technologies, such as NVIDIA NICs.

ComnetCo uses decades of experience to link the various pieces of an AI compute system, using open standards as the glue.

Decisions, decisions, decisions

In short, network optimization involves an incredible number of decisions, involving everything from network speed to network density (NICs per node) to cable lengths. In making these design choices, ComnetCo experts weigh the trade-offs associated with each decision. For example, in what cases is it worth cutting down to four NICs rather than eight. Should your system’s InfiniBand (IB) run at 200Gb/s vs. 400Gb/s. Or, for another example, could rearranging nodes and switches within rack layouts help to achieve the lowest cost of cabling because cables past a certain length need to be optical, which are significantly more expensive than copper. What’s more, speed and density are generally determined by AI workload type and size, while the optimization of cable lengths requires an understanding of your unique environment to frame out the maximum number of nodes per rack, and then arrange them in the most cost-effective way.

By carefully balancing these main network considerations is one way in which ComnetCo experts optimize a system’s networking, which allows for strategic cost-cutting to maximize money spent on compute vs. other cluster components. That is, optimizing your AI compute system and its networking means more bang for the buck.

What to Expect When You’re Expecting a Supercomputer

ComnetCo Blog

What to Expect When You’re Expecting a Supercomputer

In May 2022, the US Department of Energy’s newest supercomputer, Frontier, became fully operational at the Oak Ridge National Laboratory (ORNL) in Tennessee. Built by Hewlett Packard Enterprise (HPE), it replaced Fugaku — the result of a collaboration between Fujitsu and Japan’s RIKEN Center for Computational Science (R-CCS) — as the fastest machine on the Top500 list. At the time, the authors called it the “only true exascale* machine on the list”. Nonetheless, another HPE machine, Aurora, became the second exascale system in the world, sitting just below Frontier on the Top500 list at number two.

The HPE Cray Frontier supercomputer — the fastest machine on the planet.

Only a decade ago, many experts doubted we could ever get to exascale computing—i.e.,1018 flops. For one, they believed that an exascale computer would require 100 megawatts of electrical power to operate, making it impractical. Nonetheless, today fully operational exascale computer is busy at the Tennessee lab—as well the second machine at the US DoE’s the Argonne National Laboratory—helping researchers tackle problems of national importance that could not be addressed by existing supercomputing platforms. Some of these scientific challenges include enhancing nuclear reactor efficiency and safety, uncovering the underlying genetics of diseases, and further integrating artificial intelligence (AI) with data analytics, modeling, and simulation.

The HPE Cray Aurora supercomputer — the fastest AI machine on the planet.

The latest verified exascale-class supercomputer, Aurora is an HPE Cray supercomputer with additional compute and accelerator infrastructure provided by Intel. It sits just below Frontier on the Top500 list of the fastest supercomputers in the world.

HPC systems come right sized for a wide range of needs

Not everyone needs a $600 million exascale computer that fills a room larger than two professional basketball courts. However, a growing number of organizations are looking to take advantage of High Performance Computing. Some plan to upgrade existing HPC systems while others will deploy a supercomputer for the first time.

HPC systems range from clusters of high-speed computer servers to purpose-built supercomputers that employ millions of processors or processor cores. However, while when people think of these massive machines they typically think of GPUs and CPUs. When linked together with cutting-edge networking fabrics they become capable of processing massive amounts of data to solve the most complex problems computationally. They enable the organizations running them to achieve everything from advancing human knowledge of the universe to creating significant competitive advantages.

Some of the forces driving increased adoption of HPC include better productivity and faster results with greater accuracy. Specifically, supercomputers and HPC systems offer unique capabilities such as quickly performing the modeling and simulations of world around us, both theoretically and physically, including simulations that rely on solving partial differential equations. HPC can also power applications like Graph Database Analytics, which offers the potential to solve problems previously thought to be unsolvable. These such tasks as deanonymizing the Bitcoin blockchain to uncover perpetrators of cyberextortion, cryptocurrency exchange hacks, and terrorist and WMD financing. And in the exploding area of artificial intelligence, machine tailored to the needs of AI can, for example digest massive data sets such as large language models (LLMs) that enable generative AI.

Some of these challenging, data-intensive problems include:

  • Modeling and simulation
  • Electromagnetic simulation
  • Computational fluid dynamics (CFD)
  • Finite element method (FEM)
  • Computational chemistry
  • Complex graph database analytics
  • Oil and gas exploration
  • Molecular modeling and drug discovery
  • Nuclear fusion research
  • Cryptoanalysis

AI also advances these HPC applications with machine learning and deep learning apps. These workloads are driving innovation in:

Applications include everything from genomics to molecular modeling to ‘Image Analysis’ for faster and more accurate diagnosis of cancer and personalized medicine for more targeted treatment.

Government agencies, green energy researchers, and traditional oil & gas companies apply the massive processing power of HPC in applications such as seismic data processing, reservoir simulation and modeling, and wind energy simulation. HPC simulations help these users to predict where they can, for example, find oil reserves or whether the reservoir may tap into one on a neighboring property.

HPC systems provide ideal platforms for processing vast amounts of data such as required in weather forecasting and climate change modeling. This capability also proves indispensable in many government applications, including AI-based large scale satellite image analysis and defense research as well as intelligence work.

Fraud detection and risk analysis simulation and Monte Carlo represent just some of the more common applications.

Who is the typical HPC buyer?

With the availability of high-powered cloud computing coupled with AI, organizations are getting a taste of the possibilities offered by HPC. Consequently, whereas not long ago the typical HPC buyer was in the ivory tower, the universe of users is rapidly expanding. From manufacturing to aerospace to pharmaceuticals, commercial applications typically apply the machine to a single task. Universities and government research centers, the other main buyers of supercomputers, almost always have multiple users accessing the machine’s processing power not only for interdisciplinary research but also for novel use cases.

What are the various HPC architectures?

Not all HPC systems are created equal. They vary greatly in the various components and how those components are packaged together. The system’s components typically include a CPU and an accelerator such as an FPGA or GPU along with memory, storage, and networking components. HPC nodes, or servers, can be based on a variety of architectures working in unison, either parallel or clustered nodes to break up the problem and parallel computing that combines enough processing power to handle complex computational tasks holistically. HPC parallel computing architectures allows HPC clusters to execute large workloads and splits them into separate computational tasks that are carried out at the same time. A supercomputer is single machine (even if spread across multiple racks), essentially a mainframe computer on steroids, in which the processors and storage are designed to work as a single, extremely powerful computer. Often the two architectures are difficult to distinguish, therefore the industry sometimes defines supercomputers as HPC systems above a certain price point. NOTE: Interdisciplinary research can lead to faster time to insight by bringing together diversity in schools of science. Supercomputing offers a huge advantage here by facilitating collaboration and cross-pollination of schools of thought.

What is the right environment for a supercomputer?

When you work with Hewlett Packard Enterprise (HPE) and ComnetCo, the engagement does not stop at choosing the right compute architecture; it includes looking at how to combine all resources in the right environment.

The design phase even includes looking at things like “Can the local facility support the power requirements?”. A typical supercomputer consumes anywhere from 1 to 10 megawatts of power, or enough electricity to power almost 10,000 homes. This includes electricity required to not only power the machine but also that is needed to cool it. Furthermore, that electric power needs to be stable. You can have conditioned power and then run it through the UPS, which makes sure you have clean power, but what happens when you have a hundred-year flood and you need to switch over to a generator? Finally, a comprehensive design assessment should even include analyzing the cost of various fuels—e.g., natural gas vs. diesel.

Getting Started

The first step in buying and deploying an HPC system involves partnering with a leading maker and trusted experts who can help you navigate the entire deployment process—from pre-sales scoping studies to purchase to deployment to after-deployment support. When engaging with HPE and ComnetCo, we sit down with you and consider your unique needs.

A good place to start is asking yourselves what are you trying to achieve? The answer to that question will help to determine how much compute power you need. Once we have a handle on capacity and speed that your researchers require, we collaborate with you to decide what type of architecture you will need—scale out or scale up. A scale-out architecture essentially allows you to combine multiple machines into a single machine with a larger memory pool. Scale-up enables you to increase the performance of your existing machine and in many cases to extend its lifecycle. cluster or full-blown supercomputer. These are not easy questions to answer because, just as no two snowflakes are alike, virtually every HPC system comes down to a custom build based on unique needs—even at the level of individual nodes.

What will the engagement look like?

Once we have assessed the basic needs, such as processing power, storage, and electric power and cooling requirements, we look for potential pitfalls and how to avoid them. This ability to see hurdles early so you can avoid them constitutes one of the key advantages of working with a leading manufacturer together with ComnetCo experts. Using HPE Cray Superdome and Superdome Flex servers, ComnetCo has deployed everything from highly efficient HPC clusters to some of the world’s fastest supercomputers, such as three Top100 machines at Idaho National Laboratory.

One of the first questions buyers ask is “How long will it take to deploy our system?”. This varies depending on whether you plan to purchase an HPC cluster or a factory-built supercomputer. The charts below show you the typical phases and timelines, start to finish, involved in both types of deployments.

Case in Point: Sawtooth

Let us look at the example of a very fast supercomputer. Named after a central Idaho mountain range, the Sawtooth supercomputer at Idaho National Laboratory (INL) went online in 2019. At a cost of
$19.2 million, the system ranked #37 on the 2019 Top500 fastest supercomputers in the world. That is the highest ranking ever reached by an INL supercomputer.

As you can see, deploying an HPC system or supercomputer requires careful planning in concert with guidance from experienced experts on everything from choosing a ‘right-sized’ system down to critical advice on the right interconnects. In addition to working with the world’s leading manufacturer of HPC systems—Hewlett Packard Enterprise—ComnetCo also offers more than 25 years of experience in deploying systems of all sizes. Plus, once your system comes online, you can count on the backing of these two leading companies for reliable and responsive complete lifecycle system support.

* As of May 2023

A new AI model allows researchers to share insights, not data.

ComnetCo Blog

A new AI model allows researchers to share insights, not data.

In the world of artificial intelligence (AI), there’s a new kid on the block. As if all the myriad branches of AI were not confusing enough, in addition to everything from deep learning to fuzzy logic we now have “swarm learning.” As a form of machine learning, it basically facilitates training models at the edge, so the edge devices get smarter and also train their peers.

But swarm learning also puts two new twists on standard machine learning that make it very exciting for a range of applications: it works as a decentralized model, and it links the edge devices with blockchain technology. This means researchers can share insights without sharing data, thus enabling collaboration while preserving privacy.

A natural evolution of ML

Actually, it is not a completely new concept but evolves from other forms of AI. The journal Nature debuted swarm learning in May 2021, and the authors described the concept as the fourth step in a progression of machine learning concepts. First, there is local learning where data and computation reside at different, disconnected locations. Next, comes cloud-based centralized learning. In the third evolution, federated learning, computing is performed at the point where data is created, collected, and stored with parameter settings orchestrated by a central parameter server.

In the fourth phase of this evolution, swarm learning makes it possible to share just the neural network inference data from many distributed edge nodes all linked over a blockchain. In other words, by sharing insights derived from AI analytics performed at the edge, researchers can collaborate in different jurisdictions without sharing the actual data. This new distributed approach eliminates the need for centralized coordination and a parameter server; a potential threat vector for bad actors to corrupt or manipulate confidential data. The individual edge nodes almost literally become a swarm, exchanging parameters for learning securely using blockchain technology.

Concept of Swarm Learning

Increase accuracy and reduce biases in AI models

In a recent ComnetCo / Hewlett Packard Enterprise (HPE) white paper, Dr. Eng Lim Goh, HPE’s Senior Vice President & Chief Technology Officer for Artificial Intelligence, described how “More and more, we are thinking at some point a smart edge device should not only be running a trained AI/ML learning model given to it by humans, but should also be doing learning on its own based on the data it’s collecting,” explained Dr. Goh. “This is the next forward-thinking concept.” 

Dr. Goh also described to us his recent collaboration with the World Health Organization (WHO) on the potential for swarm learning to solve a huge challenge in medicine. 

Medicine is an inherently decentralized field. Hospitals around the globe want to utilize the massive amounts data collected from edge devices within the world of medical IoT. “However, one catch here is that each sensor will be looking at its own compartmentalized data, and therefore will be highly biased towards the data it’s seeing,” explained Dr. Goh. “Eliminating this bias by sharing AI/ML outcomes from many edge devices is one reason why we came up with a concept called ‘Swarm Learning’.”

Swarm learning leverages the security of blockchain smart contracts to work collaboratively with peers and improve model insights. In fact, the authors of the original Nature article showed that swarm learning classifiers outperformed those developed at individual sites.

In addition to better accuracy, swarm learning is also more efficient. By putting machine learning at the edge, or the near edge, the data remains at the source preventing the inefficient movement of data—or data duplication—to the core or central location.

Enabling collaboration while protecting privacy

The beauty of swarm learning is that it allows for the insights generated from data to be shared without sharing the source data itself. Data is not moved from the sources thus preserving data privacy by limiting data movement.

As in federated learning, the machine learning method is applied locally at the data collection source. Only inferred insights from that data are shared between the nodes. Protecting privacy by not exposing patient private data is not only critical for maintaining compliance with data privacy laws but is also a basic duty of researchers. According to the National Institute for Health (NIH), “Protecting patients involved in research from harm and preserving their rights is essential to ethical research.”1

This new approach means, for example, that hospitals can share insights derived from applying AI at the edge without actually risking exposure of patient protected data to bad actors. Plus, there is no central custodian that aggregates all the data, leveraging a blockchain helps to ensure data integrity. This protection is not only critical for protecting patient privacy but also safeguarding confidential data from the probing noses of hackers, such as, for example, nation states spying on vaccine research. This decentralized, distributed model shares only the insights gleaned from the data, often derived using AI machine learning models.

1 https://www.ncbi.nlm.nih.gov/books/NBK9579/