What to Expect When You’re Expecting a Supercomputer

ComnetCo Blog

What to Expect When You’re Expecting a Supercomputer

In May 2022, the US Department of Energy’s newest supercomputer, Frontier, became fully operational at the Oak Ridge National Laboratory (ORNL) in Tennessee. Built by Hewlett Packard Enterprise (HPE), it replaced Fugaku — the result of a collaboration between Fujitsu and Japan’s RIKEN Center for Computational Science (R-CCS) — as the fastest machine on the Top500 list. At the time, the authors called it the “only true exascale* machine on the list”. Nonetheless, another HPE machine, Aurora, became the second exascale system in the world, sitting just below Frontier on the Top500 list at number two.

The HPE Cray Frontier supercomputer — the fastest machine on the planet.

Only a decade ago, many experts doubted we could ever get to exascale computing—i.e.,1018 flops. For one, they believed that an exascale computer would require 100 megawatts of electrical power to operate, making it impractical. Nonetheless, today fully operational exascale computer is busy at the Tennessee lab—as well the second machine at the US DoE’s the Argonne National Laboratory—helping researchers tackle problems of national importance that could not be addressed by existing supercomputing platforms. Some of these scientific challenges include enhancing nuclear reactor efficiency and safety, uncovering the underlying genetics of diseases, and further integrating artificial intelligence (AI) with data analytics, modeling, and simulation.

The HPE Cray Aurora supercomputer — the fastest AI machine on the planet.

The latest verified exascale-class supercomputer, Aurora is an HPE Cray supercomputer with additional compute and accelerator infrastructure provided by Intel. It sits just below Frontier on the Top500 list of the fastest supercomputers in the world.

HPC systems come right sized for a wide range of needs

Not everyone needs a $600 million exascale computer that fills a room larger than two professional basketball courts. However, a growing number of organizations are looking to take advantage of High Performance Computing. Some plan to upgrade existing HPC systems while others will deploy a supercomputer for the first time.

HPC systems range from clusters of high-speed computer servers to purpose-built supercomputers that employ millions of processors or processor cores. However, while when people think of these massive machines they typically think of GPUs and CPUs. When linked together with cutting-edge networking fabrics they become capable of processing massive amounts of data to solve the most complex problems computationally. They enable the organizations running them to achieve everything from advancing human knowledge of the universe to creating significant competitive advantages.

Some of the forces driving increased adoption of HPC include better productivity and faster results with greater accuracy. Specifically, supercomputers and HPC systems offer unique capabilities such as quickly performing the modeling and simulations of world around us, both theoretically and physically, including simulations that rely on solving partial differential equations. HPC can also power applications like Graph Database Analytics, which offers the potential to solve problems previously thought to be unsolvable. These such tasks as deanonymizing the Bitcoin blockchain to uncover perpetrators of cyberextortion, cryptocurrency exchange hacks, and terrorist and WMD financing. And in the exploding area of artificial intelligence, machine tailored to the needs of AI can, for example digest massive data sets such as large language models (LLMs) that enable generative AI.

Some of these challenging, data-intensive problems include:

  • Modeling and simulation
  • Electromagnetic simulation
  • Computational fluid dynamics (CFD)
  • Finite element method (FEM)
  • Computational chemistry
  • Complex graph database analytics
  • Oil and gas exploration
  • Molecular modeling and drug discovery
  • Nuclear fusion research
  • Cryptoanalysis

AI also advances these HPC applications with machine learning and deep learning apps. These workloads are driving innovation in:

Applications include everything from genomics to molecular modeling to ‘Image Analysis’ for faster and more accurate diagnosis of cancer and personalized medicine for more targeted treatment.

Government agencies, green energy researchers, and traditional oil & gas companies apply the massive processing power of HPC in applications such as seismic data processing, reservoir simulation and modeling, and wind energy simulation. HPC simulations help these users to predict where they can, for example, find oil reserves or whether the reservoir may tap into one on a neighboring property.

HPC systems provide ideal platforms for processing vast amounts of data such as required in weather forecasting and climate change modeling. This capability also proves indispensable in many government applications, including AI-based large scale satellite image analysis and defense research as well as intelligence work.

Fraud detection and risk analysis simulation and Monte Carlo represent just some of the more common applications.

Who is the typical HPC buyer?

With the availability of high-powered cloud computing coupled with AI, organizations are getting a taste of the possibilities offered by HPC. Consequently, whereas not long ago the typical HPC buyer was in the ivory tower, the universe of users is rapidly expanding. From manufacturing to aerospace to pharmaceuticals, commercial applications typically apply the machine to a single task. Universities and government research centers, the other main buyers of supercomputers, almost always have multiple users accessing the machine’s processing power not only for interdisciplinary research but also for novel use cases.

What are the various HPC architectures?

Not all HPC systems are created equal. They vary greatly in the various components and how those components are packaged together. The system’s components typically include a CPU and an accelerator such as an FPGA or GPU along with memory, storage, and networking components. HPC nodes, or servers, can be based on a variety of architectures working in unison, either parallel or clustered nodes to break up the problem and parallel computing that combines enough processing power to handle complex computational tasks holistically. HPC parallel computing architectures allows HPC clusters to execute large workloads and splits them into separate computational tasks that are carried out at the same time. A supercomputer is single machine (even if spread across multiple racks), essentially a mainframe computer on steroids, in which the processors and storage are designed to work as a single, extremely powerful computer. Often the two architectures are difficult to distinguish, therefore the industry sometimes defines supercomputers as HPC systems above a certain price point. NOTE: Interdisciplinary research can lead to faster time to insight by bringing together diversity in schools of science. Supercomputing offers a huge advantage here by facilitating collaboration and cross-pollination of schools of thought.

What is the right environment for a supercomputer?

When you work with Hewlett Packard Enterprise (HPE) and ComnetCo, the engagement does not stop at choosing the right compute architecture; it includes looking at how to combine all resources in the right environment.

The design phase even includes looking at things like “Can the local facility support the power requirements?”. A typical supercomputer consumes anywhere from 1 to 10 megawatts of power, or enough electricity to power almost 10,000 homes. This includes electricity required to not only power the machine but also that is needed to cool it. Furthermore, that electric power needs to be stable. You can have conditioned power and then run it through the UPS, which makes sure you have clean power, but what happens when you have a hundred-year flood and you need to switch over to a generator? Finally, a comprehensive design assessment should even include analyzing the cost of various fuels—e.g., natural gas vs. diesel.

Getting Started

The first step in buying and deploying an HPC system involves partnering with a leading maker and trusted experts who can help you navigate the entire deployment process—from pre-sales scoping studies to purchase to deployment to after-deployment support. When engaging with HPE and ComnetCo, we sit down with you and consider your unique needs.

A good place to start is asking yourselves what are you trying to achieve? The answer to that question will help to determine how much compute power you need. Once we have a handle on capacity and speed that your researchers require, we collaborate with you to decide what type of architecture you will need—scale out or scale up. A scale-out architecture essentially allows you to combine multiple machines into a single machine with a larger memory pool. Scale-up enables you to increase the performance of your existing machine and in many cases to extend its lifecycle. cluster or full-blown supercomputer. These are not easy questions to answer because, just as no two snowflakes are alike, virtually every HPC system comes down to a custom build based on unique needs—even at the level of individual nodes.

What will the engagement look like?

Once we have assessed the basic needs, such as processing power, storage, and electric power and cooling requirements, we look for potential pitfalls and how to avoid them. This ability to see hurdles early so you can avoid them constitutes one of the key advantages of working with a leading manufacturer together with ComnetCo experts. Using HPE Cray Superdome and Superdome Flex servers, ComnetCo has deployed everything from highly efficient HPC clusters to some of the world’s fastest supercomputers, such as three Top100 machines at Idaho National Laboratory.

One of the first questions buyers ask is “How long will it take to deploy our system?”. This varies depending on whether you plan to purchase an HPC cluster or a factory-built supercomputer. The charts below show you the typical phases and timelines, start to finish, involved in both types of deployments.

Case in Point: Sawtooth

Let us look at the example of a very fast supercomputer. Named after a central Idaho mountain range, the Sawtooth supercomputer at Idaho National Laboratory (INL) went online in 2019. At a cost of
$19.2 million, the system ranked #37 on the 2019 Top500 fastest supercomputers in the world. That is the highest ranking ever reached by an INL supercomputer.

As you can see, deploying an HPC system or supercomputer requires careful planning in concert with guidance from experienced experts on everything from choosing a ‘right-sized’ system down to critical advice on the right interconnects. In addition to working with the world’s leading manufacturer of HPC systems—Hewlett Packard Enterprise—ComnetCo also offers more than 25 years of experience in deploying systems of all sizes. Plus, once your system comes online, you can count on the backing of these two leading companies for reliable and responsive complete lifecycle system support.

* As of May 2023

A new AI model allows researchers to share insights, not data.

ComnetCo Blog

A new AI model allows researchers to share insights, not data.

In the world of artificial intelligence (AI), there’s a new kid on the block. As if all the myriad branches of AI were not confusing enough, in addition to everything from deep learning to fuzzy logic we now have “swarm learning.” As a form of machine learning, it basically facilitates training models at the edge, so the edge devices get smarter and also train their peers.

But swarm learning also puts two new twists on standard machine learning that make it very exciting for a range of applications: it works as a decentralized model, and it links the edge devices with blockchain technology. This means researchers can share insights without sharing data, thus enabling collaboration while preserving privacy.

A natural evolution of ML

Actually, it is not a completely new concept but evolves from other forms of AI. The journal Nature debuted swarm learning in May 2021, and the authors described the concept as the fourth step in a progression of machine learning concepts. First, there is local learning where data and computation reside at different, disconnected locations. Next, comes cloud-based centralized learning. In the third evolution, federated learning, computing is performed at the point where data is created, collected, and stored with parameter settings orchestrated by a central parameter server.

In the fourth phase of this evolution, swarm learning makes it possible to share just the neural network inference data from many distributed edge nodes all linked over a blockchain. In other words, by sharing insights derived from AI analytics performed at the edge, researchers can collaborate in different jurisdictions without sharing the actual data. This new distributed approach eliminates the need for centralized coordination and a parameter server; a potential threat vector for bad actors to corrupt or manipulate confidential data. The individual edge nodes almost literally become a swarm, exchanging parameters for learning securely using blockchain technology.

Concept of Swarm Learning

Increase accuracy and reduce biases in AI models

In a recent ComnetCo / Hewlett Packard Enterprise (HPE) white paper, Dr. Eng Lim Goh, HPE’s Senior Vice President & Chief Technology Officer for Artificial Intelligence, described how “More and more, we are thinking at some point a smart edge device should not only be running a trained AI/ML learning model given to it by humans, but should also be doing learning on its own based on the data it’s collecting,” explained Dr. Goh. “This is the next forward-thinking concept.” 

Dr. Goh also described to us his recent collaboration with the World Health Organization (WHO) on the potential for swarm learning to solve a huge challenge in medicine. 

Medicine is an inherently decentralized field. Hospitals around the globe want to utilize the massive amounts data collected from edge devices within the world of medical IoT. “However, one catch here is that each sensor will be looking at its own compartmentalized data, and therefore will be highly biased towards the data it’s seeing,” explained Dr. Goh. “Eliminating this bias by sharing AI/ML outcomes from many edge devices is one reason why we came up with a concept called ‘Swarm Learning’.”

Swarm learning leverages the security of blockchain smart contracts to work collaboratively with peers and improve model insights. In fact, the authors of the original Nature article showed that swarm learning classifiers outperformed those developed at individual sites.

In addition to better accuracy, swarm learning is also more efficient. By putting machine learning at the edge, or the near edge, the data remains at the source preventing the inefficient movement of data—or data duplication—to the core or central location.

Enabling collaboration while protecting privacy

The beauty of swarm learning is that it allows for the insights generated from data to be shared without sharing the source data itself. Data is not moved from the sources thus preserving data privacy by limiting data movement.

As in federated learning, the machine learning method is applied locally at the data collection source. Only inferred insights from that data are shared between the nodes. Protecting privacy by not exposing patient private data is not only critical for maintaining compliance with data privacy laws but is also a basic duty of researchers. According to the National Institute for Health (NIH), “Protecting patients involved in research from harm and preserving their rights is essential to ethical research.”1

This new approach means, for example, that hospitals can share insights derived from applying AI at the edge without actually risking exposure of patient protected data to bad actors. Plus, there is no central custodian that aggregates all the data, leveraging a blockchain helps to ensure data integrity. This protection is not only critical for protecting patient privacy but also safeguarding confidential data from the probing noses of hackers, such as, for example, nation states spying on vaccine research. This decentralized, distributed model shares only the insights gleaned from the data, often derived using AI machine learning models.

1 https://www.ncbi.nlm.nih.gov/books/NBK9579/