the race for “exascale”

On May 27, 2022, the HPC community (high performance computing) announces with great fanfare the arrival of the first “exascale” supercomputer, i.e. capable of performing 1018 “FLOPS”, i.e. one billion billion operations per second (on real numbers in floating-point notation, for be precise).

The new supercomputer, Frontierwhich is operated by the US Department of Energy at Oak Ridge National Laboratory in Tennessee with several million cores supplants the Japanese supercomputer Fugaku which demotes to second position in the TOP500 ranking of the most powerful machines.

Frontier, not content with being (for now) the most powerful computer in the world, is also well ranked in terms of energy efficiency… at least in relation to its power because it consumes enormous amounts of energy, the equivalent of a city of several tens of thousands of inhabitants. And the problem does not stop at Frontiersince it is only the flagship of the flourishing global fleet of several thousand supercomputers.

A distance confrontation

This return of the Americans to the front of the race highlights a new battleground between the American and Chinese superpowers, which the Europeans observe from ambush. Indeed, China had created a surprise in 2017 by stealing first place from the United States: we then witnessed a mass arrival with more than 200 Chinese supercomputers in the TOP500. Today, the first Chinese machine is relegated to sixth place and the Chinese have chosen to take their machines out of this ranking.

In 2008, the supercomputer Roadrunner of the American Los Alamos National Lab is the first to reach the “PetaFlops”, or one million billion FLOPS (1015). The exascale becomes a strategic objective for the Americans, even though this goal seems technically unattainable.

To achieve exascale, it was necessary to rethink the architecture of the previous generation PetaFlops. For example, at these extreme scales, the question of the reliability of millions of components becomes crucial. Like a grain of sand jamming a gear, the failure of one element prevents the entire machine from functioning.

The “energy wall”

But the US Department of Energy (US DoE) has added a constraint to this technological development by imposing a maximum power of 20 Megawatts to deploy the exascale – constraint called “energy wall”. The American initiative Exascale computing was funded with more than a billion dollars in 2016.

To pass this “energy wall”, it was necessary to rethink all the software layers (from the operating system to the applications) and design new algorithms to manage heterogeneous computing resources, i.e. standard processors and accelerators, memory hierarchies, interconnections in particular.

In the end, the electricity consumption of Frontier is measured at 21.1 Megawatts, or 52.23 Gigaflops per Watt, which roughly corresponds to 150 tons of CO2 emissions per day taking into account the energy mix of Tennessee, where the platform is located. This is just below the 20 Megawatt wall limit set in the DoE target (if we take the 21.1 Megawatts from the 1.102 exaflop of Frontierwe arrive at 19.15 Megawatts).

This places Frontier ranked second in the Top Green500 of supercomputers that consume the fewest operations (Flops) per Watt – this ranking was launched in 2013 and notes the birth of community concerns for energy issues. This place in the Top Green500 is good news: the performance gain of Frontier is also accompanied by a gain in energy consumption.

Overly optimistic estimates

But these estimates of digital energy consumption are underestimated, as is often the case: they only take into account usage and neglect the important part of the manufacture of the supercomputer and associated infrastructure such as buildings, and its future. dismantling. My research experience and that of my academic and industrial colleagues allow us to estimate that the use represents only regarding half of the total energy cost, taken over an average lifespan of 5 years. There are few studies on the subject, because of the systemic difficulty of the thing and the low availability of data, but let us quote the recent study of measurement of the consumption of one hour of a heart of the plate -Dahu form of calculation which concludes that the proportion of usage costs is barely 30% with regard to the total energy cost.

In addition, technological improvements that allow energy savings generate an overall surplus of consumption: this is known as the “rebound effect”. New features and increased usage ultimately result in increased energy consumption. A recent example in computer science is that of natural language models (NLP, for Natural Language Processing), which are enriched with new functionalities as the computational performance increases link.

The tree that hides the forest

Technological progress to achieve exascale is indisputable, but the direct and indirect counterpart that weighs on global warming remains significant, despite the optimists who say that it is a drop in the bucket compared to the 40 billion tonnes of CO2 emitted each year by all human activities.

Moreover, it is not just a single supercomputer: Frontier is the tree that hides the forest. Indeed, we have observed for a long time in the community that the progress obtained by building a new generation of high-performance computing spreads rapidly: new platforms very quickly replace the platforms already deployed in university computing centers or in companies. If the replacement is premature, the effective life of the replaced machines is reduced, and their environmental impact increases.

The TOP500 represents only a part of the galaxy of HPC platforms deployed around the world. It is very difficult to estimate the number because many platforms are falling off the radar: a large number of large-scale platforms are in private companies and many smaller-scale platforms are deployed locally.

A small study made directly on the data of the TOP500 shows that the effective performance of the most powerful platform has been multiplied in the last ten years by 33 (the average performance of the 500 machines only progresses by a factor of 20). Over the same period, the energy gain of the TOP Green500 was multiplied by barely 15 (and 18 on average). The overall balance in terms of energy consumed is therefore negative – it has, in the end, increased.

What to do with this progress?

A counter-argument can be put forward: progress towards increasingly powerful platforms might make it possible to find technical solutions to combat climate change. This way of thinking is representative of the state of mind of our technocentric society, but it is unfortunately almost impossible to measure the impact of these new technologies on the reduction of the carbon footprint. Indeed, most of the time, these measures focus on the phases of use and ignore the “sides”, such as the manufacture of new equipment for example.

One can legitimately wonder what mechanism drives this race for performance. One reason cited by the designers of Frontier is scientific progress: the more complex the phenomena we seek to model and understand, the more simulations are needed and the only way to conduct simulations is to build ever more powerful HPC platforms…”

_______

Par Denis TrystramUniversity professor in computer science, Grenoble Alpes University (UGA)

The version originale of this article was published on The Conversation.