High Performance Computing (HPC) can enable humans to understand the world around us, from atomic behaviour to how the universe is expanding. While HPC used to be associated with a “supercomputer,” today’s HPC environments are created using hundreds to thousands of individual servers, connected by a high speed and low latency networking fabric. To take advantage of thousands of separate computing elements or “cores” simultaneously, applications need to be designed and implemented to work in parallel, sharing intermediate results as required. For example, a weather prediction application will divide the earth’s atmosphere into thousands of 3D cells and simulate the physics of wind, water bodies, atmospheric pressure, and other phenomena within that cell. Each of the cells then needs to communicate to their neighbours the results of the previous simulation. The more processing power that is available, the smaller each cell can be, with more accurate physics.
Until recently, HPC algorithms ran on CPUs from Intel and AMD. Over time, these CPUs became faster and incorporated more cores. However, a new, highly optimised accelerator is becoming integrated with HPC systems. A graphics processing unit (GPU) has enabled an increase in specific applications’ performance by more than an order of magnitude. GPUs can be found in almost all of the world’s Top500 HPC systems. Thousands of applications have been modified to take advantage of thousands of GPU cores simultaneously, with impressive results.
HPC is becoming integrated into enterprise workflows, rather than an entirely separate computing infrastructure. An HPC system will typically be composed of many CPUs, a significant number of GPUs, solid-state disk (SSD) drives, fast networking, and an entire software eco-system. An efficient HPC system will have a balance between the CPUs, GPUs, high speed memory and a storage system, all working together. This balance is important, as the most expensive resource in an HPC system are the CPUs and GPUs. A well-designed system will be able to deliver data to the CPUs extremely fast and never “starve” the CPUs of actual work.
Universities and research labs worldwide continue to invest in and create very high-end HPC systems to solve and understand the most complex challenges humankind faces today. Research labs and education institutions are exploding with the massive amounts of data now readily available to researchers. Areas of high interest that use HPC systems include bio-informatics, cosmology, biology, climate study, mechanical simulations, and financial services.
Ghent University – Turning seven hours AI experiments into 40 minutes
IDLab is a research lab at Ghent University and the University of Antwerp. Their idea was to extend their research areas to include AI Robotics, IoT, and data mining. To accomplish this, IDLab determined that new servers were required. They needed to have the ability to house several GPUs within the server enclosure to ensure maximum performance. Early tests indicated that existing applications could run up to ten times faster when using GPUs than a pure CPU execution. The quicker times allow researchers to develop better AI algorithms and get results faster than ever before.
One of the new servers’ requirements was the ability to run multiple jobs on the same server without affecting other applications’ performance. A powerful server was needed that had the compute, memory, and GPU capacity to allow this. The challenge was to increase performance by 10x to keep up with current demands. IDLab chose powerful GPU servers, that were specifically designed to handle next-generation AI applications. These servers contained two NVIDIA HGX-2 boards, that could accommodate eight GPU boards in one server matched with the appropriate CPU power.
To run AI-based algorithms, the researchers needed to complete various jobs faster so that iterations could be made to these algorithms in a timely manner. The chosen server solution helped them to cut down experiments from nearly seven hours down to the needed 40 minutes while still receiving high-quality results.
Understanding and tracking COVID-19 at Goethe University Frankfurt
Another case where a high-performance server was needed to optimise research processes was at the Goethe University Frankfurt. The supercomputer centre is known worldwide for enabling a wide range of researchers to use one of the fastest systems in Europe. The architects of this new system determined that the best choice for these new servers would need to incorporate several GPUs, in addition to a high core count CPU. The required server’s critical design was a very fast communication path between the CPU and GPU, utilising the PCI Gen 4 bus. Goethe University chose servers based on AMD EPYC processors and Radeon Instinct MI50 GPUs for this new HPC system. This combination allows massive amounts of data to be shipped to and from the GPU from the CPU extremely fast at up to 64 GB/second. Once the GPUs have completed their tasks, the results can quickly be sent back to the CPU. Goethe University researchers have been using this new system to track the COVID-19 pandemic worldwide, among other research initiatives. Understanding how COVID-19 spreads throughout a population, allows authorities to put policies and action plans in place to be prepared for potential, similar challenges in the future.
Molecular dynamics simulations at Lawrence Livermore National Lab
In the United States, the Lawrence Livermore National Labs (LLNL) recently expanded a large scale HPC system to over 11 Petaflops. This system is intended to be used to find various treatments and a vaccine for COVID-19. It will also allow leveraging computational workloads in genomics and other scientific disciplines. The Corona system, named for the total solar eclipse in 2017, was recently outfitted with a large number of servers, containing both AMD EPYC CPUs and Radeon Instinct GPUs. The additional computing capacity will, for example, allow researchers to handle better the computationally intensive molecular dynamics simulations. These are critical to understand, for instance, structure and function of the virus. Ultimately, that’s the base for finding a cure for COVID-19. Molecular dynamics simulations have been designed to take advantage of GPUs, increasing their performance significantly.
Breaking down the walls of limited computing capabilities
HPC in research and academic institutions allows a wide range of researchers to focus on new science without being delayed by old and outdated servers. GPUs are increasingly being used to reduce the time to complete many tasks, enabling new algorithms to be developed and iterated. Robust HPC systems are being used to understand a wide range of scientific problems that were previously out of reach due to limited computing capabilities.
Author: Martin Galle, Director FAE at Supermicro