This lesson is still being designed and assembled (Pre-Alpha version)

HPC Intro

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Question 1

Objectives
  • Objective 1

Simple, inexpensive computing tasks are typically performed sequentially, i.e., where instructions are completed one after another in the order that they appear in the code, which is the default paradigm in most programming languages. For larger tasks that require many tasks to be executed, it is often more efficient to take advantage of the intrinisically parallel nature of most processors, which are designed to execute multiple processes simultaneously. Many common programming languages, including Python, support software that is executed in parallel, where multiple CPU cores are employed to perform tasks independently.

In modern computing, parallel programming has become more and more essential as computational tasks become more demanding. From protein folding in experimental drug development to galaxy formation and evolution, complex simulations rely on parallel computing to solve some of the most difficult problems in science. Parallel programming, hardware architecture, and systems admininstration come together in the multidisciplinary field of high-performance computing (HPC). In constrast to running code locally on your home machine, high-performance computing involves connecting to a cluster of computers elsewhere in the world that are networked together in order to run many operations in parallel.

Intro

Computer Architectures

Historically, computer architectures can be divided into two categories – von Neumann and Harvard. In the former, a computer system contains the following components:

The ALU takes in data from local memory from the MU and performs calculations, and the CU interprets instructions and directs the flow of data to and from the I/O devices, as shown in the diagram below. The MU contains all of the memory and instructions, which creates a performance bottleneck related to data transfer.

![von Neumann diagram](../fig/vonneumann.png)
<br>
<sub>Diagram of von Neumann architecture, from (\cite{https://onlinelibrary.wiley.com/doi/book/10.1002/9780470932025})</sub>

The Harvard architecture is a variant of the von Neumann design, where instruction and data storage are physically separated, which allows simulataneous access to instructions and memory. This partially overcomes the von Neumann bottleneck, and most modern central processing units (CPU) adopt this architecture.

![Harvard diagram](../fig/harvard.png)
<br>
<sub>Diagram of Harvard architecture, from (\cite{https://onlinelibrary.wiley.com/doi/book/10.1002/9780470932025})</sub>

Performance

Computational peformance is largely determed by three components:

Processing astronomical data, building models and running simulations requires significant computational power. The laptop or PC you’re using right now probably has between 8 and 32 Gb of RAM, a processor with 4-10 cores, and a hard drive that can store between 256 Gb and 1 Tb of data. But what happens if you need to process a dataset that is larger than 1 Tb, or if your model that has to be loaded into the RAM is larger than 32 Gb, or if the simulation you are running will take a month to calculate on your CPU? You need a bigger computer, or you need many computers working in parallel.

Flynn’s Taxonomy: a framework for parallel computing

When we talk about parallel computing, it’s helpful to have a framework to classify different types of computer architectures. The most common one is Flynn’s Taxonomy, which was proposed in 1966 (\cite{https://ieeexplore.ieee.org/document/1447203}). It gives a simple vocabulary for describing how computers handle tasks, and will help us in understanding how certain programming models are better for certain problems.

Flynn’s taxonomy uses four words:

These are combined to describe four main architectures (\cite{https://onlinelibrary.wiley.com/doi/book/10.1002/9780470932025}). For a thorough overview on these, you can refer to the HIPOWERED book. Let us go over them briefly,

In addition to these, a separate way in which parallel computers can be organized are:

  1. Multiprocessors: Computers with shared memory.
  2. Multicomputers: Computers with distributed memory.

SIMD in Practice: GPUs

An important example of SIMD architecture in modern computing is the GPU (Graphics Processing Unit).

GPUs were originally designed for computer graphics, which is an inherently parallel task (for e.g., calculating the color of millions of pixels at once). Researchers soon realized this massive parallelism could be used for general-purpose scientific computing, including physics simulations and training AI models, leading to the term GPGPU (General-Purpose GPU). These allow for significant speedups in “data-parallel” models. The trade-off is that GPUs have a different memory hierarchy (with less cache per core compared to CPUs), meaning performance can be limited by algorithms that require frequent or irregular communication between threads.

A CPU consists of a few very powerful cores optimized for complex, sequential tasks. A GPU, in contrast, is made of thousands of simpler cores that are masters of efficiency for data-parallel problems. Because of this, nearly all modern supercomputers are hybrid systems that use both CPUs and GPUs, leveraging the strengths of each.

Supercomputers vs. Computing Clusters

In the early days of HPC, a “supercomputer” was often a single, monolithic machine with custom vector processors. Today, that has completely changed, the vast majority of systems are clusters. Let us define some terms associated with this,

Network Topology for Clusters

Since a cluster is just a collection of nodes, the way these nodes are connected (called the network topology) is critical to performance. If any program needs to send data between nodes frequently, a slow or inefficient network will create a major bottleneck.

Common topologies for HPC include:

Other topologies which are less common for an HPC include Bus, Ring, Star, Hypercube, Fully connected, Crossbar and Multistage interconnection. More information can be found in the HiPowered book.

Never Run Computations on the Login Node!

When you connect to an HPC cluster, you land on a login node. This node is a shared resource for all users to compile code, manage files, and submit jobs to the workload manager. It is not designed for heavy computation! Running an intensive program on the login node will slow it down for everyone and is a classic mistake for new users. Your job must be submitted through the workload manager (e.g., using sbatch in SLURM) to run on the compute nodes.

File system

HPC clusters use a few different locations and formats for storage.

\cite{https://www.hpc.iastate.edu/guides/nova/storage#:~:text=Home%20directories%20(/home),used%20for%20high%20volume%20access.}) \cite{https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=140938}

Which computer for which task?

Key Points

  • Keypoint 1