Resource requirements

Overview

Teaching: 30 min
Exercises: 10 min

Questions

What is the difference between requesting for CPU and GPU resources using Slurm?

How can I optimize my slurm script to avail the best resources for my specific task?

Objectives

Understand different types of computational workloads and their resource requirements

Write optimized Slurm job scripts for sequential, parallel, and GPU workloads

Monitor and analyze resource utilization

Apply best practices for efficient resource allocation

Understanding Resource Requirements

Different computational tasks have varying resource requirements. Understanding these patterns is crucial for efficient HPC usage.

Types of Workloads

CPU-bound workloads: Tasks that primarily use computational power

Mathematical calculations, simulations, data processing
Benefit from more CPU cores and higher clock speeds

Real-world astronomy example:
Calculating theoretical stellar evolution tracks using the MESA code.
Each star’s model requires intense numerical integration of stellar structure equations over millions of time steps, mostly using the CPU.

Memory-bound workloads: Tasks limited by memory access speed

Large dataset processing, in-memory databases
Require sufficient RAM and fast memory access

Real-world astronomy example:
Processing full-sky Cosmic Microwave Background (CMB) maps from Planck at high resolution.
The HEALPix maps are large, and operations like spherical harmonic transforms require large amounts of RAM to store intermediate matrices.

I/O-bound workloads: Tasks limited by disk or network operations

File processing, database queries, data transfer
Benefit from fast storage and network connections

Real-world astronomy example:
Stacking thousands of raw Rubin Observatory images to improve signal-to-noise ratio for faint galaxy detection.
The bottleneck is reading the large image files from storage and writing processed results back.

GPU-accelerated workloads: Tasks that can utilize parallel processing

Machine learning, scientific simulations, image processing
Require appropriate GPU resources and memory

Real-world astronomy example:
Training a convolutional neural network to classify transient events in ZTF light curves.
The matrix multiplications and convolutions benefit greatly from GPU acceleration.

Types of Jobs and Resources

When you run work on an HPC cluster, your job’s type determines how it will be scheduled and what resources it will use. Broadly, jobs fall into three categories:

Serial jobs
These use a single CPU core (or sometimes a single thread) to run all calculations. They don’t require communication between multiple processes. They’re ideal for workloads like simple data analysis, single-threaded simulations, or testing code.
Parallel jobs
These use multiple CPU cores — sometimes across multiple nodes — to run tasks simultaneously. Parallel jobs often use MPI (Message Passing Interface) or OpenMP explained in the previous section to coordinate work. They’re suited for large-scale simulations or computations that can be split into many parts running at once.
GPU jobs
These use Graphics Processing Units to accelerate certain types of workloads, especially those involving heavy numerical computation like deep learning, image processing, or large matrix operations. GPU jobs often also use CPU cores for parts of the workflow.

Once you know your job type, you can select the correct SLURM partition (queue) and request the right resources:

Job Type	SLURM Partition	Key SLURM Options	Example Use Case
Serial	`computes_thin`	`--partition`, no MPI	Single-thread tensor calc
Parallel	`computes_thin`	`-N`, `-n`, `mpirun`	MPI simulation
GPU	`gpu`	`--gpus`, `--cpus-per-task`	Deep learning training

Choosing the Right Node

Regular Node: For MPI-based distributed jobs or simple CPU tasks.
SMP Node (Symmetric Multiprocessing): For jobs needing large shared memory (big matrices, in-memory data) or multi-threaded code (OpenMP, R, Python multiprocessing).
- In an SMP system, multiple CPUs (cores) share the same physical memory and can access it at the same speed. This architecture is ideal when tasks need frequent access to a common memory space without the communication overhead of distributed systems.
GPU Node: For massively parallel computations on GPUs (e.g., CUDA, TensorFlow, PyTorch).

Decision chart for choosing node types

Decision chart for Choosing Nodes

Exercise: Classify the Job Type

Using the decision chart above, classify each of the following HPC tasks into CPU-bound, Memory-bound, I/O-bound, or GPU-accelerated, and suggest the most appropriate node type.

Running a grid of supernova explosion models, varying the star mass, explosion energy, circumstellar density, and radius. Each model runs independently on a separate CPU.

Simulating the galactic stellar population across many spatial grid points for the entire galactic plane. Each grid point is independent; results are aggregated at the end.

Running the 2D-Hybrid pipeline for periodicity detection on ZTF DR6 quasar light curves:

Processing objects in batches of 1000 per job.

Each object is analyzed with wavelet + cross-band methods.

Simulating and fitting microlensing light curves, where each fit is run on a separate CPU, and runtime varies between light curves.

Distributing difference imaging runs for transient detection using PyTorch, previously run on GPUs, or via MPI/HTCondor.

Discussion:
Are some tasks “mixed-type”? Which resources are more critical for performance? How would you prioritize CPU vs GPU vs memory for these workloads?

Solutions

Grid of supernova explosion models

Type: CPU-bound (parallel)

Reason: Each model is independent; heavy numerical computations per model.

Node type: Regular node (multi-core CPU).

SLURM options: Use multiple CPUs; each job in the array runs a separate model with --array=1-100.

Galactic stellar population simulation

Type: CPU-bound (embarrassingly parallel)

Reason: Each spatial grid point is independent; computation-heavy but not memory-intensive.

Node type: Regular node or multi-node if very large. CPUs are sufficient.

SLURM options: Distribute grid points across CPU cores; use MPI or job arrays with --ntasks=64.

2D-Hybrid pipeline for ZTF quasar light curves

Type: CPU-bound (parallel)

Reason: Wavelet + cross-band analysis is CPU-only; each batch processed independently.

Node type: Regular node (multi-core CPU).

SLURM options: Array jobs with --cpus-per-task for intra-batch parallelization.

Microlensing light curve fitting

Type: CPU-bound (parallel, varying runtimes)

Reason: Each light curve fit is independent; some fits take longer than others.

Node type: Regular node or SMP node if shared memory needed for multithreaded fits.

SLURM options: Array jobs with --array=1-999 for fitting all light curves.

Difference imaging runs with PyTorch / MPI / HTCondor

Type: GPU-accelerated (if using PyTorch) or CPU-bound (MPI/HTCondor alternative)

Reason: GPU use accelerates image subtraction; MPI/HTCondor distributes CPU tasks efficiently.

Node type: GPU node (for PyTorch) or regular nodes (for MPI/HTCondor).

SLURM options: --gpus for GPU tasks, -N/-n for MPI tasks.

Key Points

Different computational models (sequential, parallel, GPU) significantly impact runtime and efficiency.

Sequential CPU execution is simple but inefficient for large parameter spaces.

Parallel CPU (e.g., MPI or OpenMP) reduces runtime by distributing tasks but is limited by CPU core counts and communication overhead.

GPU computing can drastically accelerate tasks with massively parallel workloads like grid-based simulations.

Choosing the right computational model depends on the problem structure, resource availability, and cost-efficiency.

Effective Slurm job scripts should match the workload to the hardware: CPUs for serial/parallel, GPUs for highly parallelizable tasks.

Monitoring tools (like nvidia-smi, seff, top) help validate whether the resource request matches the actual usage.

Optimizing resource usage minimizes wait times in shared environments and improves overall throughput.

previous episode

Introduction to High Performance Computing for astronomical software development

next episode

Resource requirements

Overview