This lesson is still being designed and assembled (Pre-Alpha version)

Resource requirements

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • What is the difference between requesting for CPU and GPU resources using Slurm?

  • How can I optimize my slurm script to avail the best resources for my specific task?

Objectives
  • Understand different types of computational workloads and their resource requirements

  • Write optimized Slurm job scripts for sequential, parallel, and GPU workloads

  • Monitor and analyze resource utilization

  • Apply best practices for efficient resource allocation

Understanding Resource Requirements

Different computational tasks have varying resource requirements. Understanding these patterns is crucial for efficient HPC usage.

Types of Workloads

CPU-bound workloads: Tasks that primarily use computational power

Real-world astronomy example:
Calculating theoretical stellar evolution tracks using the MESA code.
Each star’s model requires intense numerical integration of stellar structure equations over millions of time steps, mostly using the CPU.


Memory-bound workloads: Tasks limited by memory access speed

Real-world astronomy example:
Processing full-sky Cosmic Microwave Background (CMB) maps from Planck at high resolution.
The HEALPix maps are large, and operations like spherical harmonic transforms require large amounts of RAM to store intermediate matrices.


I/O-bound workloads: Tasks limited by disk or network operations

Real-world astronomy example:
Stacking thousands of raw Rubin Observatory images to improve signal-to-noise ratio for faint galaxy detection.
The bottleneck is reading the large image files from storage and writing processed results back.


GPU-accelerated workloads: Tasks that can utilize parallel processing

Real-world astronomy example:
Training a convolutional neural network to classify transient events in ZTF light curves.
The matrix multiplications and convolutions benefit greatly from GPU acceleration.


Types of Jobs and Resources

When you run work on an HPC cluster, your job’s type determines how it will be scheduled and what resources it will use. Broadly, jobs fall into three categories:

Once you know your job type, you can select the correct SLURM partition (queue) and request the right resources:

Job Type SLURM Partition Key SLURM Options Example Use Case
Serial computes_thin --partition, no MPI Single-thread tensor calc
Parallel computes_thin -N, -n, mpirun MPI simulation
GPU gpu --gpus, --cpus-per-task Deep learning training

Choosing the Right Node

Decision chart for choosing node types

Decision chart for Choosing Nodes

Exercise: Classify the Job Type

Using the decision chart above, classify each of the following HPC tasks into CPU-bound, Memory-bound, I/O-bound, or GPU-accelerated, and suggest the most appropriate node type.

  1. Running a grid of supernova explosion models, varying the star mass, explosion energy, circumstellar density, and radius. Each model runs independently on a separate CPU.

  2. Simulating the galactic stellar population across many spatial grid points for the entire galactic plane. Each grid point is independent; results are aggregated at the end.

  3. Running the 2D-Hybrid pipeline for periodicity detection on ZTF DR6 quasar light curves:
    • Processing objects in batches of 1000 per job.
    • Each object is analyzed with wavelet + cross-band methods.
  4. Simulating and fitting microlensing light curves, where each fit is run on a separate CPU, and runtime varies between light curves.

  5. Distributing difference imaging runs for transient detection using PyTorch, previously run on GPUs, or via MPI/HTCondor.

Discussion:
Are some tasks “mixed-type”? Which resources are more critical for performance? How would you prioritize CPU vs GPU vs memory for these workloads?

Solutions

  1. Grid of supernova explosion models
    • Type: CPU-bound (parallel)
    • Reason: Each model is independent; heavy numerical computations per model.
    • Node type: Regular node (multi-core CPU).
    • SLURM options: Use multiple CPUs; each job in the array runs a separate model with --array=1-100.
  2. Galactic stellar population simulation
    • Type: CPU-bound (embarrassingly parallel)
    • Reason: Each spatial grid point is independent; computation-heavy but not memory-intensive.
    • Node type: Regular node or multi-node if very large. CPUs are sufficient.
    • SLURM options: Distribute grid points across CPU cores; use MPI or job arrays with --ntasks=64.
  3. 2D-Hybrid pipeline for ZTF quasar light curves
    • Type: CPU-bound (parallel)
    • Reason: Wavelet + cross-band analysis is CPU-only; each batch processed independently.
    • Node type: Regular node (multi-core CPU).
    • SLURM options: Array jobs with --cpus-per-task for intra-batch parallelization.
  4. Microlensing light curve fitting
    • Type: CPU-bound (parallel, varying runtimes)
    • Reason: Each light curve fit is independent; some fits take longer than others.
    • Node type: Regular node or SMP node if shared memory needed for multithreaded fits.
    • SLURM options: Array jobs with --array=1-999 for fitting all light curves.
  5. Difference imaging runs with PyTorch / MPI / HTCondor
    • Type: GPU-accelerated (if using PyTorch) or CPU-bound (MPI/HTCondor alternative)
    • Reason: GPU use accelerates image subtraction; MPI/HTCondor distributes CPU tasks efficiently.
    • Node type: GPU node (for PyTorch) or regular nodes (for MPI/HTCondor).
    • SLURM options: --gpus for GPU tasks, -N/-n for MPI tasks.

Key Points

  • Different computational models (sequential, parallel, GPU) significantly impact runtime and efficiency.

  • Sequential CPU execution is simple but inefficient for large parameter spaces.

  • Parallel CPU (e.g., MPI or OpenMP) reduces runtime by distributing tasks but is limited by CPU core counts and communication overhead.

  • GPU computing can drastically accelerate tasks with massively parallel workloads like grid-based simulations.

  • Choosing the right computational model depends on the problem structure, resource availability, and cost-efficiency.

  • Effective Slurm job scripts should match the workload to the hardware: CPUs for serial/parallel, GPUs for highly parallelizable tasks.

  • Monitoring tools (like nvidia-smi, seff, top) help validate whether the resource request matches the actual usage.

  • Optimizing resource usage minimizes wait times in shared environments and improves overall throughput.