This lesson is still being designed and assembled (Pre-Alpha version)

Introduction to High Performance Computing for astronomical software development

Setting the Scene

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What are we teaching in this course?

  • What motivated the selection of topics covered in the course?

Objectives
  • Setting the scene and expectations

  • Making sure everyone has all the necessary software installed

Introduction

The course is organised into the following sections:

Section 1: Software project example

Section 2: Unit testing

Before We Start

A few notes before we start.

Prerequisite Knowledge

This is an intermediate-level software development course intended for people who have already been developing code in Python (or other languages) and applying it to their own problems after gaining basic software development skills. So, it is expected for you to have some prerequisite knowledge on the topics covered, as outlined at the beginning of the lesson. Check out this quiz to help you test your prior knowledge and determine if this course is for you.

Setup, Common Issues & Fixes

Have you setup and installed all the tools and accounts required for this course? Check the list of common issues, fixes & tips if you experience any problems running any of the tools you installed - your issue may be solved there.

Compulsory and Optional Exercises

Exercises are a crucial part of this course and the narrative. They are used to reinforce the points taught and give you an opportunity to practice things on your own. Please do not be tempted to skip exercises as that will get your local software project out of sync with the course and break the narrative. Exercises that are clearly marked as “optional” can be skipped without breaking things but we advise you to go through them too, if time allows. All exercises contain solutions but, wherever possible, try and work out a solution on your own.

Outdated Screenshots

Throughout this lesson we will make use and show content from various interfaces, e.g. websites, PC-installed software, command line, etc. These are evolving tools and platforms, always adding new features and new visual elements. Screenshots in the lesson may then become out-of-sync, refer to or show content that no longer exists or is different to what you see on your machine. If during the lesson you find screenshots that no longer match what you see or have a big discrepancy with what you see, please open an issue describing what you see and how it differs from the lesson content. Feel free to add as many screenshots as necessary to clarify the issue.

Let Us Know About the Issues

The original materials were adapted specifically for this workshop. They weren’t used before, and it is possible that they contain typos, code errors, or underexplained or unclear moments. Please, let us know about these issues. It will help us to improve the materials and make the next workshop better.

$ cd ~/InterPython_Workshop_Example/data
$ ls -l
total 24008
-rw-rw-r-- 1 alex alex 23686283 Jan 10 20:29 kepler_RRLyr.csv
-rw-rw-r-- 1 alex alex   895553 Jan 10 20:29 lsst_RRLyr.pkl
-rw-rw-r-- 1 alex alex   895553 Jan 10 20:29 lsst_RRLyr_protocol_4.pkl
...

Exercise

Exercise task

Solution

Exercise solution

 code example
 ...

Key Points

  • Keypoint 1

  • Keypoint 2


Section 1: HPC basics

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Question 1

Objectives
  • Objective 1

Section overview, what it’s about, tools we’ll use, info we’ll learn.

Key Points

  • Keypoint 1


HPC Intro

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Question 1

Objectives
  • Objective 1

Simple, inexpensive computing tasks are typically performed sequentially, i.e., where instructions are completed one after another in the order that they appear in the code, which is the default paradigm in most programming languages. For larger tasks that require many tasks to be executed, it is often more efficient to take advantage of the intrinisically parallel nature of most processors, which are designed to execute multiple processes simultaneously. Many common programming languages, including Python, support software that is executed in parallel, where multiple CPU cores are employed to perform tasks independently.

In modern computing, parallel programming has become more and more essential as computational tasks become more demanding. From protein folding in experimental drug development to galaxy formation and evolution, complex simulations rely on parallel computing to solve some of the most difficult problems in science. Parallel programming, hardware architecture, and systems admininstration come together in the multidisciplinary field of high-performance computing (HPC). In constrast to running code locally on your home machine, high-performance computing involves connecting to a cluster of computers elsewhere in the world that are networked together in order to run many operations in parallel.

Intro

Computer Architectures

Historically, computer architectures can be divided into two categories – von Neumann and Harvard. In the former, a computer system contains the following components:

The ALU takes in data from local memory from the MU and performs calculations, and the CU interprets instructions and directs the flow of data to and from the I/O devices, as shown in the diagram below. The MU contains all of the memory and instructions, which creates a performance bottleneck related to data transfer.

![von Neumann diagram](../fig/vonneumann.png)
<br>
<sub>Diagram of von Neumann architecture, from (\cite{https://onlinelibrary.wiley.com/doi/book/10.1002/9780470932025})</sub>

The Harvard architecture is a variant of the von Neumann design, where instruction and data storage are physically separated, which allows simulataneous access to instructions and memory. This partially overcomes the von Neumann bottleneck, and most modern central processing units (CPU) adopt this architecture.

![Harvard diagram](../fig/harvard.png)
<br>
<sub>Diagram of Harvard architecture, from (\cite{https://onlinelibrary.wiley.com/doi/book/10.1002/9780470932025})</sub>

Performance

Computational peformance is largely determed by three components:

Processing astronomical data, building models and running simulations requires significant computational power. The laptop or PC you’re using right now probably has between 8 and 32 Gb of RAM, a processor with 4-10 cores, and a hard drive that can store between 256 Gb and 1 Tb of data. But what happens if you need to process a dataset that is larger than 1 Tb, or if your model that has to be loaded into the RAM is larger than 32 Gb, or if the simulation you are running will take a month to calculate on your CPU? You need a bigger computer, or you need many computers working in parallel.

Flynn’s Taxonomy: a framework for parallel computing

When we talk about parallel computing, it’s helpful to have a framework to classify different types of computer architectures. The most common one is Flynn’s Taxonomy, which was proposed in 1966 (\cite{https://ieeexplore.ieee.org/document/1447203}). It gives a simple vocabulary for describing how computers handle tasks, and will help us in understanding how certain programming models are better for certain problems.

Flynn’s taxonomy uses four words:

These are combined to describe four main architectures (\cite{https://onlinelibrary.wiley.com/doi/book/10.1002/9780470932025}). For a thorough overview on these, you can refer to the HIPOWERED book. Let us go over them briefly,

In addition to these, a separate way in which parallel computers can be organized are:

  1. Multiprocessors: Computers with shared memory.
  2. Multicomputers: Computers with distributed memory.

SIMD in Practice: GPUs

An important example of SIMD architecture in modern computing is the GPU (Graphics Processing Unit).

GPUs were originally designed for computer graphics, which is an inherently parallel task (for e.g., calculating the color of millions of pixels at once). Researchers soon realized this massive parallelism could be used for general-purpose scientific computing, including physics simulations and training AI models, leading to the term GPGPU (General-Purpose GPU). These allow for significant speedups in “data-parallel” models. The trade-off is that GPUs have a different memory hierarchy (with less cache per core compared to CPUs), meaning performance can be limited by algorithms that require frequent or irregular communication between threads.

A CPU consists of a few very powerful cores optimized for complex, sequential tasks. A GPU, in contrast, is made of thousands of simpler cores that are masters of efficiency for data-parallel problems. Because of this, nearly all modern supercomputers are hybrid systems that use both CPUs and GPUs, leveraging the strengths of each.

Supercomputers vs. Computing Clusters

In the early days of HPC, a “supercomputer” was often a single, monolithic machine with custom vector processors. Today, that has completely changed, the vast majority of systems are clusters. Let us define some terms associated with this,

Network Topology for Clusters

Since a cluster is just a collection of nodes, the way these nodes are connected (called the network topology) is critical to performance. If any program needs to send data between nodes frequently, a slow or inefficient network will create a major bottleneck.

Common topologies for HPC include:

Other topologies which are less common for an HPC include Bus, Ring, Star, Hypercube, Fully connected, Crossbar and Multistage interconnection. More information can be found in the HiPowered book.

Never Run Computations on the Login Node!

When you connect to an HPC cluster, you land on a login node. This node is a shared resource for all users to compile code, manage files, and submit jobs to the workload manager. It is not designed for heavy computation! Running an intensive program on the login node will slow it down for everyone and is a classic mistake for new users. Your job must be submitted through the workload manager (e.g., using sbatch in SLURM) to run on the compute nodes.

File system

HPC clusters use a few different locations and formats for storage.

\cite{https://www.hpc.iastate.edu/guides/nova/storage#:~:text=Home%20directories%20(/home),used%20for%20high%20volume%20access.}) \cite{https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=140938}

Which computer for which task?

Key Points

  • Keypoint 1


Bura access

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Question 1

Objectives
  • Objective 1

Intro

Paragraph 1

Exercise

Exercise task

Solution

Exercise solution

 code example
 ...

Key Points

  • Keypoint 1


Command line basics

Overview

Teaching: XX min
Exercises: YY min
Questions
  • What command line skills do I need to work with data on High Performing Computing (HPC)?

Objectives
  • Learn essential CLI commands used in data management and processing on HPC

The top 10 basic commands to learn

CLI stands for Command Line Interface.

It is a way to interact with a computer program by typing text commands into a terminal or console window, instead of using a graphical user interface (GUI) with buttons and menus.

When working with large datasets, pipeline logs, and configuration files — mastering the command line is essential. Whether you’re navigating a High Performance Computing (HPC) repo, inspecting files, or debugging processing failures, these Unix commands will be indispensable.

The following are general-purpose commands, and we may add LSST-specific notes where applicable.

Working with LSST data often involves accessing large-scale datasets stored in hierarchical directories, using symbolic links for shared data, and scripting reproducible data analysis pipelines. These are the fundamental commands every LSST astronomer should know.

File Preparation:needed to run for later exercsies

# Make a dummy data directory and populate it
mkdir -p 1.IntroHPC/1.CLI
echo "dummy input" > 1.IntroHPC/1.CLI/test.in
echo "file list" > 1.IntroHPC/1.CLI/test.files
touch 1.IntroHPC/1.CLI/14si.pspnc

Directory and File Operations

Setup (run once before these examples): ```bash

mkdir -p lsst_data/raw cd lsst_data touch image01.fits echo “instrument: LATISS” > config.yaml echo -e “INFO: Init\nFATAL: Calibration failed” > job.log ```

ls

List contents of a directory. Useful flags:

$ ls -alF

pwd, cd

To check and change the current directory:

$ pwd
$ cd /lsst_data/raw

mkdir, tree

Create directories and visualize structure:

$ mkdir -p repo/gen3/raw/20240101
$ tree repo/gen3

File Manipulation

cp, mv, rm

Basic operations:

$ cp image01.fits image02.fits
$ mv image02.fits image_raw.fits
$ rm image_raw.fits

ln

Create symbolic links to avoid data duplication:

$ ln -s /datasets/lsst/raw/image01.fits ./image01.fits

Viewing and Extracting Data

cat, less, grep

View and search YAML config or log files:

$ cat config.yaml
$ less job.log
$ grep "FATAL" job.log

Permissions and Metadata

chmod, chown, stat

Manage and inspect file attributes:

$ chmod 644 config.yaml
$ stat image01.fits

LSST-Specific Use Cases

Familiarity with bash, grep, find, and awk will accelerate your workflow.


Exercises

Exercise 1: Set up LSST-style directory

  1. Create a folder structure:
    lsst_cli/
    ├── visit001/
    │   ├── raw/
    │   ├── calexp/
    │   └── logs/
    ├── visit002/
    │   ├── raw/
    │   ├── calexp/
    │   └── logs/
    
  2. Populate each raw/ with image01.fits, and symbolic link to calexp.fits in calexp/.

  3. Add a process.yaml and log file in each logs/.

Use tree to verify.

Exercise 2: Analyze Logs

Using grep and less, identify all lines with “WARNING” or “FATAL” in the log files across visits.


Further Learning

Explore additional CLI tools:

ls

List all the files in a directory. Linux as many Operating Systems organize files in files and directories (also called folders).

$ ls
file0a  file0b  folder1  folder2 link0a  link2a

Some terminal offer color output so you can differentiate normal files from folders. You can make the difference more clear with this

$ ls -aCF
./  ../  file0a  file0b  folder1/  folder2/ link0a@  link2a@

You will see a two extra directories "." and "..". Those are special folders that refer to the current folder and the folder up in the tree. Directories have the suffix "/". Symbolic links, kind of shortcuts to other files or directories are indicated with the symbol "@".

Another option to get more information about the files in the system is:

$ ls -al
total 16
drwxr-xr-x    5 andjelka  staff   160 Jun 16 08:53 .
drwxr-xr-x+ 273 andjelka  staff  8736 Jun 16 08:52 ..
-rw-r--r--    1 andjelka  staff    19 Jun 16 08:53 config.yaml
-rw-r--r--    1 andjelka  staff     0 Jun 16 08:53 image01.fits
-rw-r--r--    1 andjelka  staff    37 Jun 16 08:53 job.log

Those characters on the first column indicate the permissions. The first character will be “d” for directories, “l” for symbolic links and “-“ for normal files. The next 3 characters are the permissions for “read”, “write” and “execute” for the owner. The next 3 are for the group, and the final 3 are for others. The meaning of “execute” for a file indicates that the file could be a script or binary executable. For a directory it means that you can see its contents.

cp

This command copies the contents of one file into another file. For example

$ cp file0b file0c

rm

This command deletes the contents of one file. For example

$ rm file0c

There is no such thing like a trash folder on a HPC system. Deleting a file should be consider an irreversible operation.

Recursive deletes can be done with

$ rm -rf folder_to_delete

Be extremely cautious deleting files recursively. You cannot damage the system as the files that you do not own you cannot delete. However, you can delete all your files forever.

mv

This command moves a files from one directory to another. It also can be used to rename files or directories.

$ mv file0b file0c

pwd

It is easy to get lost when you move in complex directory structures. pwd will tell you the current directory.

$ pwd
/Users/andjelka/Documents/LSST/interpython/interpython_hpc

cd

This command moves you to the directory indicated as an argument, if no argument is given, it returns to your home directory.

$ cd folder1

cat and tac

When you want to see the contents of a text file, the command cat displays the contents on the screen. It is also useful when you want to concatenate the contents of several files.

$ cat star_A_lc.csv
time,brightness
0.0,90.5
0.5,91.1
1.0,88.9
1.5,92.2
2.0,89.3
2.5,90.8
3.0,87.7...

To concatenate files you need to use the symbol ">" indicating that you want to redirect the output of a command into a file

$ cat file1 file2 file3 > file_all

The command tac shows the files in reverse starting from the last line back to the first one.

more and less

Sometimes text files, as those created as product of simulations are too large to be seen in one screen, the command “more” shows the files one screen at a time. The command "less" offers more functionality and should be the tool of choice to see large text files.

$ less OUT

ln

This command allow to create links between files. Used wisely could help you save time when traveling frequently to deep directories. By default it creates hard links. Hard links are like copies, but they make references to the same place in disk. Symbolic links are better in many cases because you can cross file systems and partitions. To create a symbolic link

$ ln -s file1 link_to_file1

grep

The grep command extract from its input the lines containing a specified string or regular expression. It is a powerful command for extracting specific information from large files. Consider for example

$ grep time  star_A_lc.csv
time,brightness
$ grep 88.9  star_A_lc.csv
1.0,88.9
  ...

Create a light curve directory with empty csv files created with touch command/use provided csv files:

mkdir -p lightcurves
cd lightcurves
touch star_A_lc.csv star_B_lc.csv star_C_lc.csv
ln -s star_A_lc.csv brightest_star.csv

ls – List Light Curve Files

List files:

$ ls
star_A_lc.csv  star_B_lc.csv  star_C_lc.csv  brightest_star.csv

Use -F and -a for extra detail:

$ ls -aF
./  ../  star_A_lc.csv  star_B_lc.csv  star_C_lc.csv  brightest_star.csv@

Long format with metadata:

$ ls -al
-rw-r--r--  1 user  staff  1024 Jun 16 09:00 star_A_lc.csv
lrwxr-xr-x  1 user  staff    15 Jun 16 09:01 brightest_star.csv -> star_A_lc.csv

cp – Copy a Light Curve File

$ cp star_B_lc.csv backup_star_B.csv

rm – Delete a Corrupted Light Curve

$ rm star_C_lc.csv

mv – Rename Light Curve

$ mv star_B_lc.csv star_B_epoch1.csv

pwd – Show Working Directory

$ pwd
/home/user/...../lightcurves

cd – Move between directroies

$ cd ../images

cat and tac – Inspect or Reverse Light Curve

cat star_A_lc.csv
tac star_A_lc.csv

Combine curves:

cat star_A_lc.csv star_B_epoch1.csv > merged_lc.csv

more and less – View Long Curves

$ less star_A_lc.csv

ln – Create Alias for Light Curve

ln -s star_B_epoch1.csv variable_star.csv

grep – Extract Brightness Above Threshold

grep ',[89][0-9]\.[0-9]*' star_A_lc.csv

Regular expressions offers ways to specified text strings that could vary in several ways and allow commands such as grep to extract those strings efficiently. We will see more about regular expressions on our third day devoted to data processing.

More commands

The 10 commands above, will give you enough tools to move files around and travel the directory tree. The GNU Core Utilities are the basic file, shell and text manipulation utilities of the GNU operating system. These are the core utilities which are expected to exist on every operating system.

If you want to know about the whole set of coreutils execute:

info coreutils

Each command has its own manual. You can access those manuals with

man <COMMAND>

Output of entire files

cat                    Concatenate and write files
tac                    Concatenate and write files in reverse
nl                     Number lines and write files
od                     Write files in octal or other formats
base64                 Transform data into printable data

Formatting file contents

fmt                    Reformat paragraph text
numfmt                 Reformat numbers
pr                     Paginate or columnate files for printing
fold                   Wrap input lines to fit in specified width

Output of parts of files

head                   Output the first part of files
tail                   Output the last part of files
split                  Split a file into fixed-size pieces
csplit                 Split a file into context-determined pieces

Summarizing files

wc                     Print newline, word, and byte counts
sum                    Print checksum and block counts
cksum                  Print CRC checksum and byte counts
md5sum                 Print or check MD5 digests
sha1sum                Print or check SHA-1 digests
sha2 utilities                   Print or check SHA-2 digests

Operating on sorted files

sort                   Sort text files
shuf                   Shuffle text files
uniq                   Uniquify files
comm                   Compare two sorted files line by line
ptx                    Produce a permuted index of file contents
tsort                  Topological sort

Operating on fields

cut                    Print selected parts of lines
paste                  Merge lines of files
join                   Join lines on a common field

Operating on characters

tr                     Translate, squeeze, and/or delete characters
expand                 Convert tabs to spaces
unexpand               Convert spaces to tabs

Directory listing

ls                     List directory contents
dir                    Briefly list directory contents
vdir                   Verbosely list directory contents
dircolors              Color setup for 'ls'

Basic operations

cp                     Copy files and directories
dd                     Convert and copy a file
install                Copy files and set attributes
mv                     Move (rename) files
rm                     Remove files or directories
shred                  Remove files more securely

Special file types

link                   Make a hard link via the link syscall
ln                     Make links between files
mkdir                  Make directories
mkfifo                 Make FIFOs (named pipes)
mknod                  Make block or character special files
readlink               Print value of a symlink or canonical file name
rmdir                  Remove empty directories
unlink                 Remove files via unlink syscall

Changing file attributes

chown                  Change file owner and group
chgrp                  Change group ownership
chmod                  Change access permissions
touch                  Change file timestamps

Disk usage

df                     Report file system disk space usage
du                     Estimate file space usage
stat                   Report file or file system status
sync                   Synchronize data on disk with memory
truncate               Shrink or extend the size of a file

Printing text

echo                   Print a line of text
printf                 Format and print data
yes                    Print a string until interrupted

Conditions

false                  Do nothing, unsuccessfully
true                   Do nothing, successfully
test                   Check file types and compare values
expr                   Evaluate expressions
tee                    Redirect output to multiple files or processes

File name manipulation

basename               Strip directory and suffix from a file name
dirname                Strip last file name component
pathchk                Check file name validity and portability
mktemp                 Create temporary file or directory
realpath               Print resolved file names

Working context

pwd                    Print working directory
stty                   Print or change terminal characteristics
printenv               Print all or some environment variables
tty                    Print file name of terminal on standard input

User information

id                     Print user identity
logname                Print current login name
whoami                 Print effective user ID
groups                 Print group names a user is in
users                  Print login names of users currently logged in
who                    Print who is currently logged in

System context

arch                   Print machine hardware name
date                   Print or set system date and time
nproc                  Print the number of processors
uname                  Print system information
hostname               Print or set system name
hostid                 Print numeric host identifier
uptime                 Print system uptime and load

Modified command

chroot                 Run a command with a different root directory
env                    Run a command in a modified environment
nice                   Run a command with modified niceness
nohup                  Run a command immune to hangups
stdbuf                 Run a command with modified I/O buffering
timeout                Run a command with a time limit

Process control

kill                   Sending a signal to processes

Delaying

sleep                  Delay for a specified time

Numeric operations

factor                 Print prime factors
seq                    Print numeric sequences

Exercise: Using the Command Line Interface

  1. Create 4 folders A, B, C, D and inside each of them create a three more: X, Y and Z. At the end you should have 12 subfolders. Use the command tree to ensure you create the correct tree.

Solution

You should get:

$ tree
.
├── A
│   ├── X
│   ├── Y
│   └── Z
├── B
│   ├── X
│   ├── Y
│   └── Z
├── C
│   ├── X
│   ├── Y
│   └── Z
└── D
   ├── X
   ├── Y
   └── Z
  1. Lets copy some files in those folders. From the data folder lightcurve and two csv files 1.IntroHPC/1.CLI, there are 3 files t17.in, t17.files and 14si.pspnc. Using the command line tools create copies of “t17.in” and “t17.files” inside each of those folders and symbolic link for 14si.pspnc. Both “t17.in” and “t17.files” are text files that we want to edit, but 14si.pspnc is just a relatively big file that we just need to use for the simulation, we do not want to make copies of if, just symbolic links and save disk space.

Solution

Step-by-step CLI commands:

# Step 1: Create the main folders
mkdir -p A/X A/Y A/Z B/X B/Y B/Z C/X C/Y C/Z D/X D/Y D/Z

# Step 2: Confirm structure
tree

Output should be:

.
├── A
│   ├── X
│   ├── Y
│   └── Z
├── B
│   ├── X
│   ├── Y
│   └── Z
├── C
│   ├── X
│   ├── Y
│   └── Z
└── D
    ├── X
    ├── Y
    └── Z

File Preparation:

# Make a dummy data directory and populate it
mkdir -p 1.IntroHPC/1.CLI
echo "dummy input" > 1.IntroHPC/1.CLI/test.in
echo "file list" > 1.IntroHPC/1.CLI/test.files
touch 1.IntroHPC/1.CLI/14si.pspnc
for folder in A B C D; do
  for sub in X Y Z; do
    cp 1.IntroHPC/1.CLI/test.in $folder/$sub/
    cp 1.IntroHPC/1.CLI/test.files $folder/$sub/
    ln -s ../../../1.IntroHPC/1.CLI/14si.pspnc $folder/$sub/14si.pspnc
  done
done

Verify

tree A
cat A/X/t17.in
ls -l A/X/14si.pspnc

Midnight Commander

GNU Midnight Commander is a visual file manager. mc feature a rich full-screen text mode application that allows you to copy, move and delete files and whole directory trees. Sometimes using a text-based user interface is convenient, in order to use mc just enter the command on the terminal

mc

There are several keystrokes that can be used to work with mc, most of them comes from typing the F1 to F10 keys. On Mac you need to press the “fn” key, on gnome (Linux), you need to disable the interpretation of the Function keys for gnome-terminal.

Exercise: Using the Command Line Interface

Use mc to create a folder E and subfolders X, Y and Z, copy the same files as we did for the previous exercise.

Exercise: Create LSST-style Visit Directory Structure

Use the CLI to create the following:

lsst_cli/
├── visit001/
│   ├── raw/
│   ├── calexp/
│   └── logs/
├── visit002/
│   ├── raw/
│   ├── calexp/
│   └── logs/

Then:

Exercise: Analyze Simulated Pipeline Logs

Use grep to find all lines in all job.log files containing “FATAL” or “WARNING”.

$ grep -rE 'FATAL|WARNING' lsst_cli/

Midnight Commander

GNU Midnight Commander is a visual file manager. mc feature a rich full-screen text mode application that allows you to copy, move and delete files and whole directory trees. Sometimes using a text-based user interface is convenient, in order to use mc just enter the command on the terminal

mc

There are several keystrokes that can be used to work with mc, most of them comes from typing the F1 to F10 keys. On Mac you need to press the “fn” key, on gnome (Linux), you need to disable the interpretation of the Function keys for gnome-terminal.

Exercise: Using the Command Line Interface

Use mc to create a folder E and subfolders X, Y and Z, copy the same files as we did for the previous exercise.

Key Points

  • Basic CLI skills enable efficient navigation and manipulation of data repositories

  • Use man to explore arguments for command-line tools


HPC facilities

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Question 1

Objectives
  • Objective 1

What are the IDACs

IDACs roster

A table with IDAC website, CPUs/GPU/Storage space data, Status (operational, construction, planned…), LSST and other surveys data stored, access info (command line/GUI), access policy (automated upon registration, personal contact needed, restricted to certain countries, etc), additional information (e.g. no Jupyter or best suited for LSST epoch image analysis).

Key Points

  • Keypoint 1


Section 2: HPC Bura

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Question 1

Objectives
  • Objective 1

Section overview, what it’s about, tools we’ll use, info we’ll learn.

Key Points

  • Keypoint 1


Slurm

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Question 1

Objectives
  • Objective 1

Intro

Paragraph 1

Key Points

  • Keypoint 1


Intro for computing nodes and resources

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • Question 1

Objectives
  • Objective 1

Intro

Paragraph 1

Key Points

  • Keypoint 1


Intro code examples

Overview

Teaching: 30 min
Exercises: 20 min
Questions
  • What is the difference between serial and parallel code?

  • How do CPU and GPU programs differ?

  • What tools and programming models are used for HPC development?

Objectives
  • Understand the structure of CPU and GPU code examples.

  • Identify differences between serial, multi-threaded, and GPU-accelerated code.

  • Recognize common programming models like OpenMP, MPI, and CUDA.

  • Appreciate performance trade-offs and profiling basics.

Motivation for HPC Coding

Most users begin with simple serial code, which runs sequentially on one processor. However, for problems involving large data sets, high resolution simulations, or time-critical tasks, serial execution quickly becomes inefficient.

Parallel programming allows us to split work across multiple CPUs or even GPUs. High-Performance Computing (HPC) relies on this concept to solve problems faster.

Figure Suggestion:

Plot showing execution time of serial vs parallel implementation for increasing problem sizes (e.g., matrix size or loop iterations).

Serial Code Example (CPU)

Introduction to NumPy

Before diving into parallel computing or GPU acceleration, it’s important to understand how performance can already be improved significantly on a CPU using efficient libraries.

Example: Summing the elements of a large array using Serial Computation

import numpy as np
import time

array = np.random.rand(10**7)
start = time.time()
total = np.sum(array)
end = time.time()
print(f"Sum: {total}, Time taken: {end - start:.4f} seconds")

Exercise:

Modify the above to use a manual loop with for instead of np.sum, and compare the performance.

Solution

Replace np.sum(array) with a manual loop using for.
Note: This will be much slower due to Python’s loop overhead.

import numpy as np
import time

array = np.random.rand(10**7)
start = time.time()
total = 0.0
for value in array:
    total += value
end = time.time()
print(f"Sum: {total}, Time taken: {end - start:.4f} seconds")

This gives you a baseline for how optimized np.sum is compared to native Python loops.

Reference:

Carpentries Python loops lesson


Parallel CPU Programming

Introduction to OpenMP and MPI

Parallel programming on CPUs is primarily achieved through two widely-used models:

OpenMP (Open Multi-Processing)

OpenMP is used for shared-memory parallelism. It enables multi-threading where each thread has access to the same memory space. It is ideal for multicore processors on a single node.

OpenMP was first introduced in October 1997 as a collaborative effort between hardware vendors, software developers, and academia. The goal was to standardize a simple, portable API for shared-memory parallel programming in C, C++, and Fortran. Over time, OpenMP has evolved to support nested parallelism, Single Instruction Multiple Data (vectorization), and offloading to GPUs, while remaining easy to integrate into existing code through compiler directives.

OpenMP is now maintained by the OpenMP Architecture Review Board, which includes organizations like Arm, AMD, IBM, Intel, Cray, HP, Fujitsu, Nvidia, NEC, Red Hat, Texas Instruments, and Oracle Corporation. OpenMP allows you to parallelize loops in C/C++ or Fortran using compiler directives.

Example: Running a loop in parallel using OpenMP

#include <omp.h>
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
}

Since C programming is not a prerequisite for this workshop, let’s break down the parallel loop code in detail.

Requirements:

Explanation of the code

  • #include <omp.h>: Includes the OpenMP API header needed for all OpenMP functions and directives.
  • #pragma omp parallel for: A compiler directive that tells the compiler to parallelize the for loop that follows.
  • The for loop itself performs element-wise addition of two arrays (b and c), storing the result in array a.

How OpenMP Executes This

  1. OpenMP detects available CPU cores (e.g., 4 or 8).
  2. It splits the loop into chunks — one for each thread.
  3. Each core runs its chunk simultaneously (in parallel).
  4. The threads synchronize automatically once all work is done.

Output

The output is stored in array a, which will contain the sum of corresponding elements from arrays b and c. The execution is faster than running the loop sequentially.

Real-World Analogy

Suppose you need to send 100 emails:

  • Without OpenMP: One person sends all 100 emails one by one.
  • With OpenMP: 4 people each send 25 emails at the same time — finishing in a quarter of the time.

Exercise: Parallelization Challenge

Consider this loop:

for (int i = 1; i < N; i++) {
  a[i] = a[i-1] + b[i];
}

Can this be parallelized with OpenMP? Why or why not?

Solution

No, this cannot be safely parallelized because each iteration depends on the result of the previous iteration (a[i-1]).

OpenMP requires loop iterations to be independent for parallel execution. Here, since each a[i] relies on a[i-1], the loop has a sequential dependency, also known as a loop-carried dependency.

This prevents naive parallelization with OpenMP’s #pragma omp parallel for.

However, this type of problem can be parallelized using more advanced techniques like a parallel prefix sum (scan) algorithm, which restructures the computation to allow parallel execution in logarithmic steps instead of linear.

MPI (Message Passing Interface)

MPI is used for distributed-memory parallelism. Processes run on separate memory spaces (often on different nodes) and communicate via message passing. It is suitable for large-scale HPC clusters.

MPI emerged earlier, in the early 1990s, as the need for a standardized message-passing interface became clear in the growing field of distributed-memory computing. Before MPI, various parallel systems used their own vendor-specific libraries, making code difficult to port across machines.

In June 1994, the first official MPI standard (MPI-1) was published by the MPI Forum, a collective of academic institutions, government labs, and industry partners. Since then, MPI has become the de facto standard for scalable parallel computing across multiple nodes, and it continues to evolve with versions like MPI-2, MPI-3, MPI-4, and finally MPI-5 released on June 5 2025 which add support for features like parallel I/O and dynamic process management.

Example: Implementation of MPI using the mpi4py library in python

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

data = rank ** 2
all_data = comm.gather(data, root=0)
if rank == 0:
    print(all_data)

Explanation of the code

This example demonstrates a basic use of mpi4py to perform a gather operation using the MPI.COMM_WORLD communicator.

Each process:

  • Determines its rank (an integer from 0 to N-1, where N is the number of processes).
  • Computes rank ** 2 (the square of its rank).
  • Uses comm.gather() to send the result to the root process (rank 0).

Only the root process gathers the data and prints the complete list.

Example Output (4 processes):

  • Rank 0 computes 0² = 0
  • Rank 1 computes 1² = 1
  • Rank 2 computes 2² = 4
  • Rank 3 computes 3² = 9

The root process (rank 0) gathers all results and prints:

[0, 1, 4, 9]

Other ranks do not print anything.

This example illustrates point-to-root communication — useful when one process needs to collect and process results from all workers.

Note:

You won’t be able to run this code in your current environment. This example requires a Slurm job submission script to launch MPI processes across nodes. Detailed instructions on how to configure Slurm scripts and request resources are provided in Section 2: HPC Bura - Resource Optimization .

Typically one would run this file after having a slurm script with the required resources followed by this command

mpirun -n 4 python your_script.py

Exercise:

Modify serial array summation using OpenMP (C) or multiprocessing (Python).

References:


GPU Programming Concepts

GPUs, or Graphics Processing Units, are composed of thousands of lightweight processing cores that are optimized for handling multiple operations simultaneously. This parallel architecture makes them particularly effective for data-parallel problems, where the same operation is performed independently across large datasets such as matrix multiplications, vector operations, or image processing tasks.

Originally designed to accelerate the rendering of complex graphics and visual effects in computer games, GPUs are inherently well-suited for high-throughput computations involving large tensors and multidimensional arrays. Their architecture enables them to perform numerous arithmetic operations in parallel, which has made them increasingly valuable in scientific computing, deep learning, and simulations.

Even without explicit parallel programming, many modern libraries and frameworks (such as TensorFlow, PyTorch, and CuPy) can automatically leverage GPU acceleration to significantly improve performance. However, to fully exploit the computational power of GPUs, especially in high-performance computing (HPC) environments, explicit parallelization is often employed.

Introduction to CUDA

In HPC systems, CUDA (Compute Unified Device Architecture), a parallel computing platform and programming model developed by NVIDIA is the most widely used platform for GPU programming. CUDA allows developers to write highly parallel code that runs directly on the GPU, providing fine-grained control over memory usage, thread management, and performance optimization. It allows developers to harness the power of NVIDIA GPUs for general-purpose computing, known as GPGPU (General-Purpose computing on Graphics Processing Units).

A Brief History

How CUDA Works

CUDA allows developers to write C, C++, Fortran, and Python code that runs on the GPU.

This hierarchical design allows fine-grained control over memory and computation.

Key Features

A CUDA program includes:

Checking CUDA availability before running code

import cuda

if cuda.is_available():
    print("CUDA is available!")
    print(f"Detected GPU: {cuda.get_current_device().name}")
else:
    print("CUDA is NOT available.")

High-Level Libraries for Portability

High-level libraries allow easier GPU programming in Python:

Example: Add vectors utlising CUDA using the numba python library

from numba_cuda import cuda
import numpy as np
import time

@cuda.jit
def add_vectors(a, b, c):
    i = cuda.grid(1)
    if i < a.size:
        c[i] = a[i] + b[i]

# Setup input arrays
N = 1_000_000
a = np.arange(N, dtype=np.float32)
b = np.arange(N, dtype=np.float32)
c = np.zeros_like(a)

# Copy arrays to device
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.device_array_like(a)

# Configure the kernel
threads_per_block = 256
blocks_per_grid = (N + threads_per_block - 1) // threads_per_block

# Launch the kernel
start = time.time()
add_vectors[blocks_per_grid, threads_per_block](d_a, d_b, d_c)
cuda.synchronize()  # Wait for GPU to finish
gpu_time = time.time() - start

# Copy result back to host
d_c.copy_to_host(out=c)

# Verify results
print("First 5 results:", c[:5])
print("Time taken on GPU:", gpu_time, "seconds")

Note:

This code also requires GPU access and Slurm job submission to be executed properly. You will revisit this exercise after completing Section 2: HPC Bura - Resource Optimization , which introduces how to configure resources and submit jobs.

Exercise:

Write a Numba or CuPy version of vector addition and compare speed with NumPy.

References:



CPU vs GPU Architecture

Figure Suggestion:

Diagram comparing CPU vs GPU architecture, e.g., from CUDA C Programming Guide

Comparing CPU and GPU Approaches

Feature CPU (OpenMP/MPI) GPU (CUDA)
Cores Few (2–64) Thousands (1024–10000+)
Memory Shared / distributed Device-local (needs transfer)
Programming Easier to debug Requires more setup
Performance Good for logic-heavy tasks Excellent for large, data-parallel problems

Exercise:

Show which parts of the code execute on GPU vs CPU (host vs device). Read about concepts like memory copy and kernel launch.

Reference: NVIDIA CUDA Samples

Figure:

Bar chart showing performance on matrix multiplication or vector addition.


Code Profiling (Optional)

To understand and improve performance, profiling tools are essential.

Exercise:

Time your serial and parallel code. Where is the bottleneck?

Optional Reference: NVIDIA Nsight Tools


Summary


Key Points

  • Serial code is limited to a single thread of execution, while parallel code uses multiple cores or nodes.

  • OpenMP and MPI are popular for parallel CPU programming; CUDA is used for GPU programming.

  • High-level libraries like Numba and CuPy make GPU acceleration accessible from Python.


Resource optimization

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • What is the difference between requesting for CPU and GPU resources using Slurm?

  • How can I optimize my slurm script to avail the best resources for my specific task?

Objectives
  • Understand different types of computational workloads and their resource requirements

  • Write optimized Slurm job scripts for sequential, parallel, and GPU workloads

  • Monitor and analyze resource utilization

  • Apply best practices for efficient resource allocation

Understanding Resource Requirements

Different computational tasks have varying resource requirements. Understanding these patterns is crucial for efficient HPC usage.

Types of Workloads

CPU-bound workloads: Tasks that primarily use computational power

Memory-bound workloads: Tasks limited by memory access speed

I/O-bound workloads: Tasks limited by disk or network operations

GPU-accelerated workloads: Tasks that can utilize parallel processing


Types of Jobs and Resources

Job Type SLURM Partition Key SLURM Options Example Use Case
Serial serial --partition, no MPI Single-thread tensor calc
Parallel defaultq -N, -n, mpirun MPI simulation
GPU gpu --gpus, --cpus-per-task Deep learning training

Choosing the Right Node


Example

For understanding how we can utilise different resources available on the HPC for the same computational task, we take the example of a python code which calculates the Gravitational Deflection Angle defined in the following way:

Deflection Angle Formula

For light passing near a massive object, the deflection angle (α) in the weak-field approximation is given by:

α = 4GM / (c²b)

Where:

Computational Task Description

Compute the deflection angle over a grid of:

Generate a 2D array where each entry corresponds to the deflection angle for a specific pair of mass and impact parameter. Now we will look at how we will implement this for the different resources available on the HPC.

Sequential Job Optimization

Sequential jobs run on a single CPU core and are suitable for tasks that cannot be parallelized.

Sequential Job Script Explained

#!/bin/bash
#SBATCH -J jobname                    # Job name for identification
#SBATCH -o outfile.%J                 # Standard output file (%J = job ID)
#SBATCH -e errorfile.%J               # Standard error file (%J = job ID)
#SBATCH --partition=serial            # Use serial queue for single-core jobs
./[programme executable name]          # Execute your program

Script breakdown:

Example: Gravitational Deflection Angle Sequential CPU

import numpy as np
import time
import matplotlib.pyplot as plt
import os
import matplotlib.colors as colors

# Constants
G = 6.67430e-11
c = 299792458
M_sun = 1.98847e30

# Parameter grid
mass_grid = np.linspace(1, 1000, 10000)  # Solar masses
impact_grid = np.linspace(1e9, 1e12, 10000)  # meters

result = np.zeros((len(mass_grid), len(impact_grid)))

# Timing
start = time.time()

# Sequential computation
for i, M in enumerate(mass_grid):
    for j, b in enumerate(impact_grid):
        result[i, j] = (4 * G * M * M_sun) / (c**2 * b)

end = time.time()

print(f"CPU Sequential time: {end - start:.3f} seconds")

result = np.save("result_cpu.npy", result)
mass_grid = np.save("mass_grid_cpu.npy", mass_grid)
impact_grid = np.save("impact_grid_cpu.npy", impact_grid)

# Load data
result = np.load("result_cpu.npy")
mass_grid = np.load("mass_grid_cpu.npy")
impact_grid = np.load("impact_grid_cpu.npy")

# Create meshgrid
M, B = np.meshgrid(mass_grid / 1.989e30, impact_grid / 1e9, indexing='ij')

# Create output directory
os.makedirs("plots", exist_ok=True)

plt.figure(figsize=(8,6))
pcm = plt.pcolormesh(B, M, result,
                      norm=colors.LogNorm(vmin=result[result > 0].min(), vmax=result.max()),
                      shading='auto', cmap='plasma')

plt.colorbar(pcm, label='Deflection Angle (radians, log scale)')
plt.xlabel('Impact Parameter (Gm)')
plt.ylabel('Mass (Solar Masses)')
plt.title('Gravitational Deflection Angle - CPU')

plt.tight_layout()
plt.savefig("plots/deflection_angle_cpu.png", dpi=300)
plt.close()

print("CPU plot saved in 'plots/deflection_angle_cpu.png'")

Sequential Job Script for the Example

#!/bin/bash
#SBATCH --job-name=HPC_WS_SCPU # Provide a name for the job 
#SBATCH --output=HPC_WS_SCPU_%j.out # Request the output file along with the job number
#SBATCH --error=HPC_WS_SCPU_%j.err # Request the error file along with the job number
#SBATCH --partition=serial
#SBATCH --nodes=1 # Request one CPU node
#SBATCH --ntasks=1 # Request 1 core from the CPU node
#SBATCH --time=-01:00:00 # Set time limit for the job
#SBATCH --mem=16G #Request 16GB memory 

# Load required modules
module purge # Remove the list of pre loaded modules
module load Python/3.9.1
module list

# Create a python virtual environment 
python3 -m venv name_of_your_venv

# Activate your Python environment
source name_of_your_venv/bin/activate

echo "Starting Gravitational Lensing Deflection calculation of Sequential CPU..."
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"

# Run the Python script (with logging)
python Gravitational_Deflection_Angle_SCPU.py

echo "Job completed at $(date)"

Exercise: Profile Your Code

Compile and run the sequential code. Use htop to monitor resource usage. Identify whether it’s CPU-bound or memory-bound

Parallel Job Optimization

Parallel jobs can utilize multiple CPU cores across one or more nodes to accelerate computation.

Parallel Job Script Explained

#!/bin/bash
#SBATCH -J jobname                    # Job name
#SBATCH -o outfile.%J                 # Output file
#SBATCH -e errorfile.%J               # Error file
#SBATCH --partition=defaultq          # Parallel job queue
#SBATCH -N 2                          # Number of compute nodes
#SBATCH -n 24                         # Total number of CPU cores per node
mpirun -np 48 ./mpi_program           # Run with 48 MPI processes (2 nodes × 24 cores)

Changes from the sequential script:

Example: Gravitational Deflection Angle Parallel CPU

from mpi4py import MPI
import numpy as np
import time
import os 
import matplotlib.pyplot as plt
import matplotlib.colors as colors

# MPI setup
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Constants
G = 6.67430e-11
c = 299792458
M_sun = 1.98847e30

# Parameter grid (same on all ranks)
mass_grid = np.linspace(1, 1000, 10000)  # Solar masses
impact_grid = np.linspace(1e9, 1e12, 10000)  # meters

# Distribute mass grid among ranks
chunk_size = len(mass_grid) // size
start_idx = rank * chunk_size
end_idx = (rank + 1) * chunk_size if rank != size - 1 else len(mass_grid)

local_mass = mass_grid[start_idx:end_idx]
local_result = np.zeros((len(local_mass), len(impact_grid)))

# Timing
local_start = time.time()

# Compute local chunk
for i, M in enumerate(local_mass):
    for j, b in enumerate(impact_grid):
        local_result[i, j] = (4 * G * M * M_sun) / (c**2 * b)

local_end = time.time()
print(f"Rank {rank} local time: {local_end - local_start:.3f} seconds")

# Gather results
result = None
if rank == 0:
    result = np.zeros((len(mass_grid), len(impact_grid)))

comm.Gather(local_result, result, root=0)

if rank == 0:
    total_time = local_end - local_start
    print(f"MPI total time (wall time): {total_time:.3f} seconds")
    result = np.save("result_mpi.npy", result)
    mass_grid = np.save("mass_grid_mpi.npy", mass_grid)
    impact_grid = np.save("impact_grid_mpi.npy", impact_grid)

# Load data
result = np.load("result_mpi.npy")
mass_grid = np.load("mass_grid_mpi.npy")
impact_grid = np.load("impact_grid_mpi.npy")

# Create meshgrid
M, B = np.meshgrid(mass_grid / 1.989e30, impact_grid / 1e9, indexing='ij')

# Create output directory
os.makedirs("plots", exist_ok=True)

plt.figure(figsize=(8,6))
pcm = plt.pcolormesh(B, M, result,
                      norm=colors.LogNorm(vmin=result[result > 0].min(), vmax=result.max()),
                      shading='auto', cmap='plasma')

plt.colorbar(pcm, label='Deflection Angle (radians, log scale)')
plt.xlabel('Impact Parameter (Gm)')
plt.ylabel('Mass (Solar Masses)')
plt.title('Gravitational Deflection Angle - MPI')

plt.tight_layout()
plt.savefig("plots/deflection_angle_mpi.png", dpi=300)
plt.close()

print("MPI plot saved in 'plots/deflection_angle_mpi.png'")

Parallel Job Script for the Example

#!/bin/bash
#SBATCH --job-name=HPC_WS_PCPU # Provide a name for the job 
#SBATCH --output=HPC_WS_PCPU_%j.out # Request the output file along with the job number
#SBATCH --error=HPC_WS_PCPU_%j.err # Request the error file along with the job number
#SBATCH --partition=defaultq 
#SBATCH --nodes=2 # Request two CPU nodes
#SBATCH --ntasks=4 # Request 2 cores from each CPU node
#SBATCH --time=-01:00:00 # Set time limit for the job
#SBATCH --mem=16G #Request 16GB memory 

# Load required modules
module purge # Remove the list of pre loaded modules
module load Python/3.9.1
module load openmpi4/default
module list # List the modules

# Create a python virtual environment 
python3 -m venv name_of_your_venv

# Activate your Python virtual environment
source name_of_your_venv/bin/activate

echo "Starting Gravitational Lensing Deflection calculation of Sequential CPU..."
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"

# Run the Python script with MPI (with logging)
mpirun -np 4 python Gravitational_Lensing_PCPU.py

echo "Job completed at $(date)"

Exercise: Optimize Parallel Performance

Compile the OpenMP version with different thread counts. Submit jobs with varying --cpus-per-task values. Plot performance vs. thread count

GPU Job Optimization

GPU jobs leverage graphics processing units for massively parallel computations.

GPU Job Script Explained

#!/bin/bash
#SBATCH --nodes=1                     # Single node (GPUs are node-local)
#SBATCH --ntasks-per-node=1           # One task per node
#SBATCH --cpus-per-task=4             # CPU cores to support GPU
#SBATCH -o output-%J.out              # Output file with job ID
#SBATCH -e error-%J.err               # Error file with job ID
#SBATCH --partition=gpu               # GPU-enabled partition
#SBATCH --mem 32G                     # Memory allocation
#SBATCH --gpus-per-node=1             # Number of GPUs requested
./[programme executable name]          # GPU program execution

GPU-specific parameters:

Example: CUDA Implementation

import numpy as np
from numba import cuda
import time
import matplotlib.pyplot as plt
import os
import matplotlib.colors as colors


# Constants
G = 6.67430e-11
c = 299792458

# Parameter grid
mass_grid = np.linspace(1e30, 1e33, 10000)
impact_grid = np.linspace(1e9, 1e12, 10000)

mass_grid_device = cuda.to_device(mass_grid)
impact_grid_device = cuda.to_device(impact_grid)
result_device = cuda.device_array((len(mass_grid), len(impact_grid)))

# CUDA kernel
@cuda.jit
def compute_deflection(mass_array, impact_array, result):
    i, j = cuda.grid(2)
    if i < mass_array.size and j < impact_array.size:
        M = mass_array[i]
        b = impact_array[j]
        result[i, j] = (4 * G * M) / (c**2 * b)

# Setup thread/block dimensions
threadsperblock = (16, 16)
blockspergrid_x = (mass_grid.size + threadsperblock[0] - 1) // threadsperblock[0]
blockspergrid_y = (impact_grid.size + threadsperblock[1] - 1) // threadsperblock[1]
blockspergrid = (blockspergrid_x, blockspergrid_y)

# Run the kernel
start = time.time()
compute_deflection[blockspergrid, threadsperblock](mass_grid_device, impact_grid_device, result_device)
cuda.synchronize()
end = time.time()

result = result_device.copy_to_host()

print(f"CUDA time: {end - start:.3f} seconds")

# Save the result and grids
np.save("result_cuda.npy", result)
np.save("mass_grid_cuda.npy", mass_grid)
np.save("impact_grid_cuda.npy", impact_grid)

print("Result and grids saved as .npy files.")

# Load data
result = np.load("result_cuda.npy")
mass_grid = np.load("mass_grid_cuda.npy")
impact_grid = np.load("impact_grid_cuda.npy")

# Create meshgrid
M, B = np.meshgrid(mass_grid / 1.989e30, impact_grid / 1e9, indexing='ij')

# Create output directory
os.makedirs("plots", exist_ok=True)

plt.figure(figsize=(8,6))
pcm = plt.pcolormesh(B, M, result,
                      norm=colors.LogNorm(vmin=result[result > 0].min(), vmax=result.max()),
                      shading='auto', cmap='plasma')

plt.colorbar(pcm, label='Deflection Angle (radians, log scale)')
plt.xlabel('Impact Parameter (Gm)')
plt.ylabel('Mass (Solar Masses)')
plt.title('Gravitational Deflection Angle - CUDA')

plt.tight_layout()
plt.savefig("plots/deflection_angle_cuda.png", dpi=300)
plt.close()

print("CUDA plot saved in 'plots/deflection_angle_cuda.png'")

GPU Job Script for the Example

#!/bin/bash
#SBATCH --job-name=HPC_WS_GPU  # Provide a name for the job 
#SBATCH --output=HPC_WS_GPU_%j.out
#SBATCH --error=HPC_WS_GPU_%j.err
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4 # Number of CPUs for data preparation 
#SBATCH --mem=32G # Memmory allocation
#SBATCH --gpus-per-node=1
#SBATCH --time=06:00:00

# --------- Load Environment ---------
module load Python/3.9.1
module load cuda/11.2
module list

# Activate your Python virtual environment
source name_of_your_venv/bin/activate

# --------- Run the Python Script ---------
python Gravitational_Lensing_GPU.py

Exercise: GPU vs CPU Comparison

Run the tensor operations script on both CPU and GPU. Compare execution times and memory usage. Calculate the speedup factor

Resource Monitoring and Performance Analysis

Monitoring Job Performance

#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --job-name=ResourceMonitor
#SBATCH --output=ResourceMonitor_%j.out
#SBATCH --time=00:10:00  # 10 minutes max (5 for monitoring + buffer)

# --------- Configuration ---------
LOG_FILE="resource_monitor.log"
INTERVAL=30    # Interval between logs in seconds
DURATION=60   # Total duration in seconds (5 minutes)
ITERATIONS=$((DURATION / INTERVAL))

# --------- Start Monitoring ---------
echo "Starting Resource Monitoring for $DURATION seconds (~$((DURATION/60)) minutes)..."
echo "Logging to: $LOG_FILE"
echo "------ Monitoring Started at $(date) ------" >> "$LOG_FILE"

# --------- System Info Check ---------
echo "==== System Info Check ====" | tee -a "$LOG_FILE"
echo "Hostname: $(hostname)" | tee -a "$LOG_FILE"

# Check NVIDIA driver and GPU presence
if command -v nvidia-smi &> /dev/null; then
    echo "✅ nvidia-smi is available." | tee -a "$LOG_FILE"
    if nvidia-smi &>> "$LOG_FILE"; then
        echo "✅ GPU detected and driver is working." | tee -a "$LOG_FILE"
    else
        echo "⚠️ NVIDIA-SMI failed. Check GPU node or driver issues." | tee -a "$LOG_FILE"
    fi
else
    echo "❌ nvidia-smi is not installed." | tee -a "$LOG_FILE"
fi

echo "Checking for NVIDIA GPU presence on PCI bus..." | tee -a "$LOG_FILE"
if lspci | grep -i nvidia &>> "$LOG_FILE"; then
    echo "✅ NVIDIA GPU found on PCI bus." | tee -a "$LOG_FILE"
else
    echo "❌ No NVIDIA GPU detected on this node." | tee -a "$LOG_FILE"
fi

echo "" | tee -a "$LOG_FILE"

# --------- Trap CTRL+C for Clean Exit ---------
trap "echo 'Stopping monitoring...'; echo '------ Monitoring Ended at $(date) ------' >> \"$LOG_FILE\"; exit" SIGINT SIGTERM

# --------- Monitoring Loop ---------
for ((i=1; i<=ITERATIONS; i++)); do
    echo "========================== $(date) ==========================" >> "$LOG_FILE"

    # GPU usage monitoring
    echo "--- GPU Usage (nvidia-smi) ---" >> "$LOG_FILE"
    nvidia-smi 2>&1 | grep -v "libnvidia-ml.so" >> "$LOG_FILE"
    echo "" >> "$LOG_FILE"

    # CPU and Memory monitoring
    echo "--- CPU and Memory Usage (top) ---" >> "$LOG_FILE"
    top -b -n 1 | head -20 >> "$LOG_FILE"
    echo "" >> "$LOG_FILE"

    sleep $INTERVAL
done

echo "------ Monitoring Ended at $(date) ------" >> "$LOG_FILE"
echo "✅ Resource monitoring completed."

Understanding Outputs - top - CPU and Memory Monitoring

Example Output:

--- CPU and Memory Usage (top) ---
top - 17:53:49 up 175 days,  9:41,  0 users,  load average: 1.01, 1.06, 1.08
Tasks: 765 total,   1 running, 764 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.2 us,  0.1 sy,  0.0 ni, 97.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 515188.2 total, 482815.2 free,  17501.5 used,  14871.5 buff/cache
MiB Swap:   4096.0 total,   4072.2 free,     23.8 used. 493261.3 avail Mem

Explanation:

Header Line - System Uptime and Load Average

top - 17:53:49 up 175 days,  9:41,  0 users,  load average: 1.01, 1.06, 1.08

Task Summary

Tasks: 765 total,   1 running, 764 sleeping,   0 stopped,   0 zombie

CPU Usage

%Cpu(s):  2.2 us,  0.1 sy,  0.0 ni, 97.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
Field Meaning
us User CPU time - 2.2%
sy System (kernel) time - 0.1%
ni Nice (priority) - 0.0%
id Idle - 97.7%
wa Waiting for I/O - 0.0%
hi Hardware interrupts - 0.0%
si Software interrupts - 0.0%
st Steal time (virtualization) - 0.0%

Memory Usage

MiB Mem : 515188.2 total, 482815.2 free,  17501.5 used,  14871.5 buff/cache
Field Meaning
total Total RAM (515188.2 MiB)
free Free RAM (482815.2 MiB)
used Used by programs (17501.5 MiB)
buff/cache Disk cache and buffers (14871.5 MiB)

Swap Usage

MiB Swap:   4096.0 total,   4072.2 free,     23.8 used. 493261.3 avail Mem
Field Meaning
total Swap space available (4096 MiB)
free Free swap (4072.2 MiB)
used Swap used (23.8 MiB)
avail Mem Available memory for new tasks (493261.3 MiB)

Understanding Outputs - nvidia-smi GPU Monitoring

Example nvidia-smi Output:

------ Wed Jul  2 17:12:23 IST 2025 ------
Wed Jul  2 17:12:23 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------|
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 NVL                On  |   00000000:AB:00.0 Off |                    0 |
| N/A   37C    P0             86W /  400W |    1294MiB /  95830MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2234986      C   python                                       1284MiB |
+-----------------------------------------------------------------------------------------+

Explanation of nvidia-smi Output:

GPU Summary Header

GPU Info Section

Field Meaning
GPU GPU index number (0)
Name GPU model: NVIDIA H100 NVL
Persistence-M Persistence Mode: On (reduces init overhead)
Bus-Id PCI bus ID location
Disp.A Display Active: Off (no display connected)
Volatile Uncorr. ECC GPU memory error count (0 = no errors)
Fan Fan speed (N/A — passive cooling)
Temp Temperature (37C — healthy)
Perf Performance state (P0 = maximum performance)
Pwr:Usage/Cap Power usage (86W of 400W max)
Memory-Usage 1294MiB used / 95830MiB total
GPU-Util GPU utilization (0% — idle)
Compute M. Compute mode (Default)
MIG M. Multi-Instance GPU mode (Disabled)

Processes Section

Field Meaning
GPU GPU ID (0)
PID Process ID (2234986)
Type Type of process: C (compute)
Process Name Process name (python)
GPU Memory 1284MiB used by this process

Performance Comparison Script

import matplotlib.pyplot as plt

# Extracted timings from the printed output
methods = ['Sequential (CPU)', 'MPI (PCPU)', 'CUDA (GPU)']
times = [70.430, 13.507, 0.341] # Replace the times with the times printed by running the above scripts

plt.figure(figsize=(10, 6))
bars = plt.bar(methods, times, color=['blue', 'green', 'red'])
plt.ylabel('Execution Time (seconds)')
plt.title('Performance Comparison: CPU vs MPI vs GPU')

# Add labels above bars
for bar, time in zip(bars, times):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
             f'{time:.3f}s', ha='center', va='bottom')

plt.tight_layout()
plt.savefig('performance_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

Exercise: Resource Efficiency Analysis

Run the above python script to create a comparitive analysis between the different methods you used in this tutorial to understand the efficiency of different resources

Example Solution

Performance Comparison

This plot shows the execution time comparison between CPU, MPI, and GPU implementations.

Best Practices and Common Pitfalls

Resource Allocation Best Practices

  1. Match resources to workload requirements
    • Don’t request more resources than you can use
    • Consider memory requirements carefully
    • Use appropriate partitions/queues
  2. Test with small jobs first
    • Validate your scripts with shorter runs
    • Check resource utilization before scaling up
  3. Monitor and optimize
    • Use profiling tools to identify bottlenecks
    • Adjust resource requests based on actual usage

Common Mistakes to Avoid

  1. Over-requesting resources
    # Bad: Requesting 32 cores for sequential code
    #SBATCH --cpus-per-task=32
    ./sequential_program
       
    # Good: Match core count to parallelization
    #SBATCH --cpus-per-task=1
    ./sequential_program
    
  2. Memory allocation errors
    # Bad: Not specifying memory for memory-intensive jobs
    #SBATCH --partition=defaultq
       
    # Good: Specify adequate memory
    #SBATCH --partition=defaultq
    #SBATCH --mem=16G
    
  3. GPU job inefficiencies
    # Bad: Too many CPU cores for GPU job
    #SBATCH --cpus-per-task=32
    #SBATCH --gpus-per-node=1
       
    # Good: Balanced CPU-GPU ratio
    #SBATCH --cpus-per-task=4
    #SBATCH --gpus-per-node=1
    

Summary

Resource optimization in HPC involves understanding your workload characteristics and matching them with appropriate resource allocations. Key takeaways:

Efficient resource utilization not only improves your job performance but also ensures fair access to shared HPC resources for all users.


Revisit Earlier Exercises

Now that you’ve learned how to submit jobs using Slurm and request computational resources effectively, revisit the following exercises from the earlier lesson:

Try running them now on your cluster using the appropriate Slurm script and resource flags.

Solution 1: Slurm Submission Script for Exercise MPI with mpi4py

The following script can be used to submit your MPI-based Python program (mpi_hpc_ws.py) on an HPC cluster using Slurm:

#!/bin/bash
#SBATCH --job-name=mpi_hpc_ws
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --partition=defaultq
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --time=00:10:00
#SBATCH --mem=16G

# Load required modules
module purge
module load Python/3.9.1
module list


Create a python virtual environment 
python3 -m venv name_of_your_venv

Activate your Python environment
source name_of_your_venv/bin/activate

# Run the MPI job
mpirun -np 4 python mpi_hpc_ws.py

Make sure your virtual environment has mpi4py installed and that your system has access to the OpenMPI runtime via mpirun. Adjust the number of nodes and tasks depending on the cluster policies.

Solution 2: Slurm Submission Script for Exercise GPU with numba-cuda

The following script can be used to submit a GPU-accelerated Python job (numba_cuda_test.py) using Slurm:

#!/bin/bash
#SBATCH --job-name=Numba_Cuda
#SBATCH --output=Numba_Cuda_%j.out
#SBATCH --error=Numba_Cuda_%j.err
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gpus-per-node=1
#SBATCH --time=00:10:00

# --------- Load Environment ---------
module load Python/3.9.1
module load cuda/11.2
module list

# --------- Check whether the GPU is available ---------
from numba import cuda
print("CUDA Available:", cuda.is_available())
# Activate virtual environment
source 'name_of_venv'/bin/activate # Here name_of_venv refers to the name of your virtual environment without the quotes

# --------- Run the Python Script ---------
python numba_cuda_test.py

Make sure your virtual environment includes the numba-cuda python library to access the GPU.

Key Points

  • Different computational models (sequential, parallel, GPU) significantly impact runtime and efficiency.

  • Sequential CPU execution is simple but inefficient for large parameter spaces.

  • Parallel CPU (e.g., MPI or OpenMP) reduces runtime by distributing tasks but is limited by CPU core counts and communication overhead.

  • GPU computing can drastically accelerate tasks with massively parallel workloads like grid-based simulations.

  • Choosing the right computational model depends on the problem structure, resource availability, and cost-efficiency.

  • Effective Slurm job scripts should match the workload to the hardware: CPUs for serial/parallel, GPUs for highly parallelizable tasks.

  • Monitoring tools (like nvidia-smi, seff, top) help validate whether the resource request matches the actual usage.

  • Optimizing resource usage minimizes wait times in shared environments and improves overall throughput.


Wrap-up

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • Looking back at what was covered and how different pieces fit together

  • Where are some advanced topics and further reading available?

Objectives
  • Put the course in context with future learning.

Summary

Further Resources

Below are some additional resources to help you continue learning:

Key Points

  • Keypoint 1