Setting the Scene
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What are we teaching in this course?
What motivated the selection of topics covered in the course?
Objectives
Setting the scene and expectations
Making sure everyone has all the necessary software installed
Introduction
The course is organised into the following sections:
Section 1: Software project example
Section 2: Unit testing
Before We Start
A few notes before we start.
Prerequisite Knowledge
This is an intermediate-level software development course intended for people who have already been developing code in Python (or other languages) and applying it to their own problems after gaining basic software development skills. So, it is expected for you to have some prerequisite knowledge on the topics covered, as outlined at the beginning of the lesson. Check out this quiz to help you test your prior knowledge and determine if this course is for you.
Setup, Common Issues & Fixes
Have you setup and installed all the tools and accounts required for this course? Check the list of common issues, fixes & tips if you experience any problems running any of the tools you installed - your issue may be solved there.
Compulsory and Optional Exercises
Exercises are a crucial part of this course and the narrative. They are used to reinforce the points taught and give you an opportunity to practice things on your own. Please do not be tempted to skip exercises as that will get your local software project out of sync with the course and break the narrative. Exercises that are clearly marked as “optional” can be skipped without breaking things but we advise you to go through them too, if time allows. All exercises contain solutions but, wherever possible, try and work out a solution on your own.
Outdated Screenshots
Throughout this lesson we will make use and show content from various interfaces, e.g. websites, PC-installed software, command line, etc. These are evolving tools and platforms, always adding new features and new visual elements. Screenshots in the lesson may then become out-of-sync, refer to or show content that no longer exists or is different to what you see on your machine. If during the lesson you find screenshots that no longer match what you see or have a big discrepancy with what you see, please open an issue describing what you see and how it differs from the lesson content. Feel free to add as many screenshots as necessary to clarify the issue.
Let Us Know About the Issues
The original materials were adapted specifically for this workshop. They weren’t used before, and it is possible that they contain typos, code errors, or underexplained or unclear moments. Please, let us know about these issues. It will help us to improve the materials and make the next workshop better.
$ cd ~/InterPython_Workshop_Example/data
$ ls -l
total 24008
-rw-rw-r-- 1 alex alex 23686283 Jan 10 20:29 kepler_RRLyr.csv
-rw-rw-r-- 1 alex alex 895553 Jan 10 20:29 lsst_RRLyr.pkl
-rw-rw-r-- 1 alex alex 895553 Jan 10 20:29 lsst_RRLyr_protocol_4.pkl
...
Exercise
Exercise task
Solution
Exercise solution
code example ...
Key Points
Keypoint 1
Keypoint 2
Section 1: HPC basics
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Question 1
Objectives
Objective 1
Section overview, what it’s about, tools we’ll use, info we’ll learn.
- Intro into HPC calculations and how they differ from usual ones
- Introducing code examples (CPU and GPU ones)
- A recap of terminal commands useful for remote work and HPC (+practical session: terminal commands)
- Different types of HPC facilities, how to choose one, LSST HPC infrastructure
Key Points
Keypoint 1
HPC Intro
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Question 1
Objectives
Objective 1
Simple, inexpensive computing tasks are typically performed sequentially, i.e., where instructions are completed one after another in the order that they appear in the code, which is the default paradigm in most programming languages. For larger tasks that require many tasks to be executed, it is often more efficient to take advantage of the intrinisically parallel nature of most processors, which are designed to execute multiple processes simultaneously. Many common programming languages, including Python, support software that is executed in parallel, where multiple CPU cores are employed to perform tasks independently.
In modern computing, parallel programming has become more and more essential as computational tasks become more demanding. From protein folding in experimental drug development to galaxy formation and evolution, complex simulations rely on parallel computing to solve some of the most difficult problems in science. Parallel programming, hardware architecture, and systems admininstration come together in the multidisciplinary field of high-performance computing (HPC). In constrast to running code locally on your home machine, high-performance computing involves connecting to a cluster of computers elsewhere in the world that are networked together in order to run many operations in parallel.
Intro
Computer Architectures
Historically, computer architectures can be divided into two categories – von Neumann and Harvard. In the former, a computer system contains the following components:
- Arithmetic/logic unit (ALU)
- Control unit (CU)
- Memory unit (MU)
- Input/output (I/O) devices
The ALU takes in data from local memory from the MU and performs calculations, and the CU interprets instructions and directs the flow of data to and from the I/O devices, as shown in the diagram below. The MU contains all of the memory and instructions, which creates a performance bottleneck related to data transfer.

<br>
<sub>Diagram of von Neumann architecture, from (\cite{https://onlinelibrary.wiley.com/doi/book/10.1002/9780470932025})</sub>
The Harvard architecture is a variant of the von Neumann design, where instruction and data storage are physically separated, which allows simulataneous access to instructions and memory. This partially overcomes the von Neumann bottleneck, and most modern central processing units (CPU) adopt this architecture.

<br>
<sub>Diagram of Harvard architecture, from (\cite{https://onlinelibrary.wiley.com/doi/book/10.1002/9780470932025})</sub>
Performance
Computational peformance is largely determed by three components:
-
CPU: CPU performance is quantified by frequency, or “clock speed.” This is determines how quickly a CPU executes the instructions passed to it in terms of CPU cycles per second. For example a CPU with a clock speed of 3.5 GHz peforms 3.5 billion cycles each second. Some CPUs have multiple cores that support parallelization by executing multiple instructions simultaneously (\cite{https://www.intel.com/content/www/us/en/gaming/resources/cpu-clock-speed.html})
-
RAM: Random access memory (RAM) is a computer’s short-term memory and stores the data a computer needs to run applications and open files. Faster RAM allows data to flow to and from your CPU more rapidly, and more RAM capacity helps the CPU complete complex operations simultaneously (\cite{https://www.intel.com/content/www/us/en/tech-tips-and-tricks/computer-ram.html}).
-
Hard drive: In contrast to RAM, a computer’s hard drive is for long term data storage. Hard drives are characterized by their capacity and performance. Higher-capacity drives can hold more data and higher-performance drives read and write data faster. Hard disk drives (HDD) tend to offer more capacity at a lower cost, while solid state drives (SSDs) offer better performance and reliability.
Processing astronomical data, building models and running simulations requires significant computational power. The laptop or PC you’re using right now probably has between 8 and 32 Gb of RAM, a processor with 4-10 cores, and a hard drive that can store between 256 Gb and 1 Tb of data. But what happens if you need to process a dataset that is larger than 1 Tb, or if your model that has to be loaded into the RAM is larger than 32 Gb, or if the simulation you are running will take a month to calculate on your CPU? You need a bigger computer, or you need many computers working in parallel.
Flynn’s Taxonomy: a framework for parallel computing
When we talk about parallel computing, it’s helpful to have a framework to classify different types of computer architectures. The most common one is Flynn’s Taxonomy, which was proposed in 1966 (\cite{https://ieeexplore.ieee.org/document/1447203}). It gives a simple vocabulary for describing how computers handle tasks, and will help us in understanding how certain programming models are better for certain problems.
Flynn’s taxonomy uses four words:
- Single
- Instruction
- Multiple
- Data
These are combined to describe four main architectures (\cite{https://onlinelibrary.wiley.com/doi/book/10.1002/9780470932025}). For a thorough overview on these, you can refer to the HIPOWERED book. Let us go over them briefly,
- SISD (Single Instruction, Single Data): This is a traditional serial computer, and is also called a von Neumann computer. It executes one instruction at a time on a single piece of data. Your laptop, when running a simple, non-parallel program, is acting as a SISD machine.
- SIMD (Single Instruction, Multiple Data): This is a parallel architecture where multiple processors all execute the same instruction at the same time, but each one works on a different piece of data. This is the key to massive data parallelism.
- MISD (Multiple Instruction, Single Data): Each processor uses a different instruction on the same piece of data. This architecture is very rare in practice.
- MIMD (Multiple Instruction, Multiple Data): This is the most common type of parallel computer today. It has multiple processors, and each one can execute different instructions on different data, all at the same time. This is the architecture of a multi-core processor and of entire computing clusters.
In addition to these, a separate way in which parallel computers can be organized are:
- Multiprocessors: Computers with shared memory.
- Multicomputers: Computers with distributed memory.
SIMD in Practice: GPUs
An important example of SIMD architecture in modern computing is the GPU (Graphics Processing Unit).
GPUs were originally designed for computer graphics, which is an inherently parallel task (for e.g., calculating the color of millions of pixels at once). Researchers soon realized this massive parallelism could be used for general-purpose scientific computing, including physics simulations and training AI models, leading to the term GPGPU (General-Purpose GPU). These allow for significant speedups in “data-parallel” models. The trade-off is that GPUs have a different memory hierarchy (with less cache per core compared to CPUs), meaning performance can be limited by algorithms that require frequent or irregular communication between threads.
A CPU consists of a few very powerful cores optimized for complex, sequential tasks. A GPU, in contrast, is made of thousands of simpler cores that are masters of efficiency for data-parallel problems. Because of this, nearly all modern supercomputers are hybrid systems that use both CPUs and GPUs, leveraging the strengths of each.
Supercomputers vs. Computing Clusters
In the early days of HPC, a “supercomputer” was often a single, monolithic machine with custom vector processors. Today, that has completely changed, the vast majority of systems are clusters. Let us define some terms associated with this,
- Cluster: A cluster is a collection of many individual, standard (SISD) computers (often called nodes) connected by a very fast, high-performance network. A modern supercomputer is a massive cluster. These are classified as multicomputers as they were originally built by connecting multiple SISD computers.
- Node: A node is a single computer within the cluster. It has its own processors (CPUs), memory (RAM), and sometimes its own accelerators (GPUs). A typical compute node in a cluster today has two CPUs with multiple cores each.
- Workload Manager (or Scheduler): The entire cluster is managed by a special piece of software called a workload manager or scheduler, such as SLURM or PBS. Its job is to manage all the resources, handle a queue of jobs from many users, and decide when and where jobs will run. When submitting a job, it is the scheduler which reserves a set of nodes for the job for a certain amount of time.
Network Topology for Clusters
Since a cluster is just a collection of nodes, the way these nodes are connected (called the network topology) is critical to performance. If any program needs to send data between nodes frequently, a slow or inefficient network will create a major bottleneck.
Common topologies for HPC include:
-
Mesh: Nodes are arranged in a two or three-dimensional grid, with each node connected to its nearest neighbors. This structure is illustrated in Figure 1 below, which shows examples of a 2D mesh, a 3D mesh, and a 2D torus (where the edges of the mash wrap around to connect the boundaries, forming a torus).
Figure 1: 2D and 3D meshes: a) 2D mesh, b) 3D mesh, c) 2D torus. -
Fat Tree: The fat tree topology, shown in Figure 2, is widely used in large clusters. It is a hierarchical tree structure, but with “fatter” (higher bandwidth) links closer to the root to prevent network congestion when many nodes communicate simultaneously.
Figure 2: Fat tree topology.
Other topologies which are less common for an HPC include Bus, Ring, Star, Hypercube, Fully connected, Crossbar and Multistage interconnection. More information can be found in the HiPowered book.
Never Run Computations on the Login Node!
When you connect to an HPC cluster, you land on a login node. This node is a shared resource for all users to compile code, manage files, and submit jobs to the workload manager. It is not designed for heavy computation! Running an intensive program on the login node will slow it down for everyone and is a classic mistake for new users. Your job must be submitted through the workload manager (e.g., using
sbatch
in SLURM) to run on the compute nodes.
File system
HPC clusters use a few different locations and formats for storage.
-
Home directories: HPC clusters allocate personal storage to individual users, though typically with limited capacity. This is a good place to store scripts and configuration files.
-
Scratch: Scratch space is temporary storage that offers signifcantly larger capacity for active jobs and processing that is not backed up and usually deleted after job completion. Using scratch space is appropriate for:
- Jobs that require large storage capacity while running
- Data sets that do not fit in personal storage but are not permanently needed
- Jobs that need higher-performance storage than provided by personal storage
-
Shared: Shared storage is accessible to multiple users. These spaces tend to be allocated to members of a research groups as a common working directory and are continuously backed up.
\cite{https://www.hpc.iastate.edu/guides/nova/storage#:~:text=Home%20directories%20(/home),used%20for%20high%20volume%20access.}) \cite{https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=140938}
Which computer for which task?
- If you have an algorithm that requires the output from step A to start step B… (sequential code)
- If you have an algorithm that performs the same operation on a large volume of homogeneous data… (parallelizable code)
- If you have an algorithm that operates on vectors or matrices… (vectorization)
Key Points
Keypoint 1
Bura access
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Question 1
Objectives
Objective 1
- HPC Bura, how to access it. Different authorization schemes used by the astronomical HPC facilities (+practical session: logging in to the Bura)
Intro
Paragraph 1
Exercise
Exercise task
Solution
Exercise solution
code example ...
Key Points
Keypoint 1
Command line basics
Overview
Teaching: XX min
Exercises: YY minQuestions
What command line skills do I need to work with data on High Performing Computing (HPC)?
Objectives
Learn essential CLI commands used in data management and processing on HPC
The top 10 basic commands to learn
CLI stands for Command Line Interface.
It is a way to interact with a computer program by typing text commands into a terminal or console window, instead of using a graphical user interface (GUI) with buttons and menus.
When working with large datasets, pipeline logs, and configuration files — mastering the command line is essential. Whether you’re navigating a High Performance Computing (HPC) repo, inspecting files, or debugging processing failures, these Unix commands will be indispensable.
The following are general-purpose commands, and we may add LSST-specific notes where applicable.
Working with LSST data often involves accessing large-scale datasets stored in hierarchical directories, using symbolic links for shared data, and scripting reproducible data analysis pipelines. These are the fundamental commands every LSST astronomer should know.
File Preparation:needed to run for later exercsies
# Make a dummy data directory and populate it mkdir -p 1.IntroHPC/1.CLI echo "dummy input" > 1.IntroHPC/1.CLI/test.in echo "file list" > 1.IntroHPC/1.CLI/test.files touch 1.IntroHPC/1.CLI/14si.pspnc
Directory and File Operations
Setup (run once before these examples): ```bash
mkdir -p lsst_data/raw cd lsst_data touch image01.fits echo “instrument: LATISS” > config.yaml echo -e “INFO: Init\nFATAL: Calibration failed” > job.log ```
ls
List contents of a directory. Useful flags:
-l
: long format-a
: include hidden files-F
: append indicator (e.g./
for directory,@
for symlink)
$ ls -alF
-a
: Show all files, including hidden ones (those starting with.
like.bashrc
)-C
: Display in columns-F
: Append file type indicators:/
for directories@
for symbolic links*
for executables
pwd
, cd
To check and change the current directory:
$ pwd
$ cd /lsst_data/raw
mkdir
, tree
Create directories and visualize structure:
$ mkdir -p repo/gen3/raw/20240101
$ tree repo/gen3
File Manipulation
cp
, mv
, rm
Basic operations:
$ cp image01.fits image02.fits
$ mv image02.fits image_raw.fits
$ rm image_raw.fits
ln
Create symbolic links to avoid data duplication:
$ ln -s /datasets/lsst/raw/image01.fits ./image01.fits
Viewing and Extracting Data
cat
, less
, grep
View and search YAML config or log files:
$ cat config.yaml
$ less job.log
$ grep "FATAL" job.log
Permissions and Metadata
chmod
, chown
, stat
Manage and inspect file attributes:
$ chmod 644 config.yaml
$ stat image01.fits
LSST-Specific Use Cases
Familiarity with
bash
,grep
,find
, andawk
will accelerate your workflow.
Exercises
Exercise 1: Set up LSST-style directory
- Create a folder structure:
lsst_cli/ ├── visit001/ │ ├── raw/ │ ├── calexp/ │ └── logs/ ├── visit002/ │ ├── raw/ │ ├── calexp/ │ └── logs/
-
Populate each
raw/
withimage01.fits
, and symbolic link tocalexp.fits
incalexp/
. - Add a
process.yaml
and log file in eachlogs/
.
Use tree
to verify.
Exercise 2: Analyze Logs
Using grep
and less
, identify all lines with “WARNING” or “FATAL” in the log files across visits.
Further Learning
Explore additional CLI tools:
awk
,cut
,xargs
eups
,conda
for environment setup
ls
List all the files in a directory. Linux as many Operating Systems organize files in files and directories (also called folders).
$ ls
file0a file0b folder1 folder2 link0a link2a
Some terminal offer color output so you can differentiate normal files from folders. You can make the difference more clear with this
$ ls -aCF
./ ../ file0a file0b folder1/ folder2/ link0a@ link2a@
You will see a two extra directories "."
and ".."
. Those are special folders that refer to the current folder and the folder up in the tree.
Directories have the suffix "/"
. Symbolic links, kind of shortcuts to other files or directories are indicated with the symbol "@"
.
Another option to get more information about the files in the system is:
$ ls -al
total 16
drwxr-xr-x 5 andjelka staff 160 Jun 16 08:53 .
drwxr-xr-x+ 273 andjelka staff 8736 Jun 16 08:52 ..
-rw-r--r-- 1 andjelka staff 19 Jun 16 08:53 config.yaml
-rw-r--r-- 1 andjelka staff 0 Jun 16 08:53 image01.fits
-rw-r--r-- 1 andjelka staff 37 Jun 16 08:53 job.log
Those characters on the first column indicate the permissions. The first character will be “d” for directories, “l” for symbolic links and “-“ for normal files. The next 3 characters are the permissions for “read”, “write” and “execute” for the owner. The next 3 are for the group, and the final 3 are for others. The meaning of “execute” for a file indicates that the file could be a script or binary executable. For a directory it means that you can see its contents.
cp
This command copies the contents of one file into another file. For example
$ cp file0b file0c
rm
This command deletes the contents of one file. For example
$ rm file0c
There is no such thing like a trash folder on a HPC system. Deleting a file should be consider an irreversible operation.
Recursive deletes can be done with
$ rm -rf folder_to_delete
Be extremely cautious deleting files recursively. You cannot damage the system as the files that you do not own you cannot delete. However, you can delete all your files forever.
mv
This command moves a files from one directory to another. It also can be used to rename files or directories.
$ mv file0b file0c
pwd
It is easy to get lost when you move in complex directory structures. pwd will tell you the current directory.
$ pwd
/Users/andjelka/Documents/LSST/interpython/interpython_hpc
cd
This command moves you to the directory indicated as an argument, if no argument is given, it returns to your home directory.
$ cd folder1
cat and tac
When you want to see the contents of a text file, the command cat displays the contents on the screen. It is also useful when you want to concatenate the contents of several files.
$ cat star_A_lc.csv
time,brightness
0.0,90.5
0.5,91.1
1.0,88.9
1.5,92.2
2.0,89.3
2.5,90.8
3.0,87.7...
To concatenate files you need to use the symbol ">"
indicating that you want to redirect the output of a command into a file
$ cat file1 file2 file3 > file_all
The command tac shows the files in reverse starting from the last line back to the first one.
more and less
Sometimes text files, as those created as product of simulations are too large to be seen in one screen, the command “more” shows the files one screen at a time. The command "less"
offers more functionality and should be the tool of choice to see large text files.
$ less OUT
ln
This command allow to create links between files. Used wisely could help you save time when traveling frequently to deep directories. By default it creates hard links. Hard links are like copies, but they make references to the same place in disk. Symbolic links are better in many cases because you can cross file systems and partitions. To create a symbolic link
$ ln -s file1 link_to_file1
grep
The grep command extract from its input the lines containing a specified string or regular expression. It is a powerful command for extracting specific information from large files. Consider for example
$ grep time star_A_lc.csv
time,brightness
$ grep 88.9 star_A_lc.csv
1.0,88.9
...
Create a light curve directory with empty csv files created with touch command/use provided csv files:
mkdir -p lightcurves
cd lightcurves
touch star_A_lc.csv star_B_lc.csv star_C_lc.csv
ln -s star_A_lc.csv brightest_star.csv
ls
– List Light Curve Files
List files:
$ ls
star_A_lc.csv star_B_lc.csv star_C_lc.csv brightest_star.csv
Use -F
and -a
for extra detail:
$ ls -aF
./ ../ star_A_lc.csv star_B_lc.csv star_C_lc.csv brightest_star.csv@
Long format with metadata:
$ ls -al
-rw-r--r-- 1 user staff 1024 Jun 16 09:00 star_A_lc.csv
lrwxr-xr-x 1 user staff 15 Jun 16 09:01 brightest_star.csv -> star_A_lc.csv
cp
– Copy a Light Curve File
$ cp star_B_lc.csv backup_star_B.csv
rm
– Delete a Corrupted Light Curve
$ rm star_C_lc.csv
mv
– Rename Light Curve
$ mv star_B_lc.csv star_B_epoch1.csv
pwd
– Show Working Directory
$ pwd
/home/user/...../lightcurves
cd
– Move between directroies
$ cd ../images
cat
and tac
– Inspect or Reverse Light Curve
cat star_A_lc.csv
tac star_A_lc.csv
Combine curves:
cat star_A_lc.csv star_B_epoch1.csv > merged_lc.csv
more
and less
– View Long Curves
$ less star_A_lc.csv
ln
– Create Alias for Light Curve
ln -s star_B_epoch1.csv variable_star.csv
grep
– Extract Brightness Above Threshold
grep ',[89][0-9]\.[0-9]*' star_A_lc.csv
Regular expressions offers ways to specified text strings that could vary in several ways and allow commands such as grep to extract those strings efficiently. We will see more about regular expressions on our third day devoted to data processing.
More commands
The 10 commands above, will give you enough tools to move files around and travel the directory tree. The GNU Core Utilities are the basic file, shell and text manipulation utilities of the GNU operating system. These are the core utilities which are expected to exist on every operating system.
If you want to know about the whole set of coreutils execute:
info coreutils
Each command has its own manual. You can access those manuals with
man <COMMAND>
Output of entire files
cat Concatenate and write files tac Concatenate and write files in reverse nl Number lines and write files od Write files in octal or other formats base64 Transform data into printable data
Formatting file contents
fmt Reformat paragraph text numfmt Reformat numbers pr Paginate or columnate files for printing fold Wrap input lines to fit in specified width
Output of parts of files
head Output the first part of files tail Output the last part of files split Split a file into fixed-size pieces csplit Split a file into context-determined pieces
Summarizing files
wc Print newline, word, and byte counts sum Print checksum and block counts cksum Print CRC checksum and byte counts md5sum Print or check MD5 digests sha1sum Print or check SHA-1 digests sha2 utilities Print or check SHA-2 digests
Operating on sorted files
sort Sort text files shuf Shuffle text files uniq Uniquify files comm Compare two sorted files line by line ptx Produce a permuted index of file contents tsort Topological sort
Operating on fields
cut Print selected parts of lines paste Merge lines of files join Join lines on a common field
Operating on characters
tr Translate, squeeze, and/or delete characters expand Convert tabs to spaces unexpand Convert spaces to tabs
Directory listing
ls List directory contents dir Briefly list directory contents vdir Verbosely list directory contents dircolors Color setup for 'ls'
Basic operations
cp Copy files and directories dd Convert and copy a file install Copy files and set attributes mv Move (rename) files rm Remove files or directories shred Remove files more securely
Special file types
link Make a hard link via the link syscall ln Make links between files mkdir Make directories mkfifo Make FIFOs (named pipes) mknod Make block or character special files readlink Print value of a symlink or canonical file name rmdir Remove empty directories unlink Remove files via unlink syscall
Changing file attributes
chown Change file owner and group chgrp Change group ownership chmod Change access permissions touch Change file timestamps
Disk usage
df Report file system disk space usage du Estimate file space usage stat Report file or file system status sync Synchronize data on disk with memory truncate Shrink or extend the size of a file
Printing text
echo Print a line of text printf Format and print data yes Print a string until interrupted
Conditions
false Do nothing, unsuccessfully true Do nothing, successfully test Check file types and compare values expr Evaluate expressions tee Redirect output to multiple files or processes
File name manipulation
basename Strip directory and suffix from a file name dirname Strip last file name component pathchk Check file name validity and portability mktemp Create temporary file or directory realpath Print resolved file names
Working context
pwd Print working directory stty Print or change terminal characteristics printenv Print all or some environment variables tty Print file name of terminal on standard input
User information
id Print user identity logname Print current login name whoami Print effective user ID groups Print group names a user is in users Print login names of users currently logged in who Print who is currently logged in
System context
arch Print machine hardware name date Print or set system date and time nproc Print the number of processors uname Print system information hostname Print or set system name hostid Print numeric host identifier uptime Print system uptime and load
Modified command
chroot Run a command with a different root directory env Run a command in a modified environment nice Run a command with modified niceness nohup Run a command immune to hangups stdbuf Run a command with modified I/O buffering timeout Run a command with a time limit
Process control
kill Sending a signal to processes
Delaying
sleep Delay for a specified time
Numeric operations
factor Print prime factors seq Print numeric sequences
Exercise: Using the Command Line Interface
- Create 4 folders
A
,B
,C
,D
and inside each of them create a three more:X
,Y
andZ
. At the end you should have 12 subfolders. Use the command tree to ensure you create the correct tree.Solution
You should get:
$ tree . ├── A │ ├── X │ ├── Y │ └── Z ├── B │ ├── X │ ├── Y │ └── Z ├── C │ ├── X │ ├── Y │ └── Z └── D ├── X ├── Y └── Z
- Lets copy some files in those folders. From the data folder lightcurve and two csv files
1.IntroHPC/1.CLI
, there are 3 filest17.in
,t17.files
and14si.pspnc
. Using the command line tools create copies of “t17.in” and “t17.files” inside each of those folders and symbolic link for14si.pspnc
. Both “t17.in” and “t17.files” are text files that we want to edit, but14si.pspnc
is just a relatively big file that we just need to use for the simulation, we do not want to make copies of if, just symbolic links and save disk space.
Solution
Step-by-step CLI commands:
# Step 1: Create the main folders mkdir -p A/X A/Y A/Z B/X B/Y B/Z C/X C/Y C/Z D/X D/Y D/Z # Step 2: Confirm structure tree
Output should be:
. ├── A │ ├── X │ ├── Y │ └── Z ├── B │ ├── X │ ├── Y │ └── Z ├── C │ ├── X │ ├── Y │ └── Z └── D ├── X ├── Y └── Z
File Preparation:
# Make a dummy data directory and populate it mkdir -p 1.IntroHPC/1.CLI echo "dummy input" > 1.IntroHPC/1.CLI/test.in echo "file list" > 1.IntroHPC/1.CLI/test.files touch 1.IntroHPC/1.CLI/14si.pspnc
Copy and link files
for folder in A B C D; do for sub in X Y Z; do cp 1.IntroHPC/1.CLI/test.in $folder/$sub/ cp 1.IntroHPC/1.CLI/test.files $folder/$sub/ ln -s ../../../1.IntroHPC/1.CLI/14si.pspnc $folder/$sub/14si.pspnc done done
Verify
tree A cat A/X/t17.in ls -l A/X/14si.pspnc
Midnight Commander
GNU Midnight Commander is a visual file manager. mc feature a rich full-screen text mode application that allows you to copy, move and delete files and whole directory trees. Sometimes using a text-based user interface is convenient, in order to use mc just enter the command on the terminal
mc
There are several keystrokes that can be used to work with mc, most of them comes from typing the F1 to F10 keys. On Mac you need to press the “fn” key, on gnome (Linux), you need to disable the interpretation of the Function keys for gnome-terminal.
Exercise: Using the Command Line Interface
Use mc to create a folder E and subfolders X, Y and Z, copy the same files as we did for the previous exercise.
Exercise: Create LSST-style Visit Directory Structure
Use the CLI to create the following:
lsst_cli/
├── visit001/
│ ├── raw/
│ ├── calexp/
│ └── logs/
├── visit002/
│ ├── raw/
│ ├── calexp/
│ └── logs/
Then:
- Add dummy files
image01.fits
into eachraw/
folder. - Create symbolic links from
calexp/calexp.fits
to../raw/image01.fits
. - Create YAML files in each
logs/
folder with config info and dummyjob.log
files withWARNING
andFATAL
strings.
Exercise: Analyze Simulated Pipeline Logs
Use grep
to find all lines in all job.log
files containing “FATAL” or “WARNING”.
$ grep -rE 'FATAL|WARNING' lsst_cli/
Midnight Commander
GNU Midnight Commander is a visual file manager. mc feature a rich full-screen text mode application that allows you to copy, move and delete files and whole directory trees. Sometimes using a text-based user interface is convenient, in order to use mc just enter the command on the terminal
mc
There are several keystrokes that can be used to work with mc, most of them comes from typing the F1 to F10 keys. On Mac you need to press the “fn” key, on gnome (Linux), you need to disable the interpretation of the Function keys for gnome-terminal.
Exercise: Using the Command Line Interface
Use mc to create a folder E and subfolders X, Y and Z, copy the same files as we did for the previous exercise.
Key Points
Basic CLI skills enable efficient navigation and manipulation of data repositories
Use man to explore arguments for command-line tools
HPC facilities
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Question 1
Objectives
Objective 1
What are the IDACs
- IDACs idea: computational facilities from all over the world contribute their CPU hours and storage space
- Different types of IDACs: full DR storage, light IDACs, computation-only…
IDACs roster
A table with IDAC website, CPUs/GPU/Storage space data, Status (operational, construction, planned…), LSST and other surveys data stored, access info (command line/GUI), access policy (automated upon registration, personal contact needed, restricted to certain countries, etc), additional information (e.g. no Jupyter or best suited for LSST epoch image analysis).
Key Points
Keypoint 1
Section 2: HPC Bura
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Question 1
Objectives
Objective 1
Section overview, what it’s about, tools we’ll use, info we’ll learn.
- HPC Bura, how to access it. Different authorization schemes used by the astronomical HPC facilities (+practical session: logging in to the Bura)
- Intro for computing nodes and resources
- Slurm as a workload manager (+practical session: how to use Slurm)
- Resource optimization (+practical session: running CPU and GPU code examples)
Key Points
Keypoint 1
Slurm
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Question 1
Objectives
Objective 1
- Slurm as a workload manager (+practical session: how to use Slurm)
Intro
Paragraph 1
Key Points
Keypoint 1
Intro for computing nodes and resources
Overview
Teaching: 5 min
Exercises: 0 minQuestions
Question 1
Objectives
Objective 1
- Intro for computing nodes and resources
Intro
Paragraph 1
Key Points
Keypoint 1
Intro code examples
Overview
Teaching: 30 min
Exercises: 20 minQuestions
What is the difference between serial and parallel code?
How do CPU and GPU programs differ?
What tools and programming models are used for HPC development?
Objectives
Understand the structure of CPU and GPU code examples.
Identify differences between serial, multi-threaded, and GPU-accelerated code.
Recognize common programming models like OpenMP, MPI, and CUDA.
Appreciate performance trade-offs and profiling basics.
Motivation for HPC Coding
Most users begin with simple serial code, which runs sequentially on one processor. However, for problems involving large data sets, high resolution simulations, or time-critical tasks, serial execution quickly becomes inefficient.
Parallel programming allows us to split work across multiple CPUs or even GPUs. High-Performance Computing (HPC) relies on this concept to solve problems faster.
Figure Suggestion:
Plot showing execution time of serial vs parallel implementation for increasing problem sizes (e.g., matrix size or loop iterations).
Serial Code Example (CPU)
Introduction to NumPy
Before diving into parallel computing or GPU acceleration, it’s important to understand how performance can already be improved significantly on a CPU using efficient libraries.
- One of the most widely used tools for this in Python is
NumPy
. NumPy provides a fast and memory-efficient way to handle large numerical datasets using multi-dimensional arrays and vectorized operations. - While regular Python lists are flexible, they are not optimized for heavy numerical tasks. Looping through data element by element can quickly become a bottleneck as the problem size grows.
- NumPy solves this problem by providing a powerful N-dimensional array object and tools for performing operations on these arrays efficiently.
- Under the hood, NumPy uses optimized C code, so operations are much faster than using standard Python loops.
- NumPy also supports vectorized operations, which means you can apply functions to entire arrays without writing explicit loops. This not only improves performance but also leads to cleaner and more readable code.
- Using NumPy on the CPU is often the first step toward writing efficient scientific code.
- It’s a strong foundation before we move on to parallel computing or GPU acceleration. Now, we’ll see an example of how a simple numerical operation is implemented using NumPy on a single CPU core.
Example: Summing the elements of a large array using Serial Computation
import numpy as np
import time
array = np.random.rand(10**7)
start = time.time()
total = np.sum(array)
end = time.time()
print(f"Sum: {total}, Time taken: {end - start:.4f} seconds")
Exercise:
Modify the above to use a manual loop with
for
instead ofnp.sum
, and compare the performance.
Solution
Replace
np.sum(array)
with a manual loop usingfor
.
Note: This will be much slower due to Python’s loop overhead.import numpy as np import time array = np.random.rand(10**7) start = time.time() total = 0.0 for value in array: total += value end = time.time() print(f"Sum: {total}, Time taken: {end - start:.4f} seconds")
This gives you a baseline for how optimized
np.sum
is compared to native Python loops.
Reference:
Parallel CPU Programming
Introduction to OpenMP and MPI
Parallel programming on CPUs is primarily achieved through two widely-used models:
OpenMP (Open Multi-Processing)
OpenMP is used for shared-memory parallelism. It enables multi-threading where each thread has access to the same memory space. It is ideal for multicore processors on a single node.
OpenMP was first introduced in October 1997 as a collaborative effort between hardware vendors, software developers, and academia. The goal was to standardize a simple, portable API for shared-memory parallel programming in C, C++, and Fortran. Over time, OpenMP has evolved to support nested parallelism, Single Instruction Multiple Data (vectorization), and offloading to GPUs, while remaining easy to integrate into existing code through compiler directives.
OpenMP is now maintained by the OpenMP Architecture Review Board, which includes organizations like Arm, AMD, IBM, Intel, Cray, HP, Fujitsu, Nvidia, NEC, Red Hat, Texas Instruments, and Oracle Corporation. OpenMP allows you to parallelize loops in C/C++ or Fortran using compiler directives.
Example: Running a loop in parallel using OpenMP
#include <omp.h>
#pragma omp parallel for
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i];
}
Since C programming is not a prerequisite for this workshop, let’s break down the parallel loop code in detail.
Requirements:
- Add
#include <omp.h>
to your code - Compile with
-fopenmp
flag
Explanation of the code
#include <omp.h>
: Includes the OpenMP API header needed for all OpenMP functions and directives.#pragma omp parallel for
: A compiler directive that tells the compiler to parallelize thefor
loop that follows.- The
for
loop itself performs element-wise addition of two arrays (b
andc
), storing the result in arraya
.How OpenMP Executes This
- OpenMP detects available CPU cores (e.g., 4 or 8).
- It splits the loop into chunks — one for each thread.
- Each core runs its chunk simultaneously (in parallel).
- The threads synchronize automatically once all work is done.
Output
The output is stored in array
a
, which will contain the sum of corresponding elements from arraysb
andc
. The execution is faster than running the loop sequentially.Real-World Analogy
Suppose you need to send 100 emails:
- Without OpenMP: One person sends all 100 emails one by one.
- With OpenMP: 4 people each send 25 emails at the same time — finishing in a quarter of the time.
Exercise: Parallelization Challenge
Consider this loop:
for (int i = 1; i < N; i++) { a[i] = a[i-1] + b[i]; }
Can this be parallelized with OpenMP? Why or why not?
Solution
No, this cannot be safely parallelized because each iteration depends on the result of the previous iteration (
a[i-1]
).OpenMP requires loop iterations to be independent for parallel execution. Here, since each
a[i]
relies ona[i-1]
, the loop has a sequential dependency, also known as a loop-carried dependency.This prevents naive parallelization with OpenMP’s
#pragma omp parallel for
.However, this type of problem can be parallelized using more advanced techniques like a parallel prefix sum (scan) algorithm, which restructures the computation to allow parallel execution in logarithmic steps instead of linear.
MPI (Message Passing Interface)
MPI is used for distributed-memory parallelism. Processes run on separate memory spaces (often on different nodes) and communicate via message passing. It is suitable for large-scale HPC clusters.
MPI emerged earlier, in the early 1990s, as the need for a standardized message-passing interface became clear in the growing field of distributed-memory computing. Before MPI, various parallel systems used their own vendor-specific libraries, making code difficult to port across machines.
In June 1994, the first official MPI standard (MPI-1) was published by the MPI Forum, a collective of academic institutions, government labs, and industry partners. Since then, MPI has become the de facto standard for scalable parallel computing across multiple nodes, and it continues to evolve with versions like MPI-2, MPI-3, MPI-4, and finally MPI-5 released on June 5 2025 which add support for features like parallel I/O and dynamic process management.
Example: Implementation of MPI using the mpi4py library in python
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
data = rank ** 2
all_data = comm.gather(data, root=0)
if rank == 0:
print(all_data)
Explanation of the code
This example demonstrates a basic use of
mpi4py
to perform a gather operation using theMPI.COMM_WORLD
communicator.Each process:
- Determines its rank (an integer from 0 to N-1, where N is the number of processes).
- Computes
rank ** 2
(the square of its rank).- Uses
comm.gather()
to send the result to the root process (rank 0).Only the root process gathers the data and prints the complete list.
Example Output (4 processes):
- Rank 0 computes
0² = 0
- Rank 1 computes
1² = 1
- Rank 2 computes
2² = 4
- Rank 3 computes
3² = 9
The root process (rank 0) gathers all results and prints:
[0, 1, 4, 9]
Other ranks do not print anything.
This example illustrates point-to-root communication — useful when one process needs to collect and process results from all workers.
Note:
You won’t be able to run this code in your current environment. This example requires a Slurm job submission script to launch MPI processes across nodes. Detailed instructions on how to configure Slurm scripts and request resources are provided in Section 2: HPC Bura - Resource Optimization .
Typically one would run this file after having a slurm script with the required resources followed by this command
mpirun -n 4 python your_script.py
Exercise:
Modify serial array summation using OpenMP (C) or
multiprocessing
(Python).
References:
GPU Programming Concepts
GPUs, or Graphics Processing Units, are composed of thousands of lightweight processing cores that are optimized for handling multiple operations simultaneously. This parallel architecture makes them particularly effective for data-parallel problems, where the same operation is performed independently across large datasets such as matrix multiplications, vector operations, or image processing tasks.
Originally designed to accelerate the rendering of complex graphics and visual effects in computer games, GPUs are inherently well-suited for high-throughput computations involving large tensors and multidimensional arrays. Their architecture enables them to perform numerous arithmetic operations in parallel, which has made them increasingly valuable in scientific computing, deep learning, and simulations.
Even without explicit parallel programming, many modern libraries and frameworks (such as TensorFlow, PyTorch, and CuPy) can automatically leverage GPU acceleration to significantly improve performance. However, to fully exploit the computational power of GPUs, especially in high-performance computing (HPC) environments, explicit parallelization is often employed.
Introduction to CUDA
In HPC systems, CUDA (Compute Unified Device Architecture), a parallel computing platform and programming model developed by NVIDIA is the most widely used platform for GPU programming. CUDA allows developers to write highly parallel code that runs directly on the GPU, providing fine-grained control over memory usage, thread management, and performance optimization. It allows developers to harness the power of NVIDIA GPUs for general-purpose computing, known as GPGPU (General-Purpose computing on Graphics Processing Units).
A Brief History
- Introduced by NVIDIA in 2006, CUDA was the first platform to provide direct access to the GPU’s virtual instruction set and parallel computational elements.
- Before CUDA, GPUs were primarily used for rendering graphics, and general-purpose computations required indirect use through graphics APIs like OpenGL or DirectX.
- CUDA revolutionized scientific computing, deep learning, and high-performance computing (HPC) by enabling massive parallelism and accelerating workloads previously limited to CPUs.
How CUDA Works
CUDA allows developers to write C, C++, Fortran, and Python code that runs on the GPU.
- A CUDA program typically runs on both the CPU (host) and the GPU (device).
- Computational tasks (kernels) are written to execute in parallel across thousands of lightweight CUDA threads.
- These threads are organized hierarchically into:
- Grids of Blocks
- Blocks of Threads
This hierarchical design allows fine-grained control over memory and computation.
Key Features
- Massive parallelism with thousands of concurrent threads
- Unified memory architecture for seamless CPU-GPU data access
- Built-in libraries for BLAS, FFT, random number generation, and more (e.g., cuBLAS, cuFFT, cuRAND)
- Tooling support including profilers, debuggers, and performance analyzers (e.g., Nsight, CUDA-GDB)
A CUDA program includes:
- Host code: Runs on the CPU, manages memory, and launches kernels.
- Device code (kernel): Runs on the GPU.
- Memory management: Host/device memory allocations and transfers.
Checking CUDA availability before running code
import cuda
if cuda.is_available():
print("CUDA is available!")
print(f"Detected GPU: {cuda.get_current_device().name}")
else:
print("CUDA is NOT available.")
High-Level Libraries for Portability
High-level libraries allow easier GPU programming in Python:
- Numba: JIT compiler for Python; supports GPU via
@cuda.jit
- CuPy: NumPy-like API for NVIDIA GPUs
- Dask: Parallel computing with familiar APIs
Example: Add vectors utlising CUDA using the numba python library
from numba_cuda import cuda
import numpy as np
import time
@cuda.jit
def add_vectors(a, b, c):
i = cuda.grid(1)
if i < a.size:
c[i] = a[i] + b[i]
# Setup input arrays
N = 1_000_000
a = np.arange(N, dtype=np.float32)
b = np.arange(N, dtype=np.float32)
c = np.zeros_like(a)
# Copy arrays to device
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.device_array_like(a)
# Configure the kernel
threads_per_block = 256
blocks_per_grid = (N + threads_per_block - 1) // threads_per_block
# Launch the kernel
start = time.time()
add_vectors[blocks_per_grid, threads_per_block](d_a, d_b, d_c)
cuda.synchronize() # Wait for GPU to finish
gpu_time = time.time() - start
# Copy result back to host
d_c.copy_to_host(out=c)
# Verify results
print("First 5 results:", c[:5])
print("Time taken on GPU:", gpu_time, "seconds")
Note:
This code also requires GPU access and Slurm job submission to be executed properly. You will revisit this exercise after completing Section 2: HPC Bura - Resource Optimization , which introduces how to configure resources and submit jobs.
Exercise:
Write a Numba or CuPy version of vector addition and compare speed with NumPy.
References:
CPU vs GPU Architecture
- CPUs: Few powerful cores, better for sequential tasks.
- GPUs: Many lightweight cores, ideal for parallel workloads.
Figure Suggestion:
Diagram comparing CPU vs GPU architecture, e.g., from CUDA C Programming Guide
Comparing CPU and GPU Approaches
Feature | CPU (OpenMP/MPI) | GPU (CUDA) |
---|---|---|
Cores | Few (2–64) | Thousands (1024–10000+) |
Memory | Shared / distributed | Device-local (needs transfer) |
Programming | Easier to debug | Requires more setup |
Performance | Good for logic-heavy tasks | Excellent for large, data-parallel problems |
Exercise:
Show which parts of the code execute on GPU vs CPU (host vs device). Read about concepts like memory copy and kernel launch.
Reference: NVIDIA CUDA Samples
Figure:
Bar chart showing performance on matrix multiplication or vector addition.
Code Profiling (Optional)
To understand and improve performance, profiling tools are essential.
- CPU:
gprof
,perf
,cProfile
- GPU:
nvprof
, Nsight Systems, Nsight Compute
Exercise:
Time your serial and parallel code. Where is the bottleneck?
Optional Reference: NVIDIA Nsight Tools
Summary
- Serial code is simple but doesn’t scale well.
- Use OpenMP and MPI for parallelism on CPUs.
- Use CUDA (or high-level wrappers like Numba/CuPy) for GPU programming.
- Always profile your code to understand performance.
- Choose your tool based on problem size, complexity, and hardware.
Key Points
Serial code is limited to a single thread of execution, while parallel code uses multiple cores or nodes.
OpenMP and MPI are popular for parallel CPU programming; CUDA is used for GPU programming.
High-level libraries like Numba and CuPy make GPU acceleration accessible from Python.
Resource optimization
Overview
Teaching: 30 min
Exercises: 10 minQuestions
What is the difference between requesting for CPU and GPU resources using Slurm?
How can I optimize my slurm script to avail the best resources for my specific task?
Objectives
Understand different types of computational workloads and their resource requirements
Write optimized Slurm job scripts for sequential, parallel, and GPU workloads
Monitor and analyze resource utilization
Apply best practices for efficient resource allocation
Understanding Resource Requirements
Different computational tasks have varying resource requirements. Understanding these patterns is crucial for efficient HPC usage.
Types of Workloads
CPU-bound workloads: Tasks that primarily use computational power
- Mathematical calculations, simulations, data processing
- Benefit from more CPU cores and higher clock speeds
Memory-bound workloads: Tasks limited by memory access speed
- Large dataset processing, in-memory databases
- Require sufficient RAM and fast memory access
I/O-bound workloads: Tasks limited by disk or network operations
- File processing, database queries, data transfer
- Benefit from fast storage and network connections
GPU-accelerated workloads: Tasks that can utilize parallel processing
- Machine learning, scientific simulations, image processing
- Require appropriate GPU resources and memory
Types of Jobs and Resources
Job Type | SLURM Partition | Key SLURM Options | Example Use Case |
---|---|---|---|
Serial | serial |
--partition , no MPI |
Single-thread tensor calc |
Parallel | defaultq |
-N , -n , mpirun |
MPI simulation |
GPU | gpu |
--gpus , --cpus-per-task |
Deep learning training |
Choosing the Right Node
- GPU Node: For massively parallel computations on GPUs (e.g., CUDA, TensorFlow, PyTorch).
- SMP Node: For jobs needing large shared memory (big matrices, in-memory data) or multi-threaded code (OpenMP, R, Python multiprocessing).
- Regular Node: For MPI-based distributed jobs or simple CPU tasks.
Decision chart for Choosing Nodes:
Example
For understanding how we can utilise different resources available on the HPC for the same computational task, we take the example of a python code which calculates the Gravitational Deflection Angle defined in the following way:
Deflection Angle Formula
For light passing near a massive object, the deflection angle (α) in the weak-field approximation is given by:
α = 4GM / (c²b)
Where:
- G = Gravitational constant (6.67430 × 10⁻¹¹ m³ kg⁻¹ s⁻²)
- M = Mass of the lensing object (in kilograms)
- c = Speed of light (299792458 m/s)
- b = Impact parameter (the closest approach distance of the light ray to the mass, in meters)
Computational Task Description
Compute the deflection angle over a grid of:
- Mass values: From 1 to 1000 solar masses (10³⁰ to 10³³ kg)
- Impact parameters: From 10⁹ to 10¹² meters
Generate a 2D array where each entry corresponds to the deflection angle for a specific pair of mass and impact parameter. Now we will look at how we will implement this for the different resources available on the HPC.
Sequential Job Optimization
Sequential jobs run on a single CPU core and are suitable for tasks that cannot be parallelized.
Sequential Job Script Explained
#!/bin/bash
#SBATCH -J jobname # Job name for identification
#SBATCH -o outfile.%J # Standard output file (%J = job ID)
#SBATCH -e errorfile.%J # Standard error file (%J = job ID)
#SBATCH --partition=serial # Use serial queue for single-core jobs
./[programme executable name] # Execute your program
Script breakdown:
#!/bin/bash
: Specifies bash shell for script execution#SBATCH -J jobname
: Sets a descriptive job name for easy identification in queue#SBATCH -o outfile.%J
: Redirects standard output to a file with job ID#SBATCH -e errorfile.%J
: Redirects error messages to separate file#SBATCH --partition=serial
: Specifies the queue/partition for sequential jobs
Example: Gravitational Deflection Angle Sequential CPU
import numpy as np
import time
import matplotlib.pyplot as plt
import os
import matplotlib.colors as colors
# Constants
G = 6.67430e-11
c = 299792458
M_sun = 1.98847e30
# Parameter grid
mass_grid = np.linspace(1, 1000, 10000) # Solar masses
impact_grid = np.linspace(1e9, 1e12, 10000) # meters
result = np.zeros((len(mass_grid), len(impact_grid)))
# Timing
start = time.time()
# Sequential computation
for i, M in enumerate(mass_grid):
for j, b in enumerate(impact_grid):
result[i, j] = (4 * G * M * M_sun) / (c**2 * b)
end = time.time()
print(f"CPU Sequential time: {end - start:.3f} seconds")
result = np.save("result_cpu.npy", result)
mass_grid = np.save("mass_grid_cpu.npy", mass_grid)
impact_grid = np.save("impact_grid_cpu.npy", impact_grid)
# Load data
result = np.load("result_cpu.npy")
mass_grid = np.load("mass_grid_cpu.npy")
impact_grid = np.load("impact_grid_cpu.npy")
# Create meshgrid
M, B = np.meshgrid(mass_grid / 1.989e30, impact_grid / 1e9, indexing='ij')
# Create output directory
os.makedirs("plots", exist_ok=True)
plt.figure(figsize=(8,6))
pcm = plt.pcolormesh(B, M, result,
norm=colors.LogNorm(vmin=result[result > 0].min(), vmax=result.max()),
shading='auto', cmap='plasma')
plt.colorbar(pcm, label='Deflection Angle (radians, log scale)')
plt.xlabel('Impact Parameter (Gm)')
plt.ylabel('Mass (Solar Masses)')
plt.title('Gravitational Deflection Angle - CPU')
plt.tight_layout()
plt.savefig("plots/deflection_angle_cpu.png", dpi=300)
plt.close()
print("CPU plot saved in 'plots/deflection_angle_cpu.png'")
Sequential Job Script for the Example
#!/bin/bash
#SBATCH --job-name=HPC_WS_SCPU # Provide a name for the job
#SBATCH --output=HPC_WS_SCPU_%j.out # Request the output file along with the job number
#SBATCH --error=HPC_WS_SCPU_%j.err # Request the error file along with the job number
#SBATCH --partition=serial
#SBATCH --nodes=1 # Request one CPU node
#SBATCH --ntasks=1 # Request 1 core from the CPU node
#SBATCH --time=-01:00:00 # Set time limit for the job
#SBATCH --mem=16G #Request 16GB memory
# Load required modules
module purge # Remove the list of pre loaded modules
module load Python/3.9.1
module list
# Create a python virtual environment
python3 -m venv name_of_your_venv
# Activate your Python environment
source name_of_your_venv/bin/activate
echo "Starting Gravitational Lensing Deflection calculation of Sequential CPU..."
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"
# Run the Python script (with logging)
python Gravitational_Deflection_Angle_SCPU.py
echo "Job completed at $(date)"
Exercise: Profile Your Code
Compile and run the sequential code. Use
htop
to monitor resource usage. Identify whether it’s CPU-bound or memory-bound
Parallel Job Optimization
Parallel jobs can utilize multiple CPU cores across one or more nodes to accelerate computation.
Parallel Job Script Explained
#!/bin/bash
#SBATCH -J jobname # Job name
#SBATCH -o outfile.%J # Output file
#SBATCH -e errorfile.%J # Error file
#SBATCH --partition=defaultq # Parallel job queue
#SBATCH -N 2 # Number of compute nodes
#SBATCH -n 24 # Total number of CPU cores per node
mpirun -np 48 ./mpi_program # Run with 48 MPI processes (2 nodes × 24 cores)
Changes from the sequential script:
#SBATCH --partition=defaultq
: Sets to the default partition#SBATCH -N 2
: Requests 2 compute nodes#SBATCH -n 24
: Specifies 24 CPU cores per nodempirun -np 48
: Launches 48 MPI processes total (2 × 24)
Example: Gravitational Deflection Angle Parallel CPU
from mpi4py import MPI
import numpy as np
import time
import os
import matplotlib.pyplot as plt
import matplotlib.colors as colors
# MPI setup
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
# Constants
G = 6.67430e-11
c = 299792458
M_sun = 1.98847e30
# Parameter grid (same on all ranks)
mass_grid = np.linspace(1, 1000, 10000) # Solar masses
impact_grid = np.linspace(1e9, 1e12, 10000) # meters
# Distribute mass grid among ranks
chunk_size = len(mass_grid) // size
start_idx = rank * chunk_size
end_idx = (rank + 1) * chunk_size if rank != size - 1 else len(mass_grid)
local_mass = mass_grid[start_idx:end_idx]
local_result = np.zeros((len(local_mass), len(impact_grid)))
# Timing
local_start = time.time()
# Compute local chunk
for i, M in enumerate(local_mass):
for j, b in enumerate(impact_grid):
local_result[i, j] = (4 * G * M * M_sun) / (c**2 * b)
local_end = time.time()
print(f"Rank {rank} local time: {local_end - local_start:.3f} seconds")
# Gather results
result = None
if rank == 0:
result = np.zeros((len(mass_grid), len(impact_grid)))
comm.Gather(local_result, result, root=0)
if rank == 0:
total_time = local_end - local_start
print(f"MPI total time (wall time): {total_time:.3f} seconds")
result = np.save("result_mpi.npy", result)
mass_grid = np.save("mass_grid_mpi.npy", mass_grid)
impact_grid = np.save("impact_grid_mpi.npy", impact_grid)
# Load data
result = np.load("result_mpi.npy")
mass_grid = np.load("mass_grid_mpi.npy")
impact_grid = np.load("impact_grid_mpi.npy")
# Create meshgrid
M, B = np.meshgrid(mass_grid / 1.989e30, impact_grid / 1e9, indexing='ij')
# Create output directory
os.makedirs("plots", exist_ok=True)
plt.figure(figsize=(8,6))
pcm = plt.pcolormesh(B, M, result,
norm=colors.LogNorm(vmin=result[result > 0].min(), vmax=result.max()),
shading='auto', cmap='plasma')
plt.colorbar(pcm, label='Deflection Angle (radians, log scale)')
plt.xlabel('Impact Parameter (Gm)')
plt.ylabel('Mass (Solar Masses)')
plt.title('Gravitational Deflection Angle - MPI')
plt.tight_layout()
plt.savefig("plots/deflection_angle_mpi.png", dpi=300)
plt.close()
print("MPI plot saved in 'plots/deflection_angle_mpi.png'")
Parallel Job Script for the Example
#!/bin/bash
#SBATCH --job-name=HPC_WS_PCPU # Provide a name for the job
#SBATCH --output=HPC_WS_PCPU_%j.out # Request the output file along with the job number
#SBATCH --error=HPC_WS_PCPU_%j.err # Request the error file along with the job number
#SBATCH --partition=defaultq
#SBATCH --nodes=2 # Request two CPU nodes
#SBATCH --ntasks=4 # Request 2 cores from each CPU node
#SBATCH --time=-01:00:00 # Set time limit for the job
#SBATCH --mem=16G #Request 16GB memory
# Load required modules
module purge # Remove the list of pre loaded modules
module load Python/3.9.1
module load openmpi4/default
module list # List the modules
# Create a python virtual environment
python3 -m venv name_of_your_venv
# Activate your Python virtual environment
source name_of_your_venv/bin/activate
echo "Starting Gravitational Lensing Deflection calculation of Sequential CPU..."
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"
# Run the Python script with MPI (with logging)
mpirun -np 4 python Gravitational_Lensing_PCPU.py
echo "Job completed at $(date)"
Exercise: Optimize Parallel Performance
Compile the OpenMP version with different thread counts. Submit jobs with varying
--cpus-per-task
values. Plot performance vs. thread count
GPU Job Optimization
GPU jobs leverage graphics processing units for massively parallel computations.
GPU Job Script Explained
#!/bin/bash
#SBATCH --nodes=1 # Single node (GPUs are node-local)
#SBATCH --ntasks-per-node=1 # One task per node
#SBATCH --cpus-per-task=4 # CPU cores to support GPU
#SBATCH -o output-%J.out # Output file with job ID
#SBATCH -e error-%J.err # Error file with job ID
#SBATCH --partition=gpu # GPU-enabled partition
#SBATCH --mem 32G # Memory allocation
#SBATCH --gpus-per-node=1 # Number of GPUs requested
./[programme executable name] # GPU program execution
GPU-specific parameters:
--partition=gpu
: Specifies GPU-enabled compute nodes--gpus-per-node=1
: Requests one GPU per node--mem 32G
: Allocates sufficient memory for GPU operations--cpus-per-task=4
: Provides CPU cores to feed data to GPU
Example: CUDA Implementation
import numpy as np
from numba import cuda
import time
import matplotlib.pyplot as plt
import os
import matplotlib.colors as colors
# Constants
G = 6.67430e-11
c = 299792458
# Parameter grid
mass_grid = np.linspace(1e30, 1e33, 10000)
impact_grid = np.linspace(1e9, 1e12, 10000)
mass_grid_device = cuda.to_device(mass_grid)
impact_grid_device = cuda.to_device(impact_grid)
result_device = cuda.device_array((len(mass_grid), len(impact_grid)))
# CUDA kernel
@cuda.jit
def compute_deflection(mass_array, impact_array, result):
i, j = cuda.grid(2)
if i < mass_array.size and j < impact_array.size:
M = mass_array[i]
b = impact_array[j]
result[i, j] = (4 * G * M) / (c**2 * b)
# Setup thread/block dimensions
threadsperblock = (16, 16)
blockspergrid_x = (mass_grid.size + threadsperblock[0] - 1) // threadsperblock[0]
blockspergrid_y = (impact_grid.size + threadsperblock[1] - 1) // threadsperblock[1]
blockspergrid = (blockspergrid_x, blockspergrid_y)
# Run the kernel
start = time.time()
compute_deflection[blockspergrid, threadsperblock](mass_grid_device, impact_grid_device, result_device)
cuda.synchronize()
end = time.time()
result = result_device.copy_to_host()
print(f"CUDA time: {end - start:.3f} seconds")
# Save the result and grids
np.save("result_cuda.npy", result)
np.save("mass_grid_cuda.npy", mass_grid)
np.save("impact_grid_cuda.npy", impact_grid)
print("Result and grids saved as .npy files.")
# Load data
result = np.load("result_cuda.npy")
mass_grid = np.load("mass_grid_cuda.npy")
impact_grid = np.load("impact_grid_cuda.npy")
# Create meshgrid
M, B = np.meshgrid(mass_grid / 1.989e30, impact_grid / 1e9, indexing='ij')
# Create output directory
os.makedirs("plots", exist_ok=True)
plt.figure(figsize=(8,6))
pcm = plt.pcolormesh(B, M, result,
norm=colors.LogNorm(vmin=result[result > 0].min(), vmax=result.max()),
shading='auto', cmap='plasma')
plt.colorbar(pcm, label='Deflection Angle (radians, log scale)')
plt.xlabel('Impact Parameter (Gm)')
plt.ylabel('Mass (Solar Masses)')
plt.title('Gravitational Deflection Angle - CUDA')
plt.tight_layout()
plt.savefig("plots/deflection_angle_cuda.png", dpi=300)
plt.close()
print("CUDA plot saved in 'plots/deflection_angle_cuda.png'")
GPU Job Script for the Example
#!/bin/bash
#SBATCH --job-name=HPC_WS_GPU # Provide a name for the job
#SBATCH --output=HPC_WS_GPU_%j.out
#SBATCH --error=HPC_WS_GPU_%j.err
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4 # Number of CPUs for data preparation
#SBATCH --mem=32G # Memmory allocation
#SBATCH --gpus-per-node=1
#SBATCH --time=06:00:00
# --------- Load Environment ---------
module load Python/3.9.1
module load cuda/11.2
module list
# Activate your Python virtual environment
source name_of_your_venv/bin/activate
# --------- Run the Python Script ---------
python Gravitational_Lensing_GPU.py
Exercise: GPU vs CPU Comparison
Run the tensor operations script on both CPU and GPU. Compare execution times and memory usage. Calculate the speedup factor
Resource Monitoring and Performance Analysis
Monitoring Job Performance
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --job-name=ResourceMonitor
#SBATCH --output=ResourceMonitor_%j.out
#SBATCH --time=00:10:00 # 10 minutes max (5 for monitoring + buffer)
# --------- Configuration ---------
LOG_FILE="resource_monitor.log"
INTERVAL=30 # Interval between logs in seconds
DURATION=60 # Total duration in seconds (5 minutes)
ITERATIONS=$((DURATION / INTERVAL))
# --------- Start Monitoring ---------
echo "Starting Resource Monitoring for $DURATION seconds (~$((DURATION/60)) minutes)..."
echo "Logging to: $LOG_FILE"
echo "------ Monitoring Started at $(date) ------" >> "$LOG_FILE"
# --------- System Info Check ---------
echo "==== System Info Check ====" | tee -a "$LOG_FILE"
echo "Hostname: $(hostname)" | tee -a "$LOG_FILE"
# Check NVIDIA driver and GPU presence
if command -v nvidia-smi &> /dev/null; then
echo "✅ nvidia-smi is available." | tee -a "$LOG_FILE"
if nvidia-smi &>> "$LOG_FILE"; then
echo "✅ GPU detected and driver is working." | tee -a "$LOG_FILE"
else
echo "⚠️ NVIDIA-SMI failed. Check GPU node or driver issues." | tee -a "$LOG_FILE"
fi
else
echo "❌ nvidia-smi is not installed." | tee -a "$LOG_FILE"
fi
echo "Checking for NVIDIA GPU presence on PCI bus..." | tee -a "$LOG_FILE"
if lspci | grep -i nvidia &>> "$LOG_FILE"; then
echo "✅ NVIDIA GPU found on PCI bus." | tee -a "$LOG_FILE"
else
echo "❌ No NVIDIA GPU detected on this node." | tee -a "$LOG_FILE"
fi
echo "" | tee -a "$LOG_FILE"
# --------- Trap CTRL+C for Clean Exit ---------
trap "echo 'Stopping monitoring...'; echo '------ Monitoring Ended at $(date) ------' >> \"$LOG_FILE\"; exit" SIGINT SIGTERM
# --------- Monitoring Loop ---------
for ((i=1; i<=ITERATIONS; i++)); do
echo "========================== $(date) ==========================" >> "$LOG_FILE"
# GPU usage monitoring
echo "--- GPU Usage (nvidia-smi) ---" >> "$LOG_FILE"
nvidia-smi 2>&1 | grep -v "libnvidia-ml.so" >> "$LOG_FILE"
echo "" >> "$LOG_FILE"
# CPU and Memory monitoring
echo "--- CPU and Memory Usage (top) ---" >> "$LOG_FILE"
top -b -n 1 | head -20 >> "$LOG_FILE"
echo "" >> "$LOG_FILE"
sleep $INTERVAL
done
echo "------ Monitoring Ended at $(date) ------" >> "$LOG_FILE"
echo "✅ Resource monitoring completed."
Understanding Outputs - top
- CPU and Memory Monitoring
Example Output:
--- CPU and Memory Usage (top) ---
top - 17:53:49 up 175 days, 9:41, 0 users, load average: 1.01, 1.06, 1.08
Tasks: 765 total, 1 running, 764 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.2 us, 0.1 sy, 0.0 ni, 97.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 515188.2 total, 482815.2 free, 17501.5 used, 14871.5 buff/cache
MiB Swap: 4096.0 total, 4072.2 free, 23.8 used. 493261.3 avail Mem
Explanation:
Header Line - System Uptime and Load Average
top - 17:53:49 up 175 days, 9:41, 0 users, load average: 1.01, 1.06, 1.08
- 17:53:49 - Current time.
- up 175 days, 9:41 - How long the system has been running.
- 0 users - Number of users logged in.
-
load average - System load over 1, 5, and 15 minutes.
- A load of 1.00 means one CPU core is fully utilized.
Task Summary
Tasks: 765 total, 1 running, 764 sleeping, 0 stopped, 0 zombie
- 765 total - Total processes.
- 1 running - Actively running.
- 764 sleeping - Waiting for input or tasks.
- 0 stopped - Stopped processes.
- 0 zombie - Zombie processes (defunct).
CPU Usage
%Cpu(s): 2.2 us, 0.1 sy, 0.0 ni, 97.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
Field | Meaning |
---|---|
us | User CPU time - 2.2% |
sy | System (kernel) time - 0.1% |
ni | Nice (priority) - 0.0% |
id | Idle - 97.7% |
wa | Waiting for I/O - 0.0% |
hi | Hardware interrupts - 0.0% |
si | Software interrupts - 0.0% |
st | Steal time (virtualization) - 0.0% |
Memory Usage
MiB Mem : 515188.2 total, 482815.2 free, 17501.5 used, 14871.5 buff/cache
Field | Meaning |
---|---|
total | Total RAM (515188.2 MiB) |
free | Free RAM (482815.2 MiB) |
used | Used by programs (17501.5 MiB) |
buff/cache | Disk cache and buffers (14871.5 MiB) |
Swap Usage
MiB Swap: 4096.0 total, 4072.2 free, 23.8 used. 493261.3 avail Mem
Field | Meaning |
---|---|
total | Swap space available (4096 MiB) |
free | Free swap (4072.2 MiB) |
used | Swap used (23.8 MiB) |
avail Mem | Available memory for new tasks (493261.3 MiB) |
- These explanations cover the descriptions of each of the different parameters given by the
top
output.
Understanding Outputs - nvidia-smi
GPU Monitoring
Example nvidia-smi
Output:
------ Wed Jul 2 17:12:23 IST 2025 ------
Wed Jul 2 17:12:23 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------|
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL On | 00000000:AB:00.0 Off | 0 |
| N/A 37C P0 86W / 400W | 1294MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2234986 C python 1284MiB |
+-----------------------------------------------------------------------------------------+
…
Explanation of nvidia-smi
Output:
GPU Summary Header
- NVIDIA-SMI Version: 560.35.05 — Monitoring tool version.
- Driver Version: 560.35.05 — NVIDIA driver version installed.
- CUDA Version: 12.6 — CUDA toolkit compatibility version.
GPU Info Section
Field | Meaning |
---|---|
GPU | GPU index number (0) |
Name | GPU model: NVIDIA H100 NVL |
Persistence-M | Persistence Mode: On (reduces init overhead) |
Bus-Id | PCI bus ID location |
Disp.A | Display Active: Off (no display connected) |
Volatile Uncorr. ECC | GPU memory error count (0 = no errors) |
Fan | Fan speed (N/A — passive cooling) |
Temp | Temperature (37C — healthy) |
Perf | Performance state (P0 = maximum performance) |
Pwr:Usage/Cap | Power usage (86W of 400W max) |
Memory-Usage | 1294MiB used / 95830MiB total |
GPU-Util | GPU utilization (0% — idle) |
Compute M. | Compute mode (Default) |
MIG M. | Multi-Instance GPU mode (Disabled) |
Processes Section
Field | Meaning |
---|---|
GPU | GPU ID (0) |
PID | Process ID (2234986) |
Type | Type of process: C (compute) |
Process Name | Process name (python) |
GPU Memory | 1284MiB used by this process |
- These explanations cover the descriptions of each of the different parameters given by the
nvidia-smi
output.
Performance Comparison Script
import matplotlib.pyplot as plt
# Extracted timings from the printed output
methods = ['Sequential (CPU)', 'MPI (PCPU)', 'CUDA (GPU)']
times = [70.430, 13.507, 0.341] # Replace the times with the times printed by running the above scripts
plt.figure(figsize=(10, 6))
bars = plt.bar(methods, times, color=['blue', 'green', 'red'])
plt.ylabel('Execution Time (seconds)')
plt.title('Performance Comparison: CPU vs MPI vs GPU')
# Add labels above bars
for bar, time in zip(bars, times):
plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
f'{time:.3f}s', ha='center', va='bottom')
plt.tight_layout()
plt.savefig('performance_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
Exercise: Resource Efficiency Analysis
Run the above python script to create a comparitive analysis between the different methods you used in this tutorial to understand the efficiency of different resources
Example Solution
This plot shows the execution time comparison between CPU, MPI, and GPU implementations.
Best Practices and Common Pitfalls
Resource Allocation Best Practices
- Match resources to workload requirements
- Don’t request more resources than you can use
- Consider memory requirements carefully
- Use appropriate partitions/queues
- Test with small jobs first
- Validate your scripts with shorter runs
- Check resource utilization before scaling up
- Monitor and optimize
- Use profiling tools to identify bottlenecks
- Adjust resource requests based on actual usage
Common Mistakes to Avoid
- Over-requesting resources
# Bad: Requesting 32 cores for sequential code #SBATCH --cpus-per-task=32 ./sequential_program # Good: Match core count to parallelization #SBATCH --cpus-per-task=1 ./sequential_program
- Memory allocation errors
# Bad: Not specifying memory for memory-intensive jobs #SBATCH --partition=defaultq # Good: Specify adequate memory #SBATCH --partition=defaultq #SBATCH --mem=16G
- GPU job inefficiencies
# Bad: Too many CPU cores for GPU job #SBATCH --cpus-per-task=32 #SBATCH --gpus-per-node=1 # Good: Balanced CPU-GPU ratio #SBATCH --cpus-per-task=4 #SBATCH --gpus-per-node=1
Summary
Resource optimization in HPC involves understanding your workload characteristics and matching them with appropriate resource allocations. Key takeaways:
- Profile your code to understand resource requirements
- Use sequential jobs for single-threaded applications
- Leverage parallel computing for scalable workloads
- Utilize GPUs for massively parallel computations
- Monitor performance and adjust allocations accordingly
- Avoid common pitfalls like over-requesting resources
Efficient resource utilization not only improves your job performance but also ensures fair access to shared HPC resources for all users.
Revisit Earlier Exercises
Now that you’ve learned how to submit jobs using Slurm and request computational resources effectively, revisit the following exercises from the earlier lesson:
Try running them now on your cluster using the appropriate Slurm script and resource flags.
Solution 1: Slurm Submission Script for Exercise MPI with
mpi4py
The following script can be used to submit your MPI-based Python program (
mpi_hpc_ws.py
) on an HPC cluster using Slurm:#!/bin/bash #SBATCH --job-name=mpi_hpc_ws #SBATCH --output=mpi_%j.out #SBATCH --error=mpi_%j.err #SBATCH --partition=defaultq #SBATCH --nodes=2 #SBATCH --ntasks=4 #SBATCH --time=00:10:00 #SBATCH --mem=16G # Load required modules module purge module load Python/3.9.1 module list Create a python virtual environment python3 -m venv name_of_your_venv Activate your Python environment source name_of_your_venv/bin/activate # Run the MPI job mpirun -np 4 python mpi_hpc_ws.py
Make sure your virtual environment has
mpi4py
installed and that your system has access to the OpenMPI runtime viampirun
. Adjust the number of nodes and tasks depending on the cluster policies.
Solution 2: Slurm Submission Script for Exercise GPU with
numba-cuda
The following script can be used to submit a GPU-accelerated Python job (
numba_cuda_test.py
) using Slurm:#!/bin/bash #SBATCH --job-name=Numba_Cuda #SBATCH --output=Numba_Cuda_%j.out #SBATCH --error=Numba_Cuda_%j.err #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=16G #SBATCH --gpus-per-node=1 #SBATCH --time=00:10:00 # --------- Load Environment --------- module load Python/3.9.1 module load cuda/11.2 module list # --------- Check whether the GPU is available --------- from numba import cuda print("CUDA Available:", cuda.is_available()) # Activate virtual environment source 'name_of_venv'/bin/activate # Here name_of_venv refers to the name of your virtual environment without the quotes # --------- Run the Python Script --------- python numba_cuda_test.py
Make sure your virtual environment includes the
numba-cuda
python library to access the GPU.
Key Points
Different computational models (sequential, parallel, GPU) significantly impact runtime and efficiency.
Sequential CPU execution is simple but inefficient for large parameter spaces.
Parallel CPU (e.g., MPI or OpenMP) reduces runtime by distributing tasks but is limited by CPU core counts and communication overhead.
GPU computing can drastically accelerate tasks with massively parallel workloads like grid-based simulations.
Choosing the right computational model depends on the problem structure, resource availability, and cost-efficiency.
Effective Slurm job scripts should match the workload to the hardware: CPUs for serial/parallel, GPUs for highly parallelizable tasks.
Monitoring tools (like
nvidia-smi
,seff
,top
) help validate whether the resource request matches the actual usage.Optimizing resource usage minimizes wait times in shared environments and improves overall throughput.
Wrap-up
Overview
Teaching: 15 min
Exercises: 0 minQuestions
Looking back at what was covered and how different pieces fit together
Where are some advanced topics and further reading available?
Objectives
Put the course in context with future learning.
Summary
Further Resources
Below are some additional resources to help you continue learning:
- A comprehensive HPC manual
- Carpentries HPC workshop Using Python in an HPC environment course
- Foundations of Astronomical Data Science Carpentries Workshop
- A previous InterPython workshop materials, covering collaborative usage of GitHub, Programming Paradigms, Software Architecture and many more
- CodeRefinery courses on FAIR (Findable, Accessible, Interoperable, and Reusable) software practices
- Python documentation
- GitHub Actions documentation
Key Points
Keypoint 1