Distributed Workflows with MPI
Fuzzball supports distributed jobs that use MPI.
MPI, or “Message Passing Interface,” is a standard in high performance computing that allows for communication between CPU cores within or between compute nodes. A program that can utilize MPI to run processes over multiple cores may improve performance by adding more cores. Many scientific applications are designed to make use of MPI to improve their performance on HPC systems. Several different implementations of the MPI standard are available. Fuzzball provides an MPI wrapper that sets up MPI jobs on a given number of nodes with a given MPI implementation. Two implementations of MPI are supported covering the vast majority of use cases: MPICH and OpenMPI.
Here is a basic Fuzzfile showing how to use single node MPI
(OpenMPI) on Fuzzball. This example uses the
container
openmpi-hello-world. To see how this
container was created as well as the mpi_hello_world.c source file, see this GitHub
repo.
version: v1
jobs:
mpi-hello-world:
env:
- PATH=/usr/lib64/openmpi/bin:/usr/local/bin:/usr/bin:/bin
image:
uri: oras://docker.io/anderbubble/openmpi-hello-world.sif
command:
- /bin/sh
- '-c'
- mpi_hello_world
resource:
cpu:
cores: 4
affinity: NUMA
memory:
size: 1GB
multinode:
nodes: 1
implementation: openmpi
The multinode section lets you specify that multiple copies of your command should be distributed
across cores. The nodes option sets the number of compute nodes. This example is slightly
counterintuitive. MPI runs 4 processes on a single node, because we’ve requested 1 node with 4
cores. The implementation option sets the type of MPI to use. Valid implementations are openmpi
for OpenMPI and mpich for MPICH. You can also select gasnet for programs created using an
alternate parallel programming
language that runs on Gasnet like Chapel.
The MPI program itself is not executed withmpirun. The contents of thecommandfield are run withmpirunprefixed to them when themultinodefield is used for MPI-based execution. Running a command here in the multi-node example withmpirunwould fail because the effective command line would end up beingmpirun mpirun hello-mpi 100- andmpiruncan’t be called recursively.
After running this job you will see logs like these showing the hostname, rank, CPU number, NUMA node, and Mount Namespace of each job within the array.
Hello world! Processor mpi-hello-world, Rank 2 of 4, CPU 2, NUMA node 0, Namespace mnt:[4026532665]
Hello world! Processor mpi-hello-world, Rank 0 of 4, CPU 3, NUMA node 0, Namespace mnt:[4026532665]
Hello world! Processor mpi-hello-world, Rank 1 of 4, CPU 1, NUMA node 0, Namespace mnt:[4026532665]
Hello world! Processor mpi-hello-world, Rank 3 of 4, CPU 0, NUMA node 0, Namespace mnt:[4026532665]
A similar Fuzzfile running the same workflow with the MPICH flavor of MPI is presented below:
version: v1
jobs:
mpi-hello-world:
env:
- PATH=/usr/lib64/mpich/bin:/usr/local/bin:/usr/bin:/bin
image:
uri: oras://docker.io/anderbubble/mpich-hello-world.sif
command:
- /bin/sh
- '-c'
- mpi_hello_world
resource:
cpu:
cores: 4
affinity: NUMA
memory:
size: 1GB
multinode:
nodes: 1
implementation: mpich
An example workflow that utilizes OpenMPI on multiple nodes appears below:
version: v1
jobs:
mpi-hello-world:
env:
- PATH=/usr/lib64/openmpi/bin:/usr/local/bin:/usr/bin:/bin
image:
uri: oras://docker.io/anderbubble/openmpi-hello-world.sif
command:
- /bin/sh
- '-c'
- mpi_hello_world
resource:
cpu:
cores: 2
affinity: NUMA
memory:
size: 1GB
multinode:
nodes: 4
implementation: openmpi
In this case, we use the multinode.nodes field to request 4 compute nodes. We also use the
resources section, to request 2 cores
and 1GBs of RAM per compute node. So in total, this job will claim 8 cores and 4GBs of memory
across 4 total compute nodes.
The logs from this job look like this:
Hello world! Processor mpi-hello-world-0, Rank 0 of 8, CPU 1, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-0, Rank 1 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-3, Rank 6 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532669]
Hello world! Processor mpi-hello-world-2, Rank 3 of 8, CPU 1, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-2, Rank 2 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-3, Rank 7 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532669]
Hello world! Processor mpi-hello-world-1, Rank 4 of 8, CPU 1, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-1, Rank 5 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532667]
If you are running across a high-speed
fabric
(like InfiniBand, or Omni-Path), you will need the appropriate libraries installed inside the
container. Tools like libfabric libibverbs, etc., along with the appropriate driver specific
libraries, can generally be downloaded into Rocky Linux-based containers with the following code
block.
# dnf -y groupinstall "InfiniBand Support" \\
&& dnf -y install opa-basic-tools libpsm2 libfabric libfabric-devel
Thelibfabric-devellibrary is probably only necessary if you are building an application.