Distributed Workflows with MPI
Fuzzball supports distributed jobs that use MPI.
MPI, or “Message Passing Interface,” is a standard in high performance computing that allows for communication between CPU cores within or between compute nodes. A program that can utilize MPI to run processes over multiple cores may improve performance by adding more cores. Many scientific applications are designed to make use of MPI to improve their performance on HPC systems. Several different implementations of the MPI standard are available. Fuzzball provides an MPI wrapper that sets up MPI jobs on a given number of nodes with a given MPI implementation. Two implementations of MPI are supported covering the vast majority of use cases: MPICH and OpenMPI.
Here is a basic Fuzzfile showing how to use single node MPI
(OpenMPI) on Fuzzball. This example uses the
container
openmpi-hello-world. To see how this
container was created as well as the mpi_hello_world.c
source file, see this GitHub
repo.
version: v1
jobs:
mpi-hello-world:
env:
- PATH=/usr/lib64/openmpi/bin:/usr/local/bin:/usr/bin:/bin
image:
uri: oras://docker.io/anderbubble/openmpi-hello-world.sif
command:
- /bin/sh
- '-c'
- mpi_hello_world
resource:
cpu:
cores: 4
affinity: NUMA
memory:
size: 1GB
multinode:
nodes: 1
implementation: openmpi
The multinode
section lets you specify that multiple copies of your command should be distributed
across cores. The nodes
option sets the number of compute nodes. This example is slightly
counterintuitive. MPI runs 4 processes on a single node, because we’ve requested 1 node with 4
cores. The implementation
option sets the type of MPI to use. Valid implementations are openmpi
for OpenMPI and mpich
for MPICH. You can also select gasnet
for programs created using an
alternate parallel programming
language that runs on Gasnet like Chapel.
The MPI program itself is not executed withmpirun
. The contents of thecommand
field are run withmpirun
prefixed to them when themultinode
field is used for MPI-based execution. Running a command here in the multi-node example withmpirun
would fail because the effective command line would end up beingmpirun mpirun hello-mpi 100
- andmpirun
can’t be called recursively.
After running this job you will see logs like these showing the hostname, rank, CPU number, NUMA node, and Mount Namespace of each job within the array.
Hello world! Processor mpi-hello-world, Rank 2 of 4, CPU 2, NUMA node 0, Namespace mnt:[4026532665]
Hello world! Processor mpi-hello-world, Rank 0 of 4, CPU 3, NUMA node 0, Namespace mnt:[4026532665]
Hello world! Processor mpi-hello-world, Rank 1 of 4, CPU 1, NUMA node 0, Namespace mnt:[4026532665]
Hello world! Processor mpi-hello-world, Rank 3 of 4, CPU 0, NUMA node 0, Namespace mnt:[4026532665]
A similar Fuzzfile running the same workflow with the MPICH flavor of MPI is presented below:
version: v1
jobs:
mpi-hello-world:
env:
- PATH=/usr/lib64/mpich/bin:/usr/local/bin:/usr/bin:/bin
image:
uri: oras://docker.io/anderbubble/mpich-hello-world.sif
command:
- /bin/sh
- '-c'
- mpi_hello_world
resource:
cpu:
cores: 4
affinity: NUMA
memory:
size: 1GB
multinode:
nodes: 1
implementation: mpich
An example workflow that utilizes OpenMPI on multiple nodes appears below:
version: v1
jobs:
mpi-hello-world:
env:
- PATH=/usr/lib64/openmpi/bin:/usr/local/bin:/usr/bin:/bin
image:
uri: oras://docker.io/anderbubble/openmpi-hello-world.sif
command:
- /bin/sh
- '-c'
- mpi_hello_world
resource:
cpu:
cores: 2
affinity: NUMA
memory:
size: 1GB
multinode:
nodes: 4
implementation: openmpi
In this case, we use the multinode.nodes
field to request 4 compute nodes. We also use the
resources
section, to request 2 cores
and 1GBs of RAM per compute node. So in total, this job will claim 8 cores and 4GBs of memory
across 4 total compute nodes.
The logs from this job look like this:
Hello world! Processor mpi-hello-world-0, Rank 0 of 8, CPU 1, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-0, Rank 1 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-3, Rank 6 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532669]
Hello world! Processor mpi-hello-world-2, Rank 3 of 8, CPU 1, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-2, Rank 2 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-3, Rank 7 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532669]
Hello world! Processor mpi-hello-world-1, Rank 4 of 8, CPU 1, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-1, Rank 5 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532667]
If you are running across a high-speed
fabric
(like InfiniBand, or Omni-Path), you will need the appropriate libraries installed inside the
container. Tools like libfabric
libibverbs
, etc., along with the appropriate driver specific
libraries, can generally be downloaded into Rocky Linux-based containers with the following code
block.
# dnf -y groupinstall "InfiniBand Support" \\
&& dnf -y install opa-basic-tools libpsm2 libfabric libfabric-devel
Thelibfabric-devel
library is probably only necessary if you are building an application.