Distributed Workflows with MPI

Overview

Fuzzball supports distributed jobs that use MPI.

MPI, or “Message Passing Interface,” is a standard in high performance computing that allows for communication between CPU cores within or between compute nodes. A program that can utilize MPI to run processes over multiple cores may improve performance by adding more cores. Many scientific applications are designed to make use of MPI to improve their performance on HPC systems. Several different implementations of the MPI standard are available. Fuzzball provides an MPI wrapper that sets up MPI jobs on a given number of nodes with a given MPI implementation. Two implementations of MPI are supported covering the vast majority of use cases: MPICH and OpenMPI.

Singe Node MPI example

Here is a basic Fuzzfile showing how to use single node MPI (OpenMPI) on Fuzzball. This example uses the container openmpi-hello-world. To see how this container was created as well as the mpi_hello_world.c source file, see this GitHub repo.

version: v1
jobs:
  mpi-hello-world:
    env:
      - PATH=/usr/lib64/openmpi/bin:/usr/local/bin:/usr/bin:/bin
    image:
      uri: oras://docker.io/anderbubble/openmpi-hello-world.sif
    command:
      - /bin/sh
      - '-c'
      - mpi_hello_world
    resource:
      cpu:
        cores: 4
        affinity: NUMA
      memory:
        size: 1GB
    multinode:
      nodes: 1
      implementation: openmpi

Preview workflow in web UI

The multinode section lets you specify that multiple copies of your command should be distributed across cores. The nodes option sets the number of compute nodes. This example is slightly counterintuitive. MPI runs 4 processes on a single node, because we’ve requested 1 node with 4 cores. The implementation option sets the type of MPI to use. Valid implementations are openmpi for OpenMPI and mpich for MPICH. You can also select gasnet for programs created using an alternate parallel programming language that runs on Gasnet like Chapel.

The MPI program itself is not executed with mpirun. The contents of the command field are run with mpirun prefixed to them when the multinode field is used for MPI-based execution. Running a command here in the multi-node example with mpirun would fail because the effective command line would end up being mpirun mpirun hello-mpi 100 - and mpirun can’t be called recursively.

After running this job you will see logs like these showing the hostname, rank, CPU number, NUMA node, and Mount Namespace of each job within the array.

Hello world! Processor mpi-hello-world, Rank 2 of 4, CPU 2, NUMA node 0, Namespace mnt:[4026532665]
Hello world! Processor mpi-hello-world, Rank 0 of 4, CPU 3, NUMA node 0, Namespace mnt:[4026532665]
Hello world! Processor mpi-hello-world, Rank 1 of 4, CPU 1, NUMA node 0, Namespace mnt:[4026532665]
Hello world! Processor mpi-hello-world, Rank 3 of 4, CPU 0, NUMA node 0, Namespace mnt:[4026532665]

A similar Fuzzfile running the same workflow with the MPICH flavor of MPI is presented below:

version: v1
jobs:
  mpi-hello-world:
    env:
      - PATH=/usr/lib64/mpich/bin:/usr/local/bin:/usr/bin:/bin
    image:
      uri: oras://docker.io/anderbubble/mpich-hello-world.sif
    command:
      - /bin/sh
      - '-c'
      - mpi_hello_world
    resource:
      cpu:
        cores: 4
        affinity: NUMA
      memory:
        size: 1GB
    multinode:
      nodes: 1
      implementation: mpich

Preview workflow in web UI

Multi-Node MPI example

An example workflow that utilizes OpenMPI on multiple nodes appears below:

version: v1
jobs:
  mpi-hello-world:
    env:
      - PATH=/usr/lib64/openmpi/bin:/usr/local/bin:/usr/bin:/bin
    image:
      uri: oras://docker.io/anderbubble/openmpi-hello-world.sif
    command:
      - /bin/sh
      - '-c'
      - mpi_hello_world
    resource:
      cpu:
        cores: 2
        affinity: NUMA
      memory:
        size: 1GB
    multinode:
      nodes: 4
      implementation: openmpi

Preview workflow in web UI

In this case, we use the multinode.nodes field to request 4 compute nodes. We also use the resources section, to request 2 cores and 1GBs of RAM per compute node. So in total, this job will claim 8 cores and 4GBs of memory across 4 total compute nodes.

The logs from this job look like this:

Hello world! Processor mpi-hello-world-0, Rank 0 of 8, CPU 1, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-0, Rank 1 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-3, Rank 6 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532669]
Hello world! Processor mpi-hello-world-2, Rank 3 of 8, CPU 1, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-2, Rank 2 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-3, Rank 7 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532669]
Hello world! Processor mpi-hello-world-1, Rank 4 of 8, CPU 1, NUMA node 0, Namespace mnt:[4026532667]
Hello world! Processor mpi-hello-world-1, Rank 5 of 8, CPU 0, NUMA node 0, Namespace mnt:[4026532667]

Libraries for High-Speed Interconnects

If you are running across a high-speed fabric (like InfiniBand, or Omni-Path), you will need the appropriate libraries installed inside the container. Tools like libfabric libibverbs, etc., along with the appropriate driver specific libraries, can generally be downloaded into Rocky Linux-based containers with the following code block.

# dnf -y groupinstall "InfiniBand Support" \\
        && dnf -y install opa-basic-tools libpsm2 libfabric libfabric-devel

The libfabric-devel library is probably only necessary if you are building an application.