Using GPUs with Fuzzball

Fuzzball intelligently passes GPUs from the host through to the containers that support jobs. Once an administrator has properly added and configured resources that include GPUs, you can access them simply by specifying them in your Workflow.

Observe the following Fuzzfile:

version: v1
jobs:
  gpu-check:
    image:
      uri: docker://rockylinux:9
    command:
      - /bin/sh
      - '-c'
      - nvidia-smi
    resource:
      cpu:
        cores: 1
      memory:
        size: 1GB
      devices:
        nvidia.com/gpu: 1

Preview workflow in web UI

Note that the gpu-check job in this workflow is based on the official Rocky Linux (v9) container on Docker Hub. This container does not have a GPU driver or any NVIDIA related software installed. However, the Fuzzfile specifies that this job should run the program nvidia-smi to check the status of any GPUs that are present.

The resources section has a field devices that allows you to specify that the job should use GPUs. Since our administrator has configured GPUs for use with nvidia.com/gpu, we can specify that we need to use one. Fuzzball will cause all of the driver-related software to be available in the container.

If you are using the workflow editor in the web UI to create your Fuzzfile, you can open the Resources tab in the flyout menu on the right and “Add” a device. Once again, in the example configuration, the nvidia.com/gpu string is appropriate to add a GPU to our job.

Workflow editor devices and GPU field

After submitting the workflow, note that the nvidia-smi command has no trouble completing even though the Rocky Linux container has no NVIDIA-related software.

Thu Aug 15 20:27:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           Off |   00000000:00:1E.0 Off |                    0 |
| N/A   42C    P0             40W /  300W |       1MiB /  16384MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Not only is it unnecessary to install NVIDIA drivers into containers that will run GPU workflows, it is considered a bad practice to do so. Under the hood, Fuzzball finds libraries and binaries associated with the NVIDIA driver and injects them into the container. This ensures that the kernel modules and the libraries are both in agreement. If you install GPU drivers into the container there is a chance that they could be found by the software in your container and disrupt the correspondence between the libraries and the installed kernel modules.

There is a difference between the NVIDIA driver and CUDA. This is confusing since CUDA is often packaged with the NVIDIA driver. You should install the correct version of CUDA in your container to support your GPU-accelerated software. But you should not install the driver.