Workflow Submission

Once you have configured Fuzzball to use either Slurm or PBS as a resource provisioner and created provisioner definitions, you can submit workflows to run on your HPC cluster. This guide covers workflow creation, submission, monitoring, and management with respect to Slurm or PBS as a backend provisioner.

For more general information about writing Fuzzfiles and running workflows, see the Quickstart Guide and the section on Creating Fuzzfiles.

Workflows Running Through Slurm and PBS

A Fuzzball workflow (specified in a Fuzzfile) defines one or more jobs with their resource requirements, container images, commands, and dependencies. When you submit your Fuzzfile to Fuzzball using a batch scheduling system as a backend, it does the following:

matches job resource requirements to available Slurm or PBS provisioner definitions
submits jobs to Slurm or PBS using sbatch or qsub commands and directs them to start Substrate
waits for Substrate registration from compute nodes
schedules and executes jobs on the registered Substrate nodes
collects outputs, logs, and status information

Fuzzball causes compute resources to be allocated exclusively. This prevents Orchestrate from trying to launch multiple instances of Substrate on a single node (which is not supported).

Simple Workflow Example

Here’s a basic workflow that runs on Fuzzball through Slurm or PBS:

version: v1
jobs:
  hello-hpc:
    image:
      uri: "docker://alpine:3.16"
    command:
      - "/bin/sh"
      - "-c"
      - "echo 'Hello from the HPC cluster!' && hostname && nproc && free -h"
    resource:
      cpu:
        cores: 2
      memory:
        size: "4 GiB"

Save this as hello-hpc.fz and submit:

$ fuzzball workflow submit --file hello-hpc.fz

Timeout Policy

Set execution timeouts to prevent runaway jobs:

jobs:
  my-job:
    image:
      uri: "docker://alpine:3.16"
    command: ["long-running-task.sh"]
    policy:
      timeout:
        execute: "2h"  # Kill job after 2 hours
    resource:
      cpu:
        cores: 4
      memory:
        size: "8 GiB"

Timeout format: <number><unit> where unit is s (seconds), m (minutes), or h (hours).

Job timeouts interact with provisioner TTL settings:
Provisioner TTL: Maximum runtime set in the provisioner definition (hard limit)
Job Timeout: Maximum execution time for a specific job (soft limit)
The effective timeout is the minimum of these two values
Example with provisioner TTL:
# Provisioner definition with 1-hour TTL
definitions:
  - id: "slurm-compute-small"
    provisioner: "slurm"
    ttl: 24000  # 6 hour 40 minute maximum
    spec:
      cpu: 4
      mem: "8 GiB"
If a job requests a 8-hour timeout but uses this provisioner, it will be terminated after 6 hours and 40 minutes (the provisioner TTL). Always ensure job timeouts are within the provisioner’s TTL limits.

Cancelling Workflows

Cancel a running workflow:

$ fuzzball workflow cancel <workflow-id>

This will:

Cancel all running jobs within a workflow via Slurm’s scancel or PBS’s qdel
Signal Substrate processes to terminate
Clean up allocated resources
Update workflow status to CANCELLED

Troubleshooting

Workflow Not Starting

Symptom: Workflow stuck in PENDING state

Possible Causes:

No matching provisioner definition
Slurm/PBS queue is full
Partition is down or unavailable

Solutions:

$ fuzzball admin provisioner list # Check provisioner definitions

$ squeue # Check Slurm queue

$ sinfo -p compute # Check partition status

$ fuzzball workflow get <workflow-id> # Review workflow requirements vs available definitions

Jobs Failing Immediately

Symptom: Jobs start but fail within seconds

Possible Causes:

Container image not found or inaccessible
Command or script errors
Insufficient resources (memory, disk)
Missing dependencies or libraries

Solutions:

$ fuzzball workflow logs <workflow-id> --job <job-name> # Check job logs for errors

$ docker pull <image-uri> # Verify container image

$ docker run <image-uri> <command> # Test command locally

$ fuzzball workflow jobs <workflow-id> # Check resource allocation

Substrate Not Connecting

Symptom: Slurm job starts but Substrate doesn’t register

Possible Causes:

Substrate binary not installed on compute nodes
Network connectivity issues
Authentication or firewall problems
Substrate configuration errors

Solutions:

$ ssh slurm-head 'ssh <compute-node> ps aux | grep fuzzball-substrate' # Check if Substrate is running on Slurm node

$ ssh slurm-head 'ssh <compute-node> journalctl -u fuzzball-substrate' # Check Substrate logs on compute node

$ ssh slurm-head 'ssh <compute-node> ping <orchestrator-host>' # Verify network connectivity from compute node to Orchestrate

$ ssh slurm-head cat slurm-<job-id>.out # Check Slurm job output

Performance Issues

Symptom: Jobs running slower than expected

Possible Causes:

Resource contention with other Slurm jobs
CPU affinity not optimal
Storage I/O bottlenecks
Network latency

Solutions:

$ ssh slurm-head 'squeue -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"' # Check node load

$ ssh slurm-head 'sacct -j <job-id> --format=JobID,MaxRSS,MaxVMSize,CPUTime,Elapsed' # Monitor job resource usage

Adjust CPU affinity in workflow. Use affinity: “SOCKET” or “NUMA” for better performance.

Use local scratch space for I/O intensive operations.

volumes:
  - name: "scratch"
    mount_path: "/scratch"
    storage_class: "ephemeral"  # Faster than network storage

For more troubleshooting guidance, see the Troubleshooting Guide.