Workflow Submission
Once you have configured Fuzzball to use either Slurm or PBS as a resource provisioner and created provisioner definitions, you can submit workflows to run on your HPC cluster. This guide covers workflow creation, submission, monitoring, and management with respect to Slurm or PBS as a backend provisioner.
For more general information about writing Fuzzfiles and running workflows, see the Quickstart Guide and the section on Creating Fuzzfiles.
A Fuzzball workflow (specified in a Fuzzfile) defines one or more jobs with their resource requirements, container images, commands, and dependencies. When you submit your Fuzzfile to Fuzzball using a batch scheduling system as a backend, it does the following:
- matches job resource requirements to available Slurm or PBS provisioner definitions
- submits jobs to Slurm or PBS using
sbatchorqsubcommands and directs them to start Substrate - waits for Substrate registration from compute nodes
- schedules and executes jobs on the registered Substrate nodes
- collects outputs, logs, and status information
Fuzzball causes compute resources to be allocated exclusively. This prevents Orchestrate from trying to launch multiple instances of Substrate on a single node (which is not supported).
Here’s a basic workflow that runs on Fuzzball through Slurm or PBS:
version: v1
jobs:
hello-hpc:
image:
uri: "docker://alpine:3.16"
command:
- "/bin/sh"
- "-c"
- "echo 'Hello from the HPC cluster!' && hostname && nproc && free -h"
resource:
cpu:
cores: 2
memory:
size: "4 GiB"
Save this as hello-hpc.fz and submit:
$ fuzzball workflow submit --file hello-hpc.fzSet execution timeouts to prevent runaway jobs:
jobs:
my-job:
image:
uri: "docker://alpine:3.16"
command: ["long-running-task.sh"]
policy:
timeout:
execute: "2h" # Kill job after 2 hours
resource:
cpu:
cores: 4
memory:
size: "8 GiB"
Timeout format: <number><unit> where unit is s (seconds), m (minutes), or h (hours).
Job timeouts interact with provisioner TTL settings:
- Provisioner TTL: Maximum runtime set in the provisioner definition (hard limit)
- Job Timeout: Maximum execution time for a specific job (soft limit)
- The effective timeout is the minimum of these two values
Example with provisioner TTL:
# Provisioner definition with 1-hour TTL definitions: - id: "slurm-compute-small" provisioner: "slurm" ttl: 24000 # 6 hour 40 minute maximum spec: cpu: 4 mem: "8 GiB"If a job requests a 8-hour timeout but uses this provisioner, it will be terminated after 6 hours and 40 minutes (the provisioner TTL). Always ensure job timeouts are within the provisioner’s TTL limits.
Cancel a running workflow:
$ fuzzball workflow cancel <workflow-id>This will:
- Cancel all running jobs within a workflow via Slurm’s
scancelor PBS’sqdel - Signal Substrate processes to terminate
- Clean up allocated resources
- Update workflow status to CANCELLED
Symptom: Workflow stuck in PENDING state
Possible Causes:
- No matching provisioner definition
- Slurm/PBS queue is full
- Partition is down or unavailable
Solutions:
$ fuzzball admin provisioner list # Check provisioner definitions
$ squeue # Check Slurm queue
$ sinfo -p compute # Check partition status
$ fuzzball workflow get <workflow-id> # Review workflow requirements vs available definitionsSymptom: Jobs start but fail within seconds
Possible Causes:
- Container image not found or inaccessible
- Command or script errors
- Insufficient resources (memory, disk)
- Missing dependencies or libraries
Solutions:
$ fuzzball workflow logs <workflow-id> --job <job-name> # Check job logs for errors
$ docker pull <image-uri> # Verify container image
$ docker run <image-uri> <command> # Test command locally
$ fuzzball workflow jobs <workflow-id> # Check resource allocationSymptom: Slurm job starts but Substrate doesn’t register
Possible Causes:
- Substrate binary not installed on compute nodes
- Network connectivity issues
- Authentication or firewall problems
- Substrate configuration errors
Solutions:
$ ssh slurm-head 'ssh <compute-node> ps aux | grep fuzzball-substrate' # Check if Substrate is running on Slurm node
$ ssh slurm-head 'ssh <compute-node> journalctl -u fuzzball-substrate' # Check Substrate logs on compute node
$ ssh slurm-head 'ssh <compute-node> ping <orchestrator-host>' # Verify network connectivity from compute node to Orchestrate
$ ssh slurm-head cat slurm-<job-id>.out # Check Slurm job outputSymptom: Jobs running slower than expected
Possible Causes:
- Resource contention with other Slurm jobs
- CPU affinity not optimal
- Storage I/O bottlenecks
- Network latency
Solutions:
$ ssh slurm-head 'squeue -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R"' # Check node load
$ ssh slurm-head 'sacct -j <job-id> --format=JobID,MaxRSS,MaxVMSize,CPUTime,Elapsed' # Monitor job resource usageAdjust CPU affinity in workflow. Use affinity: “SOCKET” or “NUMA” for better performance.
Use local scratch space for I/O intensive operations.
volumes:
- name: "scratch"
mount_path: "/scratch"
storage_class: "ephemeral" # Faster than network storage
For more troubleshooting guidance, see the Troubleshooting Guide.