Fuzzfile Syntax Guide
Workflows are the fundamental unit of work in Fuzzball, orchestrating container execution, data movement, and resource allocation across compute clusters. They are described by Fuzzfiles — YAML 1.2 documents with a well-defined structure and syntax. A workflow definition consists of several top-level sections that work together to define the complete execution lifecycle describing how data flows into and out of the workflow, what compute operations to perform, what resources those operations require, and how jobs depend on each other. The top level sections are:
version: v1
files:
defaults:
annotations:
volumes:
jobs:
services:
Each section serves a specific purpose:
- version: Specifies the workflow syntax version (currently always
v1)[required] - files: Defines inline file content that can be referenced in volumes and
jobs
[optional] - defaults: Sets default configurations (mounts, environment, policy,
resources) for jobs
[optional] - annotations: Provides workflow-level metadata for scheduling and placement
[optional] - volumes: Defines storage volumes with data ingress (importing) and egress
(exporting) capabilities
[optional] - jobs: Specifies the computational work to be performed, organized as a directed acyclic graph (DAG)
- services: Defines long-running container services that support jobs or provide external endpoints
A workflow needs to have at least 1 job or service.
Required. Denotes the workflow syntax version. Currently the only allowed value is v1.
Structure:
version: v1
Example:
version: v1
Optional. Defines inline file contents that can be referenced in volume ingress and
jobs using a file://<arbitrary_name> URI scheme. This is useful for embedding small
configuration files, scripts, or data directly in the workflow definition rather than
storing them externally. Note that some content may have to be base64 encoded to ensure
that the resulting yaml file is valid.
Structure:
files:
<file-name>: |
<file-contents>
<another-file-name>: |
<more-contents>
Fields:
- <file-name>: An arbitrary name to identify this inline file. This name is used in
file://URIs to reference the content. Can be any valid YAML key. - <file-contents>: The actual content of the file. Use the
|literal block scalar for multi-line content.
Example:
files:
config.ini: |
[settings]
debug=true
timeout=30
data.csv: |
id,name,value
1,alpha,42
2,beta,3.14
Usage:
Inline files can be referenced in two ways:
In volume ingress - to copy the content into a volume:
volumes: my: reference: volume://user/ephemeral ingress: - source: uri: file://config.ini # References the inline file destination: uri: file://config/app.ini # Destination in the volumeIn job files - to bind mount the content directly into a container:
jobs: my-job: image: uri: docker://alpine:latest files: /etc/app/config.ini: file://config.ini # Mounted at this path in container command: [cat, /etc/app/config.ini]
Optional. Sets default configurations that are applied to all jobs in the workflow. Individual jobs can override these defaults by specifying their own values. This is useful for reducing repetition when multiple jobs share common settings like volume mounts or environment variables.
Structure:
defaults:
job:
env: # Optional
- <VAR>=<value>
mounts: # Optional
<volume-name>:
location: <path>
policy: # Optional
timeout:
execute: <duration>
retry:
attempts: <number>
resource: # Optional
cpu:
cores: <number>
affinity: <CORE|SOCKET|NUMA>
sockets: <number>
threads: <boolean>
memory:
size: <size>
by-core: <boolean>
devices:
<device-type>: <count>
exclusive: <boolean>
annotations:
<key>: <value>
Fields:
- job: Container for job-level defaults
- env (optional): List of environment variables in
KEY=VALUEformat that will be set in all job containers - mounts (optional): Map of volume names to mount specifications. Each mount defines:
- location: Absolute path in the container where the volume will be mounted
- policy (optional): Default execution policies for all jobs
- timeout (optional): Time limits for job execution
- execute: Maximum duration (e.g.,
2h,30m,1h30m)
- execute: Maximum duration (e.g.,
- retry (optional): Retry behavior on failure
- attempts: Number of retry attempts before permanent failure
- timeout (optional): Time limits for job execution
- resource (optional): Default hardware resource requirements
- cpu (required if resource is specified): CPU requirements
- cores: Number of CPU cores
- affinity: Binding strategy -
CORE,SOCKET, orNUMA - sockets (optional): Number of CPU sockets
- threads (optional): Whether to expose hardware threads (hyperthreading)
- memory (required if resource is specified): Memory requirements
- size: Amount of RAM (e.g.,
4GB,512MB) - by-core (optional): If true, size is per CPU core
- size: Amount of RAM (e.g.,
- devices (optional): Map of device types to counts (e.g.,
nvidia.com/gpu: 1) - exclusive (optional): If true, job gets exclusive access to the node
- annotations (optional): Custom key-value pairs for advanced scheduling
- cpu (required if resource is specified): CPU requirements
- env (optional): List of environment variables in
Example:
defaults:
job:
env:
- SCRATCH=/scratch
- LOG_LEVEL=info
mounts:
scratch:
location: /scratch
policy:
timeout:
execute: 1h
retry:
attempts: 2
resource:
cpu:
cores: 2
affinity: CORE
threads: false
memory:
size: 4GB
by-core: false
In this example, all jobs will automatically:
- Have
/scratchandLOG_LEVELenvironment variables set - Mount the
scratchat/scratch - Timeout after 1 hour and retry up to 2 times on failure
- Request 2 CPU cores and 4GB of memory
Individual jobs can override any of these defaults by specifying their own values. Default and job specified environment variables are merged with job variables taking precedence over defaults.
Optional. Provides workflow-level metadata as key-value pairs. Currently used by the scheduler to influence job placement decisions, such as placing jobs in specific topology zones or node pools.
Structure:
annotations:
<key>: <value>
<another-key>: <another-value>
Fields:
- <key>: String key identifying the annotation
- <value>: String value for the annotation
Example:
annotations:
scheduling.fuzzball.io/node-pool: gpu-nodes
organization: research-team
Common Use Cases:
- Topology placement: Control which availability zone or region jobs run in
- Node pool selection: Target specific sets of compute nodes
- Organizational metadata: Tag workflows for tracking, billing, or auditing
Note that annotation behavior is determined by cluster configuration and may vary between deployments and provisioners.
Optional. Defines storage volumes that jobs can access. Volumes provide persistent or ephemeral storage for workflow data and support data ingress (importing data at workflow start) and egress (exporting data at workflow end).
Structure:
volumes:
<volume-name>:
reference: <volume-uri> # Required
ingress: # Optional
- source:
uri: <source-uri>
secret: <secret-ref> # Optional
destination:
uri: <dest-uri>
policy: # Optional
timeout:
execute: <duration>
retry:
attempts: <number>
egress: # Optional
- source:
uri: <source-uri>
destination:
uri: <dest-uri>
secret: <secret-ref> # Optional
policy: # Optional
timeout:
execute: <duration>
retry:
attempts: <number>
Fields:
- <volume-name>: Arbitrary name for the volume, used to reference it in job
mountssections (e.g.,scratch,data) - reference (required): Volume URI from a storage class.
Common formats:
volume://user/ephemeral- Ephemeral volume that exists only for the workflow’s lifetimevolume://user/persistent/<name>- Persistent volume that survives workflow completion. Note that<name>will only have an effect if your storage class is configured to use user-configured volume names.
- ingress (optional): List of files to import into the volume at workflow start
- source (required): Where to fetch the data from
- uri: Source location. Supported schemes:
s3://,http://,https://,file://(references inline files from thefilessection) - secret (optional): Reference to credentials for accessing the source (format:
secret://<scope>/<name>)
- uri: Source location. Supported schemes:
- destination (required): Where to place the data in the volume
- uri: Destination path in the volume (format:
file://<path>). When jobs mount this volume, the path will be relative to the mount point
- uri: Destination path in the volume (format:
- policy (optional): Execution policies for the transfer
- timeout (optional): Time limit for the transfer
- execute: Maximum duration (e.g.,
5m,1h)
- execute: Maximum duration (e.g.,
- retry (optional): Retry behavior on failure
- attempts: Number of retry attempts
- timeout (optional): Time limit for the transfer
- source (required): Where to fetch the data from
- egress (optional): List of files to export from the volume at workflow end. Structure is similar to ingress, but source is the file in the volume and destination is the external storage location
Example:
volumes:
scratch:
reference: volume://user/ephemeral
ingress:
- source:
uri: s3://my-bucket/input-data.tar.gz
secret: secret://account/AWS_CREDENTIALS
destination:
uri: file://inputs/data.tar.gz
policy:
timeout:
execute: 10m
retry:
attempts: 3
- source:
uri: file://config.ini # References inline file
destination:
uri: file://config/app.ini
egress:
- source:
uri: file://results/output.tar.gz
destination:
uri: s3://my-bucket/results/output.tar.gz
secret: secret://account/AWS_CREDENTIALS
data:
reference: volume://user/persistent/research-data
In this example:
scratchis ephemeral, downloads data from S3 at start, and uploads results at enddatais persistent and survives workflow completion with no data transfers
Path Resolution Example:
If a job mounts scratch at /scratch, a file ingressed to file://inputs/data.tar.gz
will be available in the job container at /scratch/inputs/data.tar.gz.
Optional (but at least one job or service is required). Defines the computational work to be performed as a directed acyclic graph (DAG) of compute steps. Jobs can run scientific software, data processing scripts, or any containerized application. Jobs are the fundamental execution unit in a workflow.
Structure:
jobs:
<job-name>:
image: # Required
uri: <image-uri>
secret: <secret-ref> # Optional
decryption-secret: <secret-ref> # Optional
command: [<arg1>, <arg2>, ...] # One of command or script required (mutually exclusive with script)
script: | # One of command or script required (mutually exclusive with command)
<script-content>
args: [<arg1>, <arg2>, ...] # Optional
env: # Optional
- <VAR>=<value>
cwd: <working-directory> # Optional
mounts: # Optional
<volume-name>:
location: <mount-path>
files: # Optional
<container-path>: <file-uri>
resource: # Optional
cpu: # Required if resource specified
cores: <number>
affinity: <CORE|SOCKET|NUMA>
sockets: <number> # Optional
threads: <boolean> # Optional
memory: # Required if resource specified
size: <size>
by-core: <boolean> # Optional
devices: # Optional
<device-type>: <count>
exclusive: <boolean> # Optional
annotations: # Optional
<key>: <value>
policy: # Optional
timeout:
execute: <duration>
retry:
attempts: <number>
requires: [<job1>, <job2>, ...] # Optional (deprecated, use depends-on)
depends-on: # Optional
- name: <job-name>
status: <RUNNING|FINISHED>
description: <text> # Optional
multinode: # Optional (mutually exclusive with task-array, network)
nodes: <number>
implementation: <ompi|openmpi|mpich|gasnet>
procs-per-node: <number> # Optional
network: # Optional (mutually exclusive with multinode, task-array)
isolated: <boolean> # Required (must be true)
expose-tcp: [<number>, <number>, ...] # Optional
expose-udp: [<number>, <number>, ...] # Optional
task-array: # Optional (mutually exclusive with multinode, network)
start: <number>
end: <number>
concurrency: <number> # Optional
Fields:
- <job-name>: Arbitrary name identifying this job (e.g.,
preprocess-data,train-model). Must be a valid DNS subdomain and must be unique within a workflow. This name appears infuzzball workflow statusand is used in dependency specifications - image (required): Container image specification
- uri: Image location. Supported schemes:
docker://for OCI containers,oras://for SIF images (e.g.,oras://depot.ciq.com/fuzzball/fuzzball-applications/curl.sif:latest) - secret (optional): Credentials for private registries (format:
secret://<scope>/<name>) - decryption-secret (optional): Secret to decrypt encrypted SIF images
- uri: Image location. Supported schemes:
- command (required, mutually exclusive with script): List of arguments for the container
entrypoint (e.g.,
[python3, script.py, --input, /data]) - script (required, mutually exclusive with command): Multi-line shell script to execute.
Must start with a shebang line (e.g.,
#!/bin/bash) - args (optional): Additional arguments passed to command or script
- env (optional): List of environment variables in
KEY=VALUEformat - cwd (optional): Working directory for the job. Must be an absolute path. Defaults to the
image’s working directory or
/ - mounts (optional): Map of volume names (from the
volumessection) to mount specifications- location: Absolute path in the container where the volume will be mounted
- files (optional): Map of container paths to inline file URIs. Bind mounts inline files
directly into the container (e.g.,
/etc/config.ini: file://my-config) - resource (optional): Hardware resource requirements for scheduling
- cpu (required if resource specified): CPU requirements
- cores: Number of CPU cores (must be > 0)
- affinity: Binding strategy -
CORE(any cores),SOCKET(same socket), orNUMA(same NUMA domain). Defaults toCORE - sockets (optional): Number of physical CPU sockets
- threads (optional): Whether to expose hardware threads (hyperthreading)
- memory (required if resource specified): Memory requirements
- size: Amount of RAM with units (e.g.,
4GB,512MB,2GiB) - by-core (optional): If true, size is per CPU core, i.e. total size is
cores * size
- size: Amount of RAM with units (e.g.,
- devices (optional): Map of device types to counts (e.g.,
nvidia.com/gpu: 2) - exclusive (optional): If true, job gets exclusive node access
- annotations (optional): Custom key-value pairs for advanced node selection (e.g., CPU architecture, GPU model)
- cpu (required if resource specified): CPU requirements
- policy (optional): Execution policies
- timeout (optional): Time limits
- execute: Maximum job duration (e.g.,
2h,30m)
- execute: Maximum job duration (e.g.,
- retry (optional): Retry behavior
- attempts: Number of retries on failure
- timeout (optional): Time limits
- requires (optional, deprecated): List of job names that must complete before this job
starts. Use
depends-oninstead - depends-on (optional): List of concrete dependencies with status requirements. Note that depending
on a job array to finish will wait for all tasks to finish.
- name: Job or service name to depend on
- status: Required status -
FINISHED(job/service completed) orRUNNING(job/service is running) - description (optional): Human-readable explanation of the dependency
- multinode (optional, mutually exclusive with task-array/network): Multi-node parallel execution
- nodes: Number of nodes to allocate
- implementation: MPI/communication implementation -
ompi,openmpi,mpich, orgasnet - procs-per-node (optional): Processes per node. Defaults to number of allocated CPUs
- network (optional, mutually exclusive with multinode and task-array): If present, job is run inside its own (isolated)
network namespace.
- isolated: Must be
truewhen thenetworksection is specified. - expose-tcp (optional): List of container TCP ports to expose
- expose-udp (optional): List of container UDP ports to expose
- isolated: Must be
- task-array (optional, mutually exclusive with multinode and network): Embarrassingly parallel execution
- start: Starting task ID (inclusive, must be > 0)
- end: Ending task ID (inclusive, must be >= start)
- concurrency (optional): Maximum parallel tasks (max 200). Each task receives
$FB_TASK_ID
Example:
jobs:
preprocess:
image:
uri: docker://python:3.11
script: |
#! /bin/bash
python3 preprocess.py --input /data/raw/${FB_TASK_ID} --output /data/processed/${FB_TASK_ID}
env:
- PYTHONUNBUFFERED=1
mounts:
data:
location: /data
resource:
cpu:
cores: 4
affinity: NUMA
memory:
size: 8GB
policy:
timeout:
execute: 30m
retry:
attempts: 2
task-array:
start: 1
end: 1000
concurrency: 100
train-multinode:
image:
uri: docker://nvcr.io/nvidia/pytorch:24.01-py3
secret: secret://user/NGC_API_KEY
command: [python, -m, torch.distributed.run, train.py, --input /data/processed ]
depends-on:
- name: preprocess
status: FINISHED
resource:
cpu:
cores: 32
affinity: SOCKET
memory:
size: 128GB
devices:
nvidia.com/gpu: 4
annotations:
nvidia.com/gpu.model:: NVIDIA L40
multinode:
nodes: 4
implementation: openmpi
procs-per-node: 4
This example demonstrates:
- preprocess: Basic job task array with resource requests and retry policy
- train-multinode: Multi-node MPI job with GPUs and explicit dependencies
Optional (but at least one job or service is required). Defines long-running container services that support jobs or provide external endpoints. Unlike jobs which complete and exit, services run continuously for the workflow duration (or until dependent jobs finish). Services are useful for databases, web servers, message queues, interactive computing (e.g. jupyter or Rstudio), AI inference servers, or any persistent service that jobs need to access.
Structure:
services:
<service-name>:
image: # Required
uri: <image-uri>
secret: <secret-ref> # Optional
decryption-secret: <secret-ref> # Optional
command: [<arg1>, <arg2>, ...] # One of command or script required (mutually exclusive with script)
script: | # One of command or script required (mutually exclusive with command)
<script-content>
args: [<arg1>, <arg2>, ...] # Optional
env: # Optional
- <VAR>=<value>
cwd: <working-directory> # Optional
mounts: # Optional
<volume-name>:
location: <mount-path>
files: # Optional
<container-path>: <file-uri>
resource: # Optional
cpu:
cores: <number>
affinity: <CORE|SOCKET|NUMA>
sockets: <number> # Optional
threads: <boolean> # Optional
memory:
size: <size>
by-core: <boolean> # Optional
devices: # Optional
<device-type>: <count>
exclusive: <boolean> # Optional
annotations: # Optional
<key>: <value>
requires: [<job1>, <svc1>, ...] # Optional (deprecated, use depends-on)
depends-on: # Optional
- name: <job-or-service-name>
status: <RUNNING|FINISHED>
description: <text> # Optional
multinode: # Optional
nodes: <number>
implementation: <ompi|openmpi|mpich|gasnet>
procs-per-node: <number> # Optional
network: # Optional (mutually exclusive with multinode)
ports:
- name: <port-name>
port: <port-number>
protocol: <tcp|udp> # Optional
endpoints: # Optional
- name: <endpoint-name>
port-name: <port-name> # References port name above
protocol: <http|https|grpc|grpcs|tcp|tls>
type: <subdomain|path>
scope: <endpoint-scope> # One of: user, group, organization, public
persist: <boolean> # Optional
readiness-probe: # Optional
exec: # One of: exec, http-get, tcp-socket, grpc
command: [<arg1>, <arg2>]
http-get:
path: <path>
port: <port-number>
scheme: <HTTP|HTTPS> # Optional
http-headers: # Optional
- name: <header-name>
value: <header-value>
tcp-socket:
port: <port-number>
grpc:
port: <port-number>
service: <service-name> # Optional
initial-delay-seconds: <seconds> # Optional
period-seconds: <seconds> # Optional
timeout-seconds: <seconds> # Optional
success-threshold: <number> # Optional
failure-threshold: <number> # Optional
Fields:
Services share many fields with jobs (image, command, script, args, env, cwd, mounts, files, resource, depends-on, multinode). See the jobs section for details on these common fields. Service-specific fields are:
- <service-name>: Arbitrary name identifying this service. Must be a valid DNS subdomain and must be unique within a workflow.
- network (optional): Network configuration for service exposure. If present and the list
of exposed ports is not empty, service is run inside its own (isolated) network namespace.
- ports: List of ports the service listens on
- name: Identifier for this port (used in endpoints)
- port: Port number (1-65535)
- protocol (optional):
tcporudp. Defaults totcp
- endpoints (optional): List of external endpoints to create
- name: Endpoint identifier
- port-name: References a port name from the ports list
- protocol: Protocol -
http,https,grpc,grpcs,tcp, ortls - type: Endpoint style -
subdomain(creates<name>.<workflow-id>.<account>.fuzzball) orpath(creates/endpoints/<account>/<workflow-id>/<name>) - scope: Determines who can access the endpoint. One of
user(only the workflow creator),group(anyone in the same account),organization(anyone in the same organization),public(anyone without authentication). Defaults togroupif not specified.
- ports: List of ports the service listens on
- persist (optional): If
true, service continues running even after all dependent jobs finish and continues until the workflow is cancelled. Iffalse(default), service stops when no jobs/services depend on it. - readiness-probe (optional): Kubernetes-style health check to determine when service is
ready. Service status transitions from STARTED to RUNNING only after probe succeeds
- exec: Run a command in the container. Success if exit code is 0
- command: Command to execute
- http-get: HTTP GET request. Success if status code is 200-399
- path: HTTP path
- port: Port number
- scheme (optional):
HTTPorHTTPS - http-headers (optional): Custom HTTP headers
- tcp-socket: TCP connection attempt. Success if connection establishes
- port: Port number
- grpc: gRPC health check. Success per gRPC health checking protocol
- port: Port number
- service (optional): gRPC service name
- initial-delay-seconds (optional): Delay before first probe
- period-seconds (optional): Frequency of probes
- timeout-seconds (optional): Probe timeout
- success-threshold (optional): Consecutive successes needed
- failure-threshold (optional): Consecutive failures before marking unhealthy
- exec: Run a command in the container. Success if exit code is 0
Example:
services:
postgres:
image:
uri: docker://postgres:16
env:
- POSTGRES_PASSWORD=secret
- POSTGRES_DB=myapp
mounts:
db:
location: /var/lib/postgresql/data
resource:
cpu:
cores: 4
memory:
size: 8GB
network:
ports:
- name: postgres
port: 5432
protocol: tcp
readiness-probe:
tcp-socket:
port: 5432
initial-delay-seconds: 5
period-seconds: 10
persist: true
api-server:
image:
uri: docker://mycompany/api:v1.2.3
secret: secret://account/REGISTRY_CREDS
env:
- DATABASE_URL=postgresql://postgres:5432/myapp
depends-on:
- name: postgres
status: RUNNING
description: "API needs database connection"
resource:
cpu:
cores: 2
memory:
size: 4GB
network:
ports:
- name: http
port: 8080
endpoints:
- name: api
port-name: http
protocol: https
type: subdomain
scope: group
readiness-probe:
http-get:
path: /health
port: 8080
initial-delay-seconds: 10
period-seconds: 5
failure-threshold: 3
jobs:
data-processor:
image:
uri: docker://mycompany/processor:latest
command: [python, process.py, --api-url, http://api-server:8080]
depends-on:
- name: api-server
status: RUNNING
resource:
cpu:
cores: 8
memory:
size: 16GB
This example shows:
- postgres: Persistent database service with readiness probe
- api-server: REST API depending on postgres, exposed via HTTPS subdomain endpoint at account scope
- data-processor: Job that depends on api-server being running before it starts