Provisioner Configuration
Provisioner definitions specify the compute resources available through Slurm or PBS and how workflows should request them. These definitions are created through Fuzzball’s provisioner configuration system and enable fine-grained control over resource allocation, access policies, and cost tracking.
Unlike AWS provisioning where instance types are pre-defined, Slurm and PBS provisioner definitions are flexible specifications that describe the resources available on your cluster. When a workflow is submitted, Fuzzball matches each job’s resource requirements against available provisioner definitions and submits resource requests directly to Slurm or PBS.
Here’s a simple provisioner definition for a Slurm cluster:
definitions:
- id: "slurm-compute-small"
provisioner: "slurm"
ttl: 24000
exclusive: "job"
spec:
cpu: 4
mem: "8GiB"
cost_per_hour: 0.5
This definition specifies a compute resource with 4 CPU cores and 8 GiB of memory, tracked at $0.50
per hour for billing purposes. The ttl (time-to-live) of 24000 seconds (6 hours 40 minutes) sets a
maximum runtime for jobs, and exclusive: "job" ensures job-level resource isolation.
Do not set the TTL less than 24000. Fuzzball has hardcoded time limits for jobs it creates to pull OCI containers. Setting a time limit less than 24000 can make it impossible to run jobs.
| Field | Type | Description | Example |
|---|---|---|---|
id | string | Unique identifier for this provisioner definition | "slurm-compute-small" |
provisioner | string | Must be "slurm" for Slurm resources | "slurm" |
spec.cpu | integer | Number of CPU cores | 4 |
spec.mem | string | Memory size with unit | "8GiB" |
ttl | integer | Time-to-live in seconds (maximum job runtime) | 24000 |
| Field | Type | Description | Example |
|---|---|---|---|
exclusive | string | Resource exclusivity level: "job" or "workflow" | "job" |
spec.cost_per_hour | float | Cost per hour for billing/tracking | 0.5 |
spec.partition | string | Slurm partition to submit jobs to | "compute" |
spec.account | string | Slurm account to charge | "research-group" |
spec.qos | string | Quality of Service level | "normal" |
spec.constraint | string | Node feature constraints | "gpu" |
spec.exclude | string | Nodes to exclude | "node[01-05]" |
spec.prefer | string | Preferred nodes | "node[10-20]" |
spec.reservation | string | Reservation to use | "maintenance" |
spec.comment | string | Job comment | "Workflow execution" |
Memory sizes must include a unit suffix:
"1GiB"- Gibibytes (1024³ bytes)"1GB"- Gigabytes (1000³ bytes)"1MiB"- Mebibytes (1024² bytes)"1MB"- Megabytes (1000² bytes)
The ttl parameter sets a maximum runtime for jobs using this provisioner definition:
definitions:
- id: "slurm-short"
provisioner: "slurm"
ttl: 24000 # 6 hour 40 minute maximum
spec:
cpu: 4
mem: "8GiB"
- id: "slurm-long"
provisioner: "slurm"
ttl: 86400 # 24 hours maximum
spec:
cpu: 8
mem: "16GiB"
Behavior:
- Jobs exceeding the TTL are automatically terminated
- TTL is specified in seconds
- A 60-second buffer is added for graceful shutdown
- If not specified, jobs have no time limit (subject to Slurm configuration)
Use Cases:
- Prevent runaway jobs from consuming resources indefinitely
- Match Slurm partition time limits
- Enforce different time limits for different resource tiers
- Support cost management by limiting job duration
The exclusive parameter controls resource isolation:
definitions:
- id: "slurm-exclusive-job"
provisioner: "slurm"
exclusive: "job" # Resources exclusive to individual jobs
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
- id: "slurm-exclusive-workflow"
provisioner: "slurm"
exclusive: "workflow" # Resources exclusive to entire workflow
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
Options:
"job": Each job gets exclusive access to its allocated resources"workflow": All jobs in a workflow share exclusive access to resources- Not specified: Resources may be shared with other Slurm jobs
Use Cases:
- Job-level exclusivity: Ideal for performance-sensitive workloads requiring consistent resources
- Workflow-level exclusivity: Useful for multi-job workflows that need dedicated nodes
- Shared resources: Default behavior for cost-efficient resource utilization
You can create multiple provisioner definitions to offer different resource tiers:
definitions:
# Small compute nodes
- id: "slurm-compute-small"
provisioner: "slurm"
ttl: 24000
exclusive: "job"
spec:
cpu: 4
mem: "8GiB"
partition: "compute"
cost_per_hour: 0.5
# Large compute nodes
- id: "slurm-compute-large"
provisioner: "slurm"
ttl: 28800 # 8 hours
exclusive: "job"
spec:
cpu: 32
mem: "128GiB"
partition: "compute"
cost_per_hour: 2.0
# GPU nodes
- id: "slurm-gpu"
provisioner: "slurm"
ttl: 43200 # 12 hours
exclusive: "workflow"
spec:
cpu: 16
mem: "64GiB"
partition: "gpu"
constraint: "gpu"
cost_per_hour: 5.0
Organize definitions by Slurm partitions for different hardware or user groups:
definitions:
# General compute partition
- id: "slurm-general"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
partition: "general"
qos: "normal"
cost_per_hour: 0.75
# High-priority partition
- id: "slurm-priority"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
partition: "priority"
qos: "high"
cost_per_hour: 1.5
# Development/testing partition
- id: "slurm-devel"
provisioner: "slurm"
spec:
cpu: 2
mem: "4GiB"
partition: "devel"
qos: "testing"
cost_per_hour: 0.1
Map provisioner definitions to different Slurm accounts for charge-back:
definitions:
# Engineering department
- id: "slurm-engineering"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
account: "engineering"
partition: "compute"
cost_per_hour: 1.0
# Research department
- id: "slurm-research"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
account: "research"
partition: "compute"
cost_per_hour: 1.0
# Marketing department
- id: "slurm-marketing"
provisioner: "slurm"
spec:
cpu: 4
mem: "8GiB"
account: "marketing"
partition: "shared"
cost_per_hour: 0.5
Use Slurm constraints to target specific hardware features:
definitions:
# Haswell nodes
- id: "slurm-haswell"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 20
mem: "64GiB"
constraint: "haswell"
partition: "compute"
cost_per_hour: 0.8
# Broadwell nodes
- id: "slurm-broadwell"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 28
mem: "128GiB"
constraint: "broadwell"
partition: "compute"
cost_per_hour: 1.2
# AMD EPYC nodes
- id: "slurm-epyc"
provisioner: "slurm"
spec:
cpu: 64
mem: "256GiB"
constraint: "epyc"
partition: "compute"
cost_per_hour: 2.5
# GPU nodes with V100
- id: "slurm-v100"
provisioner: "slurm"
spec:
cpu: 16
mem: "96GiB"
constraint: "v100"
partition: "gpu"
cost_per_hour: 8.0
# GPU nodes with A100
- id: "slurm-a100"
provisioner: "slurm"
spec:
cpu: 32
mem: "256GiB"
constraint: "a100"
partition: "gpu"
cost_per_hour: 15.0
Control job placement with exclude and prefer directives:
definitions:
# Exclude maintenance nodes
- id: "slurm-stable"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
partition: "compute"
exclude: "node[01-05]" # Exclude nodes undergoing maintenance
cost_per_hour: 0.75
# Prefer newer hardware
- id: "slurm-preferred"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
partition: "compute"
prefer: "node[100-200]" # Prefer recently upgraded nodes
cost_per_hour: 1.0
The recommended method is to use Fuzzball’s provisioner configuration system. Create a YAML file with your definitions and apply it through the admin API or CLI.
Example configuration file (slurm-provisioners.yaml):
apiVersion: fuzzball.io/v1alpha1
kind: CentralConfig
metadata:
name: slurm-provisioners
spec:
provisioner:
definitions:
- id: "slurm-compute-small"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 4
mem: "8GiB"
partition: "compute"
cost_per_hour: 0.5
- id: "slurm-compute-large"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 32
mem: "128GiB"
partition: "compute"
cost_per_hour: 2.0
Apply using the Fuzzball CLI:
$ fuzzball admin provisioner create --config slurm-provisioners.yamlFuzzball provides a catalog tool that can automatically discover resources from your Slurm cluster and generate provisioner definitions. This is useful for initial setup but not required for ongoing operation.
Unlike AWS provisioning where instance types must be cataloged, Slurm resources are discovered dynamically at runtime through the Substrate. The catalog tool is provided as a convenience for generating initial definitions based on actual cluster capabilities.
Build the cataloger using the fuzzy build system:
$ fuzzy build binary catalogerGenerate definitions for a specific partition:
$ ./build/cataloger provisioner cataloger slurm definition generate \
--catalog-path /tmp/slurm-catalog \
--id "slurm-gpu-discovery" \
--ssh-host "slurm-head.example.com" \
--username "fuzzball-service" \
--private-key-path "/etc/fuzzball/slurm-key" \
--public-key "$(cat /etc/fuzzball/slurm-key.pub)" \
--partition "gpu"The catalog tool will:
- Connect to your Slurm cluster via SSH
- Submit a test job to provision the Substrate
- Query the Substrate for available resources
- Generate a provisioner definition with discovered CPU, memory, and hardware specs
- Save the definition to the specified catalog path
When a workflow is submitted, Fuzzball matches workflow resource requirements against provisioner definitions:
Workflow requesting exactly what a definition offers:
Workflow (workflow.yaml):
version: v1
jobs:
my-job:
name: "compute-job"
image:
uri: "docker://alpine:latest"
command: ["echo", "Hello"]
resource:
cpu:
cores: 4
memory:
size: "8GiB"
Provisioner Definition:
definitions:
- id: "slurm-compute-small"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 4
mem: "8GiB"
This workflow will match the slurm-compute-small definition.
Workflow requesting less than what a definition offers:
Workflow:
resource:
cpu:
cores: 2
memory:
size: "4GiB"
Provisioner Definition:
definitions:
- id: "slurm-compute-small"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 4
mem: "8GiB"
This workflow can use the slurm-compute-small definition, consuming only 2 cores and 4 GiB.
Workflow requesting more than any definition offers:
Workflow:
resource:
cpu:
cores: 64
memory:
size: "512GiB"
If no provisioner definition offers 64 cores and 512 GiB, the workflow will fail with a “no matching provisioner definition” error.
Solution: Create a provisioner definition that meets or exceeds the requirements:
definitions:
- id: "slurm-compute-xlarge"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 64
mem: "512GiB"
partition: "himem"
cost_per_hour: 5.0
Use access policies to control which users or organizations can use specific provisioner definitions. This is configured through the central configuration system’s policy expressions.
Example: Restrict GPU nodes to specific organizations:
definitions:
- id: "slurm-gpu"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 16
mem: "64GiB"
partition: "gpu"
constraint: "gpu"
cost_per_hour: 5.0
access_policy: |
req.organization.id in ["org-research", "org-engineering"]
See the Provisioner Administration documentation for details on access policies and expression syntax.
The cost_per_hour field enables billing and cost tracking:
definitions:
- id: "slurm-compute"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
cost_per_hour: 0.75
When workflows run using this definition:
- Usage is tracked per hour of resource allocation
- Costs are calculated as:
cost_per_hour × hours_used - Billing reports aggregate costs by organization, account, or user
Set cost_per_hour to reflect your internal charge-back rates or actual operational costs.
Create a few standard definitions covering common use cases:
definitions:
- id: "slurm-small"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 4
mem: "8GiB"
cost_per_hour: 0.5
- id: "slurm-medium"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
cost_per_hour: 1.0
- id: "slurm-large"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 32
mem: "128GiB"
cost_per_hour: 2.0
Align definitions with your existing partition and account structure. Create one definition per partition/tier combination:
definitions:
- id: "slurm-compute-standard"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
partition: "compute"
qos: "normal"
- id: "slurm-compute-priority"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
partition: "compute"
qos: "high"
Choose IDs that clearly indicate the resource type:
# Good: Clear and descriptive
- id: "slurm-gpu-v100-32gb"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
constraint: "v100"
# Avoid: Unclear or too generic
- id: "def-1"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
Add comments to explain partition choices, constraints, or costs:
definitions:
# GPU partition with V100 GPUs for deep learning workloads
# Cost reflects amortized hardware and power costs
- id: "slurm-gpu-v100"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 16
mem: "96GiB"
partition: "gpu"
constraint: "v100"
cost_per_hour: 8.0
comment: "V100 GPU nodes for deep learning"
Create test definitions in a development partition:
definitions:
- id: "slurm-test"
provisioner: "slurm"
ttl: 24000
spec:
cpu: 2
mem: "4GiB"
partition: "devel"
qos: "testing"
cost_per_hour: 0.0 # Free for testing
Submit test workflows to verify configuration before promoting to production.
After creating provisioner definitions, verify they’re available:
$ fuzzball admin provisioner list # List all provisioner definitions
$ fuzzball admin provisioner get slurm-compute-small # Show details of a specific definition
$ fuzzball workflow submit --file test-workflow.yaml # Test with a simple workflowSymptom: Workflow fails with “no matching provisioner definition”
Solutions:
- Verify definition exists:
fuzzball admin provisioner list - Check workflow resource requirements match definition specs
- Ensure definition has
provisioner: "slurm" - Verify access policies allow your user/organization
Symptom: Workflow submitted but jobs never start on Slurm
Solutions:
- Check if partition exists:
sinfo -p compute - Verify account has access:
sacctmgr show assoc where account=research-group - Check QoS is valid:
sacctmgr show qos - Review Slurm logs:
/var/log/slurm/slurmctld.log
Symptom: Jobs fail or get killed due to resource constraints
Solutions:
- Ensure definition specs match actual node capabilities
- Verify memory units (GiB vs GB)
- Check CPU count matches Slurm node configuration
- Use catalog tool to discover actual resources
For more troubleshooting guidance, see the Troubleshooting Guide.
Here’s a simple provisioner definition for a PBS cluster:
definitions:
- id: "pbs-compute-small"
provisioner: "pbs"
ttl: 24000
exclusive: "job"
spec:
cpu: 4
mem: "8GiB"
cost_per_hour: 0.5
This definition specifies a compute resource with 4 CPU cores and 8 GiB of memory, tracked at $0.50
per hour for billing purposes. The ttl (time-to-live) of 24000 seconds (6 hours 40 minutes) sets a
maximum runtime for jobs and is automatically converted to PBS walltime. The exclusive: "job"
ensures job-level resource isolation with automatic place=free:excl directive.
Do not set the TTL less than 24000. Fuzzball has hardcoded time limits for jobs it creates to pull OCI containers. Setting a time limit less than 24000 can make it impossible to run jobs.
| Field | Type | Description | Example |
|---|---|---|---|
id | string | Unique identifier for this provisioner definition | "pbs-compute-small" |
provisioner | string | Must be "pbs" for PBS resources | "pbs" |
spec.cpu | integer | Number of CPU cores | 4 |
spec.mem | string | Memory size with unit | "8GiB" |
ttl | integer | Time-to-live in seconds (maximum job runtime) | 24000 |
| Field | Type | Description | Example |
|---|---|---|---|
exclusive | string | Resource exclusivity level: "job" or "workflow" | "job" |
spec.cost_per_hour | float | Cost per hour for billing/tracking | 0.5 |
spec.queue | string | PBS queue to submit jobs to | "workq" |
spec.gpus | integer | Number of GPUs requested | 1 |
Memory sizes must include a unit suffix:
"1GiB"- Gibibytes (1024³ bytes)"1GB"- Gigabytes (1000³ bytes)"1MiB"- Mebibytes (1024² bytes)"1MB"- Megabytes (1000² bytes)
The ttl parameter sets a maximum runtime for jobs and is automatically converted to PBS walltime:
definitions:
- id: "pbs-short"
provisioner: "pbs"
ttl: 24000 # 6 hour 40 minute maximum
spec:
cpu: 4
mem: "8GiB"
- id: "pbs-long"
provisioner: "pbs"
ttl: 86400 # 24 hours maximum
spec:
cpu: 8
mem: "16GiB"
Behavior:
- Jobs exceeding the TTL are automatically terminated
- TTL is specified in seconds and converted to PBS walltime format
- A 60-second buffer is added for graceful shutdown
- The PBS provisioner automatically handles walltime conversion
- If not specified, jobs have no time limit (subject to PBS queue configuration)
Use Cases:
- Prevent runaway jobs from consuming resources indefinitely
- Match PBS queue time limits
- Enforce different time limits for different resource tiers
- Support cost management by limiting job duration
The exclusive parameter controls resource isolation and automatically adds PBS placement directives:
definitions:
- id: "pbs-exclusive-job"
provisioner: "pbs"
ttl: 24000
exclusive: "job" # Resources exclusive to individual jobs
spec:
cpu: 8
mem: "16GiB"
- id: "pbs-exclusive-workflow"
provisioner: "pbs"
ttl: 24000
exclusive: "workflow" # Resources exclusive to entire workflow
spec:
cpu: 16
mem: "32GiB"
Options:
"job": Each job gets exclusive access to its allocated resources (adds#PBS -l place=free:excl)"workflow": All jobs in a workflow share exclusive access to resources- Not specified: Resources may be shared with other PBS jobs
Important: The PBS provisioner automatically adds the place=free:excl directive for exclusive job placement, ensuring nodes are not shared.
Use Cases:
- Job-level exclusivity: Ideal for performance-sensitive workloads requiring consistent resources
- Workflow-level exclusivity: Useful for multi-job workflows that need dedicated nodes
- Shared resources: Default behavior for cost-efficient resource utilization
For GPU-enabled compute nodes, specify the number of GPUs:
definitions:
- id: "pbs-gpu"
provisioner: "pbs"
ttl: 43200 # 12 hours
exclusive: "job"
spec:
cpu: 16
mem: "64GiB"
gpus: 2 # Request 2 GPUs
queue: "gpu"
cost_per_hour: 8.0
The GPU specification is translated to the appropriate PBS resource request format based on your PBS implementation (Torque or OpenPBS).
You can create multiple provisioner definitions to offer different resource tiers:
definitions:
# Small compute nodes
- id: "pbs-compute-small"
provisioner: "pbs"
ttl: 24000
exclusive: "job"
spec:
cpu: 4
mem: "8GiB"
queue: "workq"
cost_per_hour: 0.5
# Large compute nodes
- id: "pbs-compute-large"
provisioner: "pbs"
ttl: 28800 # 8 hours
exclusive: "job"
spec:
cpu: 32
mem: "128GiB"
queue: "workq"
cost_per_hour: 2.0
# GPU nodes
- id: "pbs-gpu"
provisioner: "pbs"
ttl: 43200 # 12 hours
exclusive: "workflow"
spec:
cpu: 16
mem: "64GiB"
gpus: 2
queue: "gpu"
cost_per_hour: 8.0
Organize definitions by PBS queues for different hardware or user groups:
definitions:
# General compute queue
- id: "pbs-general"
provisioner: "pbs"
ttl: 14400 # 4 hours
spec:
cpu: 8
mem: "16GiB"
queue: "workq"
cost_per_hour: 0.75
# High-priority queue
- id: "pbs-priority"
provisioner: "pbs"
ttl: 28800 # 8 hours
spec:
cpu: 8
mem: "16GiB"
queue: "high"
cost_per_hour: 1.5
# Development/testing queue
- id: "pbs-devel"
provisioner: "pbs"
ttl: 24000 # 1 hour
spec:
cpu: 2
mem: "4GiB"
queue: "devel"
cost_per_hour: 0.1
Configure definitions for specific hardware capabilities:
definitions:
# Standard CPU nodes
- id: "pbs-cpu-standard"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 20
mem: "64GiB"
queue: "workq"
cost_per_hour: 0.8
# High-memory nodes
- id: "pbs-himem"
provisioner: "pbs"
ttl: 28800
spec:
cpu: 28
mem: "256GiB"
queue: "himem"
cost_per_hour: 2.5
# GPU nodes with V100
- id: "pbs-v100"
provisioner: "pbs"
ttl: 43200
spec:
cpu: 16
mem: "96GiB"
gpus: 2
queue: "gpu"
cost_per_hour: 8.0
# GPU nodes with A100
- id: "pbs-a100"
provisioner: "pbs"
ttl: 86400
spec:
cpu: 32
mem: "256GiB"
gpus: 4
queue: "gpu-a100"
cost_per_hour: 15.0
The recommended method is to use Fuzzball’s provisioner configuration system. Create a YAML file with your definitions and apply it through the admin API or CLI.
Example configuration file (pbs-provisioners.yaml):
apiVersion: fuzzball.io/v1alpha1
kind: CentralConfig
metadata:
name: pbs-provisioners
spec:
provisioner:
definitions:
- id: "pbs-compute-small"
provisioner: "pbs"
ttl: 24000
exclusive: "job"
spec:
cpu: 4
mem: "8GiB"
queue: "workq"
cost_per_hour: 0.5
- id: "pbs-compute-large"
provisioner: "pbs"
ttl: 28800
exclusive: "job"
spec:
cpu: 32
mem: "128GiB"
queue: "workq"
cost_per_hour: 2.0
Apply using the Fuzzball CLI:
$ fuzzball admin provisioner create --config pbs-provisioners.yamlWhen a workflow is submitted, Fuzzball matches workflow resource requirements against provisioner definitions:
Workflow requesting exactly what a definition offers:
Workflow (workflow.fz):
version: v1
jobs:
my-job:
image:
uri: "docker://alpine:latest"
command: ["echo", "Hello"]
resource:
cpu:
cores: 4
memory:
size: "8GiB"
Provisioner Definition:
definitions:
- id: "pbs-compute-small"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 4
mem: "8GiB"
This workflow will match the pbs-compute-small definition.
Workflow requesting less than what a definition offers:
Workflow:
resource:
cpu:
cores: 2
memory:
size: "4GiB"
Provisioner Definition:
definitions:
- id: "pbs-compute-small"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 4
mem: "8GiB"
This workflow can use the pbs-compute-small definition, consuming only 2 cores and 4 GiB.
Workflow requesting more than any definition offers:
Workflow:
resource:
cpu:
cores: 64
memory:
size: "512GiB"
If no provisioner definition offers 64 cores and 512 GiB, the workflow will fail with a “no matching provisioner definition” error.
Solution: Create a provisioner definition that meets or exceeds the requirements:
definitions:
- id: "pbs-compute-xlarge"
provisioner: "pbs"
ttl: 86400
spec:
cpu: 64
mem: "512GiB"
queue: "himem"
cost_per_hour: 5.0
Use access policies to control which users or organizations can use specific provisioner definitions. This is configured through the central configuration system’s policy expressions.
Example: Restrict GPU nodes to specific organizations:
definitions:
- id: "pbs-gpu"
provisioner: "pbs"
ttl: 43200
spec:
cpu: 16
mem: "64GiB"
gpus: 2
queue: "gpu"
cost_per_hour: 8.0
access_policy: |
req.organization.id in ["org-research", "org-engineering"]
See the Provisioner Administration documentation for details on access policies and expression syntax.
The cost_per_hour field enables billing and cost tracking:
definitions:
- id: "pbs-compute"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
cost_per_hour: 0.75
When workflows run using this definition:
- Usage is tracked per hour of resource allocation
- Costs are calculated as:
cost_per_hour × hours_used - Billing reports aggregate costs by organization, account, or user
Set cost_per_hour to reflect your internal charge-back rates or actual operational costs.
Create a few standard definitions covering common use cases:
definitions:
- id: "pbs-small"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 4
mem: "8GiB"
cost_per_hour: 0.5
- id: "pbs-medium"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
cost_per_hour: 1.0
- id: "pbs-large"
provisioner: "pbs"
ttl: 28800
spec:
cpu: 32
mem: "128GiB"
cost_per_hour: 2.0
Align definitions with your existing queue structure. Create one definition per queue/tier combination:
definitions:
- id: "pbs-workq-standard"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
queue: "workq"
- id: "pbs-high-priority"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 8
mem: "16GiB"
queue: "high"
Choose IDs that clearly indicate the resource type:
# Good: Clear and descriptive
- id: "pbs-gpu-v100-32gb"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
gpus: 2
queue: "gpu"
# Avoid: Unclear or too generic
- id: "def-1"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 16
mem: "32GiB"
The TTL must not be set lower than 24000 (6 hours and 40 minutes) because Fuzzball sets a time limit of 6 hours for jobs to pull OCI containers. Setting it lower may cause jobs to become unschedulable. TTL should also not be set longer than the configured max walltime for a cluster.
definitions:
# Short jobs (testing, quick analysis)
- id: "pbs-quick"
provisioner: "pbs"
ttl: 24000 # 6 hours 40 minutes
spec:
cpu: 4
mem: "8GiB"
# Medium jobs (standard processing)
- id: "pbs-standard"
provisioner: "pbs"
ttl: 144000 # 40 hours
spec:
cpu: 16
mem: "32GiB"
# Long jobs (simulations, training)
- id: "pbs-long"
provisioner: "pbs"
ttl: 864000 # 10 days
spec:
cpu: 32
mem: "128GiB"
Create test definitions in a development queue:
definitions:
- id: "pbs-test"
provisioner: "pbs"
ttl: 24000
spec:
cpu: 2
mem: "4GiB"
queue: "devel"
cost_per_hour: 0.0 # Free for testing
Submit test workflows to verify configuration before promoting to production.
After creating provisioner definitions, verify they’re available:
$ fuzzball admin provisioner list # List all provisioner definitions
$ fuzzball admin provisioner get pbs-compute-small # Show details of a specific definition
$ fuzzball workflow submit --file test-workflow.fz # Test with a simple workflowSymptom: Workflow fails with “no matching provisioner definition”
Solutions:
- Verify definition exists:
fuzzball admin provisioner list - Check workflow resource requirements match definition specs
- Ensure definition has
provisioner: "pbs" - Verify access policies allow your user/organization
Symptom: Workflow submitted but jobs never start on PBS
Solutions:
- Check if queue exists:
qstat -Qorqstat -q - Check queue limits and availability
- Review PBS logs:
/var/log/pbs/server_logs - Verify user permissions with PBS administrator
Symptom: Jobs fail or get killed due to resource constraints
Solutions:
- Ensure definition specs match actual node capabilities.
- Verify memory units (GiB vs GB).
- Check CPU count matches PBS node configuration.
- Use catalog tool to discover actual resources.
- Check PBS node properties:
pbsnodes -a.
Symptom: Jobs terminated with “walltime exceeded” message
Solutions:
- Increase TTL in provisioner definition.
- Check queue walltime limits:
qstat -Qf queue-name. - Ensure job completes within TTL minus 60-second buffer.
- Consider splitting long-running jobs into smaller tasks.
For more troubleshooting guidance, see the Troubleshooting Guide.