Provisioner Configuration

Provisioner definitions specify the compute resources available through Slurm or PBS and how workflows should request them. These definitions are created through Fuzzball’s provisioner configuration system and enable fine-grained control over resource allocation, access policies, and cost tracking.

Unlike AWS provisioning where instance types are pre-defined, Slurm and PBS provisioner definitions are flexible specifications that describe the resources available on your cluster. When a workflow is submitted, Fuzzball matches each job’s resource requirements against available provisioner definitions and submits resource requests directly to Slurm or PBS.

select

Please select either the Slurm or PBS tab to see the appropriate instructions for your environment.

Slurm

Basic Provisioner Definition

Here’s a simple provisioner definition for a Slurm cluster:

definitions:
  - id: "slurm-compute-small"
    provisioner: "slurm"
    ttl: 24000
    exclusive: "job"
    spec:
      cpu: 4
      mem: "8GiB"
      cost_per_hour: 0.5

This definition specifies a compute resource with 4 CPU cores and 8 GiB of memory, tracked at $0.50 per hour for billing purposes. The ttl (time-to-live) of 24000 seconds (6 hours 40 minutes) sets a maximum runtime for jobs, and exclusive: "job" ensures job-level resource isolation.

Do not set the TTL less than 24000. Fuzzball has hardcoded time limits for jobs it creates to pull OCI containers. Setting a time limit less than 24000 can make it impossible to run jobs.

Configuration Parameters

Required Fields

Field	Type	Description	Example
`id`	string	Unique identifier for this provisioner definition	`"slurm-compute-small"`
`provisioner`	string	Must be `"slurm"` for Slurm resources	`"slurm"`
`spec.cpu`	integer	Number of CPU cores	`4`
`spec.mem`	string	Memory size with unit	`"8GiB"`
`ttl`	integer	Time-to-live in seconds (maximum job runtime)	`24000`

Optional Fields

Field	Type	Description	Example
`exclusive`	string	Resource exclusivity level: `"job"` or `"workflow"`	`"job"`
`spec.cost_per_hour`	float	Cost per hour for billing/tracking	`0.5`
`spec.partition`	string	Slurm partition to submit jobs to	`"compute"`
`spec.account`	string	Slurm account to charge	`"research-group"`
`spec.qos`	string	Quality of Service level	`"normal"`
`spec.constraint`	string	Node feature constraints	`"gpu"`
`spec.exclude`	string	Nodes to exclude	`"node[01-05]"`
`spec.prefer`	string	Preferred nodes	`"node[10-20]"`
`spec.reservation`	string	Reservation to use	`"maintenance"`
`spec.comment`	string	Job comment	`"Workflow execution"`

Memory Units

Memory sizes must include a unit suffix:

"1GiB" - Gibibytes (1024³ bytes)
"1GB" - Gigabytes (1000³ bytes)
"1MiB" - Mebibytes (1024² bytes)
"1MB" - Megabytes (1000² bytes)

Time-to-Live (TTL)

The ttl parameter sets a maximum runtime for jobs using this provisioner definition:

definitions:
  - id: "slurm-short"
    provisioner: "slurm"
    ttl: 24000  # 6 hour 40 minute maximum
    spec:
      cpu: 4
      mem: "8GiB"

  - id: "slurm-long"
    provisioner: "slurm"
    ttl: 86400  # 24 hours maximum
    spec:
      cpu: 8
      mem: "16GiB"

Behavior:

Jobs exceeding the TTL are automatically terminated
TTL is specified in seconds
A 60-second buffer is added for graceful shutdown
If not specified, jobs have no time limit (subject to Slurm configuration)

Use Cases:

Prevent runaway jobs from consuming resources indefinitely
Match Slurm partition time limits
Enforce different time limits for different resource tiers
Support cost management by limiting job duration

Resource Exclusivity

The exclusive parameter controls resource isolation:

definitions:
  - id: "slurm-exclusive-job"
    provisioner: "slurm"
    exclusive: "job"  # Resources exclusive to individual jobs
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"

  - id: "slurm-exclusive-workflow"
    provisioner: "slurm"
    exclusive: "workflow"  # Resources exclusive to entire workflow
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"

Options:

"job": Each job gets exclusive access to its allocated resources
"workflow": All jobs in a workflow share exclusive access to resources
Not specified: Resources may be shared with other Slurm jobs

Use Cases:

Job-level exclusivity: Ideal for performance-sensitive workloads requiring consistent resources
Workflow-level exclusivity: Useful for multi-job workflows that need dedicated nodes
Shared resources: Default behavior for cost-efficient resource utilization

Multiple Provisioner Definitions

You can create multiple provisioner definitions to offer different resource tiers:

definitions:
  # Small compute nodes
  - id: "slurm-compute-small"
    provisioner: "slurm"
    ttl: 24000
    exclusive: "job"
    spec:
      cpu: 4
      mem: "8GiB"
      partition: "compute"
      cost_per_hour: 0.5

  # Large compute nodes
  - id: "slurm-compute-large"
    provisioner: "slurm"
    ttl: 28800  # 8 hours
    exclusive: "job"
    spec:
      cpu: 32
      mem: "128GiB"
      partition: "compute"
      cost_per_hour: 2.0

  # GPU nodes
  - id: "slurm-gpu"
    provisioner: "slurm"
    ttl: 43200  # 12 hours
    exclusive: "workflow"
    spec:
      cpu: 16
      mem: "64GiB"
      partition: "gpu"
      constraint: "gpu"
      cost_per_hour: 5.0

Partition-Based Configuration

Organize definitions by Slurm partitions for different hardware or user groups:

definitions:
  # General compute partition
  - id: "slurm-general"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "general"
      qos: "normal"
      cost_per_hour: 0.75

  # High-priority partition
  - id: "slurm-priority"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "priority"
      qos: "high"
      cost_per_hour: 1.5

  # Development/testing partition
  - id: "slurm-devel"
    provisioner: "slurm"
    spec:
      cpu: 2
      mem: "4GiB"
      partition: "devel"
      qos: "testing"
      cost_per_hour: 0.1

Account-Based Configuration

Map provisioner definitions to different Slurm accounts for charge-back:

definitions:
  # Engineering department
  - id: "slurm-engineering"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      account: "engineering"
      partition: "compute"
      cost_per_hour: 1.0

  # Research department
  - id: "slurm-research"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      account: "research"
      partition: "compute"
      cost_per_hour: 1.0

  # Marketing department
  - id: "slurm-marketing"
    provisioner: "slurm"
    spec:
      cpu: 4
      mem: "8GiB"
      account: "marketing"
      partition: "shared"
      cost_per_hour: 0.5

Hardware-Specific Configuration

Use Slurm constraints to target specific hardware features:

definitions:
  # Haswell nodes
  - id: "slurm-haswell"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 20
      mem: "64GiB"
      constraint: "haswell"
      partition: "compute"
      cost_per_hour: 0.8

  # Broadwell nodes
  - id: "slurm-broadwell"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 28
      mem: "128GiB"
      constraint: "broadwell"
      partition: "compute"
      cost_per_hour: 1.2

  # AMD EPYC nodes
  - id: "slurm-epyc"
    provisioner: "slurm"
    spec:
      cpu: 64
      mem: "256GiB"
      constraint: "epyc"
      partition: "compute"
      cost_per_hour: 2.5

  # GPU nodes with V100
  - id: "slurm-v100"
    provisioner: "slurm"
    spec:
      cpu: 16
      mem: "96GiB"
      constraint: "v100"
      partition: "gpu"
      cost_per_hour: 8.0

  # GPU nodes with A100
  - id: "slurm-a100"
    provisioner: "slurm"
    spec:
      cpu: 32
      mem: "256GiB"
      constraint: "a100"
      partition: "gpu"
      cost_per_hour: 15.0

Node Exclusion and Preference

Control job placement with exclude and prefer directives:

definitions:
  # Exclude maintenance nodes
  - id: "slurm-stable"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "compute"
      exclude: "node[01-05]"  # Exclude nodes undergoing maintenance
      cost_per_hour: 0.75

  # Prefer newer hardware
  - id: "slurm-preferred"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      partition: "compute"
      prefer: "node[100-200]"  # Prefer recently upgraded nodes
      cost_per_hour: 1.0

Creating Provisioner Definitions

Using the Provisioner Configuration API

The recommended method is to use Fuzzball’s provisioner configuration system. Create a YAML file with your definitions and apply it through the admin API or CLI.

Example configuration file (slurm-provisioners.yaml):

apiVersion: fuzzball.io/v1alpha1
kind: CentralConfig
metadata:
  name: slurm-provisioners
spec:
  provisioner:
    definitions:
      - id: "slurm-compute-small"
        provisioner: "slurm"
        ttl: 24000
        spec:
          cpu: 4
          mem: "8GiB"
          partition: "compute"
          cost_per_hour: 0.5

      - id: "slurm-compute-large"
        provisioner: "slurm"
        ttl: 24000
        spec:
          cpu: 32
          mem: "128GiB"
          partition: "compute"
          cost_per_hour: 2.0

Apply using the Fuzzball CLI:

$ fuzzball admin provisioner create --config slurm-provisioners.yaml

Using the Catalog Tool (Optional)

Fuzzball provides a catalog tool that can automatically discover resources from your Slurm cluster and generate provisioner definitions. This is useful for initial setup but not required for ongoing operation.

Unlike AWS provisioning where instance types must be cataloged, Slurm resources are discovered dynamically at runtime through the Substrate. The catalog tool is provided as a convenience for generating initial definitions based on actual cluster capabilities.

Building the Cataloger Tool

Build the cataloger using the fuzzy build system:

$ fuzzy build binary cataloger

Discovering Resources

Generate definitions for a specific partition:

$ ./build/cataloger provisioner cataloger slurm definition generate \
    --catalog-path /tmp/slurm-catalog \
    --id "slurm-gpu-discovery" \
    --ssh-host "slurm-head.example.com" \
    --username "fuzzball-service" \
    --private-key-path "/etc/fuzzball/slurm-key" \
    --public-key "$(cat /etc/fuzzball/slurm-key.pub)" \
    --partition "gpu"

The catalog tool will:

Connect to your Slurm cluster via SSH
Submit a test job to provision the Substrate
Query the Substrate for available resources
Generate a provisioner definition with discovered CPU, memory, and hardware specs
Save the definition to the specified catalog path

Workflow Resource Matching

When a workflow is submitted, Fuzzball matches workflow resource requirements against provisioner definitions:

Exact Match

Workflow requesting exactly what a definition offers:

Workflow (workflow.yaml):

version: v1
jobs:
  my-job:
    name: "compute-job"
    image:
      uri: "docker://alpine:latest"
    command: ["echo", "Hello"]
    resource:
      cpu:
        cores: 4
      memory:
        size: "8GiB"

Provisioner Definition:

definitions:
  - id: "slurm-compute-small"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"

This workflow will match the slurm-compute-small definition.

Subset Match

Workflow requesting less than what a definition offers:

Workflow:

resource:
  cpu:
    cores: 2
  memory:
    size: "4GiB"

Provisioner Definition:

definitions:
  - id: "slurm-compute-small"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"

This workflow can use the slurm-compute-small definition, consuming only 2 cores and 4 GiB.

No Match

Workflow requesting more than any definition offers:

Workflow:

resource:
  cpu:
    cores: 64
  memory:
    size: "512GiB"

If no provisioner definition offers 64 cores and 512 GiB, the workflow will fail with a “no matching provisioner definition” error.

Solution: Create a provisioner definition that meets or exceeds the requirements:

definitions:
  - id: "slurm-compute-xlarge"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 64
      mem: "512GiB"
      partition: "himem"
      cost_per_hour: 5.0

Access Control

Use access policies to control which users or organizations can use specific provisioner definitions. This is configured through the central configuration system’s policy expressions.

Example: Restrict GPU nodes to specific organizations:

definitions:
  - id: "slurm-gpu"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "64GiB"
      partition: "gpu"
      constraint: "gpu"
      cost_per_hour: 5.0
    access_policy: |
      req.organization.id in ["org-research", "org-engineering"]

See the Provisioner Administration documentation for details on access policies and expression syntax.

Cost Tracking

The cost_per_hour field enables billing and cost tracking:

definitions:
  - id: "slurm-compute"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      cost_per_hour: 0.75

When workflows run using this definition:

Usage is tracked per hour of resource allocation
Costs are calculated as: cost_per_hour × hours_used
Billing reports aggregate costs by organization, account, or user

Set cost_per_hour to reflect your internal charge-back rates or actual operational costs.

Best Practices

1. Start with Standard Tiers

Create a few standard definitions covering common use cases:

definitions:
  - id: "slurm-small"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"
      cost_per_hour: 0.5

  - id: "slurm-medium"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      cost_per_hour: 1.0

  - id: "slurm-large"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 32
      mem: "128GiB"
      cost_per_hour: 2.0

2. Map to Existing Slurm Structure

Align definitions with your existing partition and account structure. Create one definition per partition/tier combination:

definitions:
  - id: "slurm-compute-standard"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "compute"
      qos: "normal"

  - id: "slurm-compute-priority"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "compute"
      qos: "high"

3. Use Descriptive IDs

Choose IDs that clearly indicate the resource type:

# Good: Clear and descriptive
- id: "slurm-gpu-v100-32gb"
  provisioner: "slurm"
  ttl: 24000
  spec:
    cpu: 16
    mem: "32GiB"
    constraint: "v100"

# Avoid: Unclear or too generic
- id: "def-1"
  provisioner: "slurm"
  ttl: 24000
  spec:
    cpu: 16
    mem: "32GiB"

4. Document Custom Configurations

Add comments to explain partition choices, constraints, or costs:

definitions:
  # GPU partition with V100 GPUs for deep learning workloads
  # Cost reflects amortized hardware and power costs
  - id: "slurm-gpu-v100"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "96GiB"
      partition: "gpu"
      constraint: "v100"
      cost_per_hour: 8.0
      comment: "V100 GPU nodes for deep learning"

5. Test Definitions Before Production

Create test definitions in a development partition:

definitions:
  - id: "slurm-test"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 2
      mem: "4GiB"
      partition: "devel"
      qos: "testing"
      cost_per_hour: 0.0  # Free for testing

Submit test workflows to verify configuration before promoting to production.

Validation

After creating provisioner definitions, verify they’re available:

$ fuzzball admin provisioner list # List all provisioner definitions

$ fuzzball admin provisioner get slurm-compute-small # Show details of a specific definition

$ fuzzball workflow submit --file test-workflow.yaml # Test with a simple workflow

Troubleshooting

Definition Not Found

Symptom: Workflow fails with “no matching provisioner definition”

Solutions:

Verify definition exists: fuzzball admin provisioner list
Check workflow resource requirements match definition specs
Ensure definition has provisioner: "slurm"
Verify access policies allow your user/organization

Jobs Not Starting

Symptom: Workflow submitted but jobs never start on Slurm

Solutions:

Check if partition exists: sinfo -p compute
Verify account has access: sacctmgr show assoc where account=research-group
Check QoS is valid: sacctmgr show qos
Review Slurm logs: /var/log/slurm/slurmctld.log

Resource Mismatches

Symptom: Jobs fail or get killed due to resource constraints

Solutions:

Ensure definition specs match actual node capabilities
Verify memory units (GiB vs GB)
Check CPU count matches Slurm node configuration
Use catalog tool to discover actual resources

For more troubleshooting guidance, see the Troubleshooting Guide.

PBS

Basic Provisioner Definition

Here’s a simple provisioner definition for a PBS cluster:

definitions:
  - id: "pbs-compute-small"
    provisioner: "pbs"
    ttl: 24000
    exclusive: "job"
    spec:
      cpu: 4
      mem: "8GiB"
      cost_per_hour: 0.5

This definition specifies a compute resource with 4 CPU cores and 8 GiB of memory, tracked at $0.50 per hour for billing purposes. The ttl (time-to-live) of 24000 seconds (6 hours 40 minutes) sets a maximum runtime for jobs and is automatically converted to PBS walltime. The exclusive: "job" ensures job-level resource isolation with automatic place=free:excl directive.

Do not set the TTL less than 24000. Fuzzball has hardcoded time limits for jobs it creates to pull OCI containers. Setting a time limit less than 24000 can make it impossible to run jobs.

Configuration Parameters

Required Fields

Field	Type	Description	Example
`id`	string	Unique identifier for this provisioner definition	`"pbs-compute-small"`
`provisioner`	string	Must be `"pbs"` for PBS resources	`"pbs"`
`spec.cpu`	integer	Number of CPU cores	`4`
`spec.mem`	string	Memory size with unit	`"8GiB"`
`ttl`	integer	Time-to-live in seconds (maximum job runtime)	`24000`

Optional Fields

Field	Type	Description	Example
`exclusive`	string	Resource exclusivity level: `"job"` or `"workflow"`	`"job"`
`spec.cost_per_hour`	float	Cost per hour for billing/tracking	`0.5`
`spec.queue`	string	PBS queue to submit jobs to	`"workq"`
`spec.gpus`	integer	Number of GPUs requested	`1`

Memory Units

Memory sizes must include a unit suffix:

"1GiB" - Gibibytes (1024³ bytes)
"1GB" - Gigabytes (1000³ bytes)
"1MiB" - Mebibytes (1024² bytes)
"1MB" - Megabytes (1000² bytes)

Time-to-Live (TTL) and Walltime

The ttl parameter sets a maximum runtime for jobs and is automatically converted to PBS walltime:

definitions:
  - id: "pbs-short"
    provisioner: "pbs"
    ttl: 24000  # 6 hour 40 minute maximum
    spec:
      cpu: 4
      mem: "8GiB"

  - id: "pbs-long"
    provisioner: "pbs"
    ttl: 86400  # 24 hours maximum
    spec:
      cpu: 8
      mem: "16GiB"

Behavior:

Jobs exceeding the TTL are automatically terminated
TTL is specified in seconds and converted to PBS walltime format
A 60-second buffer is added for graceful shutdown
The PBS provisioner automatically handles walltime conversion
If not specified, jobs have no time limit (subject to PBS queue configuration)

Use Cases:

Prevent runaway jobs from consuming resources indefinitely
Match PBS queue time limits
Enforce different time limits for different resource tiers
Support cost management by limiting job duration

Resource Exclusivity

The exclusive parameter controls resource isolation and automatically adds PBS placement directives:

definitions:
  - id: "pbs-exclusive-job"
    provisioner: "pbs"
    ttl: 24000
    exclusive: "job"  # Resources exclusive to individual jobs
    spec:
      cpu: 8
      mem: "16GiB"

  - id: "pbs-exclusive-workflow"
    provisioner: "pbs"
    ttl: 24000
    exclusive: "workflow"  # Resources exclusive to entire workflow
    spec:
      cpu: 16
      mem: "32GiB"

Options:

"job": Each job gets exclusive access to its allocated resources (adds #PBS -l place=free:excl)
"workflow": All jobs in a workflow share exclusive access to resources
Not specified: Resources may be shared with other PBS jobs

Important: The PBS provisioner automatically adds the place=free:excl directive for exclusive job placement, ensuring nodes are not shared.

Use Cases:

Job-level exclusivity: Ideal for performance-sensitive workloads requiring consistent resources
Workflow-level exclusivity: Useful for multi-job workflows that need dedicated nodes
Shared resources: Default behavior for cost-efficient resource utilization

GPU Configuration

For GPU-enabled compute nodes, specify the number of GPUs:

definitions:
  - id: "pbs-gpu"
    provisioner: "pbs"
    ttl: 43200  # 12 hours
    exclusive: "job"
    spec:
      cpu: 16
      mem: "64GiB"
      gpus: 2  # Request 2 GPUs
      queue: "gpu"
      cost_per_hour: 8.0

The GPU specification is translated to the appropriate PBS resource request format based on your PBS implementation (Torque or OpenPBS).

Multiple Provisioner Definitions

You can create multiple provisioner definitions to offer different resource tiers:

definitions:
  # Small compute nodes
  - id: "pbs-compute-small"
    provisioner: "pbs"
    ttl: 24000
    exclusive: "job"
    spec:
      cpu: 4
      mem: "8GiB"
      queue: "workq"
      cost_per_hour: 0.5

  # Large compute nodes
  - id: "pbs-compute-large"
    provisioner: "pbs"
    ttl: 28800  # 8 hours
    exclusive: "job"
    spec:
      cpu: 32
      mem: "128GiB"
      queue: "workq"
      cost_per_hour: 2.0

  # GPU nodes
  - id: "pbs-gpu"
    provisioner: "pbs"
    ttl: 43200  # 12 hours
    exclusive: "workflow"
    spec:
      cpu: 16
      mem: "64GiB"
      gpus: 2
      queue: "gpu"
      cost_per_hour: 8.0

Queue-Based Configuration

Organize definitions by PBS queues for different hardware or user groups:

definitions:
  # General compute queue
  - id: "pbs-general"
    provisioner: "pbs"
    ttl: 14400  # 4 hours
    spec:
      cpu: 8
      mem: "16GiB"
      queue: "workq"
      cost_per_hour: 0.75

  # High-priority queue
  - id: "pbs-priority"
    provisioner: "pbs"
    ttl: 28800  # 8 hours
    spec:
      cpu: 8
      mem: "16GiB"
      queue: "high"
      cost_per_hour: 1.5

  # Development/testing queue
  - id: "pbs-devel"
    provisioner: "pbs"
    ttl: 24000  # 1 hour
    spec:
      cpu: 2
      mem: "4GiB"
      queue: "devel"
      cost_per_hour: 0.1

Hardware-Specific Configuration

Configure definitions for specific hardware capabilities:

definitions:
  # Standard CPU nodes
  - id: "pbs-cpu-standard"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 20
      mem: "64GiB"
      queue: "workq"
      cost_per_hour: 0.8

  # High-memory nodes
  - id: "pbs-himem"
    provisioner: "pbs"
    ttl: 28800
    spec:
      cpu: 28
      mem: "256GiB"
      queue: "himem"
      cost_per_hour: 2.5

  # GPU nodes with V100
  - id: "pbs-v100"
    provisioner: "pbs"
    ttl: 43200
    spec:
      cpu: 16
      mem: "96GiB"
      gpus: 2
      queue: "gpu"
      cost_per_hour: 8.0

  # GPU nodes with A100
  - id: "pbs-a100"
    provisioner: "pbs"
    ttl: 86400
    spec:
      cpu: 32
      mem: "256GiB"
      gpus: 4
      queue: "gpu-a100"
      cost_per_hour: 15.0

Creating Provisioner Definitions

Using the Provisioner Configuration API

The recommended method is to use Fuzzball’s provisioner configuration system. Create a YAML file with your definitions and apply it through the admin API or CLI.

Example configuration file (pbs-provisioners.yaml):

apiVersion: fuzzball.io/v1alpha1
kind: CentralConfig
metadata:
  name: pbs-provisioners
spec:
  provisioner:
    definitions:
      - id: "pbs-compute-small"
        provisioner: "pbs"
        ttl: 24000
        exclusive: "job"
        spec:
          cpu: 4
          mem: "8GiB"
          queue: "workq"
          cost_per_hour: 0.5

      - id: "pbs-compute-large"
        provisioner: "pbs"
        ttl: 28800
        exclusive: "job"
        spec:
          cpu: 32
          mem: "128GiB"
          queue: "workq"
          cost_per_hour: 2.0

Apply using the Fuzzball CLI:

$ fuzzball admin provisioner create --config pbs-provisioners.yaml

Workflow Resource Matching

When a workflow is submitted, Fuzzball matches workflow resource requirements against provisioner definitions:

Exact Match

Workflow requesting exactly what a definition offers:

Workflow (workflow.fz):

version: v1
jobs:
  my-job:
    image:
      uri: "docker://alpine:latest"
    command: ["echo", "Hello"]
    resource:
      cpu:
        cores: 4
      memory:
        size: "8GiB"

Provisioner Definition:

definitions:
  - id: "pbs-compute-small"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"

This workflow will match the pbs-compute-small definition.

Subset Match

Workflow requesting less than what a definition offers:

Workflow:

resource:
  cpu:
    cores: 2
  memory:
    size: "4GiB"

Provisioner Definition:

definitions:
  - id: "pbs-compute-small"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"

This workflow can use the pbs-compute-small definition, consuming only 2 cores and 4 GiB.

No Match

Workflow requesting more than any definition offers:

Workflow:

resource:
  cpu:
    cores: 64
  memory:
    size: "512GiB"

If no provisioner definition offers 64 cores and 512 GiB, the workflow will fail with a “no matching provisioner definition” error.

Solution: Create a provisioner definition that meets or exceeds the requirements:

definitions:
  - id: "pbs-compute-xlarge"
    provisioner: "pbs"
    ttl: 86400
    spec:
      cpu: 64
      mem: "512GiB"
      queue: "himem"
      cost_per_hour: 5.0

Access Control

Use access policies to control which users or organizations can use specific provisioner definitions. This is configured through the central configuration system’s policy expressions.

Example: Restrict GPU nodes to specific organizations:

definitions:
  - id: "pbs-gpu"
    provisioner: "pbs"
    ttl: 43200
    spec:
      cpu: 16
      mem: "64GiB"
      gpus: 2
      queue: "gpu"
      cost_per_hour: 8.0
    access_policy: |
      req.organization.id in ["org-research", "org-engineering"]

See the Provisioner Administration documentation for details on access policies and expression syntax.

Cost Tracking

The cost_per_hour field enables billing and cost tracking:

definitions:
  - id: "pbs-compute"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      cost_per_hour: 0.75

When workflows run using this definition:

Usage is tracked per hour of resource allocation
Costs are calculated as: cost_per_hour × hours_used
Billing reports aggregate costs by organization, account, or user

Set cost_per_hour to reflect your internal charge-back rates or actual operational costs.

Best Practices

1. Start with Standard Tiers

Create a few standard definitions covering common use cases:

definitions:
  - id: "pbs-small"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"
      cost_per_hour: 0.5

  - id: "pbs-medium"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      cost_per_hour: 1.0

  - id: "pbs-large"
    provisioner: "pbs"
    ttl: 28800
    spec:
      cpu: 32
      mem: "128GiB"
      cost_per_hour: 2.0

2. Map to Existing PBS Structure

Align definitions with your existing queue structure. Create one definition per queue/tier combination:

definitions:
  - id: "pbs-workq-standard"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      queue: "workq"

  - id: "pbs-high-priority"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      queue: "high"

3. Use Descriptive IDs

Choose IDs that clearly indicate the resource type:

# Good: Clear and descriptive
- id: "pbs-gpu-v100-32gb"
  provisioner: "pbs"
  ttl: 24000
  spec:
    cpu: 16
    mem: "32GiB"
    gpus: 2
    queue: "gpu"

# Avoid: Unclear or too generic
- id: "def-1"
  provisioner: "pbs"
  ttl: 24000
  spec:
    cpu: 16
    mem: "32GiB"

4. Set Appropriate TTL Values

The TTL must not be set lower than 24000 (6 hours and 40 minutes) because Fuzzball sets a time limit of 6 hours for jobs to pull OCI containers. Setting it lower may cause jobs to become unschedulable. TTL should also not be set longer than the configured max walltime for a cluster.

definitions:
  # Short jobs (testing, quick analysis)
  - id: "pbs-quick"
    provisioner: "pbs"
    ttl: 24000  # 6 hours 40 minutes
    spec:
      cpu: 4
      mem: "8GiB"

  # Medium jobs (standard processing)
  - id: "pbs-standard"
    provisioner: "pbs"
    ttl: 144000  # 40 hours
    spec:
      cpu: 16
      mem: "32GiB"

  # Long jobs (simulations, training)
  - id: "pbs-long"
    provisioner: "pbs"
    ttl: 864000  # 10 days
    spec:
      cpu: 32
      mem: "128GiB"

5. Test Definitions Before Production

Create test definitions in a development queue:

definitions:
  - id: "pbs-test"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 2
      mem: "4GiB"
      queue: "devel"
      cost_per_hour: 0.0  # Free for testing

Submit test workflows to verify configuration before promoting to production.

Validation

After creating provisioner definitions, verify they’re available:

$ fuzzball admin provisioner list # List all provisioner definitions

$ fuzzball admin provisioner get pbs-compute-small # Show details of a specific definition

$ fuzzball workflow submit --file test-workflow.fz # Test with a simple workflow

Troubleshooting

Definition Not Found

Symptom: Workflow fails with “no matching provisioner definition”

Solutions:

Verify definition exists: fuzzball admin provisioner list
Check workflow resource requirements match definition specs
Ensure definition has provisioner: "pbs"
Verify access policies allow your user/organization

Jobs Not Starting

Symptom: Workflow submitted but jobs never start on PBS

Solutions:

Check if queue exists: qstat -Q or qstat -q
Check queue limits and availability
Review PBS logs: /var/log/pbs/server_logs
Verify user permissions with PBS administrator

Resource Mismatches

Symptom: Jobs fail or get killed due to resource constraints

Solutions:

Ensure definition specs match actual node capabilities.
Verify memory units (GiB vs GB).
Check CPU count matches PBS node configuration.
Use catalog tool to discover actual resources.
Check PBS node properties: pbsnodes -a.

Walltime Exceeded

Symptom: Jobs terminated with “walltime exceeded” message

Solutions:

Increase TTL in provisioner definition.
Check queue walltime limits: qstat -Qf queue-name.
Ensure job completes within TTL minus 60-second buffer.
Consider splitting long-running jobs into smaller tasks.

For more troubleshooting guidance, see the Troubleshooting Guide.