Fuzzball Documentation
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Provisioner Configuration

Provisioner definitions specify the compute resources available through Slurm or PBS and how workflows should request them. These definitions are created through Fuzzball’s provisioner configuration system and enable fine-grained control over resource allocation, access policies, and cost tracking.

Unlike AWS provisioning where instance types are pre-defined, Slurm and PBS provisioner definitions are flexible specifications that describe the resources available on your cluster. When a workflow is submitted, Fuzzball matches each job’s resource requirements against available provisioner definitions and submits resource requests directly to Slurm or PBS.

Please select either the Slurm or PBS tab to see the appropriate instructions for your environment.

Basic Provisioner Definition

Here’s a simple provisioner definition for a Slurm cluster:

definitions:
  - id: "slurm-compute-small"
    provisioner: "slurm"
    ttl: 24000
    exclusive: "job"
    spec:
      cpu: 4
      mem: "8GiB"
      cost_per_hour: 0.5

This definition specifies a compute resource with 4 CPU cores and 8 GiB of memory, tracked at $0.50 per hour for billing purposes. The ttl (time-to-live) of 24000 seconds (6 hours 40 minutes) sets a maximum runtime for jobs, and exclusive: "job" ensures job-level resource isolation.

Do not set the TTL less than 24000. Fuzzball has hardcoded time limits for jobs it creates to pull OCI containers. Setting a time limit less than 24000 can make it impossible to run jobs.

Configuration Parameters

Required Fields

FieldTypeDescriptionExample
idstringUnique identifier for this provisioner definition"slurm-compute-small"
provisionerstringMust be "slurm" for Slurm resources"slurm"
spec.cpuintegerNumber of CPU cores4
spec.memstringMemory size with unit"8GiB"
ttlintegerTime-to-live in seconds (maximum job runtime)24000

Optional Fields

FieldTypeDescriptionExample
exclusivestringResource exclusivity level: "job" or "workflow""job"
spec.cost_per_hourfloatCost per hour for billing/tracking0.5
spec.partitionstringSlurm partition to submit jobs to"compute"
spec.accountstringSlurm account to charge"research-group"
spec.qosstringQuality of Service level"normal"
spec.constraintstringNode feature constraints"gpu"
spec.excludestringNodes to exclude"node[01-05]"
spec.preferstringPreferred nodes"node[10-20]"
spec.reservationstringReservation to use"maintenance"
spec.commentstringJob comment"Workflow execution"

Memory Units

Memory sizes must include a unit suffix:

  • "1GiB" - Gibibytes (1024³ bytes)
  • "1GB" - Gigabytes (1000³ bytes)
  • "1MiB" - Mebibytes (1024² bytes)
  • "1MB" - Megabytes (1000² bytes)

Time-to-Live (TTL)

The ttl parameter sets a maximum runtime for jobs using this provisioner definition:

definitions:
  - id: "slurm-short"
    provisioner: "slurm"
    ttl: 24000  # 6 hour 40 minute maximum
    spec:
      cpu: 4
      mem: "8GiB"

  - id: "slurm-long"
    provisioner: "slurm"
    ttl: 86400  # 24 hours maximum
    spec:
      cpu: 8
      mem: "16GiB"

Behavior:

  • Jobs exceeding the TTL are automatically terminated
  • TTL is specified in seconds
  • A 60-second buffer is added for graceful shutdown
  • If not specified, jobs have no time limit (subject to Slurm configuration)

Use Cases:

  • Prevent runaway jobs from consuming resources indefinitely
  • Match Slurm partition time limits
  • Enforce different time limits for different resource tiers
  • Support cost management by limiting job duration

Resource Exclusivity

The exclusive parameter controls resource isolation:

definitions:
  - id: "slurm-exclusive-job"
    provisioner: "slurm"
    exclusive: "job"  # Resources exclusive to individual jobs
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"

  - id: "slurm-exclusive-workflow"
    provisioner: "slurm"
    exclusive: "workflow"  # Resources exclusive to entire workflow
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"

Options:

  • "job": Each job gets exclusive access to its allocated resources
  • "workflow": All jobs in a workflow share exclusive access to resources
  • Not specified: Resources may be shared with other Slurm jobs

Use Cases:

  • Job-level exclusivity: Ideal for performance-sensitive workloads requiring consistent resources
  • Workflow-level exclusivity: Useful for multi-job workflows that need dedicated nodes
  • Shared resources: Default behavior for cost-efficient resource utilization

Multiple Provisioner Definitions

You can create multiple provisioner definitions to offer different resource tiers:

definitions:
  # Small compute nodes
  - id: "slurm-compute-small"
    provisioner: "slurm"
    ttl: 24000
    exclusive: "job"
    spec:
      cpu: 4
      mem: "8GiB"
      partition: "compute"
      cost_per_hour: 0.5

  # Large compute nodes
  - id: "slurm-compute-large"
    provisioner: "slurm"
    ttl: 28800  # 8 hours
    exclusive: "job"
    spec:
      cpu: 32
      mem: "128GiB"
      partition: "compute"
      cost_per_hour: 2.0

  # GPU nodes
  - id: "slurm-gpu"
    provisioner: "slurm"
    ttl: 43200  # 12 hours
    exclusive: "workflow"
    spec:
      cpu: 16
      mem: "64GiB"
      partition: "gpu"
      constraint: "gpu"
      cost_per_hour: 5.0

Partition-Based Configuration

Organize definitions by Slurm partitions for different hardware or user groups:

definitions:
  # General compute partition
  - id: "slurm-general"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "general"
      qos: "normal"
      cost_per_hour: 0.75

  # High-priority partition
  - id: "slurm-priority"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "priority"
      qos: "high"
      cost_per_hour: 1.5

  # Development/testing partition
  - id: "slurm-devel"
    provisioner: "slurm"
    spec:
      cpu: 2
      mem: "4GiB"
      partition: "devel"
      qos: "testing"
      cost_per_hour: 0.1

Account-Based Configuration

Map provisioner definitions to different Slurm accounts for charge-back:

definitions:
  # Engineering department
  - id: "slurm-engineering"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      account: "engineering"
      partition: "compute"
      cost_per_hour: 1.0

  # Research department
  - id: "slurm-research"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      account: "research"
      partition: "compute"
      cost_per_hour: 1.0

  # Marketing department
  - id: "slurm-marketing"
    provisioner: "slurm"
    spec:
      cpu: 4
      mem: "8GiB"
      account: "marketing"
      partition: "shared"
      cost_per_hour: 0.5

Hardware-Specific Configuration

Use Slurm constraints to target specific hardware features:

definitions:
  # Haswell nodes
  - id: "slurm-haswell"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 20
      mem: "64GiB"
      constraint: "haswell"
      partition: "compute"
      cost_per_hour: 0.8

  # Broadwell nodes
  - id: "slurm-broadwell"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 28
      mem: "128GiB"
      constraint: "broadwell"
      partition: "compute"
      cost_per_hour: 1.2

  # AMD EPYC nodes
  - id: "slurm-epyc"
    provisioner: "slurm"
    spec:
      cpu: 64
      mem: "256GiB"
      constraint: "epyc"
      partition: "compute"
      cost_per_hour: 2.5

  # GPU nodes with V100
  - id: "slurm-v100"
    provisioner: "slurm"
    spec:
      cpu: 16
      mem: "96GiB"
      constraint: "v100"
      partition: "gpu"
      cost_per_hour: 8.0

  # GPU nodes with A100
  - id: "slurm-a100"
    provisioner: "slurm"
    spec:
      cpu: 32
      mem: "256GiB"
      constraint: "a100"
      partition: "gpu"
      cost_per_hour: 15.0

Node Exclusion and Preference

Control job placement with exclude and prefer directives:

definitions:
  # Exclude maintenance nodes
  - id: "slurm-stable"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "compute"
      exclude: "node[01-05]"  # Exclude nodes undergoing maintenance
      cost_per_hour: 0.75

  # Prefer newer hardware
  - id: "slurm-preferred"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      partition: "compute"
      prefer: "node[100-200]"  # Prefer recently upgraded nodes
      cost_per_hour: 1.0

Creating Provisioner Definitions

Using the Provisioner Configuration API

The recommended method is to use Fuzzball’s provisioner configuration system. Create a YAML file with your definitions and apply it through the admin API or CLI.

Example configuration file (slurm-provisioners.yaml):

apiVersion: fuzzball.io/v1alpha1
kind: CentralConfig
metadata:
  name: slurm-provisioners
spec:
  provisioner:
    definitions:
      - id: "slurm-compute-small"
        provisioner: "slurm"
        ttl: 24000
        spec:
          cpu: 4
          mem: "8GiB"
          partition: "compute"
          cost_per_hour: 0.5

      - id: "slurm-compute-large"
        provisioner: "slurm"
        ttl: 24000
        spec:
          cpu: 32
          mem: "128GiB"
          partition: "compute"
          cost_per_hour: 2.0

Apply using the Fuzzball CLI:

$ fuzzball admin provisioner create --config slurm-provisioners.yaml

Using the Catalog Tool (Optional)

Fuzzball provides a catalog tool that can automatically discover resources from your Slurm cluster and generate provisioner definitions. This is useful for initial setup but not required for ongoing operation.

Unlike AWS provisioning where instance types must be cataloged, Slurm resources are discovered dynamically at runtime through the Substrate. The catalog tool is provided as a convenience for generating initial definitions based on actual cluster capabilities.

Building the Cataloger Tool

Build the cataloger using the fuzzy build system:

$ fuzzy build binary cataloger

Discovering Resources

Generate definitions for a specific partition:

$ ./build/cataloger provisioner cataloger slurm definition generate \
    --catalog-path /tmp/slurm-catalog \
    --id "slurm-gpu-discovery" \
    --ssh-host "slurm-head.example.com" \
    --username "fuzzball-service" \
    --private-key-path "/etc/fuzzball/slurm-key" \
    --public-key "$(cat /etc/fuzzball/slurm-key.pub)" \
    --partition "gpu"

The catalog tool will:

  1. Connect to your Slurm cluster via SSH
  2. Submit a test job to provision the Substrate
  3. Query the Substrate for available resources
  4. Generate a provisioner definition with discovered CPU, memory, and hardware specs
  5. Save the definition to the specified catalog path

Workflow Resource Matching

When a workflow is submitted, Fuzzball matches workflow resource requirements against provisioner definitions:

Exact Match

Workflow requesting exactly what a definition offers:

Workflow (workflow.yaml):

version: v1
jobs:
  my-job:
    name: "compute-job"
    image:
      uri: "docker://alpine:latest"
    command: ["echo", "Hello"]
    resource:
      cpu:
        cores: 4
      memory:
        size: "8GiB"

Provisioner Definition:

definitions:
  - id: "slurm-compute-small"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"

This workflow will match the slurm-compute-small definition.

Subset Match

Workflow requesting less than what a definition offers:

Workflow:

resource:
  cpu:
    cores: 2
  memory:
    size: "4GiB"

Provisioner Definition:

definitions:
  - id: "slurm-compute-small"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"

This workflow can use the slurm-compute-small definition, consuming only 2 cores and 4 GiB.

No Match

Workflow requesting more than any definition offers:

Workflow:

resource:
  cpu:
    cores: 64
  memory:
    size: "512GiB"

If no provisioner definition offers 64 cores and 512 GiB, the workflow will fail with a “no matching provisioner definition” error.

Solution: Create a provisioner definition that meets or exceeds the requirements:

definitions:
  - id: "slurm-compute-xlarge"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 64
      mem: "512GiB"
      partition: "himem"
      cost_per_hour: 5.0

Access Control

Use access policies to control which users or organizations can use specific provisioner definitions. This is configured through the central configuration system’s policy expressions.

Example: Restrict GPU nodes to specific organizations:

definitions:
  - id: "slurm-gpu"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "64GiB"
      partition: "gpu"
      constraint: "gpu"
      cost_per_hour: 5.0
    access_policy: |
      req.organization.id in ["org-research", "org-engineering"]

See the Provisioner Administration documentation for details on access policies and expression syntax.

Cost Tracking

The cost_per_hour field enables billing and cost tracking:

definitions:
  - id: "slurm-compute"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      cost_per_hour: 0.75

When workflows run using this definition:

  • Usage is tracked per hour of resource allocation
  • Costs are calculated as: cost_per_hour × hours_used
  • Billing reports aggregate costs by organization, account, or user

Set cost_per_hour to reflect your internal charge-back rates or actual operational costs.

Best Practices

1. Start with Standard Tiers

Create a few standard definitions covering common use cases:

definitions:
  - id: "slurm-small"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"
      cost_per_hour: 0.5

  - id: "slurm-medium"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      cost_per_hour: 1.0

  - id: "slurm-large"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 32
      mem: "128GiB"
      cost_per_hour: 2.0

2. Map to Existing Slurm Structure

Align definitions with your existing partition and account structure. Create one definition per partition/tier combination:

definitions:
  - id: "slurm-compute-standard"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "compute"
      qos: "normal"

  - id: "slurm-compute-priority"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      partition: "compute"
      qos: "high"

3. Use Descriptive IDs

Choose IDs that clearly indicate the resource type:

# Good: Clear and descriptive
- id: "slurm-gpu-v100-32gb"
  provisioner: "slurm"
  ttl: 24000
  spec:
    cpu: 16
    mem: "32GiB"
    constraint: "v100"

# Avoid: Unclear or too generic
- id: "def-1"
  provisioner: "slurm"
  ttl: 24000
  spec:
    cpu: 16
    mem: "32GiB"

4. Document Custom Configurations

Add comments to explain partition choices, constraints, or costs:

definitions:
  # GPU partition with V100 GPUs for deep learning workloads
  # Cost reflects amortized hardware and power costs
  - id: "slurm-gpu-v100"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 16
      mem: "96GiB"
      partition: "gpu"
      constraint: "v100"
      cost_per_hour: 8.0
      comment: "V100 GPU nodes for deep learning"

5. Test Definitions Before Production

Create test definitions in a development partition:

definitions:
  - id: "slurm-test"
    provisioner: "slurm"
    ttl: 24000
    spec:
      cpu: 2
      mem: "4GiB"
      partition: "devel"
      qos: "testing"
      cost_per_hour: 0.0  # Free for testing

Submit test workflows to verify configuration before promoting to production.

Validation

After creating provisioner definitions, verify they’re available:

$ fuzzball admin provisioner list # List all provisioner definitions

$ fuzzball admin provisioner get slurm-compute-small # Show details of a specific definition

$ fuzzball workflow submit --file test-workflow.yaml # Test with a simple workflow

Troubleshooting

Definition Not Found

Symptom: Workflow fails with “no matching provisioner definition”

Solutions:

  1. Verify definition exists: fuzzball admin provisioner list
  2. Check workflow resource requirements match definition specs
  3. Ensure definition has provisioner: "slurm"
  4. Verify access policies allow your user/organization

Jobs Not Starting

Symptom: Workflow submitted but jobs never start on Slurm

Solutions:

  1. Check if partition exists: sinfo -p compute
  2. Verify account has access: sacctmgr show assoc where account=research-group
  3. Check QoS is valid: sacctmgr show qos
  4. Review Slurm logs: /var/log/slurm/slurmctld.log

Resource Mismatches

Symptom: Jobs fail or get killed due to resource constraints

Solutions:

  1. Ensure definition specs match actual node capabilities
  2. Verify memory units (GiB vs GB)
  3. Check CPU count matches Slurm node configuration
  4. Use catalog tool to discover actual resources

For more troubleshooting guidance, see the Troubleshooting Guide.

Basic Provisioner Definition

Here’s a simple provisioner definition for a PBS cluster:

definitions:
  - id: "pbs-compute-small"
    provisioner: "pbs"
    ttl: 24000
    exclusive: "job"
    spec:
      cpu: 4
      mem: "8GiB"
      cost_per_hour: 0.5

This definition specifies a compute resource with 4 CPU cores and 8 GiB of memory, tracked at $0.50 per hour for billing purposes. The ttl (time-to-live) of 24000 seconds (6 hours 40 minutes) sets a maximum runtime for jobs and is automatically converted to PBS walltime. The exclusive: "job" ensures job-level resource isolation with automatic place=free:excl directive.

Do not set the TTL less than 24000. Fuzzball has hardcoded time limits for jobs it creates to pull OCI containers. Setting a time limit less than 24000 can make it impossible to run jobs.

Configuration Parameters

Required Fields

FieldTypeDescriptionExample
idstringUnique identifier for this provisioner definition"pbs-compute-small"
provisionerstringMust be "pbs" for PBS resources"pbs"
spec.cpuintegerNumber of CPU cores4
spec.memstringMemory size with unit"8GiB"
ttlintegerTime-to-live in seconds (maximum job runtime)24000

Optional Fields

FieldTypeDescriptionExample
exclusivestringResource exclusivity level: "job" or "workflow""job"
spec.cost_per_hourfloatCost per hour for billing/tracking0.5
spec.queuestringPBS queue to submit jobs to"workq"
spec.gpusintegerNumber of GPUs requested1

Memory Units

Memory sizes must include a unit suffix:

  • "1GiB" - Gibibytes (1024³ bytes)
  • "1GB" - Gigabytes (1000³ bytes)
  • "1MiB" - Mebibytes (1024² bytes)
  • "1MB" - Megabytes (1000² bytes)

Time-to-Live (TTL) and Walltime

The ttl parameter sets a maximum runtime for jobs and is automatically converted to PBS walltime:

definitions:
  - id: "pbs-short"
    provisioner: "pbs"
    ttl: 24000  # 6 hour 40 minute maximum
    spec:
      cpu: 4
      mem: "8GiB"

  - id: "pbs-long"
    provisioner: "pbs"
    ttl: 86400  # 24 hours maximum
    spec:
      cpu: 8
      mem: "16GiB"

Behavior:

  • Jobs exceeding the TTL are automatically terminated
  • TTL is specified in seconds and converted to PBS walltime format
  • A 60-second buffer is added for graceful shutdown
  • The PBS provisioner automatically handles walltime conversion
  • If not specified, jobs have no time limit (subject to PBS queue configuration)

Use Cases:

  • Prevent runaway jobs from consuming resources indefinitely
  • Match PBS queue time limits
  • Enforce different time limits for different resource tiers
  • Support cost management by limiting job duration

Resource Exclusivity

The exclusive parameter controls resource isolation and automatically adds PBS placement directives:

definitions:
  - id: "pbs-exclusive-job"
    provisioner: "pbs"
    ttl: 24000
    exclusive: "job"  # Resources exclusive to individual jobs
    spec:
      cpu: 8
      mem: "16GiB"

  - id: "pbs-exclusive-workflow"
    provisioner: "pbs"
    ttl: 24000
    exclusive: "workflow"  # Resources exclusive to entire workflow
    spec:
      cpu: 16
      mem: "32GiB"

Options:

  • "job": Each job gets exclusive access to its allocated resources (adds #PBS -l place=free:excl)
  • "workflow": All jobs in a workflow share exclusive access to resources
  • Not specified: Resources may be shared with other PBS jobs

Important: The PBS provisioner automatically adds the place=free:excl directive for exclusive job placement, ensuring nodes are not shared.

Use Cases:

  • Job-level exclusivity: Ideal for performance-sensitive workloads requiring consistent resources
  • Workflow-level exclusivity: Useful for multi-job workflows that need dedicated nodes
  • Shared resources: Default behavior for cost-efficient resource utilization

GPU Configuration

For GPU-enabled compute nodes, specify the number of GPUs:

definitions:
  - id: "pbs-gpu"
    provisioner: "pbs"
    ttl: 43200  # 12 hours
    exclusive: "job"
    spec:
      cpu: 16
      mem: "64GiB"
      gpus: 2  # Request 2 GPUs
      queue: "gpu"
      cost_per_hour: 8.0

The GPU specification is translated to the appropriate PBS resource request format based on your PBS implementation (Torque or OpenPBS).

Multiple Provisioner Definitions

You can create multiple provisioner definitions to offer different resource tiers:

definitions:
  # Small compute nodes
  - id: "pbs-compute-small"
    provisioner: "pbs"
    ttl: 24000
    exclusive: "job"
    spec:
      cpu: 4
      mem: "8GiB"
      queue: "workq"
      cost_per_hour: 0.5

  # Large compute nodes
  - id: "pbs-compute-large"
    provisioner: "pbs"
    ttl: 28800  # 8 hours
    exclusive: "job"
    spec:
      cpu: 32
      mem: "128GiB"
      queue: "workq"
      cost_per_hour: 2.0

  # GPU nodes
  - id: "pbs-gpu"
    provisioner: "pbs"
    ttl: 43200  # 12 hours
    exclusive: "workflow"
    spec:
      cpu: 16
      mem: "64GiB"
      gpus: 2
      queue: "gpu"
      cost_per_hour: 8.0

Queue-Based Configuration

Organize definitions by PBS queues for different hardware or user groups:

definitions:
  # General compute queue
  - id: "pbs-general"
    provisioner: "pbs"
    ttl: 14400  # 4 hours
    spec:
      cpu: 8
      mem: "16GiB"
      queue: "workq"
      cost_per_hour: 0.75

  # High-priority queue
  - id: "pbs-priority"
    provisioner: "pbs"
    ttl: 28800  # 8 hours
    spec:
      cpu: 8
      mem: "16GiB"
      queue: "high"
      cost_per_hour: 1.5

  # Development/testing queue
  - id: "pbs-devel"
    provisioner: "pbs"
    ttl: 24000  # 1 hour
    spec:
      cpu: 2
      mem: "4GiB"
      queue: "devel"
      cost_per_hour: 0.1

Hardware-Specific Configuration

Configure definitions for specific hardware capabilities:

definitions:
  # Standard CPU nodes
  - id: "pbs-cpu-standard"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 20
      mem: "64GiB"
      queue: "workq"
      cost_per_hour: 0.8

  # High-memory nodes
  - id: "pbs-himem"
    provisioner: "pbs"
    ttl: 28800
    spec:
      cpu: 28
      mem: "256GiB"
      queue: "himem"
      cost_per_hour: 2.5

  # GPU nodes with V100
  - id: "pbs-v100"
    provisioner: "pbs"
    ttl: 43200
    spec:
      cpu: 16
      mem: "96GiB"
      gpus: 2
      queue: "gpu"
      cost_per_hour: 8.0

  # GPU nodes with A100
  - id: "pbs-a100"
    provisioner: "pbs"
    ttl: 86400
    spec:
      cpu: 32
      mem: "256GiB"
      gpus: 4
      queue: "gpu-a100"
      cost_per_hour: 15.0

Creating Provisioner Definitions

Using the Provisioner Configuration API

The recommended method is to use Fuzzball’s provisioner configuration system. Create a YAML file with your definitions and apply it through the admin API or CLI.

Example configuration file (pbs-provisioners.yaml):

apiVersion: fuzzball.io/v1alpha1
kind: CentralConfig
metadata:
  name: pbs-provisioners
spec:
  provisioner:
    definitions:
      - id: "pbs-compute-small"
        provisioner: "pbs"
        ttl: 24000
        exclusive: "job"
        spec:
          cpu: 4
          mem: "8GiB"
          queue: "workq"
          cost_per_hour: 0.5

      - id: "pbs-compute-large"
        provisioner: "pbs"
        ttl: 28800
        exclusive: "job"
        spec:
          cpu: 32
          mem: "128GiB"
          queue: "workq"
          cost_per_hour: 2.0

Apply using the Fuzzball CLI:

$ fuzzball admin provisioner create --config pbs-provisioners.yaml

Workflow Resource Matching

When a workflow is submitted, Fuzzball matches workflow resource requirements against provisioner definitions:

Exact Match

Workflow requesting exactly what a definition offers:

Workflow (workflow.fz):

version: v1
jobs:
  my-job:
    image:
      uri: "docker://alpine:latest"
    command: ["echo", "Hello"]
    resource:
      cpu:
        cores: 4
      memory:
        size: "8GiB"

Provisioner Definition:

definitions:
  - id: "pbs-compute-small"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"

This workflow will match the pbs-compute-small definition.

Subset Match

Workflow requesting less than what a definition offers:

Workflow:

resource:
  cpu:
    cores: 2
  memory:
    size: "4GiB"

Provisioner Definition:

definitions:
  - id: "pbs-compute-small"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"

This workflow can use the pbs-compute-small definition, consuming only 2 cores and 4 GiB.

No Match

Workflow requesting more than any definition offers:

Workflow:

resource:
  cpu:
    cores: 64
  memory:
    size: "512GiB"

If no provisioner definition offers 64 cores and 512 GiB, the workflow will fail with a “no matching provisioner definition” error.

Solution: Create a provisioner definition that meets or exceeds the requirements:

definitions:
  - id: "pbs-compute-xlarge"
    provisioner: "pbs"
    ttl: 86400
    spec:
      cpu: 64
      mem: "512GiB"
      queue: "himem"
      cost_per_hour: 5.0

Access Control

Use access policies to control which users or organizations can use specific provisioner definitions. This is configured through the central configuration system’s policy expressions.

Example: Restrict GPU nodes to specific organizations:

definitions:
  - id: "pbs-gpu"
    provisioner: "pbs"
    ttl: 43200
    spec:
      cpu: 16
      mem: "64GiB"
      gpus: 2
      queue: "gpu"
      cost_per_hour: 8.0
    access_policy: |
      req.organization.id in ["org-research", "org-engineering"]

See the Provisioner Administration documentation for details on access policies and expression syntax.

Cost Tracking

The cost_per_hour field enables billing and cost tracking:

definitions:
  - id: "pbs-compute"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      cost_per_hour: 0.75

When workflows run using this definition:

  • Usage is tracked per hour of resource allocation
  • Costs are calculated as: cost_per_hour × hours_used
  • Billing reports aggregate costs by organization, account, or user

Set cost_per_hour to reflect your internal charge-back rates or actual operational costs.

Best Practices

1. Start with Standard Tiers

Create a few standard definitions covering common use cases:

definitions:
  - id: "pbs-small"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 4
      mem: "8GiB"
      cost_per_hour: 0.5

  - id: "pbs-medium"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 16
      mem: "32GiB"
      cost_per_hour: 1.0

  - id: "pbs-large"
    provisioner: "pbs"
    ttl: 28800
    spec:
      cpu: 32
      mem: "128GiB"
      cost_per_hour: 2.0

2. Map to Existing PBS Structure

Align definitions with your existing queue structure. Create one definition per queue/tier combination:

definitions:
  - id: "pbs-workq-standard"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      queue: "workq"

  - id: "pbs-high-priority"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 8
      mem: "16GiB"
      queue: "high"

3. Use Descriptive IDs

Choose IDs that clearly indicate the resource type:

# Good: Clear and descriptive
- id: "pbs-gpu-v100-32gb"
  provisioner: "pbs"
  ttl: 24000
  spec:
    cpu: 16
    mem: "32GiB"
    gpus: 2
    queue: "gpu"

# Avoid: Unclear or too generic
- id: "def-1"
  provisioner: "pbs"
  ttl: 24000
  spec:
    cpu: 16
    mem: "32GiB"

4. Set Appropriate TTL Values

The TTL must not be set lower than 24000 (6 hours and 40 minutes) because Fuzzball sets a time limit of 6 hours for jobs to pull OCI containers. Setting it lower may cause jobs to become unschedulable. TTL should also not be set longer than the configured max walltime for a cluster.

definitions:
  # Short jobs (testing, quick analysis)
  - id: "pbs-quick"
    provisioner: "pbs"
    ttl: 24000  # 6 hours 40 minutes
    spec:
      cpu: 4
      mem: "8GiB"

  # Medium jobs (standard processing)
  - id: "pbs-standard"
    provisioner: "pbs"
    ttl: 144000  # 40 hours
    spec:
      cpu: 16
      mem: "32GiB"

  # Long jobs (simulations, training)
  - id: "pbs-long"
    provisioner: "pbs"
    ttl: 864000  # 10 days
    spec:
      cpu: 32
      mem: "128GiB"

5. Test Definitions Before Production

Create test definitions in a development queue:

definitions:
  - id: "pbs-test"
    provisioner: "pbs"
    ttl: 24000
    spec:
      cpu: 2
      mem: "4GiB"
      queue: "devel"
      cost_per_hour: 0.0  # Free for testing

Submit test workflows to verify configuration before promoting to production.

Validation

After creating provisioner definitions, verify they’re available:

$ fuzzball admin provisioner list # List all provisioner definitions

$ fuzzball admin provisioner get pbs-compute-small # Show details of a specific definition

$ fuzzball workflow submit --file test-workflow.fz # Test with a simple workflow

Troubleshooting

Definition Not Found

Symptom: Workflow fails with “no matching provisioner definition”

Solutions:

  1. Verify definition exists: fuzzball admin provisioner list
  2. Check workflow resource requirements match definition specs
  3. Ensure definition has provisioner: "pbs"
  4. Verify access policies allow your user/organization

Jobs Not Starting

Symptom: Workflow submitted but jobs never start on PBS

Solutions:

  1. Check if queue exists: qstat -Q or qstat -q
  2. Check queue limits and availability
  3. Review PBS logs: /var/log/pbs/server_logs
  4. Verify user permissions with PBS administrator

Resource Mismatches

Symptom: Jobs fail or get killed due to resource constraints

Solutions:

  1. Ensure definition specs match actual node capabilities.
  2. Verify memory units (GiB vs GB).
  3. Check CPU count matches PBS node configuration.
  4. Use catalog tool to discover actual resources.
  5. Check PBS node properties: pbsnodes -a.

Walltime Exceeded

Symptom: Jobs terminated with “walltime exceeded” message

Solutions:

  1. Increase TTL in provisioner definition.
  2. Check queue walltime limits: qstat -Qf queue-name.
  3. Ensure job completes within TTL minus 60-second buffer.
  4. Consider splitting long-running jobs into smaller tasks.

For more troubleshooting guidance, see the Troubleshooting Guide.