Fuzzball Documentation
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Provisioner Configuration Details

The Fuzzball provisioner configuration system provides cluster administrators with a flexible, YAML-based approach to manage provisioner definitions, resource policies, and system-wide settings. This configuration system uses expression-based conditions and policies to control resource allocation and node matching.

Overview

The centralized configuration file is a YAML document that defines the following:

  • Global settings: Node annotations and software license tokens
  • Provisioner definitions: Static, AWS, and Slurm resource definitions
  • Conditional Policy expressions: Resource allocation rules and user/job-based access controls

The documentation regarding the expression language can be found at the Expr Language Definition documentation

After creating your provisioner configuration file you can apply it with the fuzzball admin config command like so:

$ fuzzball admin config set ./provision-config.yaml
The substrate daemon running on compute nodes must be restarted after applying changes to the provisioner configuration.

Configuration Structure

# Global node annotations applied to all nodes
nodeAnnotations:
  global.annotation: "cluster-wide-value"
  environment: "production"

# Software license token limits
softwareTokens:
  matlab: 20
  ansys: 10

# Resource provisioner definitions
definitions:
  - id: compute-nodes
    provisioner: static
    # ... provisioner-specific configuration

Global Node Annotations

Node annotations are key-value pairs applied globally to all cluster nodes:

nodeAnnotations:
  cluster.name: "hpc-cluster-01"
  datacenter: "us-west-2"
  environment: "production"

Software Tokens

Software tokens (to allocate licenses) are on the roadmap but not yet implemented at the time of this writing.

Manage concurrent usage limits for licensed software:

softwareTokens:
  matlab: 25          # Maximum 25 concurrent MATLAB sessions
  ansys: 15           # Maximum 15 concurrent ANSYS jobs
  comsol: 8           # Maximum 8 concurrent COMSOL licenses

Provisioner Definitions

Provisioner definitions must group compute resources by identical hardware specifications. Each definition should represent nodes with the same CPU, memory, GPU, and network hardware configuration. Nodes with different hardware capabilities require separate provisioner definitions to ensure accurate resource matching and allocation. See Hardware Grouping Requirements for detailed examples and best practices.

Static Provisioner

Static provisioners define physical or pre-provisioned compute resources with condition-based matching:

Standard Compute Nodes
definitions:
  - id: compute-nodes
    annotations:
      node.type: "compute"
      hardware.generation: "gen3"
    provisioner: static
    provisionerSpec:
      condition: |-
        let regex = "compute-[0-9]+";
        hostname() matches regex &&
        uname.machine == "x86_64" &&
        cpuinfo.vendor_id == "GenuineIntel"        
      costPerHour: 0.25
    policy: |-
      request.owner.organization_id == "9deebe97-7866-4024-82a3-346348851944" &&
      request.job_resource.cpu.cores <= 32      
GPU Compute Nodes
definitions:
  - id: gpu-nodes
    provisioner: static
    provisionerSpec:
      condition: |-
        hostname() matches "gpu-[0-9]+" &&
        modalias.match("pci:v000010DEd*sv*sd*bc03sc*i*")        
      costPerHour: 2.50
    policy: |-
      request.job_resource.devices["nvidia.com/gpu"] > 0 &&
      request.owner.account_id in ["7d77efb9-baed-4fbf-9c25-ee51ec4cd93f", "49e57c28-57a9-4204-85ca-c8ecd7922af5"]      
High-Memory Nodes
definitions:
  - id: highmem-nodes
    provisioner: static
    provisionerSpec:
      condition: |-
        hostname() matches "mem-[0-9]+" &&
        cpuinfo.cpu_cores >= 64 &&
        request.job_resource.mem.bytes >= (512 * 1024 * 1024 * 1024)        
      costPerHour: 1.75
    policy: |-
      request.job_kind == "job" &&
      request.job_resource.exclusive == true      

AWS Provisioner

AWS provisioners support dynamic instance provisioning with instance type expansion, a catalog of instance types is transparently provided by the AWS provisioner backend. When using wildcard patterns (e.g., t3.*, c5.*), the system automatically expands these into individual definitions for each matching instance type. The ${spec.instanceType} placeholder in the definition ID is replaced with the actual instance type during expansion.

definitions:
  # Spot instances for short workloads
  - id: aws-${spec.instanceType}-spot
    provisioner: aws
    provisionerSpec:
      instanceType: t3.*
      spot: true
    policy: |-
      request.job_ttl <= 3600      

  # On-demand instances for production workloads
  - id: aws-${spec.instanceType}
    provisioner: aws
    provisionerSpec:
      instanceType: c5.*
      spot: false
    policy: |-
      request.job_kind == "service" ||
      request.job_annotations["sla.tier"] == "production"      

  # GPU instances for machine learning
  - id: aws-${spec.instanceType}-gpu
    provisioner: aws
    provisionerSpec:
      instanceType: p3.*
      spot: false
    policy: |-
      request.job_resource.devices["nvidia.com/gpu"] > 0 &&
      request.owner.organization_id == "9deebe97-7866-4024-82a3-346348851944"      

Slurm Provisioner

Slurm provisioners integrate with existing Slurm clusters:

definitions:
  - id: slurm-compute
    provisioner: slurm
    provisionerSpec:
      costPerHour: 0.30
      cpu: 16
      memory: "64GiB"
      partition: "compute"
    policy: |-
      request.job_resource.cpu.cores <= 16 &&
      request.job_resource.mem.bytes <= (64 * 1024 * 1024 * 1024)      

  - id: slurm-gpu
    provisioner: slurm
    provisionerSpec:
      costPerHour: 1.80
      cpu: 8
      memory: "32GiB"
      partition: "gpu"
    policy: |-
      request.job_resource.devices["nvidia.com/gpu"] > 0      

Static Provisioner Conditions

Static provisioner conditions use expression language to match nodes based on system attributes. Conditions are evaluated by nodes at startup or when the configuration change.

Available Expression Variables

System Information (uname)

uname.sysname      # Operating system name (e.g., "Linux")
uname.nodename     # Network node hostname
uname.release      # Operating system release
uname.version      # Operating system version
uname.machine      # Hardware machine type (e.g., "x86_64", "aarch64")
uname.domainname   # Network domain name

Operating System Details (osrelease)

osrelease.name              # OS name (e.g., "Ubuntu")
osrelease.id                # OS identifier (e.g., "ubuntu")
osrelease.id_like           # Similar OS identifiers (e.g., "debian")
osrelease.version           # OS version string
osrelease.version_id        # OS version identifier
osrelease.version_codename  # OS version codename (e.g., "jammy")

CPU Information (cpuinfo)

cpuinfo.vendor_id    # CPU vendor (e.g., "GenuineIntel", "AuthenticAMD")
cpuinfo.cpu_family   # CPU family number
cpuinfo.model        # CPU model number
cpuinfo.model_name   # CPU model name string
cpuinfo.microcode    # Microcode version
cpuinfo.cpu_cores    # Number of physical CPU cores

Hardware Detection (modalias)

modalias.match(alias_pattern)  # Match hardware modalias patterns

Network Information

hostname()  # Returns current hostname as string

Modalias Pattern Examples

# NVIDIA GPU (any model)
modalias.match("pci:v000010DEd*sv*sd*bc03sc*i*")

# Specific NVIDIA GPU models
modalias.match("pci:v000010DEd00001B06sv*sd*bc03sc*i*")  # GTX 1080 Ti
modalias.match("pci:v000010DEd00001E07sv*sd*bc03sc*i*")  # RTX 2080 Ti

# Intel Ethernet controllers
modalias.match("pci:v00008086d*sv*sd*bc02sc00i*")

# Mellanox InfiniBand adapters
modalias.match("pci:v000015B3d*sv*sd*bc0Csc06i*")

You can also easily get the modalias for all the PCI devices on a node to match a specific device with:

IFS=$'\n'; for d in $(lspci); do modalias=$(cat /sys/bus/pci/devices/0000\:${d%% *}/modalias); echo "$modalias -> ${d#* }"; done

pci:v00008086d00004641sv00001D05sd00001174bc06sc00i00 -> Host bridge: Intel Corporation 12th Gen Core Processor Host Bridge/DRAM Registers (rev 02)
pci:v00008086d0000460Dsv00000000sd00000000bc06sc04i00 -> PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x16 Controller #1 (rev 02)
pci:v00008086d000046A6sv00001D05sd00001174bc03sc00i00 -> VGA compatible controller: Intel Corporation Alder Lake-P GT2 [Iris Xe Graphics] (rev 0c)
...

Condition Examples

Operating System Matching

condition: |-
  osrelease.id == "ubuntu" &&
  osrelease.version_id >= "20.04"  

CPU Architecture and Vendor

condition: |-
  uname.machine == "x86_64" &&
  cpuinfo.vendor_id == "GenuineIntel" &&
  cpuinfo.cpu_cores >= 16  

Hostname Pattern Matching

condition: |-
  let compute_regex = "compute-[0-9]{3}";
  let gpu_regex = "gpu-[0-9]{2}";
  hostname() matches compute_regex || hostname() matches gpu_regex  

Hardware Device Detection

condition: |-
  // Match NVIDIA GPU devices
  modalias.match("pci:v000010DEd*sv*sd*bc03sc*i*") &&
  cpuinfo.cpu_cores >= 8  

Complex Multi-Condition Logic

condition: |-
  let is_compute_node = hostname() matches "compute-[0-9]+";
  let is_intel_cpu = cpuinfo.vendor_id == "GenuineIntel";
  let is_ubuntu = osrelease.id == "ubuntu";
  let has_enough_cores = cpuinfo.cpu_cores >= 16;

  is_compute_node && is_intel_cpu && is_ubuntu && has_enough_cores  

Definition Policies

Definition policies control which users and jobs can access specific provisioner definitions. Policies are evaluated for each allocation request and must return a boolean result.

Available Policy Variables

Request Owner Information

request.owner.id              # User ID
request.owner.organization_id # Organization ID
request.owner.email           # User email address
request.owner.cluster_id      # Cluster ID
request.owner.account_id      # Account ID

Job Information

request.job_kind         # Job type/ "job", "service", "internal"
request.job_ttl          # Job time-to-live in seconds
request.job_annotations  # Job annotation key-value pairs (map[string]string)
request.multinode_job    # Boolean/ true for multi-node jobs
request.task_array_job   # Boolean/ true for task array jobs

Resource Requirements

# CPU Resources
request.job_resource.cpu.affinity  # CPU affinity/ "none", "core", "socket", "numa"
request.job_resource.cpu.cores     # Number of CPU cores requested
request.job_resource.cpu.threads   # Boolean/ hyperthreading enabled
request.job_resource.cpu.sockets   # Number of CPU sockets

# Memory Resources
request.job_resource.mem.bytes     # Memory in bytes
request.job_resource.mem.by_core   # Boolean/ memory allocation per core

# Device Resources
request.job_resource.devices       # Device requests (map[string]uint32)
request.job_resource.exclusive     # Boolean/ exclusive node access

Policy Examples

User and Organization Access Control

policy: |-
  request.owner.organization_id == "research" &&
  request.owner.account_id in ["2f0a8f4e-0a16-47d5-b541-05d3f9f44910", "c602cf05-7604-4f11-a690-79552b1fdbdd"]  

Resource-Based Restrictions

policy: |-
  request.job_resource.cpu.cores <= 32 &&
  request.job_resource.mem.bytes <= (256 * 1024 * 1024 * 1024) &&
  !request.job_resource.exclusive  

Job Type and Duration Policies

policy: |-
  request.job_kind == "job" &&
  request.job_ttl >= 300 &&
  request.job_ttl <= 86400  

GPU Access Control

policy: |-
  let gpu_count = request.job_resource.devices["nvidia.com/gpu"];
  gpu_count > 0 && gpu_count <= 4 &&
  request.owner.organization_id == "280abb59-b765-4cdd-a538-6ab8f9b7927c"  

Annotation-Based Policies

policy: |-
  request.job_annotations["priority"] == "high" &&
  request.job_annotations["project"] in ["proj-a", "proj-b"] &&
  request.owner.email matches "*@ciq.com"  

Multi-Node Job Restrictions

policy: |-
  request.multinode_job ?
  request.job_resource.cpu.cores >= 4 &&
  request.owner.account_id == "092403fe-12ef-4465-bce4-18292fec13c8"
  :
  request.job_resource.cpu.cores <= 16  

Time-Based Access

policy: |-
  let current_hour = time.Now().Hour();
  let is_business_hours = current_hour >= 9 && current_hour <= 17;

  request.job_annotations["priority"] == "low" ? !is_business_hours : true  

Complete Configuration Example

# Global cluster settings
nodeAnnotations:
  cluster.name: "hpc-cluster-prod"
  datacenter: "us-east-1"
  environment: "production"

softwareTokens:
  matlab: 50
  ansys: 25
  abaqus: 15

definitions:
  # Static compute nodes
  - id: compute-standard
    annotations:
      node.type: "compute"
      performance.tier: "standard"
    provisioner: static
    provisionerSpec:
      condition: |-
        hostname() matches "compute-[0-9]{3}" &&
        cpuinfo.vendor_id == "GenuineIntel" &&
        cpuinfo.cpu_cores >= 16 &&
        osrelease.id == "ubuntu"        
      costPerHour: 0.40
    policy: |-
      request.owner.organization_id in ["research", "engineering"] &&
      request.job_resource.cpu.cores <= 32      

  # GPU nodes for ML workloads
  - id: gpu-ml
    provisioner: static
    provisionerSpec:
      condition: |-
        hostname() matches "gpu-[0-9]{2}" &&
        modalias.match("pci:v000010DEd*sv*sd*bc03sc*i*")        
      costPerHour: 3.20
    policy: |-
      request.job_resource.devices["nvidia.com/gpu"] > 0 &&
      request.owner.organization_id == "ai-ml" &&
      request.job_annotations["framework"] in ["pytorch", "tensorflow"]      

  # AWS spot instances for batch processing
  - id: aws-${spec.instanceType}-spot
    provisioner: aws
    provisionerSpec:
      instanceType: c5.*
      spot: true
    policy: |-
      request.job_annotations["cost.optimization"] == "enabled" &&
      request.job_ttl >= 1800 &&
      request.job_kind == "job"      

  # Slurm partition integration
  - id: slurm-highmem
    provisioner: slurm
    provisionerSpec:
      costPerHour: 1.25
      cpu: 32
      memory: "256GiB"
      partition: "highmem"
    policy: |-
      request.job_resource.mem.bytes >= (128 * 1024 * 1024 * 1024) &&
      request.owner.account_id == "memory-intensive"      

Hardware Grouping Requirements

Provisioner Hardware Grouping

Provisioner definitions must group nodes by identical hardware specifications. This ensures accurate resource allocation and prevents scheduling conflicts.

Correct Hardware Grouping Examples

definitions:
  # Group 1: Intel Xeon nodes with 32 cores, 128GB RAM
  - id: intel-compute-32c-128g
    provisioner: static
    provisionerSpec:
      condition: |-
        hostname() matches "compute-[01-20]" &&
        cpuinfo.vendor_id == "GenuineIntel" &&
        cpuinfo.cpu_cores == 32 &&
        cpuinfo.model_name matches "*Xeon*"        
      costPerHour: 0.40

  # Group 2: Intel Xeon nodes with 64 cores, 256GB RAM
  - id: intel-compute-64c-256g
    provisioner: static
    provisionerSpec:
      condition: |-
        hostname() matches "compute-[21-40]" &&
        cpuinfo.vendor_id == "GenuineIntel" &&
        cpuinfo.cpu_cores == 64 &&
        cpuinfo.model_name matches "*Xeon*"        
      costPerHour: 0.80

  # Group 3: AMD EPYC nodes with 32 cores, 128GB RAM
  - id: amd-compute-32c-128g
    provisioner: static
    provisionerSpec:
      condition: |-
        hostname() matches "amd-[01-15]" &&
        cpuinfo.vendor_id == "AuthenticAMD" &&
        cpuinfo.cpu_cores == 32 &&
        cpuinfo.model_name matches "*EPYC*"        
      costPerHour: 0.35

  # Group 4: GPU nodes with NVIDIA V100s
  - id: gpu-v100-nodes
    provisioner: static
    provisionerSpec:
      condition: |-
        hostname() matches "gpu-[01-10]" &&
        modalias.match("pci:v000010DEd00001DB4sv*sd*bc03sc*i*")        
      costPerHour: 3.20

  # Group 5: GPU nodes with NVIDIA A100s
  - id: gpu-a100-nodes
    provisioner: static
    provisionerSpec:
      condition: |-
        hostname() matches "gpu-[11-20]" &&
        modalias.match("pci:v000010DEd000020F1sv*sd*bc03sc*i*")        
      costPerHour: 4.50

Incorrect Hardware Grouping (Avoid This)

definitions:
  # BAD: Mixing different hardware in one definition
  - id: mixed-compute-nodes
    provisioner: static
    provisionerSpec:
      condition: |-
        # This condition matches nodes with different hardware specs
        hostname() matches "compute-[0-9]+" ||
        hostname() matches "amd-[0-9]+" ||
        hostname() matches "gpu-[0-9]+"        
      costPerHour: 0.50  # Single cost for different hardware types

Why Hardware Grouping Matters

  1. Accurate Resource Reporting: Jobs receive consistent resource allocations within a hardware group.
  2. Cost Management: Different hardware types have different operational costs.
  3. Performance Predictability: Similar hardware provides consistent performance characteristics.
  4. Scheduling Efficiency: The scheduler can make better placement decisions with homogeneous groups.
  5. Resource Policies: Access controls can be tailored to specific hardware capabilities.

Hardware Grouping Criteria

Group nodes based on these key characteristics:

# CPU Specifications
cpuinfo.vendor_id     # Intel vs AMD
cpuinfo.cpu_cores     # Core count differences
cpuinfo.model_name    # CPU generation/model

# Memory Configuration
# (Check via system commands, not available in expressions)
# - Total memory capacity
# - Memory speed/type (DDR4 vs DDR5)

# GPU Hardware
modalias.match()      # GPU vendor/model detection
# - GPU memory capacity
# - GPU architecture (V100, A100, RTX, etc.)

# Network Hardware
modalias.match()      # Network interface detection
# - Ethernet speed (1Gb, 10Gb, 25Gb)
# - InfiniBand capabilities
# - RDMA support

# Storage Configuration
# - Local SSD vs spinning disk
# - NVMe vs SATA interfaces
# - Storage capacity tiers
# Template for hardware grouping naming convention
definitions:
  - id: {cpu_vendor}-{core_count}c-{memory_gb}g-{special_features}
    # Examples:
    # intel-32c-128g-standard
    # amd-64c-256g-highmem
    # intel-32c-128g-v100x4
    # amd-32c-128g-a100x8-infiniband