Provisioner Configuration Details
The Fuzzball provisioner configuration system provides cluster administrators with a flexible, YAML-based approach to manage provisioner definitions, resource policies, and system-wide settings. This configuration system uses expression-based conditions and policies to control resource allocation and node matching.
The centralized configuration file is a YAML document that defines the following:
- Global settings: Node annotations and software license tokens
- Provisioner definitions: Static, AWS, and Slurm resource definitions
- Conditional Policy expressions: Resource allocation rules and user/job-based access controls
The documentation regarding the expression language can be found at the Expr Language Definition documentation
After creating your provisioner configuration file you can apply it with the fuzzball admin config
command like so:
$ fuzzball admin config set ./provision-config.yaml
The substrate daemon running on compute nodes must be restarted after applying changes to the provisioner configuration.
# Global node annotations applied to all nodes
nodeAnnotations:
global.annotation: "cluster-wide-value"
environment: "production"
# Software license token limits
softwareTokens:
matlab: 20
ansys: 10
# Resource provisioner definitions
definitions:
- id: compute-nodes
provisioner: static
# ... provisioner-specific configuration
Node annotations are key-value pairs applied globally to all cluster nodes:
nodeAnnotations:
cluster.name: "hpc-cluster-01"
datacenter: "us-west-2"
environment: "production"
Software tokens (to allocate licenses) are on the roadmap but not yet implemented at the time of this writing.
Manage concurrent usage limits for licensed software:
softwareTokens:
matlab: 25 # Maximum 25 concurrent MATLAB sessions
ansys: 15 # Maximum 15 concurrent ANSYS jobs
comsol: 8 # Maximum 8 concurrent COMSOL licenses
Provisioner definitions must group compute resources by identical hardware specifications. Each definition should represent nodes with the same CPU, memory, GPU, and network hardware configuration. Nodes with different hardware capabilities require separate provisioner definitions to ensure accurate resource matching and allocation. See Hardware Grouping Requirements for detailed examples and best practices.
Static provisioners define physical or pre-provisioned compute resources with condition-based matching:
definitions:
- id: compute-nodes
annotations:
node.type: "compute"
hardware.generation: "gen3"
provisioner: static
provisionerSpec:
condition: |-
let regex = "compute-[0-9]+";
hostname() matches regex &&
uname.machine == "x86_64" &&
cpuinfo.vendor_id == "GenuineIntel"
costPerHour: 0.25
policy: |-
request.owner.organization_id == "9deebe97-7866-4024-82a3-346348851944" &&
request.job_resource.cpu.cores <= 32
definitions:
- id: gpu-nodes
provisioner: static
provisionerSpec:
condition: |-
hostname() matches "gpu-[0-9]+" &&
modalias.match("pci:v000010DEd*sv*sd*bc03sc*i*")
costPerHour: 2.50
policy: |-
request.job_resource.devices["nvidia.com/gpu"] > 0 &&
request.owner.account_id in ["7d77efb9-baed-4fbf-9c25-ee51ec4cd93f", "49e57c28-57a9-4204-85ca-c8ecd7922af5"]
definitions:
- id: highmem-nodes
provisioner: static
provisionerSpec:
condition: |-
hostname() matches "mem-[0-9]+" &&
cpuinfo.cpu_cores >= 64 &&
request.job_resource.mem.bytes >= (512 * 1024 * 1024 * 1024)
costPerHour: 1.75
policy: |-
request.job_kind == "job" &&
request.job_resource.exclusive == true
AWS provisioners support dynamic instance provisioning with instance type expansion, a catalog of
instance types is transparently provided by the AWS provisioner backend. When using wildcard
patterns (e.g., t3.*, c5.*), the system automatically expands these into individual definitions
for each matching instance type. The ${spec.instanceType} placeholder in the definition ID is
replaced with the actual instance type during expansion.
definitions:
# Spot instances for short workloads
- id: aws-${spec.instanceType}-spot
provisioner: aws
provisionerSpec:
instanceType: t3.*
spot: true
policy: |-
request.job_ttl <= 3600
# On-demand instances for production workloads
- id: aws-${spec.instanceType}
provisioner: aws
provisionerSpec:
instanceType: c5.*
spot: false
policy: |-
request.job_kind == "service" ||
request.job_annotations["sla.tier"] == "production"
# GPU instances for machine learning
- id: aws-${spec.instanceType}-gpu
provisioner: aws
provisionerSpec:
instanceType: p3.*
spot: false
policy: |-
request.job_resource.devices["nvidia.com/gpu"] > 0 &&
request.owner.organization_id == "9deebe97-7866-4024-82a3-346348851944"
Slurm provisioners integrate with existing Slurm clusters:
definitions:
- id: slurm-compute
provisioner: slurm
provisionerSpec:
costPerHour: 0.30
cpu: 16
memory: "64GiB"
partition: "compute"
policy: |-
request.job_resource.cpu.cores <= 16 &&
request.job_resource.mem.bytes <= (64 * 1024 * 1024 * 1024)
- id: slurm-gpu
provisioner: slurm
provisionerSpec:
costPerHour: 1.80
cpu: 8
memory: "32GiB"
partition: "gpu"
policy: |-
request.job_resource.devices["nvidia.com/gpu"] > 0
Static provisioner conditions use expression language to match nodes based on system attributes. Conditions are evaluated by nodes at startup or when the configuration change.
uname.sysname # Operating system name (e.g., "Linux")
uname.nodename # Network node hostname
uname.release # Operating system release
uname.version # Operating system version
uname.machine # Hardware machine type (e.g., "x86_64", "aarch64")
uname.domainname # Network domain name
osrelease.name # OS name (e.g., "Ubuntu")
osrelease.id # OS identifier (e.g., "ubuntu")
osrelease.id_like # Similar OS identifiers (e.g., "debian")
osrelease.version # OS version string
osrelease.version_id # OS version identifier
osrelease.version_codename # OS version codename (e.g., "jammy")
cpuinfo.vendor_id # CPU vendor (e.g., "GenuineIntel", "AuthenticAMD")
cpuinfo.cpu_family # CPU family number
cpuinfo.model # CPU model number
cpuinfo.model_name # CPU model name string
cpuinfo.microcode # Microcode version
cpuinfo.cpu_cores # Number of physical CPU cores
modalias.match(alias_pattern) # Match hardware modalias patterns
hostname() # Returns current hostname as string
# NVIDIA GPU (any model)
modalias.match("pci:v000010DEd*sv*sd*bc03sc*i*")
# Specific NVIDIA GPU models
modalias.match("pci:v000010DEd00001B06sv*sd*bc03sc*i*") # GTX 1080 Ti
modalias.match("pci:v000010DEd00001E07sv*sd*bc03sc*i*") # RTX 2080 Ti
# Intel Ethernet controllers
modalias.match("pci:v00008086d*sv*sd*bc02sc00i*")
# Mellanox InfiniBand adapters
modalias.match("pci:v000015B3d*sv*sd*bc0Csc06i*")
You can also easily get the modalias for all the PCI devices on a node to match a specific device with:
IFS=$'\n'; for d in $(lspci); do modalias=$(cat /sys/bus/pci/devices/0000\:${d%% *}/modalias); echo "$modalias -> ${d#* }"; done
pci:v00008086d00004641sv00001D05sd00001174bc06sc00i00 -> Host bridge: Intel Corporation 12th Gen Core Processor Host Bridge/DRAM Registers (rev 02)
pci:v00008086d0000460Dsv00000000sd00000000bc06sc04i00 -> PCI bridge: Intel Corporation 12th Gen Core Processor PCI Express x16 Controller #1 (rev 02)
pci:v00008086d000046A6sv00001D05sd00001174bc03sc00i00 -> VGA compatible controller: Intel Corporation Alder Lake-P GT2 [Iris Xe Graphics] (rev 0c)
...
condition: |-
osrelease.id == "ubuntu" &&
osrelease.version_id >= "20.04"
condition: |-
uname.machine == "x86_64" &&
cpuinfo.vendor_id == "GenuineIntel" &&
cpuinfo.cpu_cores >= 16
condition: |-
let compute_regex = "compute-[0-9]{3}";
let gpu_regex = "gpu-[0-9]{2}";
hostname() matches compute_regex || hostname() matches gpu_regex
condition: |-
// Match NVIDIA GPU devices
modalias.match("pci:v000010DEd*sv*sd*bc03sc*i*") &&
cpuinfo.cpu_cores >= 8
condition: |-
let is_compute_node = hostname() matches "compute-[0-9]+";
let is_intel_cpu = cpuinfo.vendor_id == "GenuineIntel";
let is_ubuntu = osrelease.id == "ubuntu";
let has_enough_cores = cpuinfo.cpu_cores >= 16;
is_compute_node && is_intel_cpu && is_ubuntu && has_enough_cores
Definition policies control which users and jobs can access specific provisioner definitions. Policies are evaluated for each allocation request and must return a boolean result.
request.owner.id # User ID
request.owner.organization_id # Organization ID
request.owner.email # User email address
request.owner.cluster_id # Cluster ID
request.owner.account_id # Account ID
request.job_kind # Job type/ "job", "service", "internal"
request.job_ttl # Job time-to-live in seconds
request.job_annotations # Job annotation key-value pairs (map[string]string)
request.multinode_job # Boolean/ true for multi-node jobs
request.task_array_job # Boolean/ true for task array jobs
# CPU Resources
request.job_resource.cpu.affinity # CPU affinity/ "none", "core", "socket", "numa"
request.job_resource.cpu.cores # Number of CPU cores requested
request.job_resource.cpu.threads # Boolean/ hyperthreading enabled
request.job_resource.cpu.sockets # Number of CPU sockets
# Memory Resources
request.job_resource.mem.bytes # Memory in bytes
request.job_resource.mem.by_core # Boolean/ memory allocation per core
# Device Resources
request.job_resource.devices # Device requests (map[string]uint32)
request.job_resource.exclusive # Boolean/ exclusive node access
policy: |-
request.owner.organization_id == "research" &&
request.owner.account_id in ["2f0a8f4e-0a16-47d5-b541-05d3f9f44910", "c602cf05-7604-4f11-a690-79552b1fdbdd"]
policy: |-
request.job_resource.cpu.cores <= 32 &&
request.job_resource.mem.bytes <= (256 * 1024 * 1024 * 1024) &&
!request.job_resource.exclusive
policy: |-
request.job_kind == "job" &&
request.job_ttl >= 300 &&
request.job_ttl <= 86400
policy: |-
let gpu_count = request.job_resource.devices["nvidia.com/gpu"];
gpu_count > 0 && gpu_count <= 4 &&
request.owner.organization_id == "280abb59-b765-4cdd-a538-6ab8f9b7927c"
policy: |-
request.job_annotations["priority"] == "high" &&
request.job_annotations["project"] in ["proj-a", "proj-b"] &&
request.owner.email matches "*@ciq.com"
policy: |-
request.multinode_job ?
request.job_resource.cpu.cores >= 4 &&
request.owner.account_id == "092403fe-12ef-4465-bce4-18292fec13c8"
:
request.job_resource.cpu.cores <= 16
policy: |-
let current_hour = time.Now().Hour();
let is_business_hours = current_hour >= 9 && current_hour <= 17;
request.job_annotations["priority"] == "low" ? !is_business_hours : true
# Global cluster settings
nodeAnnotations:
cluster.name: "hpc-cluster-prod"
datacenter: "us-east-1"
environment: "production"
softwareTokens:
matlab: 50
ansys: 25
abaqus: 15
definitions:
# Static compute nodes
- id: compute-standard
annotations:
node.type: "compute"
performance.tier: "standard"
provisioner: static
provisionerSpec:
condition: |-
hostname() matches "compute-[0-9]{3}" &&
cpuinfo.vendor_id == "GenuineIntel" &&
cpuinfo.cpu_cores >= 16 &&
osrelease.id == "ubuntu"
costPerHour: 0.40
policy: |-
request.owner.organization_id in ["research", "engineering"] &&
request.job_resource.cpu.cores <= 32
# GPU nodes for ML workloads
- id: gpu-ml
provisioner: static
provisionerSpec:
condition: |-
hostname() matches "gpu-[0-9]{2}" &&
modalias.match("pci:v000010DEd*sv*sd*bc03sc*i*")
costPerHour: 3.20
policy: |-
request.job_resource.devices["nvidia.com/gpu"] > 0 &&
request.owner.organization_id == "ai-ml" &&
request.job_annotations["framework"] in ["pytorch", "tensorflow"]
# AWS spot instances for batch processing
- id: aws-${spec.instanceType}-spot
provisioner: aws
provisionerSpec:
instanceType: c5.*
spot: true
policy: |-
request.job_annotations["cost.optimization"] == "enabled" &&
request.job_ttl >= 1800 &&
request.job_kind == "job"
# Slurm partition integration
- id: slurm-highmem
provisioner: slurm
provisionerSpec:
costPerHour: 1.25
cpu: 32
memory: "256GiB"
partition: "highmem"
policy: |-
request.job_resource.mem.bytes >= (128 * 1024 * 1024 * 1024) &&
request.owner.account_id == "memory-intensive"
Provisioner definitions must group nodes by identical hardware specifications. This ensures accurate resource allocation and prevents scheduling conflicts.
definitions:
# Group 1: Intel Xeon nodes with 32 cores, 128GB RAM
- id: intel-compute-32c-128g
provisioner: static
provisionerSpec:
condition: |-
hostname() matches "compute-[01-20]" &&
cpuinfo.vendor_id == "GenuineIntel" &&
cpuinfo.cpu_cores == 32 &&
cpuinfo.model_name matches "*Xeon*"
costPerHour: 0.40
# Group 2: Intel Xeon nodes with 64 cores, 256GB RAM
- id: intel-compute-64c-256g
provisioner: static
provisionerSpec:
condition: |-
hostname() matches "compute-[21-40]" &&
cpuinfo.vendor_id == "GenuineIntel" &&
cpuinfo.cpu_cores == 64 &&
cpuinfo.model_name matches "*Xeon*"
costPerHour: 0.80
# Group 3: AMD EPYC nodes with 32 cores, 128GB RAM
- id: amd-compute-32c-128g
provisioner: static
provisionerSpec:
condition: |-
hostname() matches "amd-[01-15]" &&
cpuinfo.vendor_id == "AuthenticAMD" &&
cpuinfo.cpu_cores == 32 &&
cpuinfo.model_name matches "*EPYC*"
costPerHour: 0.35
# Group 4: GPU nodes with NVIDIA V100s
- id: gpu-v100-nodes
provisioner: static
provisionerSpec:
condition: |-
hostname() matches "gpu-[01-10]" &&
modalias.match("pci:v000010DEd00001DB4sv*sd*bc03sc*i*")
costPerHour: 3.20
# Group 5: GPU nodes with NVIDIA A100s
- id: gpu-a100-nodes
provisioner: static
provisionerSpec:
condition: |-
hostname() matches "gpu-[11-20]" &&
modalias.match("pci:v000010DEd000020F1sv*sd*bc03sc*i*")
costPerHour: 4.50
definitions:
# BAD: Mixing different hardware in one definition
- id: mixed-compute-nodes
provisioner: static
provisionerSpec:
condition: |-
# This condition matches nodes with different hardware specs
hostname() matches "compute-[0-9]+" ||
hostname() matches "amd-[0-9]+" ||
hostname() matches "gpu-[0-9]+"
costPerHour: 0.50 # Single cost for different hardware types
- Accurate Resource Reporting: Jobs receive consistent resource allocations within a hardware group.
- Cost Management: Different hardware types have different operational costs.
- Performance Predictability: Similar hardware provides consistent performance characteristics.
- Scheduling Efficiency: The scheduler can make better placement decisions with homogeneous groups.
- Resource Policies: Access controls can be tailored to specific hardware capabilities.
Group nodes based on these key characteristics:
# CPU Specifications
cpuinfo.vendor_id # Intel vs AMD
cpuinfo.cpu_cores # Core count differences
cpuinfo.model_name # CPU generation/model
# Memory Configuration
# (Check via system commands, not available in expressions)
# - Total memory capacity
# - Memory speed/type (DDR4 vs DDR5)
# GPU Hardware
modalias.match() # GPU vendor/model detection
# - GPU memory capacity
# - GPU architecture (V100, A100, RTX, etc.)
# Network Hardware
modalias.match() # Network interface detection
# - Ethernet speed (1Gb, 10Gb, 25Gb)
# - InfiniBand capabilities
# - RDMA support
# Storage Configuration
# - Local SSD vs spinning disk
# - NVMe vs SATA interfaces
# - Storage capacity tiers
# Template for hardware grouping naming convention
definitions:
- id: {cpu_vendor}-{core_count}c-{memory_gb}g-{special_features}
# Examples:
# intel-32c-128g-standard
# amd-64c-256g-highmem
# intel-32c-128g-v100x4
# amd-32c-128g-a100x8-infiniband