Troubleshooting
This guide covers common issues you may encounter when using Fuzzball with Slurm or PBS integration and provides detailed troubleshooting steps and solutions.
Symptom:
Error: Failed to connect to SSH host: dial tcp: connection refused
Diagnostic Steps:
Verify SSH host is reachable:
$ ping slurm-head.example.comCheck SSH port:
$ nc -zv slurm-head.example.com 22Test SSH connection manually:
$ ssh fuzzball-service@slurm-head.example.comCheck Orchestrate logs:
$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -i "ssh"Check connection timeout setting:
$ kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml | grep connectionTimeout
Solutions:
Firewall blocking: Ensure firewall allows SSH connections from Orchestrate on Slurm head node
# firewall-cmd --permanent --add-service=ssh # firewall-cmd --reloadWrong SSH port: Verify
sshPortin configuration matches actual SSH port. Check SSH port on head node:# netstat -tlnp | grep sshdDNS resolution: Use IP address instead of hostname if DNS is unavailable
sshHost: "192.168.1.100" # Use IP directlyConnection timeout too short: Increase timeout for high-latency environments
spec: orchestrator: provisioner: slurm: connectionTimeout: 90 # Increase to 90 seconds
Symptom:
Error: ssh: handshake failed: ssh: unable to authenticate
Diagnostic Steps:
Verify credentials for password auth (enter password manually when prompted):
$ ssh fuzzball-service@slurm-head.example.comOr for key auth:
$ ssh -i /path/to/key fuzzball-service@slurm-head.example.comCheck SSH logs on head node (RHEL/CentOS):
# tail -f /var/log/secureOr on Ubuntu/Debian:
# tail -f /var/log/auth.logVerify public key is installed:
$ ssh fuzzball-service@slurm-head.example.com 'cat ~/.ssh/authorized_keys'
Solutions:
Password expired: Update password on Slurm head node
# passwd fuzzball-servicePublic key not installed: Copy public key to authorized_keys
$ ssh-copy-id -i fuzzball-key.pub fuzzball-service@slurm-head.example.comWrong key format: Ensure private key is in PEM format. Convert to PEM if needed:
$ ssh-keygen -p -m PEM -f fuzzball-keyKey permissions: Fix file permissions
$ chmod 600 ~/.ssh/id_rsa $ chmod 644 ~/.ssh/id_rsa.pub $ chmod 700 ~/.ssh
Symptom:
Error: host key mismatch for slurm-head.example.com
Diagnostic Steps:
Get current host key:
$ ssh-keyscan -t rsa slurm-head.example.comCompare with configured key:
$ kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml | grep sshHostPublicKey
Solutions:
Update host key in configuration:
spec: orchestrator: provisioner: slurm: sshHostPublicKey: "slurm-head.example.com ssh-rsa AAAAB3NzaC1yc2E..."Temporary workaround (not recommended for production):
spec: orchestrator: provisioner: slurm: skipHostKeyVerification: true
Symptom:
Error: sbatch: command not found
Diagnostic Steps:
Check if Slurm commands exist:
$ ssh fuzzball-service@slurm-head.example.com 'which sbatch' $ ssh fuzzball-service@slurm-head.example.com 'which squeue' $ ssh fuzzball-service@slurm-head.example.com 'which scancel'Check user’s PATH:
$ ssh fuzzball-service@slurm-head.example.com 'echo $PATH'Find Slurm binaries:
$ ssh fuzzball-service@slurm-head.example.com 'sudo find / -name sbatch 2>/dev/null'
Solutions:
Set binaryPath in configuration:
spec: orchestrator: provisioner: slurm: binaryPath: "/opt/slurm/bin"Add to user PATH on Slurm head node, add to ~/.bashrc or ~/.profile:
export PATH=$PATH:/opt/slurm/binCreate symlinks (if admin):
# sudo ln -s /opt/slurm/bin/sbatch /usr/local/bin/sbatch # sudo ln -s /opt/slurm/bin/squeue /usr/local/bin/squeue # sudo ln -s /opt/slurm/bin/scancel /usr/local/bin/scancel
Symptom:
Error: sbatch: error: Access denied
Diagnostic Steps:
Check user permissions:
$ ssh fuzzball-service@slurm-head.example.com 'sacctmgr show user fuzzball-service'Verify account access:
$ ssh fuzzball-service@slurm-head.example.com 'sacctmgr show assoc where user=fuzzball-service'Check partition access:
$ ssh fuzzball-service@slurm-head.example.com 'sinfo -p compute'
Solutions:
Add user to Slurm database:
# sudo sacctmgr add user fuzzball-service account=defaultGrant access to partition:
# sudo sacctmgr add account default partition=computeCheck AllowAccounts in slurm.conf (should show AllowAccounts=ALL or include your account):
# sudo grep "PartitionName=compute" /etc/slurm/slurm.conf
Symptom: Workflow status stuck in PENDING, no error messages
Diagnostic Steps:
Check Slurm controller status:
$ ssh slurm-head 'scontrol ping'Check pending jobs:
$ ssh slurm-head 'squeue -t PD'Check controller logs:
$ ssh slurm-head 'sudo tail -f /var/log/slurm/slurmctld.log'Check Orchestrate logs:
$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -A 10 "ProvisionSubstrate"
Solutions:
Restart Slurm controller (if unresponsive):
# sudo systemctl restart slurmctldIncrease timeout in configuration:
policy: timeout: execute: "5m" # Increase if jobs are slow to startCheck resource availability:
$ ssh slurm-head 'sinfo -Nel'
Symptom:
Error: sbatch: error: Batch job submission failed: Invalid partition name specified
Diagnostic Steps:
List available partitions:
$ ssh slurm-head 'sinfo -s'Check partition in provisioner definition:
$ fuzzball admin provisioner get slurm-compute-small
Solutions:
Use correct partition name:
definitions: - id: "slurm-compute" provisioner: "slurm" spec: cpu: 4 mem: "8GiB" partition: "compute" # Must match actual partition nameRemove partition specification to use default:
definitions: - id: "slurm-compute" provisioner: "slurm" spec: cpu: 4 mem: "8GiB" # partition not specified = use default
Symptom:
Error: sbatch: error: Batch job submission failed: Invalid qos specification
Diagnostic Steps:
List available QoS:
$ ssh slurm-head 'sacctmgr show qos format=name,priority'Check user’s allowed QoS:
$ ssh slurm-head 'sacctmgr show assoc where user=fuzzball-service format=qos'
Solutions:
Use valid QoS:
spec: orchestrator: provisioner: slurm: options: qos: "normal" # Must be a valid QoSRemove QoS specification to use default:
spec: orchestrator: provisioner: slurm: options: # qos not specified = use default
Symptom:
Error: fuzzball-substrate: command not found
Diagnostic Steps:
Check if Substrate is installed:
$ ssh slurm-head 'ssh <compute-node> which fuzzball-substrate'Check compute node PATH:
$ ssh slurm-head 'ssh <compute-node> echo $PATH'Find Substrate binary:
$ ssh slurm-head 'ssh <compute-node> find / -name fuzzball-substrate 2>/dev/null'
Solutions:
Install Substrate on all compute nodes. On each compute node:
# sudo cp fuzzball-substrate /usr/local/bin/ # sudo chmod +x /usr/local/bin/fuzzball-substrateUse absolute path in script: Modify Substrate invocation if needed
Add to PATH for all users. On compute nodes, add to /etc/environment:
PATH=/usr/local/bin:/usr/bin:/bin
Symptom:
Error: sudo: sorry, user fuzzball-service is not allowed to execute '/usr/local/bin/fuzzball-substrate serve' as root
Diagnostic Steps:
Check sudo permissions:
$ ssh slurm-head 'ssh <compute-node> sudo -l'Check sudoers configuration:
$ ssh slurm-head 'ssh <compute-node> sudo cat /etc/sudoers.d/fuzzball-service'
Solutions:
Add sudo permission for Substrate. On all compute nodes, create /etc/sudoers.d/fuzzball-service:
# echo "fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate" | sudo tee /etc/sudoers.d/fuzzball-service # sudo chmod 440 /etc/sudoers.d/fuzzball-serviceUse wildcards for flexibility. Allow Substrate with any arguments:
fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate *Validate sudoers file:
# sudo visudo -c
Symptom: Slurm job starts but Substrate never connects to Orchestrate
Diagnostic Steps:
Check if Substrate process is running:
$ ssh slurm-head 'ssh <compute-node> ps aux | grep fuzzball-substrate'Check Substrate logs:
$ ssh slurm-head 'cat slurm-<job-id>.out'Check network connectivity:
$ ssh slurm-head 'ssh <compute-node> ping <orchestrator-host>' $ ssh slurm-head 'ssh <compute-node> nc -zv <orchestrator-host> <orchestrator-port>'Check firewall rules on compute node:
# sudo iptables -L -n -v | grep <orchestrator-port>
Solutions:
Allow outbound connections from compute nodes. On compute nodes:
# sudo firewall-cmd --permanent --add-port=<orchestrator-port>/tcp # sudo firewall-cmd --reloadCheck Orchestrate endpoint in Substrate configuration:
$ ssh slurm-head 'ssh <compute-node> cat /etc/fuzzball/substrate.yaml'Verify DNS resolution:
$ ssh slurm-head 'ssh <compute-node> nslookup <orchestrator-host>'Use IP address if DNS is problematic. In Substrate configuration:
orchestrator: endpoint: "192.168.1.10:8080" # Use IP instead of hostname
Symptom:
Error: No matching provisioner definition found for resource requirements
Diagnostic Steps:
Check workflow resource requirements:
$ cat workflow.yaml | grep -A 5 "resource:"List provisioner definitions:
$ fuzzball admin provisioner listCompare requirements to available resources:
$ fuzzball admin provisioner get slurm-compute-small
Solutions:
Reduce workflow resource requests:
resource: cpu: cores: 4 # Reduced from 64 memory: size: "8GiB" # Reduced from 512GiBCreate larger provisioner definition:
definitions: - id: "slurm-compute-xlarge" provisioner: "slurm" spec: cpu: 64 mem: "512GiB" partition: "himem"Use Slurm node features to target appropriate nodes:
definitions: - id: "slurm-himem" provisioner: "slurm" spec: cpu: 32 mem: "256GiB" constraint: "himem"
Symptom: Job killed by Slurm with “Out of memory” error
Diagnostic Steps:
Check actual memory usage:
$ ssh slurm-head 'sacct -j <job-id> --format=JobID,MaxRSS,MaxVMSize'Check memory limit:
$ ssh slurm-head 'scontrol show job <job-id> | grep Mem'
Solutions:
Increase memory in workflow:
resource: memory: size: "32GiB" # Increased from 16GiBCreate provisioner definition with more memory:
definitions: - id: "slurm-highmem" provisioner: "slurm" spec: cpu: 16 mem: "128GiB" partition: "himem"Optimize application memory usage: Review application logs and optimize code
Symptom: Jobs running much slower than expected
Diagnostic Steps:
Check node load:
$ ssh slurm-head 'ssh <compute-node> uptime'Check other jobs on node:
$ ssh slurm-head 'squeue -w <compute-node> -o "%.18i %.9P %.20j %.8u %.2t %.10M"'Check CPU allocation:
$ ssh slurm-head 'sacct -j <job-id> --format=JobID,AllocCPUS,CPUTime,Elapsed'
Solutions:
Use CPU affinity:
resource: cpu: cores: 16 affinity: "SOCKET" # Bind to CPU socketsUse exclusive node allocation (if needed):
spec: orchestrator: provisioner: slurm: options: exclusive: "true"Request specific node features:
definitions: - id: "slurm-fast" provisioner: "slurm" spec: cpu: 32 mem: "64GiB" constraint: "haswell|broadwell" # Newer CPUs
Symptom: Workflow status remains PENDING indefinitely
Diagnostic Steps:
Check workflow details:
$ fuzzball workflow get <workflow-id>Check provisioner status:
$ fuzzball admin provisioner listCheck Slurm queue:
$ ssh slurm-head 'squeue -u fuzzball-service'Check orchestrator logs:
$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=100
Solutions:
Check for provisioner definition mismatch: Ensure workflow resources match available definitions
Wait for Slurm resources: Jobs may be queued waiting for compute nodes
Cancel and resubmit:
$ fuzzball workflow cancel <workflow-id> $ fuzzball workflow submit --file workflow.yaml
Symptom: Dependent jobs start before their dependencies complete
Diagnostic Steps:
Check job dependencies in workflow:
$ cat workflow.yaml | grep -A 5 "depends_on"Check job execution order:
$ fuzzball workflow jobs <workflow-id>
Solutions:
Verify dependency specification:
jobs: job-a: name: "first-job" # ... job definition job-b: name: "second-job" depends_on: - job-a # Must match job ID, not name # ... job definitionUse consistent job identifiers: Ensure dependency references use correct job IDs
Symptom: Jobs are killed before completion with timeout errors
Diagnostic Steps:
Check provisioner TTL setting:
$ fuzzball admin provisioner get slurm-compute-small | grep ttlCheck job timeout policy:
$ cat workflow.yaml | grep -A 3 "timeout:"Check actual job runtime:
$ ssh slurm-head 'sacct -j <job-id> --format=JobID,Elapsed,Timelimit'
Solutions:
Increase provisioner TTL:
definitions: - id: "slurm-compute-long" provisioner: "slurm" ttl: 28800 # Increase to 8 hours spec: cpu: 8 mem: "16GiB"Reduce job timeout to fit within TTL:
jobs: my-job: policy: timeout: execute: "1h" # Ensure it's less than provisioner TTL
Symptom:
Error: Failed to pull container image: authentication required
Diagnostic Steps:
Check image exists:
$ docker pull <image-uri>Check registry credentials:
$ fuzzball secret list | grep registryCheck Substrate image pull logs:
$ ssh slurm-head 'cat slurm-<job-id>.out | grep -i "pull"'
Solutions:
Use public images for testing:
image: uri: "docker://alpine:3.16" # Public imageConfigure registry credentials:
$ fuzzball secret create registry-creds \ --username <username> \ --password <password> \ --server registry.example.comUse image pull secrets:
image: uri: "docker://registry.example.com/myapp:v1" pullSecret: "registry-creds"
Symptom: Long delay between job submission and execution
Diagnostic Steps:
Check Slurm queue time:
$ ssh slurm-head 'sacct -j <job-id> --format=JobID,Submit,Start,Elapsed'Check scheduler logs:
$ ssh slurm-head 'sudo tail -f /var/log/slurm/slurmctld.log'Check compute node availability:
$ ssh slurm-head 'sinfo -Nel'
Solutions:
Use priority QoS:
spec: orchestrator: provisioner: slurm: options: qos: "high"Reserve nodes for Fuzzball. Create reservation:
$ ssh slurm-head 'sudo scontrol create reservation \ starttime=now duration=infinite \ user=fuzzball-service nodes=node[01-10] \ reservationname=fuzzball'Use reservation:
spec: orchestrator: provisioner: slurm: options: reservation: "fuzzball"
Symptom: Substrate spends long time pulling images
Solutions:
Use local registry mirror:
image: uri: "docker://local-mirror.example.com/alpine:3.16"Pre-pull common images on all compute nodes:
# sudo apptainer pull docker://alpine:3.16 # sudo apptainer pull docker://ubuntu:22.04Use smaller base images. Use alpine instead of ubuntu:
image: uri: "docker://alpine:3.16" # ~5MB instead of ubuntu:22.04 (~77MB)
Orchestrator:
# In FuzzballOrchestrate CRD
spec:
orchestrator:
logLevel: "debug"
Substrate on compute nodes, edit Substrate config:
# cat <<EOF | sudo tee /etc/fuzzball/substrate.yaml
logging:
level: debug
EOFCreate a script to gather diagnostic data:
#!/bin/bash
# diagnostic-collect.sh
echo "=== Fuzzball Configuration ==="
kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml
echo "=== Orchestrate Logs ==="
kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=500
echo "=== Provisioner Definitions ==="
fuzzball admin provisioner list
echo "=== Slurm Status ==="
ssh slurm-head 'scontrol show config | head -20'
ssh slurm-head 'sinfo'
ssh slurm-head 'squeue'
echo "=== Recent Slurm Jobs ==="
ssh slurm-head 'sacct -u fuzzball-service --starttime $(date -d "1 hour ago" +%Y-%m-%d) --format=JobID,JobName,State,ExitCode,Elapsed'
echo "=== Compute Node Status ==="
ssh slurm-head 'scontrol show nodes'
echo "=== Recent Workflows ==="
fuzzball workflow list --limit 10
Test Slurm integration interactively:
$ cat >test.sh << EOF
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
echo "Hostname: $(hostname)"
echo "Date: $(date)"
echo "CPU info:"
lscpu | head -20
echo "Memory info:"
free -h
echo "Substrate check:"
which fuzzball-substrate
fuzzball-substrate --version
EOF
$ ssh slurm-head 'sbatch test.sh'If you’re unable to resolve an issue:
Collect diagnostic information using the script above
Check Fuzzball documentation at https://docs.fuzzball.io
Review Slurm documentation at https://slurm.schedmd.com
Contact CIQ support with:
- Problem description
- Steps to reproduce
- Diagnostic information
- Configuration files (with sensitive data redacted)
- Relevant log excerpts
Community resources:
- Fuzzball GitHub: https://github.com/ctrliq/fuzzball
- CIQ contact: https://ciq.com/company/contact-us/
| Error Message | Likely Cause | Solution |
|---|---|---|
connection refused | SSH port blocked or wrong | Check firewall and SSH configuration |
connection timeout | Connection timeout too short | Increase connectionTimeout in config |
authentication failed | Wrong credentials | Verify username/password or SSH key |
command not found | Slurm not in PATH | Set binaryPath in configuration |
Invalid partition | Wrong partition name | Check available partitions with sinfo |
Access denied | No Slurm permissions | Add user to Slurm database |
No matching provisioner | Resource mismatch | Create appropriate provisioner definition |
Out of memory | Insufficient memory | Increase memory in workflow or definition |
timeout / TIMEOUT state | Job exceeded TTL or timeout | Increase provisioner TTL or reduce job timeout |
DUE TO TIME LIMIT | Slurm walltime exceeded | Check provisioner TTL and job timeout settings |
Symptom:
Error: Failed to connect to SSH host: dial tcp: connection refused
Diagnostic Steps:
Verify SSH host is reachable:
$ ping pbs-head.example.comCheck SSH port:
$ nc -zv pbs-head.example.com 22Test SSH connection manually:
$ ssh fuzzball-service@pbs-head.example.comCheck orchestrator logs:
$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -i "ssh"Check connection timeout setting:
$ kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml | grep connectionTimeout
Solutions:
Firewall blocking: Ensure firewall allows SSH connections from Orchestrate on PBS head node
# firewall-cmd --permanent --add-service=ssh # firewall-cmd --reloadWrong SSH port: Verify
sshPortin configuration matches actual SSH port. Check SSH port on head node:# netstat -tlnp | grep sshdDNS resolution: Use IP address instead of hostname if DNS is unavailable
sshHost: "192.168.1.100" # Use IP directlyConnection timeout too short: Increase timeout for high-latency environments
spec: orchestrator: provisioner: pbs: connectionTimeout: 90 # Increase to 90 seconds
Symptom:
Error: ssh: handshake failed: ssh: unable to authenticate
Diagnostic Steps:
Verify credentials for password auth (enter password manually when prompted):
$ ssh fuzzball-service@pbs-head.example.comOr for key auth:
$ ssh -i /path/to/key fuzzball-service@pbs-head.example.comCheck SSH logs on head node (RHEL/CentOS):
# tail -f /var/log/secureOr on Ubuntu/Debian:
# tail -f /var/log/auth.logVerify public key is installed:
$ ssh fuzzball-service@pbs-head.example.com 'cat ~/.ssh/authorized_keys'
Solutions:
Password expired: Update password on PBS head node
# passwd fuzzball-servicePublic key not installed: Copy public key to authorized_keys
$ ssh-copy-id -i fuzzball-key.pub fuzzball-service@pbs-head.example.comWrong key format: Ensure private key is in PEM format. Convert to PEM if needed:
$ ssh-keygen -p -m PEM -f fuzzball-keyKey permissions: Fix file permissions
$ chmod 600 ~/.ssh/id_rsa $ chmod 644 ~/.ssh/id_rsa.pub $ chmod 700 ~/.ssh
Symptom:
Error: ssh: handshake failed: ssh: host key mismatch
Diagnostic Steps:
Get current host key:
$ ssh-keyscan pbs-head.example.comCompare with configuration: Check if
sshHostPublicKeyin FuzzballOrchestrate CRD matches
Solutions:
Update host key in configuration with output from
ssh-keyscanDisable host key verification (not recommended for production):
spec: orchestrator: provisioner: pbs: skipHostKeyVerification: true
Symptom:
Error: qsub: command not found
Diagnostic Steps:
Check if PBS commands exist:
$ ssh fuzzball-service@pbs-head.example.com 'which qsub' $ ssh fuzzball-service@pbs-head.example.com 'which qstat' $ ssh fuzzball-service@pbs-head.example.com 'which qdel'Check user’s PATH:
$ ssh fuzzball-service@pbs-head.example.com 'echo $PATH'Find PBS binaries:
$ ssh fuzzball-service@pbs-head.example.com 'sudo find / -name qsub 2>/dev/null'
Solutions:
Set binaryPath in configuration:
spec: orchestrator: provisioner: pbs: binaryPath: "/opt/pbs/bin"Add to user PATH on PBS head node, add to ~/.bashrc or ~/.profile:
export PATH=$PATH:/opt/pbs/binCreate symlinks (if admin):
# sudo ln -s /opt/pbs/bin/qsub /usr/local/bin/qsub # sudo ln -s /opt/pbs/bin/qstat /usr/local/bin/qstat # sudo ln -s /opt/pbs/bin/qdel /usr/local/bin/qdel
Symptom:
Error: qsub: Unauthorized Request
Diagnostic Steps:
Check user permissions:
$ ssh fuzzball-service@pbs-head.example.com 'qstat -B'Verify queue access:
$ ssh fuzzball-service@pbs-head.example.com 'qstat -Q'Check PBS server configuration:
$ ssh fuzzball-service@pbs-head.example.com 'qmgr -c "print server"'
Solutions:
Add user to PBS allowed users:
# sudo qmgr -c "set server authorized_users += fuzzball-service@*"Grant queue access:
# sudo qmgr -c "set queue workq enabled = True" # sudo qmgr -c "set queue workq started = True"Check queue ACLs (should show allowed users or be disabled):
# sudo qmgr -c "print queue workq"
Symptom: Workflow status stuck in PENDING, no error messages
Diagnostic Steps:
Check PBS server status:
$ ssh pbs-head 'qstat -B'Check pending jobs:
$ ssh pbs-head 'qstat -i'Check PBS server logs:
$ ssh pbs-head 'sudo tail -f /var/spool/pbs/server_logs/$(date +%Y%m%d)'Check Orchestrate logs:
$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -A 10 "ProvisionSubstrate"
Solutions:
Restart PBS server (if unresponsive):
# sudo systemctl restart pbsIncrease timeout in configuration:
policy: timeout: execute: "5m" # Increase if jobs are slow to startCheck resource availability:
$ ssh pbs-head 'pbsnodes -a'
Symptom:
Error: qsub: Unknown queue
Diagnostic Steps:
List available queues:
$ ssh pbs-head 'qstat -Q'Check queue in provisioner definition:
$ fuzzball admin provisioner get pbs-workq-small
Solutions:
Use correct queue name:
definitions: - id: "pbs-compute" provisioner: "pbs" spec: cpu: 4 mem: "8GiB" queue: "workq" # Must match actual queue nameRemove queue specification to use default:
definitions: - id: "pbs-compute" provisioner: "pbs" spec: cpu: 4 mem: "8GiB" # queue not specified = use default
Symptom:
Error: fuzzball-substrate: command not found
Diagnostic Steps:
Check if Substrate is installed:
$ ssh pbs-head 'ssh <compute-node> which fuzzball-substrate'Check compute node PATH:
$ ssh pbs-head 'ssh <compute-node> echo $PATH'Find Substrate binary:
$ ssh pbs-head 'ssh <compute-node> find / -name fuzzball-substrate 2>/dev/null'
Solutions:
Install Substrate on all compute nodes. On each compute node:
# sudo cp fuzzball-substrate /usr/local/bin/ # sudo chmod +x /usr/local/bin/fuzzball-substrateUse absolute path in script: Modify Substrate invocation if needed
Add to PATH for all users. On compute nodes, add to /etc/environment:
PATH=/usr/local/bin:/usr/bin:/bin
Symptom:
Error: sudo: sorry, user fuzzball-service is not allowed to execute '/usr/local/bin/fuzzball-substrate serve' as root
Diagnostic Steps:
Check sudo permissions:
$ ssh pbs-head 'ssh <compute-node> sudo -l'Check sudoers configuration:
$ ssh pbs-head 'ssh <compute-node> sudo cat /etc/sudoers.d/fuzzball-service'
Solutions:
Add sudo permission for Substrate. On all compute nodes, create /etc/sudoers.d/fuzzball-service:
# echo "fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate" | sudo tee /etc/sudoers.d/fuzzball-service # sudo chmod 440 /etc/sudoers.d/fuzzball-serviceUse wildcards for flexibility. Allow Substrate with any arguments:
fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate *Validate sudoers file:
# sudo visudo -c
Symptom: PBS job starts but Substrate never connects to Orchestrate
Diagnostic Steps:
Check if Substrate process is running:
$ ssh pbs-head 'ssh <compute-node> ps aux | grep fuzzball-substrate'Check Substrate logs:
$ ssh pbs-head 'cat pbs-<job-id>.out'Check network connectivity:
$ ssh pbs-head 'ssh <compute-node> ping <orchestrator-host>' $ ssh pbs-head 'ssh <compute-node> nc -zv <orchestrator-host> <orchestrator-port>'Check firewall rules on compute node:
# sudo iptables -L -n -v | grep <orchestrator-port>
Solutions:
Allow outbound connections from compute nodes. On compute nodes:
# sudo firewall-cmd --permanent --add-port=<orchestrator-port>/tcp # sudo firewall-cmd --reloadCheck Orchestrate endpoint in Substrate configuration:
$ ssh pbs-head 'ssh <compute-node> cat /etc/fuzzball/substrate.yaml'Verify DNS resolution:
$ ssh pbs-head 'ssh <compute-node> nslookup <orchestrator-host>'Use IP address if DNS is problematic. In Substrate configuration:
orchestrator: endpoint: "192.168.1.10:8080" # Use IP instead of hostname
Symptom:
Error: No matching provisioner definition found for resource requirements
Diagnostic Steps:
Check workflow resource requirements:
$ cat workflow.yaml | grep -A 5 "resource:"List provisioner definitions:
$ fuzzball admin provisioner listCompare requirements to available resources:
$ fuzzball admin provisioner get pbs-workq-small
Solutions:
Reduce workflow resource requests:
resource: cpu: cores: 4 # Reduced from 64 memory: size: "8GiB" # Reduced from 512GiBCreate larger provisioner definition:
definitions: - id: "pbs-workq-xlarge" provisioner: "pbs" spec: cpu: 64 mem: "512GiB" queue: "bigmem"Use PBS node features to target appropriate nodes:
definitions: - id: "pbs-himem" provisioner: "pbs" spec: cpu: 32 mem: "256GiB" select: "mem>256GB"
Symptom: Job killed by PBS with “Out of memory” error
Diagnostic Steps:
Check actual memory usage:
$ ssh pbs-head 'qstat -f <job-id> | grep resources_used.mem'Check memory limit:
$ ssh pbs-head 'qstat -f <job-id> | grep Resource_List.mem'
Solutions:
Increase memory in workflow:
resource: memory: size: "32GiB" # Increased from 16GiBCreate provisioner definition with more memory:
definitions: - id: "pbs-highmem" provisioner: "pbs" spec: cpu: 16 mem: "128GiB" queue: "bigmem"Optimize application memory usage: Review application logs and optimize code
Symptom: Jobs running much slower than expected
Diagnostic Steps:
Check node load:
$ ssh pbs-head 'ssh <compute-node> uptime'Check other jobs on node:
$ ssh pbs-head 'qstat -n | grep <compute-node>'Check CPU allocation:
$ ssh pbs-head 'qstat -f <job-id> | grep resources_used.cpupercent'
Solutions:
Use CPU affinity:
resource: cpu: cores: 16 affinity: "SOCKET" # Bind to CPU socketsUse exclusive node allocation (if needed):
definitions: - id: "pbs-exclusive" provisioner: "pbs" exclusive: "job" spec: cpu: 32 mem: "64GiB" queue: "workq"Request specific node features:
definitions: - id: "pbs-fast" provisioner: "pbs" spec: cpu: 32 mem: "64GiB" select: "host=node[10-20]"
Symptom: Workflow status remains PENDING indefinitely
Diagnostic Steps:
Check workflow details:
$ fuzzball workflow get <workflow-id>Check provisioner status:
$ fuzzball admin provisioner listCheck PBS queue:
$ ssh pbs-head 'qstat -u fuzzball-service'Check Orchestrate logs:
$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=100
Solutions:
Check for provisioner definition mismatch: Ensure workflow resources match available definitions
Wait for PBS resources: Jobs may be queued waiting for compute nodes
Cancel and resubmit:
$ fuzzball workflow cancel <workflow-id> $ fuzzball workflow submit --file workflow.yaml
Symptom: Dependent jobs start before their dependencies complete
Diagnostic Steps:
Check job dependencies in workflow:
$ cat workflow.yaml | grep -A 5 "depends_on"Check job execution order:
$ fuzzball workflow jobs <workflow-id>
Solutions:
Verify dependency specification:
jobs: job-a: name: "first-job" # ... job definition job-b: name: "second-job" depends_on: - job-a # Must match job ID, not name # ... job definitionUse consistent job identifiers: Ensure dependency references use correct job IDs
Symptom: Jobs are killed before completion with timeout errors
Diagnostic Steps:
Check provisioner TTL setting:
$ fuzzball admin provisioner get pbs-workq-small | grep ttlCheck job timeout policy:
$ cat workflow.yaml | grep -A 3 "timeout:"Check actual job runtime:
$ ssh pbs-head 'qstat -f <job-id> | grep resources_used.walltime'
Solutions:
Increase provisioner TTL:
definitions: - id: "pbs-workq-long" provisioner: "pbs" ttl: 28800 # Increase to 8 hours spec: cpu: 8 mem: "16GiB"Reduce job timeout to fit within TTL:
jobs: my-job: policy: timeout: execute: "1h" # Ensure it's less than provisioner TTL
Symptom:
Error: Failed to pull container image: authentication required
Diagnostic Steps:
Check image exists:
$ docker pull <image-uri>Check registry credentials:
$ fuzzball secret list | grep registryCheck Substrate image pull logs:
$ ssh pbs-head 'cat pbs-<job-id>.out | grep -i "pull"'
Solutions:
Use public images for testing:
image: uri: "docker://alpine:3.16" # Public imageConfigure registry credentials:
$ fuzzball secret create registry-creds \ --username <username> \ --password <password> \ --server registry.example.comUse image pull secrets:
image: uri: "docker://registry.example.com/myapp:v1" pullSecret: "registry-creds"
Symptom: Long delay between job submission and execution
Diagnostic Steps:
Check PBS queue time:
$ ssh pbs-head 'qstat -f <job-id> | grep -E "qtime|stime"'Check PBS server logs:
$ ssh pbs-head 'sudo tail -f /var/spool/pbs/server_logs/$(date +%Y%m%d)'Check compute node availability:
$ ssh pbs-head 'pbsnodes -a'
Solutions:
Use priority queues:
spec: orchestrator: provisioner: pbs: defaultQueue: "priority" # Use higher priority queueReserve nodes for Fuzzball. Create PBS reservation:
$ ssh pbs-head 'sudo pbs_rsub -U fuzzball-service -N fuzzball -l select=10 -R $(date +%s) -E $(date -d "+1 year" +%s)'Use reservation:
spec: orchestrator: provisioner: pbs: options: reservation: "fuzzball"
Symptom: Substrate spends long time pulling images
Solutions:
Use local registry mirror:
image: uri: "docker://local-mirror.example.com/alpine:3.16"Pre-pull common images on all compute nodes:
# sudo apptainer pull docker://alpine:3.16 # sudo apptainer pull docker://ubuntu:22.04Use smaller base images. Use alpine instead of ubuntu:
image: uri: "docker://alpine:3.16" # ~5MB instead of ubuntu:22.04 (~77MB)
Orchestrator:
# In FuzzballOrchestrate CRD
spec:
orchestrator:
logLevel: "debug"
Substrate on compute nodes, edit Substrate config:
cat <<EOF | sudo tee /etc/fuzzball/substrate.yaml
logging:
level: debug
EOFCreate a script to gather diagnostic data:
#!/bin/bash
# diagnostic-collect.sh
echo "=== Fuzzball Configuration ==="
kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml
echo "=== Orchestrate Logs ==="
kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=500
echo "=== Provisioner Definitions ==="
fuzzball admin provisioner list
echo "=== PBS Status ==="
ssh pbs-head 'qstat -B'
ssh pbs-head 'qstat -Q'
ssh pbs-head 'pbsnodes -a'
echo "=== Recent PBS Jobs ==="
ssh pbs-head 'qstat -x -u fuzzball-service'
echo "=== Compute Node Status ==="
ssh pbs-head 'pbsnodes -aSv'
echo "=== Recent Workflows ==="
fuzzball workflow list --limit 10
Test PBS integration interactively:
$ cat <<EOF > test.sh
#!/bin/bash
#PBS -l select=1:ncpus=1
echo "Hostname: $(hostname)"
echo "Date: $(date)"
echo "CPU info:"
lscpu | head -20
echo "Memory info:"
free -h
echo "Substrate check:"
which fuzzball-substrate
fuzzball-substrate --version
EOF
$ ssh pbs-head 'qsub test.sh'If you’re unable to resolve an issue:
Collect diagnostic information using the script above
Check Fuzzball documentation at https://docs.fuzzball.io
Review PBS documentation at https://www.openpbs.org/documentation
Contact CIQ support with:
- Problem description
- Steps to reproduce
- Diagnostic information
- Configuration files (with sensitive data redacted)
- Relevant log excerpts
Community resources:
- Fuzzball GitHub: https://github.com/ctrliq/fuzzball
- CIQ contact: https://ciq.com/company/contact-us/
| Error Message | Likely Cause | Solution |
|---|---|---|
connection refused | SSH port blocked or wrong | Check firewall and SSH configuration |
connection timeout | Connection timeout too short | Increase connectionTimeout in config |
authentication failed | Wrong credentials | Verify username/password or SSH key |
command not found | PBS not in PATH | Set binaryPath in configuration |
Unknown queue | Wrong queue name | Check available queues with qstat -Q |
Unauthorized Request | No PBS permissions | Add user to PBS authorized users |
No matching provisioner | Resource mismatch | Create appropriate provisioner definition |
Out of memory | Insufficient memory | Increase memory in workflow or definition |
timeout / TIMEOUT state | Job exceeded TTL or timeout | Increase provisioner TTL or reduce job timeout |
walltime exceeded | PBS walltime exceeded | Check provisioner TTL and job timeout settings |