Fuzzball Documentation
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Troubleshooting

This guide covers common issues you may encounter when using Fuzzball with Slurm or PBS integration and provides detailed troubleshooting steps and solutions.

Please select either the Slurm or PBS tab to see the appropriate troubleshooting information for your environment.

Connection Issues

SSH Connection Failed

Symptom:

Error: Failed to connect to SSH host: dial tcp: connection refused

Diagnostic Steps:

  1. Verify SSH host is reachable:

    $ ping slurm-head.example.com
  2. Check SSH port:

    $ nc -zv slurm-head.example.com 22
  3. Test SSH connection manually:

    $ ssh fuzzball-service@slurm-head.example.com
  4. Check Orchestrate logs:

    $ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -i "ssh"
  5. Check connection timeout setting:

    $ kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml | grep connectionTimeout

Solutions:

  • Firewall blocking: Ensure firewall allows SSH connections from Orchestrate on Slurm head node

    # firewall-cmd --permanent --add-service=ssh
    
    # firewall-cmd --reload
  • Wrong SSH port: Verify sshPort in configuration matches actual SSH port. Check SSH port on head node:

    # netstat -tlnp | grep sshd
  • DNS resolution: Use IP address instead of hostname if DNS is unavailable

    sshHost: "192.168.1.100"  # Use IP directly
    
  • Connection timeout too short: Increase timeout for high-latency environments

    spec:
      orchestrator:
        provisioner:
          slurm:
            connectionTimeout: 90  # Increase to 90 seconds
    

Authentication Failed

Symptom:

Error: ssh: handshake failed: ssh: unable to authenticate

Diagnostic Steps:

  1. Verify credentials for password auth (enter password manually when prompted):

    $ ssh fuzzball-service@slurm-head.example.com

    Or for key auth:

    $ ssh -i /path/to/key fuzzball-service@slurm-head.example.com
  2. Check SSH logs on head node (RHEL/CentOS):

    # tail -f /var/log/secure

    Or on Ubuntu/Debian:

    # tail -f /var/log/auth.log
  3. Verify public key is installed:

    $ ssh fuzzball-service@slurm-head.example.com 'cat ~/.ssh/authorized_keys'

Solutions:

  • Password expired: Update password on Slurm head node

    # passwd fuzzball-service
  • Public key not installed: Copy public key to authorized_keys

    $ ssh-copy-id -i fuzzball-key.pub fuzzball-service@slurm-head.example.com
  • Wrong key format: Ensure private key is in PEM format. Convert to PEM if needed:

    $ ssh-keygen -p -m PEM -f fuzzball-key
  • Key permissions: Fix file permissions

    $ chmod 600 ~/.ssh/id_rsa
    
    $ chmod 644 ~/.ssh/id_rsa.pub
    
    $ chmod 700 ~/.ssh

Host Key Verification Failed

Symptom:

Error: host key mismatch for slurm-head.example.com

Diagnostic Steps:

  1. Get current host key:

    $ ssh-keyscan -t rsa slurm-head.example.com
  2. Compare with configured key:

    $ kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml | grep sshHostPublicKey

Solutions:

  • Update host key in configuration:

    spec:
      orchestrator:
        provisioner:
          slurm:
            sshHostPublicKey: "slurm-head.example.com ssh-rsa AAAAB3NzaC1yc2E..."
    
  • Temporary workaround (not recommended for production):

    spec:
      orchestrator:
        provisioner:
          slurm:
            skipHostKeyVerification: true
    

Slurm Command Issues

Command Not Found

Symptom:

Error: sbatch: command not found

Diagnostic Steps:

  1. Check if Slurm commands exist:

    $ ssh fuzzball-service@slurm-head.example.com 'which sbatch'
    
    $ ssh fuzzball-service@slurm-head.example.com 'which squeue'
    
    $ ssh fuzzball-service@slurm-head.example.com 'which scancel'
  2. Check user’s PATH:

    $ ssh fuzzball-service@slurm-head.example.com 'echo $PATH'
  3. Find Slurm binaries:

    $ ssh fuzzball-service@slurm-head.example.com 'sudo find / -name sbatch 2>/dev/null'

Solutions:

  • Set binaryPath in configuration:

    spec:
      orchestrator:
        provisioner:
          slurm:
            binaryPath: "/opt/slurm/bin"
    
  • Add to user PATH on Slurm head node, add to ~/.bashrc or ~/.profile:

    export PATH=$PATH:/opt/slurm/bin
    
  • Create symlinks (if admin):

    # sudo ln -s /opt/slurm/bin/sbatch /usr/local/bin/sbatch
    
    # sudo ln -s /opt/slurm/bin/squeue /usr/local/bin/squeue
    
    # sudo ln -s /opt/slurm/bin/scancel /usr/local/bin/scancel

Permission Denied

Symptom:

Error: sbatch: error: Access denied

Diagnostic Steps:

  1. Check user permissions:

    $ ssh fuzzball-service@slurm-head.example.com 'sacctmgr show user fuzzball-service'
  2. Verify account access:

    $ ssh fuzzball-service@slurm-head.example.com 'sacctmgr show assoc where user=fuzzball-service'
  3. Check partition access:

    $ ssh fuzzball-service@slurm-head.example.com 'sinfo -p compute'

Solutions:

  • Add user to Slurm database:

    # sudo sacctmgr add user fuzzball-service account=default
  • Grant access to partition:

    # sudo sacctmgr add account default partition=compute
  • Check AllowAccounts in slurm.conf (should show AllowAccounts=ALL or include your account):

    # sudo grep "PartitionName=compute" /etc/slurm/slurm.conf

Job Submission Issues

Job Submission Hangs

Symptom: Workflow status stuck in PENDING, no error messages

Diagnostic Steps:

  1. Check Slurm controller status:

    $ ssh slurm-head 'scontrol ping'
  2. Check pending jobs:

    $ ssh slurm-head 'squeue -t PD'
  3. Check controller logs:

    $ ssh slurm-head 'sudo tail -f /var/log/slurm/slurmctld.log'
  4. Check Orchestrate logs:

    $ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -A 10 "ProvisionSubstrate"

Solutions:

  • Restart Slurm controller (if unresponsive):

    # sudo systemctl restart slurmctld
  • Increase timeout in configuration:

    policy:
      timeout:
        execute: "5m"  # Increase if jobs are slow to start
    
  • Check resource availability:

    $ ssh slurm-head 'sinfo -Nel'

Job Fails with Invalid Partition

Symptom:

Error: sbatch: error: Batch job submission failed: Invalid partition name specified

Diagnostic Steps:

  1. List available partitions:

    $ ssh slurm-head 'sinfo -s'
  2. Check partition in provisioner definition:

    $ fuzzball admin provisioner get slurm-compute-small

Solutions:

  • Use correct partition name:

    definitions:
      - id: "slurm-compute"
        provisioner: "slurm"
        spec:
          cpu: 4
          mem: "8GiB"
          partition: "compute"  # Must match actual partition name
    
  • Remove partition specification to use default:

    definitions:
      - id: "slurm-compute"
        provisioner: "slurm"
        spec:
          cpu: 4
          mem: "8GiB"
          # partition not specified = use default
    

Job Fails with Invalid QoS

Symptom:

Error: sbatch: error: Batch job submission failed: Invalid qos specification

Diagnostic Steps:

  1. List available QoS:

    $ ssh slurm-head 'sacctmgr show qos format=name,priority'
  2. Check user’s allowed QoS:

    $ ssh slurm-head 'sacctmgr show assoc where user=fuzzball-service format=qos'

Solutions:

  • Use valid QoS:

    spec:
      orchestrator:
        provisioner:
          slurm:
            options:
              qos: "normal"  # Must be a valid QoS
    
  • Remove QoS specification to use default:

    spec:
      orchestrator:
        provisioner:
          slurm:
            options:
              # qos not specified = use default
    

Substrate Issues

Substrate Binary Not Found

Symptom:

Error: fuzzball-substrate: command not found

Diagnostic Steps:

  1. Check if Substrate is installed:

    $ ssh slurm-head 'ssh <compute-node> which fuzzball-substrate'
  2. Check compute node PATH:

    $ ssh slurm-head 'ssh <compute-node> echo $PATH'
  3. Find Substrate binary:

    $ ssh slurm-head 'ssh <compute-node> find / -name fuzzball-substrate 2>/dev/null'

Solutions:

  • Install Substrate on all compute nodes. On each compute node:

    # sudo cp fuzzball-substrate /usr/local/bin/
    
    # sudo chmod +x /usr/local/bin/fuzzball-substrate
  • Use absolute path in script: Modify Substrate invocation if needed

  • Add to PATH for all users. On compute nodes, add to /etc/environment:

    PATH=/usr/local/bin:/usr/bin:/bin
    

Substrate Permission Denied

Symptom:

Error: sudo: sorry, user fuzzball-service is not allowed to execute '/usr/local/bin/fuzzball-substrate serve' as root

Diagnostic Steps:

  1. Check sudo permissions:

    $ ssh slurm-head 'ssh <compute-node> sudo -l'
  2. Check sudoers configuration:

    $ ssh slurm-head 'ssh <compute-node> sudo cat /etc/sudoers.d/fuzzball-service'

Solutions:

  • Add sudo permission for Substrate. On all compute nodes, create /etc/sudoers.d/fuzzball-service:

    # echo "fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate" | sudo tee /etc/sudoers.d/fuzzball-service
    
    # sudo chmod 440 /etc/sudoers.d/fuzzball-service
  • Use wildcards for flexibility. Allow Substrate with any arguments:

    fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate *
    
  • Validate sudoers file:

    # sudo visudo -c

Substrate Not Registering

Symptom: Slurm job starts but Substrate never connects to Orchestrate

Diagnostic Steps:

  1. Check if Substrate process is running:

    $ ssh slurm-head 'ssh <compute-node> ps aux | grep fuzzball-substrate'
  2. Check Substrate logs:

    $ ssh slurm-head 'cat slurm-<job-id>.out'
  3. Check network connectivity:

    $ ssh slurm-head 'ssh <compute-node> ping <orchestrator-host>'
    
    $ ssh slurm-head 'ssh <compute-node> nc -zv <orchestrator-host> <orchestrator-port>'
  4. Check firewall rules on compute node:

    # sudo iptables -L -n -v | grep <orchestrator-port>

Solutions:

  • Allow outbound connections from compute nodes. On compute nodes:

    # sudo firewall-cmd --permanent --add-port=<orchestrator-port>/tcp
    
    # sudo firewall-cmd --reload
  • Check Orchestrate endpoint in Substrate configuration:

    $ ssh slurm-head 'ssh <compute-node> cat /etc/fuzzball/substrate.yaml'
  • Verify DNS resolution:

    $ ssh slurm-head 'ssh <compute-node> nslookup <orchestrator-host>'
  • Use IP address if DNS is problematic. In Substrate configuration:

    orchestrator:
      endpoint: "192.168.1.10:8080"  # Use IP instead of hostname
    

Resource Issues

Insufficient Resources

Symptom:

Error: No matching provisioner definition found for resource requirements

Diagnostic Steps:

  1. Check workflow resource requirements:

    $ cat workflow.yaml | grep -A 5 "resource:"
  2. List provisioner definitions:

    $ fuzzball admin provisioner list
  3. Compare requirements to available resources:

    $ fuzzball admin provisioner get slurm-compute-small

Solutions:

  • Reduce workflow resource requests:

    resource:
      cpu:
        cores: 4  # Reduced from 64
      memory:
        size: "8GiB"  # Reduced from 512GiB
    
  • Create larger provisioner definition:

    definitions:
      - id: "slurm-compute-xlarge"
        provisioner: "slurm"
        spec:
          cpu: 64
          mem: "512GiB"
          partition: "himem"
    
  • Use Slurm node features to target appropriate nodes:

    definitions:
      - id: "slurm-himem"
        provisioner: "slurm"
        spec:
          cpu: 32
          mem: "256GiB"
          constraint: "himem"
    

Memory Limit Exceeded

Symptom: Job killed by Slurm with “Out of memory” error

Diagnostic Steps:

  1. Check actual memory usage:

    $ ssh slurm-head 'sacct -j <job-id> --format=JobID,MaxRSS,MaxVMSize'
  2. Check memory limit:

    $ ssh slurm-head 'scontrol show job <job-id> | grep Mem'

Solutions:

  • Increase memory in workflow:

    resource:
      memory:
        size: "32GiB"  # Increased from 16GiB
    
  • Create provisioner definition with more memory:

    definitions:
      - id: "slurm-highmem"
        provisioner: "slurm"
        spec:
          cpu: 16
          mem: "128GiB"
          partition: "himem"
    
  • Optimize application memory usage: Review application logs and optimize code

CPU Contention

Symptom: Jobs running much slower than expected

Diagnostic Steps:

  1. Check node load:

    $ ssh slurm-head 'ssh <compute-node> uptime'
  2. Check other jobs on node:

    $ ssh slurm-head 'squeue -w <compute-node> -o "%.18i %.9P %.20j %.8u %.2t %.10M"'
  3. Check CPU allocation:

    $ ssh slurm-head 'sacct -j <job-id> --format=JobID,AllocCPUS,CPUTime,Elapsed'

Solutions:

  • Use CPU affinity:

    resource:
      cpu:
        cores: 16
        affinity: "SOCKET"  # Bind to CPU sockets
    
  • Use exclusive node allocation (if needed):

    spec:
      orchestrator:
        provisioner:
          slurm:
            options:
              exclusive: "true"
    
  • Request specific node features:

    definitions:
      - id: "slurm-fast"
        provisioner: "slurm"
        spec:
          cpu: 32
          mem: "64GiB"
          constraint: "haswell|broadwell"  # Newer CPUs
    

Workflow Issues

Workflow Stuck in PENDING

Symptom: Workflow status remains PENDING indefinitely

Diagnostic Steps:

  1. Check workflow details:

    $ fuzzball workflow get <workflow-id>
  2. Check provisioner status:

    $ fuzzball admin provisioner list
  3. Check Slurm queue:

    $ ssh slurm-head 'squeue -u fuzzball-service'
  4. Check orchestrator logs:

    $ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=100

Solutions:

  • Check for provisioner definition mismatch: Ensure workflow resources match available definitions

  • Wait for Slurm resources: Jobs may be queued waiting for compute nodes

  • Cancel and resubmit:

    $ fuzzball workflow cancel <workflow-id>
    
    $ fuzzball workflow submit --file workflow.yaml

Jobs Execute Out of Order

Symptom: Dependent jobs start before their dependencies complete

Diagnostic Steps:

  1. Check job dependencies in workflow:

    $ cat workflow.yaml | grep -A 5 "depends_on"
  2. Check job execution order:

    $ fuzzball workflow jobs <workflow-id>

Solutions:

  • Verify dependency specification:

    jobs:
      job-a:
        name: "first-job"
        # ... job definition
    
      job-b:
        name: "second-job"
        depends_on:
          - job-a  # Must match job ID, not name
        # ... job definition
    
  • Use consistent job identifiers: Ensure dependency references use correct job IDs

Job Terminated by TTL

Symptom: Jobs are killed before completion with timeout errors

Diagnostic Steps:

  1. Check provisioner TTL setting:

    $ fuzzball admin provisioner get slurm-compute-small | grep ttl
  2. Check job timeout policy:

    $ cat workflow.yaml | grep -A 3 "timeout:"
  3. Check actual job runtime:

    $ ssh slurm-head 'sacct -j <job-id> --format=JobID,Elapsed,Timelimit'

Solutions:

  • Increase provisioner TTL:

    definitions:
      - id: "slurm-compute-long"
        provisioner: "slurm"
        ttl: 28800  # Increase to 8 hours
        spec:
          cpu: 8
          mem: "16GiB"
    
  • Reduce job timeout to fit within TTL:

    jobs:
      my-job:
        policy:
          timeout:
            execute: "1h"  # Ensure it's less than provisioner TTL
    

Container Pull Failures

Symptom:

Error: Failed to pull container image: authentication required

Diagnostic Steps:

  1. Check image exists:

    $ docker pull <image-uri>
  2. Check registry credentials:

    $ fuzzball secret list | grep registry
  3. Check Substrate image pull logs:

    $ ssh slurm-head 'cat slurm-<job-id>.out | grep -i "pull"'

Solutions:

  • Use public images for testing:

    image:
      uri: "docker://alpine:3.16"  # Public image
    
  • Configure registry credentials:

    $ fuzzball secret create registry-creds \
      --username <username> \
      --password <password> \
      --server registry.example.com
  • Use image pull secrets:

    image:
      uri: "docker://registry.example.com/myapp:v1"
      pullSecret: "registry-creds"
    

Performance Issues

Slow Job Startup

Symptom: Long delay between job submission and execution

Diagnostic Steps:

  1. Check Slurm queue time:

    $ ssh slurm-head 'sacct -j <job-id> --format=JobID,Submit,Start,Elapsed'
  2. Check scheduler logs:

    $ ssh slurm-head 'sudo tail -f /var/log/slurm/slurmctld.log'
  3. Check compute node availability:

    $ ssh slurm-head 'sinfo -Nel'

Solutions:

  • Use priority QoS:

    spec:
      orchestrator:
        provisioner:
          slurm:
            options:
              qos: "high"
    
  • Reserve nodes for Fuzzball. Create reservation:

    $ ssh slurm-head 'sudo scontrol create reservation \
      starttime=now duration=infinite \
      user=fuzzball-service nodes=node[01-10] \
      reservationname=fuzzball'

    Use reservation:

    spec:
      orchestrator:
        provisioner:
          slurm:
            options:
              reservation: "fuzzball"
    

Slow Container Image Pulls

Symptom: Substrate spends long time pulling images

Solutions:

  • Use local registry mirror:

    image:
      uri: "docker://local-mirror.example.com/alpine:3.16"
    
  • Pre-pull common images on all compute nodes:

    # sudo apptainer pull docker://alpine:3.16
    
    # sudo apptainer pull docker://ubuntu:22.04
  • Use smaller base images. Use alpine instead of ubuntu:

    image:
      uri: "docker://alpine:3.16"  # ~5MB instead of ubuntu:22.04 (~77MB)
    

Debugging Tools

Enable Debug Logging

Orchestrator:

# In FuzzballOrchestrate CRD
spec:
  orchestrator:
    logLevel: "debug"

Substrate on compute nodes, edit Substrate config:

# cat <<EOF | sudo tee /etc/fuzzball/substrate.yaml
logging:
  level: debug
EOF

Collect Diagnostic Information

Create a script to gather diagnostic data:

#!/bin/bash
# diagnostic-collect.sh

echo "=== Fuzzball Configuration ==="
kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml

echo "=== Orchestrate Logs ==="
kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=500

echo "=== Provisioner Definitions ==="
fuzzball admin provisioner list

echo "=== Slurm Status ==="
ssh slurm-head 'scontrol show config | head -20'
ssh slurm-head 'sinfo'
ssh slurm-head 'squeue'

echo "=== Recent Slurm Jobs ==="
ssh slurm-head 'sacct -u fuzzball-service --starttime $(date -d "1 hour ago" +%Y-%m-%d) --format=JobID,JobName,State,ExitCode,Elapsed'

echo "=== Compute Node Status ==="
ssh slurm-head 'scontrol show nodes'

echo "=== Recent Workflows ==="
fuzzball workflow list --limit 10

Interactive Debugging

Test Slurm integration interactively:

$ cat >test.sh << EOF
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

echo "Hostname: $(hostname)"
echo "Date: $(date)"
echo "CPU info:"
lscpu | head -20
echo "Memory info:"
free -h
echo "Substrate check:"
which fuzzball-substrate
fuzzball-substrate --version
EOF

$ ssh slurm-head 'sbatch test.sh'

Getting Help

If you’re unable to resolve an issue:

  1. Collect diagnostic information using the script above

  2. Check Fuzzball documentation at https://docs.fuzzball.io

  3. Review Slurm documentation at https://slurm.schedmd.com

  4. Contact CIQ support with:

    • Problem description
    • Steps to reproduce
    • Diagnostic information
    • Configuration files (with sensitive data redacted)
    • Relevant log excerpts
  5. Community resources:

Common Error Messages

Error MessageLikely CauseSolution
connection refusedSSH port blocked or wrongCheck firewall and SSH configuration
connection timeoutConnection timeout too shortIncrease connectionTimeout in config
authentication failedWrong credentialsVerify username/password or SSH key
command not foundSlurm not in PATHSet binaryPath in configuration
Invalid partitionWrong partition nameCheck available partitions with sinfo
Access deniedNo Slurm permissionsAdd user to Slurm database
No matching provisionerResource mismatchCreate appropriate provisioner definition
Out of memoryInsufficient memoryIncrease memory in workflow or definition
timeout / TIMEOUT stateJob exceeded TTL or timeoutIncrease provisioner TTL or reduce job timeout
DUE TO TIME LIMITSlurm walltime exceededCheck provisioner TTL and job timeout settings

Connection Issues

SSH Connection Failed

Symptom:

Error: Failed to connect to SSH host: dial tcp: connection refused

Diagnostic Steps:

  1. Verify SSH host is reachable:

    $ ping pbs-head.example.com
  2. Check SSH port:

    $ nc -zv pbs-head.example.com 22
  3. Test SSH connection manually:

    $ ssh fuzzball-service@pbs-head.example.com
  4. Check orchestrator logs:

    $ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -i "ssh"
  5. Check connection timeout setting:

    $ kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml | grep connectionTimeout

Solutions:

  • Firewall blocking: Ensure firewall allows SSH connections from Orchestrate on PBS head node

    # firewall-cmd --permanent --add-service=ssh
    
    # firewall-cmd --reload
  • Wrong SSH port: Verify sshPort in configuration matches actual SSH port. Check SSH port on head node:

    # netstat -tlnp | grep sshd
  • DNS resolution: Use IP address instead of hostname if DNS is unavailable

    sshHost: "192.168.1.100"  # Use IP directly
    
  • Connection timeout too short: Increase timeout for high-latency environments

    spec:
      orchestrator:
        provisioner:
          pbs:
            connectionTimeout: 90  # Increase to 90 seconds
    

Authentication Failed

Symptom:

Error: ssh: handshake failed: ssh: unable to authenticate

Diagnostic Steps:

  1. Verify credentials for password auth (enter password manually when prompted):

    $ ssh fuzzball-service@pbs-head.example.com

    Or for key auth:

    $ ssh -i /path/to/key fuzzball-service@pbs-head.example.com
  2. Check SSH logs on head node (RHEL/CentOS):

    # tail -f /var/log/secure

    Or on Ubuntu/Debian:

    # tail -f /var/log/auth.log
  3. Verify public key is installed:

    $ ssh fuzzball-service@pbs-head.example.com 'cat ~/.ssh/authorized_keys'

Solutions:

  • Password expired: Update password on PBS head node

    # passwd fuzzball-service
  • Public key not installed: Copy public key to authorized_keys

    $ ssh-copy-id -i fuzzball-key.pub fuzzball-service@pbs-head.example.com
  • Wrong key format: Ensure private key is in PEM format. Convert to PEM if needed:

    $ ssh-keygen -p -m PEM -f fuzzball-key
  • Key permissions: Fix file permissions

    $ chmod 600 ~/.ssh/id_rsa
    
    $ chmod 644 ~/.ssh/id_rsa.pub
    
    $ chmod 700 ~/.ssh

Host Key Verification Failed

Symptom:

Error: ssh: handshake failed: ssh: host key mismatch

Diagnostic Steps:

  1. Get current host key:

    $ ssh-keyscan pbs-head.example.com
  2. Compare with configuration: Check if sshHostPublicKey in FuzzballOrchestrate CRD matches

Solutions:

  • Update host key in configuration with output from ssh-keyscan

  • Disable host key verification (not recommended for production):

    spec:
      orchestrator:
        provisioner:
          pbs:
            skipHostKeyVerification: true
    

Command Issues

Command Not Found

Symptom:

Error: qsub: command not found

Diagnostic Steps:

  1. Check if PBS commands exist:

    $ ssh fuzzball-service@pbs-head.example.com 'which qsub'
    
    $ ssh fuzzball-service@pbs-head.example.com 'which qstat'
    
    $ ssh fuzzball-service@pbs-head.example.com 'which qdel'
  2. Check user’s PATH:

    $ ssh fuzzball-service@pbs-head.example.com 'echo $PATH'
  3. Find PBS binaries:

    $ ssh fuzzball-service@pbs-head.example.com 'sudo find / -name qsub 2>/dev/null'

Solutions:

  • Set binaryPath in configuration:

    spec:
      orchestrator:
        provisioner:
          pbs:
            binaryPath: "/opt/pbs/bin"
    
  • Add to user PATH on PBS head node, add to ~/.bashrc or ~/.profile:

    export PATH=$PATH:/opt/pbs/bin
    
  • Create symlinks (if admin):

    # sudo ln -s /opt/pbs/bin/qsub /usr/local/bin/qsub
    
    # sudo ln -s /opt/pbs/bin/qstat /usr/local/bin/qstat
    
    # sudo ln -s /opt/pbs/bin/qdel /usr/local/bin/qdel

Permission Denied

Symptom:

Error: qsub: Unauthorized Request

Diagnostic Steps:

  1. Check user permissions:

    $ ssh fuzzball-service@pbs-head.example.com 'qstat -B'
  2. Verify queue access:

    $ ssh fuzzball-service@pbs-head.example.com 'qstat -Q'
  3. Check PBS server configuration:

    $ ssh fuzzball-service@pbs-head.example.com 'qmgr -c "print server"'

Solutions:

  • Add user to PBS allowed users:

    # sudo qmgr -c "set server authorized_users += fuzzball-service@*"
  • Grant queue access:

    # sudo qmgr -c "set queue workq enabled = True"
    
    # sudo qmgr -c "set queue workq started = True"
  • Check queue ACLs (should show allowed users or be disabled):

    # sudo qmgr -c "print queue workq"

Job Submission Issues

Job Submission Hangs

Symptom: Workflow status stuck in PENDING, no error messages

Diagnostic Steps:

  1. Check PBS server status:

    $ ssh pbs-head 'qstat -B'
  2. Check pending jobs:

    $ ssh pbs-head 'qstat -i'
  3. Check PBS server logs:

    $ ssh pbs-head 'sudo tail -f /var/spool/pbs/server_logs/$(date +%Y%m%d)'
  4. Check Orchestrate logs:

    $ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -A 10 "ProvisionSubstrate"

Solutions:

  • Restart PBS server (if unresponsive):

    # sudo systemctl restart pbs
  • Increase timeout in configuration:

    policy:
      timeout:
        execute: "5m"  # Increase if jobs are slow to start
    
  • Check resource availability:

    $ ssh pbs-head 'pbsnodes -a'

Job Fails with Invalid Queue

Symptom:

Error: qsub: Unknown queue

Diagnostic Steps:

  1. List available queues:

    $ ssh pbs-head 'qstat -Q'
  2. Check queue in provisioner definition:

    $ fuzzball admin provisioner get pbs-workq-small

Solutions:

  • Use correct queue name:

    definitions:
      - id: "pbs-compute"
        provisioner: "pbs"
        spec:
          cpu: 4
          mem: "8GiB"
          queue: "workq"  # Must match actual queue name
    
  • Remove queue specification to use default:

    definitions:
      - id: "pbs-compute"
        provisioner: "pbs"
        spec:
          cpu: 4
          mem: "8GiB"
          # queue not specified = use default
    

Substrate Issues

Substrate Binary Not Found

Symptom:

Error: fuzzball-substrate: command not found

Diagnostic Steps:

  1. Check if Substrate is installed:

    $ ssh pbs-head 'ssh <compute-node> which fuzzball-substrate'
  2. Check compute node PATH:

    $ ssh pbs-head 'ssh <compute-node> echo $PATH'
  3. Find Substrate binary:

    $ ssh pbs-head 'ssh <compute-node> find / -name fuzzball-substrate 2>/dev/null'

Solutions:

  • Install Substrate on all compute nodes. On each compute node:

    # sudo cp fuzzball-substrate /usr/local/bin/
    
    # sudo chmod +x /usr/local/bin/fuzzball-substrate
  • Use absolute path in script: Modify Substrate invocation if needed

  • Add to PATH for all users. On compute nodes, add to /etc/environment:

    PATH=/usr/local/bin:/usr/bin:/bin
    

Substrate Permission Denied

Symptom:

Error: sudo: sorry, user fuzzball-service is not allowed to execute '/usr/local/bin/fuzzball-substrate serve' as root

Diagnostic Steps:

  1. Check sudo permissions:

    $ ssh pbs-head 'ssh <compute-node> sudo -l'
  2. Check sudoers configuration:

    $ ssh pbs-head 'ssh <compute-node> sudo cat /etc/sudoers.d/fuzzball-service'

Solutions:

  • Add sudo permission for Substrate. On all compute nodes, create /etc/sudoers.d/fuzzball-service:

    # echo "fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate" | sudo tee /etc/sudoers.d/fuzzball-service
    
    # sudo chmod 440 /etc/sudoers.d/fuzzball-service
  • Use wildcards for flexibility. Allow Substrate with any arguments:

    fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate *
    
  • Validate sudoers file:

    # sudo visudo -c

Substrate Not Registering

Symptom: PBS job starts but Substrate never connects to Orchestrate

Diagnostic Steps:

  1. Check if Substrate process is running:

    $ ssh pbs-head 'ssh <compute-node> ps aux | grep fuzzball-substrate'
  2. Check Substrate logs:

    $ ssh pbs-head 'cat pbs-<job-id>.out'
  3. Check network connectivity:

    $ ssh pbs-head 'ssh <compute-node> ping <orchestrator-host>'
    
    $ ssh pbs-head 'ssh <compute-node> nc -zv <orchestrator-host> <orchestrator-port>'
  4. Check firewall rules on compute node:

    # sudo iptables -L -n -v | grep <orchestrator-port>

Solutions:

  • Allow outbound connections from compute nodes. On compute nodes:

    # sudo firewall-cmd --permanent --add-port=<orchestrator-port>/tcp
    
    # sudo firewall-cmd --reload
  • Check Orchestrate endpoint in Substrate configuration:

    $ ssh pbs-head 'ssh <compute-node> cat /etc/fuzzball/substrate.yaml'
  • Verify DNS resolution:

    $ ssh pbs-head 'ssh <compute-node> nslookup <orchestrator-host>'
  • Use IP address if DNS is problematic. In Substrate configuration:

    orchestrator:
      endpoint: "192.168.1.10:8080"  # Use IP instead of hostname
    

Resource Issues

Insufficient Resources

Symptom:

Error: No matching provisioner definition found for resource requirements

Diagnostic Steps:

  1. Check workflow resource requirements:

    $ cat workflow.yaml | grep -A 5 "resource:"
  2. List provisioner definitions:

    $ fuzzball admin provisioner list
  3. Compare requirements to available resources:

    $ fuzzball admin provisioner get pbs-workq-small

Solutions:

  • Reduce workflow resource requests:

    resource:
      cpu:
        cores: 4  # Reduced from 64
      memory:
        size: "8GiB"  # Reduced from 512GiB
    
  • Create larger provisioner definition:

    definitions:
      - id: "pbs-workq-xlarge"
        provisioner: "pbs"
        spec:
          cpu: 64
          mem: "512GiB"
          queue: "bigmem"
    
  • Use PBS node features to target appropriate nodes:

    definitions:
      - id: "pbs-himem"
        provisioner: "pbs"
        spec:
          cpu: 32
          mem: "256GiB"
          select: "mem>256GB"
    

Memory Limit Exceeded

Symptom: Job killed by PBS with “Out of memory” error

Diagnostic Steps:

  1. Check actual memory usage:

    $ ssh pbs-head 'qstat -f <job-id> | grep resources_used.mem'
  2. Check memory limit:

    $ ssh pbs-head 'qstat -f <job-id> | grep Resource_List.mem'

Solutions:

  • Increase memory in workflow:

    resource:
      memory:
        size: "32GiB"  # Increased from 16GiB
    
  • Create provisioner definition with more memory:

    definitions:
      - id: "pbs-highmem"
        provisioner: "pbs"
        spec:
          cpu: 16
          mem: "128GiB"
          queue: "bigmem"
    
  • Optimize application memory usage: Review application logs and optimize code

CPU Contention

Symptom: Jobs running much slower than expected

Diagnostic Steps:

  1. Check node load:

    $ ssh pbs-head 'ssh <compute-node> uptime'
  2. Check other jobs on node:

    $ ssh pbs-head 'qstat -n | grep <compute-node>'
  3. Check CPU allocation:

    $ ssh pbs-head 'qstat -f <job-id> | grep resources_used.cpupercent'

Solutions:

  • Use CPU affinity:

    resource:
      cpu:
        cores: 16
        affinity: "SOCKET"  # Bind to CPU sockets
    
  • Use exclusive node allocation (if needed):

    definitions:
      - id: "pbs-exclusive"
        provisioner: "pbs"
        exclusive: "job"
        spec:
          cpu: 32
          mem: "64GiB"
          queue: "workq"
    
  • Request specific node features:

    definitions:
      - id: "pbs-fast"
        provisioner: "pbs"
        spec:
          cpu: 32
          mem: "64GiB"
          select: "host=node[10-20]"
    

Workflow Issues

Workflow Stuck in PENDING

Symptom: Workflow status remains PENDING indefinitely

Diagnostic Steps:

  1. Check workflow details:

    $ fuzzball workflow get <workflow-id>
  2. Check provisioner status:

    $ fuzzball admin provisioner list
  3. Check PBS queue:

    $ ssh pbs-head 'qstat -u fuzzball-service'
  4. Check Orchestrate logs:

    $ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=100

Solutions:

  • Check for provisioner definition mismatch: Ensure workflow resources match available definitions

  • Wait for PBS resources: Jobs may be queued waiting for compute nodes

  • Cancel and resubmit:

    $ fuzzball workflow cancel <workflow-id>
    
    $ fuzzball workflow submit --file workflow.yaml

Jobs Execute Out of Order

Symptom: Dependent jobs start before their dependencies complete

Diagnostic Steps:

  1. Check job dependencies in workflow:

    $ cat workflow.yaml | grep -A 5 "depends_on"
  2. Check job execution order:

    $ fuzzball workflow jobs <workflow-id>

Solutions:

  • Verify dependency specification:

    jobs:
      job-a:
        name: "first-job"
        # ... job definition
    
      job-b:
        name: "second-job"
        depends_on:
          - job-a  # Must match job ID, not name
        # ... job definition
    
  • Use consistent job identifiers: Ensure dependency references use correct job IDs

Job Terminated by TTL

Symptom: Jobs are killed before completion with timeout errors

Diagnostic Steps:

  1. Check provisioner TTL setting:

    $ fuzzball admin provisioner get pbs-workq-small | grep ttl
  2. Check job timeout policy:

    $ cat workflow.yaml | grep -A 3 "timeout:"
  3. Check actual job runtime:

    $ ssh pbs-head 'qstat -f <job-id> | grep resources_used.walltime'

Solutions:

  • Increase provisioner TTL:

    definitions:
      - id: "pbs-workq-long"
        provisioner: "pbs"
        ttl: 28800  # Increase to 8 hours
        spec:
          cpu: 8
          mem: "16GiB"
    
  • Reduce job timeout to fit within TTL:

    jobs:
      my-job:
        policy:
          timeout:
            execute: "1h"  # Ensure it's less than provisioner TTL
    

Container Pull Failures

Symptom:

Error: Failed to pull container image: authentication required

Diagnostic Steps:

  1. Check image exists:

    $ docker pull <image-uri>
  2. Check registry credentials:

    $ fuzzball secret list | grep registry
  3. Check Substrate image pull logs:

    $ ssh pbs-head 'cat pbs-<job-id>.out | grep -i "pull"'

Solutions:

  • Use public images for testing:

    image:
      uri: "docker://alpine:3.16"  # Public image
    
  • Configure registry credentials:

    $ fuzzball secret create registry-creds \
      --username <username> \
      --password <password> \
      --server registry.example.com
  • Use image pull secrets:

    image:
      uri: "docker://registry.example.com/myapp:v1"
      pullSecret: "registry-creds"
    

Performance Issues

Slow Job Startup

Symptom: Long delay between job submission and execution

Diagnostic Steps:

  1. Check PBS queue time:

    $ ssh pbs-head 'qstat -f <job-id> | grep -E "qtime|stime"'
  2. Check PBS server logs:

    $ ssh pbs-head 'sudo tail -f /var/spool/pbs/server_logs/$(date +%Y%m%d)'
  3. Check compute node availability:

    $ ssh pbs-head 'pbsnodes -a'

Solutions:

  • Use priority queues:

    spec:
      orchestrator:
        provisioner:
          pbs:
            defaultQueue: "priority"  # Use higher priority queue
    
  • Reserve nodes for Fuzzball. Create PBS reservation:

    $ ssh pbs-head 'sudo pbs_rsub -U fuzzball-service -N fuzzball -l select=10 -R $(date +%s) -E $(date -d "+1 year" +%s)'

    Use reservation:

    spec:
      orchestrator:
        provisioner:
          pbs:
            options:
              reservation: "fuzzball"
    

Slow Container Image Pulls

Symptom: Substrate spends long time pulling images

Solutions:

  • Use local registry mirror:

    image:
      uri: "docker://local-mirror.example.com/alpine:3.16"
    
  • Pre-pull common images on all compute nodes:

    # sudo apptainer pull docker://alpine:3.16
    
    # sudo apptainer pull docker://ubuntu:22.04
  • Use smaller base images. Use alpine instead of ubuntu:

    image:
      uri: "docker://alpine:3.16"  # ~5MB instead of ubuntu:22.04 (~77MB)
    

Debugging Tools

Enable Debug Logging

Orchestrator:

# In FuzzballOrchestrate CRD
spec:
  orchestrator:
    logLevel: "debug"

Substrate on compute nodes, edit Substrate config:

cat <<EOF | sudo tee /etc/fuzzball/substrate.yaml
logging:
  level: debug
EOF

Collect Diagnostic Information

Create a script to gather diagnostic data:

#!/bin/bash
# diagnostic-collect.sh

echo "=== Fuzzball Configuration ==="
kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml

echo "=== Orchestrate Logs ==="
kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=500

echo "=== Provisioner Definitions ==="
fuzzball admin provisioner list

echo "=== PBS Status ==="
ssh pbs-head 'qstat -B'
ssh pbs-head 'qstat -Q'
ssh pbs-head 'pbsnodes -a'

echo "=== Recent PBS Jobs ==="
ssh pbs-head 'qstat -x -u fuzzball-service'

echo "=== Compute Node Status ==="
ssh pbs-head 'pbsnodes -aSv'

echo "=== Recent Workflows ==="
fuzzball workflow list --limit 10

Interactive Debugging

Test PBS integration interactively:

$ cat <<EOF > test.sh
#!/bin/bash
#PBS -l select=1:ncpus=1

echo "Hostname: $(hostname)"
echo "Date: $(date)"
echo "CPU info:"
lscpu | head -20
echo "Memory info:"
free -h
echo "Substrate check:"
which fuzzball-substrate
fuzzball-substrate --version
EOF

$ ssh pbs-head 'qsub test.sh'

Getting Help

If you’re unable to resolve an issue:

  1. Collect diagnostic information using the script above

  2. Check Fuzzball documentation at https://docs.fuzzball.io

  3. Review PBS documentation at https://www.openpbs.org/documentation

  4. Contact CIQ support with:

    • Problem description
    • Steps to reproduce
    • Diagnostic information
    • Configuration files (with sensitive data redacted)
    • Relevant log excerpts
  5. Community resources:

Common Error Messages

Error MessageLikely CauseSolution
connection refusedSSH port blocked or wrongCheck firewall and SSH configuration
connection timeoutConnection timeout too shortIncrease connectionTimeout in config
authentication failedWrong credentialsVerify username/password or SSH key
command not foundPBS not in PATHSet binaryPath in configuration
Unknown queueWrong queue nameCheck available queues with qstat -Q
Unauthorized RequestNo PBS permissionsAdd user to PBS authorized users
No matching provisionerResource mismatchCreate appropriate provisioner definition
Out of memoryInsufficient memoryIncrease memory in workflow or definition
timeout / TIMEOUT stateJob exceeded TTL or timeoutIncrease provisioner TTL or reduce job timeout
walltime exceededPBS walltime exceededCheck provisioner TTL and job timeout settings