Troubleshooting

This guide covers common issues you may encounter when using Fuzzball with Slurm or PBS integration and provides detailed troubleshooting steps and solutions.

select

Please select either the Slurm or PBS tab to see the appropriate troubleshooting information for your environment.

Slurm

Connection Issues

SSH Connection Failed

Symptom:

Error: Failed to connect to SSH host: dial tcp: connection refused

Diagnostic Steps:

Verify SSH host is reachable:
```
$ ping slurm-head.example.com
```
Check SSH port:
```
$ nc -zv slurm-head.example.com 22
```

Test SSH connection manually:

$ ssh fuzzball-service@slurm-head.example.com

Check Orchestrate logs:

$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -i "ssh"

Check connection timeout setting:

$ kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml | grep connectionTimeout

Solutions:

Firewall blocking: Ensure firewall allows SSH connections from Orchestrate on Slurm head node
```
# firewall-cmd --permanent --add-service=ssh

# firewall-cmd --reload
```
Wrong SSH port: Verify sshPort in configuration matches actual SSH port. Check SSH port on head node:
```
# netstat -tlnp | grep sshd
```
DNS resolution: Use IP address instead of hostname if DNS is unavailable
```
sshHost: "192.168.1.100"  # Use IP directly
```

Connection timeout too short: Increase timeout for high-latency environments

spec:
  orchestrator:
    provisioner:
      slurm:
        connectionTimeout: 90  # Increase to 90 seconds

Authentication Failed

Symptom:

Error: ssh: handshake failed: ssh: unable to authenticate

Diagnostic Steps:

Verify credentials for password auth (enter password manually when prompted):

$ ssh fuzzball-service@slurm-head.example.com

Or for key auth:

$ ssh -i /path/to/key fuzzball-service@slurm-head.example.com

Check SSH logs on head node (RHEL/CentOS):
```
# tail -f /var/log/secure
```
Or on Ubuntu/Debian:
```
# tail -f /var/log/auth.log
```

Verify public key is installed:

$ ssh fuzzball-service@slurm-head.example.com 'cat ~/.ssh/authorized_keys'

Solutions:

Password expired: Update password on Slurm head node
```
# passwd fuzzball-service
```

Public key not installed: Copy public key to authorized_keys

$ ssh-copy-id -i fuzzball-key.pub fuzzball-service@slurm-head.example.com

Wrong key format: Ensure private key is in PEM format. Convert to PEM if needed:
```
$ ssh-keygen -p -m PEM -f fuzzball-key
```

Key permissions: Fix file permissions

$ chmod 600 ~/.ssh/id_rsa

$ chmod 644 ~/.ssh/id_rsa.pub

$ chmod 700 ~/.ssh

Host Key Verification Failed

Symptom:

Error: host key mismatch for slurm-head.example.com

Diagnostic Steps:

Get current host key:

$ ssh-keyscan -t rsa slurm-head.example.com

Compare with configured key:

$ kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml | grep sshHostPublicKey

Solutions:

Update host key in configuration:

spec:
  orchestrator:
    provisioner:
      slurm:
        sshHostPublicKey: "slurm-head.example.com ssh-rsa AAAAB3NzaC1yc2E..."

Temporary workaround (not recommended for production):

spec:
  orchestrator:
    provisioner:
      slurm:
        skipHostKeyVerification: true

Slurm Command Issues

Command Not Found

Symptom:

Error: sbatch: command not found

Diagnostic Steps:

Check if Slurm commands exist:

$ ssh fuzzball-service@slurm-head.example.com 'which sbatch'

$ ssh fuzzball-service@slurm-head.example.com 'which squeue'

$ ssh fuzzball-service@slurm-head.example.com 'which scancel'

Check user’s PATH:

$ ssh fuzzball-service@slurm-head.example.com 'echo $PATH'

Find Slurm binaries:

$ ssh fuzzball-service@slurm-head.example.com 'sudo find / -name sbatch 2>/dev/null'

Solutions:

Set binaryPath in configuration:

spec:
  orchestrator:
    provisioner:
      slurm:
        binaryPath: "/opt/slurm/bin"

Add to user PATH on Slurm head node, add to ~/.bashrc or ~/.profile:
```
export PATH=$PATH:/opt/slurm/bin
```

Create symlinks (if admin):

# sudo ln -s /opt/slurm/bin/sbatch /usr/local/bin/sbatch

# sudo ln -s /opt/slurm/bin/squeue /usr/local/bin/squeue

# sudo ln -s /opt/slurm/bin/scancel /usr/local/bin/scancel

Permission Denied

Symptom:

Error: sbatch: error: Access denied

Diagnostic Steps:

Check user permissions:

$ ssh fuzzball-service@slurm-head.example.com 'sacctmgr show user fuzzball-service'

Verify account access:

$ ssh fuzzball-service@slurm-head.example.com 'sacctmgr show assoc where user=fuzzball-service'

Check partition access:

$ ssh fuzzball-service@slurm-head.example.com 'sinfo -p compute'

Solutions:

Add user to Slurm database:

# sudo sacctmgr add user fuzzball-service account=default

Grant access to partition:

# sudo sacctmgr add account default partition=compute

Check AllowAccounts in slurm.conf (should show AllowAccounts=ALL or include your account):
```
# sudo grep "PartitionName=compute" /etc/slurm/slurm.conf
```

Job Submission Issues

Job Submission Hangs

Symptom: Workflow status stuck in PENDING, no error messages

Diagnostic Steps:

Check Slurm controller status:
```
$ ssh slurm-head 'scontrol ping'
```
Check pending jobs:
```
$ ssh slurm-head 'squeue -t PD'
```

Check controller logs:

$ ssh slurm-head 'sudo tail -f /var/log/slurm/slurmctld.log'

Check Orchestrate logs:

$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -A 10 "ProvisionSubstrate"

Solutions:

Restart Slurm controller (if unresponsive):
```
# sudo systemctl restart slurmctld
```

Increase timeout in configuration:

policy:
  timeout:
    execute: "5m"  # Increase if jobs are slow to start

Check resource availability:
```
$ ssh slurm-head 'sinfo -Nel'
```

Job Fails with Invalid Partition

Symptom:

Error: sbatch: error: Batch job submission failed: Invalid partition name specified

Diagnostic Steps:

List available partitions:
```
$ ssh slurm-head 'sinfo -s'
```

Check partition in provisioner definition:

$ fuzzball admin provisioner get slurm-compute-small

Solutions:

Use correct partition name:

definitions:
  - id: "slurm-compute"
    provisioner: "slurm"
    spec:
      cpu: 4
      mem: "8GiB"
      partition: "compute"  # Must match actual partition name

Remove partition specification to use default:

definitions:
  - id: "slurm-compute"
    provisioner: "slurm"
    spec:
      cpu: 4
      mem: "8GiB"
      # partition not specified = use default

Job Fails with Invalid QoS

Symptom:

Error: sbatch: error: Batch job submission failed: Invalid qos specification

Diagnostic Steps:

List available QoS:

$ ssh slurm-head 'sacctmgr show qos format=name,priority'

Check user’s allowed QoS:

$ ssh slurm-head 'sacctmgr show assoc where user=fuzzball-service format=qos'

Solutions:

Use valid QoS:

spec:
  orchestrator:
    provisioner:
      slurm:
        options:
          qos: "normal"  # Must be a valid QoS

Remove QoS specification to use default:

spec:
  orchestrator:
    provisioner:
      slurm:
        options:
          # qos not specified = use default

Substrate Issues

Substrate Binary Not Found

Symptom:

Error: fuzzball-substrate: command not found

Diagnostic Steps:

Check if Substrate is installed:

$ ssh slurm-head 'ssh <compute-node> which fuzzball-substrate'

Check compute node PATH:

$ ssh slurm-head 'ssh <compute-node> echo $PATH'

Find Substrate binary:

$ ssh slurm-head 'ssh <compute-node> find / -name fuzzball-substrate 2>/dev/null'

Solutions:

Install Substrate on all compute nodes. On each compute node:

# sudo cp fuzzball-substrate /usr/local/bin/

# sudo chmod +x /usr/local/bin/fuzzball-substrate

Use absolute path in script: Modify Substrate invocation if needed
Add to PATH for all users. On compute nodes, add to /etc/environment:
```
PATH=/usr/local/bin:/usr/bin:/bin
```

Substrate Permission Denied

Symptom:

Error: sudo: sorry, user fuzzball-service is not allowed to execute '/usr/local/bin/fuzzball-substrate serve' as root

Diagnostic Steps:

Check sudo permissions:

$ ssh slurm-head 'ssh <compute-node> sudo -l'

Check sudoers configuration:

$ ssh slurm-head 'ssh <compute-node> sudo cat /etc/sudoers.d/fuzzball-service'

Solutions:

Add sudo permission for Substrate. On all compute nodes, create /etc/sudoers.d/fuzzball-service:

# echo "fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate" | sudo tee /etc/sudoers.d/fuzzball-service

# sudo chmod 440 /etc/sudoers.d/fuzzball-service

Use wildcards for flexibility. Allow Substrate with any arguments:

fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate *

Validate sudoers file:
```
# sudo visudo -c
```

Substrate Not Registering

Symptom: Slurm job starts but Substrate never connects to Orchestrate

Diagnostic Steps:

Check if Substrate process is running:

$ ssh slurm-head 'ssh <compute-node> ps aux | grep fuzzball-substrate'

Check Substrate logs:

$ ssh slurm-head 'cat slurm-<job-id>.out'

Check network connectivity:

$ ssh slurm-head 'ssh <compute-node> ping <orchestrator-host>'

$ ssh slurm-head 'ssh <compute-node> nc -zv <orchestrator-host> <orchestrator-port>'

Check firewall rules on compute node:

# sudo iptables -L -n -v | grep <orchestrator-port>

Solutions:

Allow outbound connections from compute nodes. On compute nodes:

# sudo firewall-cmd --permanent --add-port=<orchestrator-port>/tcp

# sudo firewall-cmd --reload

Check Orchestrate endpoint in Substrate configuration:

$ ssh slurm-head 'ssh <compute-node> cat /etc/fuzzball/substrate.yaml'

Verify DNS resolution:

$ ssh slurm-head 'ssh <compute-node> nslookup <orchestrator-host>'

Use IP address if DNS is problematic. In Substrate configuration:

orchestrator:
  endpoint: "192.168.1.10:8080"  # Use IP instead of hostname

Resource Issues

Insufficient Resources

Symptom:

Error: No matching provisioner definition found for resource requirements

Diagnostic Steps:

Check workflow resource requirements:

$ cat workflow.yaml | grep -A 5 "resource:"

List provisioner definitions:
```
$ fuzzball admin provisioner list
```

Compare requirements to available resources:

$ fuzzball admin provisioner get slurm-compute-small

Solutions:

Reduce workflow resource requests:

resource:
  cpu:
    cores: 4  # Reduced from 64
  memory:
    size: "8GiB"  # Reduced from 512GiB

Create larger provisioner definition:

definitions:
  - id: "slurm-compute-xlarge"
    provisioner: "slurm"
    spec:
      cpu: 64
      mem: "512GiB"
      partition: "himem"

Use Slurm node features to target appropriate nodes:

definitions:
  - id: "slurm-himem"
    provisioner: "slurm"
    spec:
      cpu: 32
      mem: "256GiB"
      constraint: "himem"

Memory Limit Exceeded

Symptom: Job killed by Slurm with “Out of memory” error

Diagnostic Steps:

Check actual memory usage:

$ ssh slurm-head 'sacct -j <job-id> --format=JobID,MaxRSS,MaxVMSize'

Check memory limit:

$ ssh slurm-head 'scontrol show job <job-id> | grep Mem'

Solutions:

Increase memory in workflow:

resource:
  memory:
    size: "32GiB"  # Increased from 16GiB

Create provisioner definition with more memory:

definitions:
  - id: "slurm-highmem"
    provisioner: "slurm"
    spec:
      cpu: 16
      mem: "128GiB"
      partition: "himem"

Optimize application memory usage: Review application logs and optimize code

CPU Contention

Symptom: Jobs running much slower than expected

Diagnostic Steps:

Check node load:

$ ssh slurm-head 'ssh <compute-node> uptime'

Check other jobs on node:

$ ssh slurm-head 'squeue -w <compute-node> -o "%.18i %.9P %.20j %.8u %.2t %.10M"'

Check CPU allocation:

$ ssh slurm-head 'sacct -j <job-id> --format=JobID,AllocCPUS,CPUTime,Elapsed'

Solutions:

Use CPU affinity:

resource:
  cpu:
    cores: 16
    affinity: "SOCKET"  # Bind to CPU sockets

Use exclusive node allocation (if needed):

spec:
  orchestrator:
    provisioner:
      slurm:
        options:
          exclusive: "true"

Request specific node features:

definitions:
  - id: "slurm-fast"
    provisioner: "slurm"
    spec:
      cpu: 32
      mem: "64GiB"
      constraint: "haswell|broadwell"  # Newer CPUs

Workflow Issues

Workflow Stuck in PENDING

Symptom: Workflow status remains PENDING indefinitely

Diagnostic Steps:

Check workflow details:
```
$ fuzzball workflow get <workflow-id>
```
Check provisioner status:
```
$ fuzzball admin provisioner list
```

Check Slurm queue:

$ ssh slurm-head 'squeue -u fuzzball-service'

Check orchestrator logs:

$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=100

Solutions:

Check for provisioner definition mismatch: Ensure workflow resources match available definitions
Wait for Slurm resources: Jobs may be queued waiting for compute nodes

Cancel and resubmit:

$ fuzzball workflow cancel <workflow-id>

$ fuzzball workflow submit --file workflow.yaml

Jobs Execute Out of Order

Symptom: Dependent jobs start before their dependencies complete

Diagnostic Steps:

Check job dependencies in workflow:

$ cat workflow.yaml | grep -A 5 "depends_on"

Check job execution order:
```
$ fuzzball workflow jobs <workflow-id>
```

Solutions:

Verify dependency specification:

jobs:
  job-a:
    name: "first-job"
    # ... job definition

  job-b:
    name: "second-job"
    depends_on:
      - job-a  # Must match job ID, not name
    # ... job definition

Use consistent job identifiers: Ensure dependency references use correct job IDs

Job Terminated by TTL

Symptom: Jobs are killed before completion with timeout errors

Diagnostic Steps:

Check provisioner TTL setting:

$ fuzzball admin provisioner get slurm-compute-small | grep ttl

Check job timeout policy:

$ cat workflow.yaml | grep -A 3 "timeout:"

Check actual job runtime:

$ ssh slurm-head 'sacct -j <job-id> --format=JobID,Elapsed,Timelimit'

Solutions:

Increase provisioner TTL:

definitions:
  - id: "slurm-compute-long"
    provisioner: "slurm"
    ttl: 28800  # Increase to 8 hours
    spec:
      cpu: 8
      mem: "16GiB"

Reduce job timeout to fit within TTL:

jobs:
  my-job:
    policy:
      timeout:
        execute: "1h"  # Ensure it's less than provisioner TTL

Container Pull Failures

Symptom:

Error: Failed to pull container image: authentication required

Diagnostic Steps:

Check image exists:
```
$ docker pull <image-uri>
```
Check registry credentials:
```
$ fuzzball secret list | grep registry
```

Check Substrate image pull logs:

$ ssh slurm-head 'cat slurm-<job-id>.out | grep -i "pull"'

Solutions:

Use public images for testing:

image:
  uri: "docker://alpine:3.16"  # Public image

Configure registry credentials:

$ fuzzball secret create registry-creds \
  --username <username> \
  --password <password> \
  --server registry.example.com

Use image pull secrets:

image:
  uri: "docker://registry.example.com/myapp:v1"
  pullSecret: "registry-creds"

Performance Issues

Slow Job Startup

Symptom: Long delay between job submission and execution

Diagnostic Steps:

Check Slurm queue time:

$ ssh slurm-head 'sacct -j <job-id> --format=JobID,Submit,Start,Elapsed'

Check scheduler logs:

$ ssh slurm-head 'sudo tail -f /var/log/slurm/slurmctld.log'

Check compute node availability:
```
$ ssh slurm-head 'sinfo -Nel'
```

Solutions:

Use priority QoS:

spec:
  orchestrator:
    provisioner:
      slurm:
        options:
          qos: "high"

Reserve nodes for Fuzzball. Create reservation:

$ ssh slurm-head 'sudo scontrol create reservation \
  starttime=now duration=infinite \
  user=fuzzball-service nodes=node[01-10] \
  reservationname=fuzzball'

Use reservation:

spec:
  orchestrator:
    provisioner:
      slurm:
        options:
          reservation: "fuzzball"

Slow Container Image Pulls

Symptom: Substrate spends long time pulling images

Solutions:

Use local registry mirror:

image:
  uri: "docker://local-mirror.example.com/alpine:3.16"

Pre-pull common images on all compute nodes:

# sudo apptainer pull docker://alpine:3.16

# sudo apptainer pull docker://ubuntu:22.04

Use smaller base images. Use alpine instead of ubuntu:

image:
  uri: "docker://alpine:3.16"  # ~5MB instead of ubuntu:22.04 (~77MB)

Debugging Tools

Enable Debug Logging

Orchestrator:

# In FuzzballOrchestrate CRD
spec:
  orchestrator:
    logLevel: "debug"

Substrate on compute nodes, edit Substrate config:

# cat <<EOF | sudo tee /etc/fuzzball/substrate.yaml
logging:
  level: debug
EOF

Collect Diagnostic Information

Create a script to gather diagnostic data:

#!/bin/bash
# diagnostic-collect.sh

echo "=== Fuzzball Configuration ==="
kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml

echo "=== Orchestrate Logs ==="
kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=500

echo "=== Provisioner Definitions ==="
fuzzball admin provisioner list

echo "=== Slurm Status ==="
ssh slurm-head 'scontrol show config | head -20'
ssh slurm-head 'sinfo'
ssh slurm-head 'squeue'

echo "=== Recent Slurm Jobs ==="
ssh slurm-head 'sacct -u fuzzball-service --starttime $(date -d "1 hour ago" +%Y-%m-%d) --format=JobID,JobName,State,ExitCode,Elapsed'

echo "=== Compute Node Status ==="
ssh slurm-head 'scontrol show nodes'

echo "=== Recent Workflows ==="
fuzzball workflow list --limit 10

Interactive Debugging

Test Slurm integration interactively:

$ cat >test.sh << EOF
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

echo "Hostname: $(hostname)"
echo "Date: $(date)"
echo "CPU info:"
lscpu | head -20
echo "Memory info:"
free -h
echo "Substrate check:"
which fuzzball-substrate
fuzzball-substrate --version
EOF

$ ssh slurm-head 'sbatch test.sh'

Getting Help

If you’re unable to resolve an issue:

Collect diagnostic information using the script above
Check Fuzzball documentation at https://docs.fuzzball.io
Review Slurm documentation at https://slurm.schedmd.com
Contact CIQ support with:
- Problem description
- Steps to reproduce
- Diagnostic information
- Configuration files (with sensitive data redacted)
- Relevant log excerpts
Community resources:
- Fuzzball GitHub: https://github.com/ctrliq/fuzzball
- CIQ contact: https://ciq.com/company/contact-us/

Common Error Messages

Error Message	Likely Cause	Solution
`connection refused`	SSH port blocked or wrong	Check firewall and SSH configuration
`connection timeout`	Connection timeout too short	Increase `connectionTimeout` in config
`authentication failed`	Wrong credentials	Verify username/password or SSH key
`command not found`	Slurm not in PATH	Set `binaryPath` in configuration
`Invalid partition`	Wrong partition name	Check available partitions with `sinfo`
`Access denied`	No Slurm permissions	Add user to Slurm database
`No matching provisioner`	Resource mismatch	Create appropriate provisioner definition
`Out of memory`	Insufficient memory	Increase memory in workflow or definition
`timeout` / `TIMEOUT` state	Job exceeded TTL or timeout	Increase provisioner TTL or reduce job timeout
`DUE TO TIME LIMIT`	Slurm walltime exceeded	Check provisioner TTL and job timeout settings

PBS

Connection Issues

SSH Connection Failed

Symptom:

Error: Failed to connect to SSH host: dial tcp: connection refused

Diagnostic Steps:

Verify SSH host is reachable:
```
$ ping pbs-head.example.com
```
Check SSH port:
```
$ nc -zv pbs-head.example.com 22
```

Test SSH connection manually:

$ ssh fuzzball-service@pbs-head.example.com

Check orchestrator logs:

$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -i "ssh"

Check connection timeout setting:

$ kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml | grep connectionTimeout

Solutions:

Firewall blocking: Ensure firewall allows SSH connections from Orchestrate on PBS head node
```
# firewall-cmd --permanent --add-service=ssh

# firewall-cmd --reload
```
Wrong SSH port: Verify sshPort in configuration matches actual SSH port. Check SSH port on head node:
```
# netstat -tlnp | grep sshd
```
DNS resolution: Use IP address instead of hostname if DNS is unavailable
```
sshHost: "192.168.1.100"  # Use IP directly
```

Connection timeout too short: Increase timeout for high-latency environments

spec:
  orchestrator:
    provisioner:
      pbs:
        connectionTimeout: 90  # Increase to 90 seconds

Authentication Failed

Symptom:

Error: ssh: handshake failed: ssh: unable to authenticate

Diagnostic Steps:

Verify credentials for password auth (enter password manually when prompted):

$ ssh fuzzball-service@pbs-head.example.com

Or for key auth:

$ ssh -i /path/to/key fuzzball-service@pbs-head.example.com

Check SSH logs on head node (RHEL/CentOS):
```
# tail -f /var/log/secure
```
Or on Ubuntu/Debian:
```
# tail -f /var/log/auth.log
```

Verify public key is installed:

$ ssh fuzzball-service@pbs-head.example.com 'cat ~/.ssh/authorized_keys'

Solutions:

Password expired: Update password on PBS head node
```
# passwd fuzzball-service
```

Public key not installed: Copy public key to authorized_keys

$ ssh-copy-id -i fuzzball-key.pub fuzzball-service@pbs-head.example.com

Wrong key format: Ensure private key is in PEM format. Convert to PEM if needed:
```
$ ssh-keygen -p -m PEM -f fuzzball-key
```

Key permissions: Fix file permissions

$ chmod 600 ~/.ssh/id_rsa

$ chmod 644 ~/.ssh/id_rsa.pub

$ chmod 700 ~/.ssh

Host Key Verification Failed

Symptom:

Error: ssh: handshake failed: ssh: host key mismatch

Diagnostic Steps:

Get current host key:
```
$ ssh-keyscan pbs-head.example.com
```
Compare with configuration: Check if sshHostPublicKey in FuzzballOrchestrate CRD matches

Solutions:

Update host key in configuration with output from ssh-keyscan

Disable host key verification (not recommended for production):

spec:
  orchestrator:
    provisioner:
      pbs:
        skipHostKeyVerification: true

Command Issues

Command Not Found

Symptom:

Error: qsub: command not found

Diagnostic Steps:

Check if PBS commands exist:

$ ssh fuzzball-service@pbs-head.example.com 'which qsub'

$ ssh fuzzball-service@pbs-head.example.com 'which qstat'

$ ssh fuzzball-service@pbs-head.example.com 'which qdel'

Check user’s PATH:

$ ssh fuzzball-service@pbs-head.example.com 'echo $PATH'

Find PBS binaries:

$ ssh fuzzball-service@pbs-head.example.com 'sudo find / -name qsub 2>/dev/null'

Solutions:

Set binaryPath in configuration:

spec:
  orchestrator:
    provisioner:
      pbs:
        binaryPath: "/opt/pbs/bin"

Add to user PATH on PBS head node, add to ~/.bashrc or ~/.profile:
```
export PATH=$PATH:/opt/pbs/bin
```

Create symlinks (if admin):

# sudo ln -s /opt/pbs/bin/qsub /usr/local/bin/qsub

# sudo ln -s /opt/pbs/bin/qstat /usr/local/bin/qstat

# sudo ln -s /opt/pbs/bin/qdel /usr/local/bin/qdel

Permission Denied

Symptom:

Error: qsub: Unauthorized Request

Diagnostic Steps:

Check user permissions:

$ ssh fuzzball-service@pbs-head.example.com 'qstat -B'

Verify queue access:

$ ssh fuzzball-service@pbs-head.example.com 'qstat -Q'

Check PBS server configuration:

$ ssh fuzzball-service@pbs-head.example.com 'qmgr -c "print server"'

Solutions:

Add user to PBS allowed users:

# sudo qmgr -c "set server authorized_users += fuzzball-service@*"

Grant queue access:

# sudo qmgr -c "set queue workq enabled = True"

# sudo qmgr -c "set queue workq started = True"

Check queue ACLs (should show allowed users or be disabled):
```
# sudo qmgr -c "print queue workq"
```

Job Submission Issues

Job Submission Hangs

Symptom: Workflow status stuck in PENDING, no error messages

Diagnostic Steps:

Check PBS server status:
```
$ ssh pbs-head 'qstat -B'
```
Check pending jobs:
```
$ ssh pbs-head 'qstat -i'
```

Check PBS server logs:

$ ssh pbs-head 'sudo tail -f /var/spool/pbs/server_logs/$(date +%Y%m%d)'

Check Orchestrate logs:

$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator | grep -A 10 "ProvisionSubstrate"

Solutions:

Restart PBS server (if unresponsive):
```
# sudo systemctl restart pbs
```

Increase timeout in configuration:

policy:
  timeout:
    execute: "5m"  # Increase if jobs are slow to start

Check resource availability:
```
$ ssh pbs-head 'pbsnodes -a'
```

Job Fails with Invalid Queue

Symptom:

Error: qsub: Unknown queue

Diagnostic Steps:

List available queues:
```
$ ssh pbs-head 'qstat -Q'
```

Check queue in provisioner definition:

$ fuzzball admin provisioner get pbs-workq-small

Solutions:

Use correct queue name:

definitions:
  - id: "pbs-compute"
    provisioner: "pbs"
    spec:
      cpu: 4
      mem: "8GiB"
      queue: "workq"  # Must match actual queue name

Remove queue specification to use default:

definitions:
  - id: "pbs-compute"
    provisioner: "pbs"
    spec:
      cpu: 4
      mem: "8GiB"
      # queue not specified = use default

Substrate Issues

Substrate Binary Not Found

Symptom:

Error: fuzzball-substrate: command not found

Diagnostic Steps:

Check if Substrate is installed:

$ ssh pbs-head 'ssh <compute-node> which fuzzball-substrate'

Check compute node PATH:

$ ssh pbs-head 'ssh <compute-node> echo $PATH'

Find Substrate binary:

$ ssh pbs-head 'ssh <compute-node> find / -name fuzzball-substrate 2>/dev/null'

Solutions:

Install Substrate on all compute nodes. On each compute node:

# sudo cp fuzzball-substrate /usr/local/bin/

# sudo chmod +x /usr/local/bin/fuzzball-substrate

Use absolute path in script: Modify Substrate invocation if needed
Add to PATH for all users. On compute nodes, add to /etc/environment:
```
PATH=/usr/local/bin:/usr/bin:/bin
```

Substrate Permission Denied

Symptom:

Error: sudo: sorry, user fuzzball-service is not allowed to execute '/usr/local/bin/fuzzball-substrate serve' as root

Diagnostic Steps:

Check sudo permissions:

$ ssh pbs-head 'ssh <compute-node> sudo -l'

Check sudoers configuration:

$ ssh pbs-head 'ssh <compute-node> sudo cat /etc/sudoers.d/fuzzball-service'

Solutions:

Add sudo permission for Substrate. On all compute nodes, create /etc/sudoers.d/fuzzball-service:

# echo "fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate" | sudo tee /etc/sudoers.d/fuzzball-service

# sudo chmod 440 /etc/sudoers.d/fuzzball-service

Use wildcards for flexibility. Allow Substrate with any arguments:

fuzzball-service ALL=(ALL) NOPASSWD: /usr/local/bin/fuzzball-substrate *

Validate sudoers file:
```
# sudo visudo -c
```

Substrate Not Registering

Symptom: PBS job starts but Substrate never connects to Orchestrate

Diagnostic Steps:

Check if Substrate process is running:

$ ssh pbs-head 'ssh <compute-node> ps aux | grep fuzzball-substrate'

Check Substrate logs:
```
$ ssh pbs-head 'cat pbs-<job-id>.out'
```

Check network connectivity:

$ ssh pbs-head 'ssh <compute-node> ping <orchestrator-host>'

$ ssh pbs-head 'ssh <compute-node> nc -zv <orchestrator-host> <orchestrator-port>'

Check firewall rules on compute node:

# sudo iptables -L -n -v | grep <orchestrator-port>

Solutions:

Allow outbound connections from compute nodes. On compute nodes:

# sudo firewall-cmd --permanent --add-port=<orchestrator-port>/tcp

# sudo firewall-cmd --reload

Check Orchestrate endpoint in Substrate configuration:

$ ssh pbs-head 'ssh <compute-node> cat /etc/fuzzball/substrate.yaml'

Verify DNS resolution:

$ ssh pbs-head 'ssh <compute-node> nslookup <orchestrator-host>'

Use IP address if DNS is problematic. In Substrate configuration:

orchestrator:
  endpoint: "192.168.1.10:8080"  # Use IP instead of hostname

Resource Issues

Insufficient Resources

Symptom:

Error: No matching provisioner definition found for resource requirements

Diagnostic Steps:

Check workflow resource requirements:

$ cat workflow.yaml | grep -A 5 "resource:"

List provisioner definitions:
```
$ fuzzball admin provisioner list
```

Compare requirements to available resources:

$ fuzzball admin provisioner get pbs-workq-small

Solutions:

Reduce workflow resource requests:

resource:
  cpu:
    cores: 4  # Reduced from 64
  memory:
    size: "8GiB"  # Reduced from 512GiB

Create larger provisioner definition:

definitions:
  - id: "pbs-workq-xlarge"
    provisioner: "pbs"
    spec:
      cpu: 64
      mem: "512GiB"
      queue: "bigmem"

Use PBS node features to target appropriate nodes:

definitions:
  - id: "pbs-himem"
    provisioner: "pbs"
    spec:
      cpu: 32
      mem: "256GiB"
      select: "mem>256GB"

Memory Limit Exceeded

Symptom: Job killed by PBS with “Out of memory” error

Diagnostic Steps:

Check actual memory usage:

$ ssh pbs-head 'qstat -f <job-id> | grep resources_used.mem'

Check memory limit:

$ ssh pbs-head 'qstat -f <job-id> | grep Resource_List.mem'

Solutions:

Increase memory in workflow:

resource:
  memory:
    size: "32GiB"  # Increased from 16GiB

Create provisioner definition with more memory:

definitions:
  - id: "pbs-highmem"
    provisioner: "pbs"
    spec:
      cpu: 16
      mem: "128GiB"
      queue: "bigmem"

Optimize application memory usage: Review application logs and optimize code

CPU Contention

Symptom: Jobs running much slower than expected

Diagnostic Steps:

Check node load:

$ ssh pbs-head 'ssh <compute-node> uptime'

Check other jobs on node:

$ ssh pbs-head 'qstat -n | grep <compute-node>'

Check CPU allocation:

$ ssh pbs-head 'qstat -f <job-id> | grep resources_used.cpupercent'

Solutions:

Use CPU affinity:

resource:
  cpu:
    cores: 16
    affinity: "SOCKET"  # Bind to CPU sockets

Use exclusive node allocation (if needed):

definitions:
  - id: "pbs-exclusive"
    provisioner: "pbs"
    exclusive: "job"
    spec:
      cpu: 32
      mem: "64GiB"
      queue: "workq"

Request specific node features:

definitions:
  - id: "pbs-fast"
    provisioner: "pbs"
    spec:
      cpu: 32
      mem: "64GiB"
      select: "host=node[10-20]"

Workflow Issues

Workflow Stuck in PENDING

Symptom: Workflow status remains PENDING indefinitely

Diagnostic Steps:

Check workflow details:
```
$ fuzzball workflow get <workflow-id>
```
Check provisioner status:
```
$ fuzzball admin provisioner list
```

Check PBS queue:

$ ssh pbs-head 'qstat -u fuzzball-service'

Check Orchestrate logs:

$ kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=100

Solutions:

Check for provisioner definition mismatch: Ensure workflow resources match available definitions
Wait for PBS resources: Jobs may be queued waiting for compute nodes

Cancel and resubmit:

$ fuzzball workflow cancel <workflow-id>

$ fuzzball workflow submit --file workflow.yaml

Jobs Execute Out of Order

Symptom: Dependent jobs start before their dependencies complete

Diagnostic Steps:

Check job dependencies in workflow:

$ cat workflow.yaml | grep -A 5 "depends_on"

Check job execution order:
```
$ fuzzball workflow jobs <workflow-id>
```

Solutions:

Verify dependency specification:

jobs:
  job-a:
    name: "first-job"
    # ... job definition

  job-b:
    name: "second-job"
    depends_on:
      - job-a  # Must match job ID, not name
    # ... job definition

Use consistent job identifiers: Ensure dependency references use correct job IDs

Job Terminated by TTL

Symptom: Jobs are killed before completion with timeout errors

Diagnostic Steps:

Check provisioner TTL setting:

$ fuzzball admin provisioner get pbs-workq-small | grep ttl

Check job timeout policy:

$ cat workflow.yaml | grep -A 3 "timeout:"

Check actual job runtime:

$ ssh pbs-head 'qstat -f <job-id> | grep resources_used.walltime'

Solutions:

Increase provisioner TTL:

definitions:
  - id: "pbs-workq-long"
    provisioner: "pbs"
    ttl: 28800  # Increase to 8 hours
    spec:
      cpu: 8
      mem: "16GiB"

Reduce job timeout to fit within TTL:

jobs:
  my-job:
    policy:
      timeout:
        execute: "1h"  # Ensure it's less than provisioner TTL

Container Pull Failures

Symptom:

Error: Failed to pull container image: authentication required

Diagnostic Steps:

Check image exists:
```
$ docker pull <image-uri>
```
Check registry credentials:
```
$ fuzzball secret list | grep registry
```

Check Substrate image pull logs:

$ ssh pbs-head 'cat pbs-<job-id>.out | grep -i "pull"'

Solutions:

Use public images for testing:

image:
  uri: "docker://alpine:3.16"  # Public image

Configure registry credentials:

$ fuzzball secret create registry-creds \
  --username <username> \
  --password <password> \
  --server registry.example.com

Use image pull secrets:

image:
  uri: "docker://registry.example.com/myapp:v1"
  pullSecret: "registry-creds"

Performance Issues

Slow Job Startup

Symptom: Long delay between job submission and execution

Diagnostic Steps:

Check PBS queue time:

$ ssh pbs-head 'qstat -f <job-id> | grep -E "qtime|stime"'

Check PBS server logs:

$ ssh pbs-head 'sudo tail -f /var/spool/pbs/server_logs/$(date +%Y%m%d)'

Check compute node availability:
```
$ ssh pbs-head 'pbsnodes -a'
```

Solutions:

Use priority queues:

spec:
  orchestrator:
    provisioner:
      pbs:
        defaultQueue: "priority"  # Use higher priority queue

Reserve nodes for Fuzzball. Create PBS reservation:

$ ssh pbs-head 'sudo pbs_rsub -U fuzzball-service -N fuzzball -l select=10 -R $(date +%s) -E $(date -d "+1 year" +%s)'

Use reservation:

spec:
  orchestrator:
    provisioner:
      pbs:
        options:
          reservation: "fuzzball"

Slow Container Image Pulls

Symptom: Substrate spends long time pulling images

Solutions:

Use local registry mirror:

image:
  uri: "docker://local-mirror.example.com/alpine:3.16"

Pre-pull common images on all compute nodes:

# sudo apptainer pull docker://alpine:3.16

# sudo apptainer pull docker://ubuntu:22.04

Use smaller base images. Use alpine instead of ubuntu:

image:
  uri: "docker://alpine:3.16"  # ~5MB instead of ubuntu:22.04 (~77MB)

Debugging Tools

Enable Debug Logging

Orchestrator:

# In FuzzballOrchestrate CRD
spec:
  orchestrator:
    logLevel: "debug"

Substrate on compute nodes, edit Substrate config:

cat <<EOF | sudo tee /etc/fuzzball/substrate.yaml
logging:
  level: debug
EOF

Collect Diagnostic Information

Create a script to gather diagnostic data:

#!/bin/bash
# diagnostic-collect.sh

echo "=== Fuzzball Configuration ==="
kubectl get fuzzballorchestrate fuzzball -n fuzzball-system -o yaml

echo "=== Orchestrate Logs ==="
kubectl logs -n fuzzball-system deployment/fuzzball-orchestrator --tail=500

echo "=== Provisioner Definitions ==="
fuzzball admin provisioner list

echo "=== PBS Status ==="
ssh pbs-head 'qstat -B'
ssh pbs-head 'qstat -Q'
ssh pbs-head 'pbsnodes -a'

echo "=== Recent PBS Jobs ==="
ssh pbs-head 'qstat -x -u fuzzball-service'

echo "=== Compute Node Status ==="
ssh pbs-head 'pbsnodes -aSv'

echo "=== Recent Workflows ==="
fuzzball workflow list --limit 10

Interactive Debugging

Test PBS integration interactively:

$ cat <<EOF > test.sh
#!/bin/bash
#PBS -l select=1:ncpus=1

echo "Hostname: $(hostname)"
echo "Date: $(date)"
echo "CPU info:"
lscpu | head -20
echo "Memory info:"
free -h
echo "Substrate check:"
which fuzzball-substrate
fuzzball-substrate --version
EOF

$ ssh pbs-head 'qsub test.sh'

Getting Help

If you’re unable to resolve an issue:

Collect diagnostic information using the script above
Check Fuzzball documentation at https://docs.fuzzball.io
Review PBS documentation at https://www.openpbs.org/documentation
Contact CIQ support with:
- Problem description
- Steps to reproduce
- Diagnostic information
- Configuration files (with sensitive data redacted)
- Relevant log excerpts
Community resources:
- Fuzzball GitHub: https://github.com/ctrliq/fuzzball
- CIQ contact: https://ciq.com/company/contact-us/

Common Error Messages

Error Message	Likely Cause	Solution
`connection refused`	SSH port blocked or wrong	Check firewall and SSH configuration
`connection timeout`	Connection timeout too short	Increase `connectionTimeout` in config
`authentication failed`	Wrong credentials	Verify username/password or SSH key
`command not found`	PBS not in PATH	Set `binaryPath` in configuration
`Unknown queue`	Wrong queue name	Check available queues with `qstat -Q`
`Unauthorized Request`	No PBS permissions	Add user to PBS authorized users
`No matching provisioner`	Resource mismatch	Create appropriate provisioner definition
`Out of memory`	Insufficient memory	Increase memory in workflow or definition
`timeout` / `TIMEOUT` state	Job exceeded TTL or timeout	Increase provisioner TTL or reduce job timeout
`walltime exceeded`	PBS walltime exceeded	Check provisioner TTL and job timeout settings