Parallel Workflows with Job Arrays
Fuzzball supports parallel workloads through task arrays.
A parallel workload has some number of independent tasks that follow a similar pattern. For example, imagine you have 1000 data files, and processing each file requires 1 core and takes about an hour. If you processed all of these files sequentially, the job would take around 42 days to complete. Now imagine you have a powerful workstation with 32 cores allowing you to process 32 input files simultaneously. As files finish processing, other files will take their place and begin processing. Using this strategy, you can decrease the time needed to process the entire data set to around 31 1/2 hours. If you had access to 1000 cores you could reduce the processing time to a single hour!
Here is a basic example Fuzzfile illustrating task arrays in Fuzzball.
version: v1
jobs:
rollcall:
image:
uri: oras://docker.io/godlovedc/lolcow:sif
command:
- /bin/sh
- '-c'
- 'echo "cow #${FB_TASK_ID}" reporting for duty | cowsay'
resource:
cpu:
cores: 1
memory:
size: 1GB
task-array:
start: 1
end: 6
concurrency: 3
The task-array
section lets you specify that multiple copies of your
job should run in
parallel. Each task has a task ID. You
can set the range using the start
and end
fields. Jobs contain an environment variable called
FB_TASK_ID
that allows you to reference the task ID of the currently
running job. The concurrency
field
allows you to specify how many jobs should run in parallel (assuming that enough
resources exist in the
cluster).
Screenshots from the Fuzzball GUI may
give you a better idea of how this task array runs. Note that Fuzzball queues up 3 tasks in the
array at a time, respecting the concurrency
field.
After the job runs you can check the logs to see how the FB_TASK_ID
variable is used to change the
standard output in each job.
It is a very common pattern to have a collection of files and to need to do something to all of
them. Consider this collection of files in the directory some-files
.
$ ls some-files/
file10.txt file2.txt file3.txt file4.txt file5.txt file6.txt
file7.txt file8.txt file999.txt file9.txt
$ cat /tmp/some-files/*.txt
this is file 10
this is file 2
this is file 3
this is file 4
this is file 5
this is file 6
this is file 7
this is file 8
this is file 999
this is file 9
Note that the filenames are numbered, but that number 1 is missing and the last file is 999. The commands in our job cannot just iterate on file numbers.
Here is a generic Fuzzfile that will let you iterate through a directory of files and perform some action on them in a parallel workflow. In this example, we simply echo the contents of each file with ascii-art, but you can change the command to do whatever you want.
version: v1
jobs:
untar:
mounts:
v1:
location: /tmp
cwd: /tmp
command:
- /bin/sh
- '-c'
- tar xvf some-files.tar.gz
image:
uri: docker://alpine
resource:
cpu:
cores: 1
memory:
size: 1GB
process-files:
requires:
- untar
mounts:
v1:
location: /tmp
cwd: /tmp/
command:
- /bin/bash
- '-c'
- cd some-files && v=$(ls) && a=($v) && cat "${a[$FB_TASK_ID]}" | cowsay
image:
uri: oras://docker.io/godlovedc/lolcow:sif
resource:
cpu:
cores: 1
memory:
size: 1GB
task-array:
start: 1
end: 10
concurrency: 10
volumes:
v1:
ingress:
- source:
uri: s3://co-ciq-misc-support/godloved/some-files.tar.gz
secret: secret://user/GODLOVED_S3
destination:
uri: file://some-files.tar.gz
reference: volume://user/ephemeral
The key to this strategy is this command that appears in the process-files
job.
cd some-files && v=$(ls) && a=($v) && cat "${a[$FB_TASK_ID]}" | cowsay
After moving to the correct directory, the files are listed and captured in a variable (v
). Then
v
is converted to an array (a
) that can then be indexed by $FB_TASK_ID
to access the correct
file.
The typical/bin/sh
invocation has been replaced in this Fuzzfile with/bin/bash
. This allows us to use a Bash array to index the file list by job number.
This job produces logs like the following: