Executing a BLAST Workflow
The Fuzzfile below uses some of the BLAST tooling to
pull one set of sequences for querying and another set of sequences for building a database, builds
the database, and then executes the query. It generates output file results/blastp.out
which can
be saved through data egress to a path
defined in an S3 URI. Different jobs
within the workflow use BLAST and edirect
containers published by NCBI on
Dockerhub.
version: v1
volumes:
blast-volume:
name: blast-volume
reference: volume://user/ephemeral
# egress:
# - source:
# uri: file://results/blastp.out
# destination:
# uri:
# secret:
jobs:
create-dirs:
image:
uri: docker://ncbi/blast:2.12.0
command: [mkdir, blastdb, queries, fasta, results, blastdb_custom]
cwd: /data
resource:
cpu:
cores: 1
affinity: NUMA
memory:
size: 1GiB
mounts:
blast-volume:
location: /data
retrieve-query-sequence:
image:
uri: docker://ncbi/edirect:20.6
command: ["/bin/sh", "-c", "efetch -db protein -format fasta -id P01349 > P01349.fsa"]
cwd: /data/queries
resource:
cpu:
cores: 1
affinity: NUMA
memory:
size: 1GiB
mounts:
blast-volume:
location: /data
requires: [create-dirs]
retrieve-database-sequences:
image:
uri: docker://ncbi/edirect:20.6
command: ["/bin/sh", "-c", "efetch -db protein -format fasta -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 > nurse-shark-proteins.fsa"]
cwd: /data/fasta
resource:
cpu:
cores: 1
affinity: NUMA
memory:
size: 1GiB
mounts:
blast-volume:
location: /data
requires: [create-dirs]
make-blast-database:
image:
uri: docker://ncbi/blast:2.12.0
command: ["/bin/sh", "-c", "makeblastdb -in /data/fasta/nurse-shark-proteins.fsa -dbtype prot -parse_seqids -out nurse-shark-proteins -title 'Nurse shark proteins' -taxid 7801 -blastdb_version 5"]
cwd: /data/fasta
resource:
cpu:
cores: 1
affinity: NUMA
memory:
size: 1GiB
mounts:
blast-volume:
location: /data
requires: [retrieve-query-sequence, retrieve-database-sequences]
run-blast:
image:
uri: docker://ncbi/blast:2.12.0
command: [blastp, -num_threads, 8, -query, /data/queries/P01349.fsa, -db, /data/fasta/nurse-shark-proteins, -out, /data/results/blastp.out]
cwd: /data
resource:
cpu:
cores: 8
affinity: NUMA
memory:
size: 30GiB
mounts:
blast-volume:
location: /data
requires: [make-blast-database]
You can run this workflow either through the GUI or the CLI.
If you click “Workflow Editor” and “Create New”, you will see a blank page in the workflow editor.
Now you can either click the ellipses (...
) menu in the lower right and select “Edit YAML” or
simply press e
on your keyboard. An editor with a Fuzzfile stub will appear.
You can delete the current contents and copy and paste the workflow definition of from above.
Now pressing “save” will return you to the interactive workflow editor. You will now see the BLAST workflow graph instead of a blank editor page. The Fuzzball GUI will automatically validate the yaml file for syntax errors.
Note that the 2 retrieve-*
jobs can proceed in
parallel if enough
resources are available.
Submitting your workflow to Fuzzball with the GUI is easy. Simply press the triangular “Start Workflow” button in the lower right corner of the workflow editor. You will be prompted to provide an optional descriptive name for your workflow.
Now you can click on “Start Workflow” in the lower right corner of the dialog box and your workflow will be submitted. If you click “Go to Status” you can view the workflow status page. The screenshot below shows the status page for a hello world workflow submission.
To retrieve logs produced by this workflow, select a job within the workflow such as
make-blast-database
, and click the “Logs” option on the right.
To run this workflow through the CLI you will need access to the Fuzzball CLI. You can install it using the Fuzzball CLI installation instructions.
First, you can create a Fuzzfile blast.fz
with the contents above using the text
editor of your choice.
You can start start this workflow using the CLI by running the following command:
$ fuzzball workflow start blast.yaml
Workflow "8ae68827-4bce-45c6-ab0c-9f086a8052fb" started.
You can monitor the workflow’s status by running the following command:
$ fuzzball workflow describe <workflow uuid>
Name: blast.yaml
Email: bphan@ciq.co
UserId: e554e134-bd2d-455b-896e-bc24d8d9f81e
Status: STAGE_STATUS_FINISHED
Created: 2024-06-18 09:37:23AM
Started: 2024-06-18 09:37:23AM
Finished: 2024-06-18 09:44:21AM
Error:
Stages:
KIND | STATUS | NAME | STARTED | FINISHED
Workflow | Finished | 8ae68827-4bce-45c6-ab0c-9f086a8052fb | 2024-06-18 09:37:23AM | 2024-06-18 09:44:21AM
Volume | Finished | blast-volume | 2024-06-18 09:37:24AM | 2024-06-18 09:37:45AM
Image | Finished | docker://ncbi/blast:2.12.0 | 2024-06-18 09:37:24AM | 2024-06-18 09:41:20AM
Job | Finished | create-dirs | 2024-06-18 09:41:35AM | 2024-06-18 09:41:40AM
Job | Finished | retrieve-query-sequence | 2024-06-18 09:41:56AM | 2024-06-18 09:42:04AM
Job | Finished | retrieve-database-sequences | 2024-06-18 09:41:58AM | 2024-06-18 09:42:06AM
Job | Finished | make-blast-database | 2024-06-18 09:42:21AM | 2024-06-18 09:42:28AM
Job | Finished | run-blast | 2024-06-18 09:43:38AM | 2024-06-18 09:43:45AM
File | Finished | file://results/blastp.out -> | 2024-06-18 09:44:00AM | 2024-06-18 09:44:05AM
| | s3://co-ciq-m... |
You can view outputs logged by the workflow using the fuzzball workflow log
command and provide the
workflow UUID and job name. For example, executing the following command, will output logs from job
make-blast-database
in the workflow:
$ fuzzball workflow log <workflow uuid> make-blast-database
Building a new DB, current time: 06/18/2024 16:42:26
New DB name: /data/fasta/nurse-shark-proteins
New DB title: Nurse shark proteins
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 7 sequences in 0.244187 seconds.