Saving BLAST Results to AWS S3
The pre-installed BLAST workflow template obtains query sequences with efetch from NCBI or curl from
any https URL and saves results to to a configurable path under the persistent DataVolume. What if
you wanted to fetch query sequences from AWS S3 and save results to AWS S3 instead and you did not
need the ability to create custom BLAST databases? To achieve that we will first
follow the workflow catalog documentation to create
a copy of the BLAST workflow and call it Blast S3. Then open the detail view on that Application
and make the following changes:
Since we are removing some functionality this template will be simpler and we will make use of the ingress and egress functionality of the volumes instead of using a job to fetch data and saving to a persistent volume. You can use the following template:
{{- $dbpath := list "/data" .BlastDbPath | join "/" }}
{{- $dbname := .BlastDbName }}
version: v1
volumes:
data:
reference: {{.DataVolume}}
scratch:
reference: {{.ScratchVolume}}
ingress:
- source:
uri: "{{.S3Uri}}/{{.RunName}}.fa"
secret: {{.S3Secret}}
destination:
uri: "file://{{.RunName}}.fa"
egress:
- source:
uri: "file://{{.RunName}}.blast.out"
destination:
uri: "{{.S3Uri}}/{{.RunName}}.blast.out"
secret: {{.S3Secret}}
jobs:
fetch-db:
image:
uri: {{.WorkflowContainer}}
mounts:
data:
location: /data
scratch:
location: /scratch
command:
- /bin/bash
- "-c"
- |
mkdir -p "{{$dbpath}}" && cd "{{$dbpath}}" || exit 1
# fix update_blastdb.pl if it's from a conda container
ubdb="$(type -p update_blastdb.pl)"
curl="$(type -p curl)"
[[ -z "$ubdb" || -z "$curl" ]] && exit 1
if [[ "$ubdb" =~ conda ]]; then
sed "s:^my \\\$curl.*$:my \$curl = '$curl';:" "${ubdb}" > update_blastdb.pl
else
cp "${ubdb}" update_blastdb.pl
fi
chmod 750 update_blastdb.pl
if ./update_blastdb.pl --showall | grep -q {{$dbname}} ; then
echo "{{$dbname}} is a public database available from NCBI"
now=$(date +%s)
if [[ -e {{$dbname}}__ ]]; then
last=$(cat {{$dbname}}__)
if (( (now - last) < 86400 )) ; then
echo " {{$dbname}} is current - update skipped."
exit
fi
fi
echo $now > {{$dbname}}__
echo " updating/downloading {{$dbname}}"
./update_blastdb.pl --num_threads=2 --decompress {{$dbname}} && exit 0 || exit 1
else
echo "{{$dbname}} is not a public database"
exit 1
fi
resource:
cpu:
cores: 1
threads: true
memory:
size: 4GiB
policy:
timeout:
execute: {{.BlastFetchTimeout}}
run-blast:
image:
uri: {{.WorkflowContainer}}
mounts:
data:
location: /data
scratch:
location: /scratch
cwd: /scratch
command:
- /bin/sh
- "-c"
- |
{{.BlastCmd}} -num_threads {{.BlastCores}} -query "{{.RunName}}.fa" -db {{$dbpath}}/{{$dbname}} -out {{.RunName}}.blast.out {{.BlastOpts}} || exit 1
cat {{.RunName}}.blast.out
resource:
cpu:
cores: {{.BlastCores}}
affinity: NUMA
memory:
size: {{.BlastMemory}}
policy:
timeout:
execute: {{.BlastQueryTimeout}}
requires: [fetch-db]
- Remove the following parameters:
CustomBlastDbFetchCmd,CustomBlastDbOptions,CustomBlastDbName,CustomBlastDbTimeout,RetrieveQuerySequencesCmd, andBlastOutputPath - Add 2 new parameters:
- Name:
S3UriType:TextDescription: AWS URI prefix for blast inputs and outputs. Default: A URI likes3://<bucket>/<path...> - Name:
S3SecretType:TextDescription: Secret from the secret store with AWS credentials. Default: A URI likesecret://<scope>/<name>
- Name:
Once you save the modified BLAST workflow template you can run analogously to the stock BLAST workflow template using the appropriate parameters.