Running the WEP Pipeline on the USDF Batch System¶
This guide shows how to run the WEP pipeline in batch mode on the Rubin US Data Facility (USDF). Helpful sources to learn more about batch processing with the Rubin Science Pipelines and using the USDF nodes can be found in the following links:
Steps¶
Run
allocateNodes.py
to get “glide-in” spots in the batch processing system. An example command is like the following:allocateNodes.py -v --dynamic -n 1 -c 2 -m 1-00:00:00 -q roma,milano -g 900 s3df
From the USDF documentation:
s3df
is specified as the target platform.The
-q roma,milano
option specifies that the glide-in jobs may run in either the roma or milano partition.- The
-n 1 -c 2
options request 1 individual glide-in slot of size 2 cores (each glide-in is a Slurm job that obtains a partial node). For a single pair of images on a single sensor like the Auxtel data in this example then 2 cores on a single node is as much as you need. For an exposure with all four wavefront sensor pairs, the maximum number of tasks that will run concurrently is 8 so there should be no need to request more than 8 cores on a node for CWFS visits. Furthermore, to minimize the amount of idle core time a
-c
value of 4 would be preferable since the later tasks in the pipeline run on pairs of wavefront sensors together. For a visit across the focal plane with a large number of science sensors then values of-c
equal to the number of sensors up to a maximum value of 32 would be reasonable. Since these are requests for a partial node-c
values more than 32 may be harder to schedule. If submitting multiple jobs to run multiple visits then it would make sense to add more nodes and increase the value of the-n
parameter and procure more glide-in slots.
- The
The maximum possible time is set to 1 day via
-m 1-00:00:00
. The glide-in Slurm jobs may not run for the full 4 days however, as the option-g 900
specifies a condor glide-in shutdown time of 900 seconds or 15 minutes. This means that the htcondor daemons will shut themselves down after 15 minutes of inactivity (for example, after the workflow is complete), and the glide-in Slurm jobs will exit at that time to avoid wasting idle resources.The
--dynamic
option requests that the htcondor slots be dynamic, partionable slots; this is the recommended setting as it supports possible multi-core jobs in the workflow.
Now one can run the ctrl_bps_htcondor workflow. Make sure that the environment needed to run the WEP pipeline is set up in the current terminal session and run:
bps submit bps_wep_test.yaml
where
bps_wep_test.yaml
looks like:# This is the WEP pipeline configuration file. pipelineYaml: "/sdf/home/b/brycek/u/dev-repos/observing/latissWepPipeline.yaml" # These are identifiers when looking at the bps report to check on status project: AOS campaign: DM-38273 # Instructions to the butler payload: # This will be the name of the collection after the added prefix of 'u/$USER/' payloadName: latiss_wep_test # This is the location of the butler repository with the data you need butlerConfig: /sdf/group/rubin/repo/embargo/butler.yaml # These are the collections in that repository with the data you want inCollection: LATISS/calib/unbounded,LATISS/raw/all,LATISS/runs/quickLook # Specify a set of images with the query here (one intra-focal and one extra-focal per job) dataQuery: exposure in (2023020200434, 2023020200435)
If you are unfamiliar with the WEP Pipeline Configuration files see WEP Configuration.
Additional Notes¶
Check on job status with
bps report
.You can overwrite the data query in the batch
yaml
file from the command line with-d
. For example:bps submit -d "exposure.observation_type='cwfs' and exposure.day_obs=20220912 and exposure.seq_num in (96..97)" bps_wep_test.yaml
The pipeline will run and the output can be found in the butler specified by
butlerConfig
under the collection specified underpayloadName
inbps_wep_test.yaml
. To overwrite thebutlerConfig
on the command line use the-b
option and to overwrite the output collection use the-o
option in the command line instruction forbps submit
. For more options and information onbps submit
see here.