Chain of restartable jobs
Make sure you have Kerberos tickets
Currently the longest job length allowed on Ekman is 10 days, i.e. 14400 minutes. If you anticipate that your job will wait in queue before starting, for say, one day, you need to get an eleven-day Kerberos ticket before submitting such a job:
How to chain jobs
There are at least two ways in which you may chain your jobs together:
- Submit a new job at the end of the job script.
-
Submit several jobs in a sequence and restrict each job to not start before the previous job has exited. This is done with the "-F" option to esubmit.
How to submit one job from another
Create a sufficiently long ticket before submitting your first job in the chain. Decide in which directory to run your application. You need to have your input files available there and you need to have enough space and quota. You also need to decide on a test, to put in your job script, to decide if you want to submit the next job before exiting.
Here is the example:
#! /bin/bash source /pdc/modules/etc/init/bash # working directory (=where all the data goes) RUNDIR=/cfs/testscratch/l/lenzkar/test # Go to run directory on scratch or nobackup disk cd $RUNDIR # Do my thing. (Here the programs of the code are started.) # Is the decision to resubmit or not? # Exchange the sleep statement with your algorithm. # Depending of outcome from computing, decide to resubmit or not. sleep 3600 need_to_resubmit=1 # Is the decision to resubmit or not? if [ $need_to_resubmit -le 0 ]; then exit fi # The next job script to be submitted runscript=$RUNDIR/resubmit_script # Prepare to submit next job module add easy rsh=/usr/heimdal/bin/rsh shost=ekman.pdc.kth.se esubmit_program=`type -p esubmit` # Submit next job on host $shost date echo ${rsh} -F ${shost} ${esubmit_program} -n 1 -t 10080 $runscript ${rsh} -F ${shost} ${esubmit_program} -n 1 -t 10080 $runscript exit
module add module add easy esubmit -n 1 -t 10080 ./resubmit_script
The same Kerberos ticket is used all the way. When your time runs
out, your esubmit call will break with a message like this:
esubmit: failed: tight ticket life? expire in 56m04s (request is 168h.) esubmit: info: use -f to override.
Of course you need to adapt the script to your application,
and that includes changing the walltime specification and
the number of nodes in the esubmit lines etc.


