You are here: Home Resources Computers Ekman How to Chain of restartable jobs

Chain of restartable jobs

If you have a restartable application, that checkpoints itself and is able to restart from the checkpoint in a clever way, you may utilize those 30 days by running your application in e.g. several seven-day jobs, that are chained together, so only one of them are active at a time.

Make sure you have Kerberos tickets

Currently the longest job length allowed on Ekman is 10 days, i.e. 14400 minutes. If you anticipate that your job will wait in queue before starting, for say, one day, you need to get an eleven-day Kerberos ticket before submitting such a job:

How to chain jobs

There are at least two ways in which you may chain your jobs together:

  • Submit a new job at the end of the job script.
  • Submit several jobs in a sequence and restrict each job to not start before the previous job has exited. This is done with the "-F" option to esubmit.

 How to submit one job from another

Create a sufficiently long ticket before submitting your first job in the chain. Decide in which directory to run your application. You need to have your input files available there and you need to have enough space and quota. You also need to decide on a test, to put in your job script, to decide if you want to submit the next job before exiting.

Here is the example:

#! /bin/bash
source /pdc/modules/etc/init/bash
# working directory (=where all the data goes)
RUNDIR=/cfs/testscratch/l/lenzkar/test
# Go to run directory on scratch or nobackup disk
cd $RUNDIR
# Do my thing. (Here the programs of the code are started.)
# Is the decision to resubmit or not?
# Exchange the sleep statement with your algorithm.
# Depending of outcome from computing, decide to resubmit or not.
sleep 3600
need_to_resubmit=1
# Is the decision to resubmit or not?
if [ $need_to_resubmit -le 0 ]; then
        exit
fi
# The next job script to be submitted
runscript=$RUNDIR/resubmit_script
# Prepare to submit next job
module add easy
rsh=/usr/heimdal/bin/rsh
shost=ekman.pdc.kth.se
esubmit_program=`type -p esubmit`
# Submit next job on host $shost
date
echo ${rsh} -F ${shost} ${esubmit_program} -n 1 -t 10080 $runscript
${rsh} -F ${shost} ${esubmit_program} -n 1 -t 10080 $runscript
exit
It is started in this way:
module add module add easy 
esubmit -n 1 -t 10080 ./resubmit_script

The same Kerberos ticket is used all the way. When your time runs
out, your esubmit call will break with a message like this:

 esubmit: failed: tight ticket life? expire in 56m04s (request is 168h.) 
 esubmit: info: use -f to override.

Of course you need to adapt the script to your application,
and that includes changing the walltime specification and
the number of nodes in the esubmit lines etc.

Filed under: