SLURM - Tips & Tricks

Add StartTime to a job

The earliest possible start time of a job can be set during the submission. This might be usefull during maintenances without proper down time (aka 'playing the queue lottery').

During submission add –begin and a time to add a starting time. The default is set to 00:00:00, starting emmidiatly, if possible. An example:

sbatch --begin=2018-06-16T02:0:00 jobscript.sh

Some alternatives follows. For complete information run man sbatch and search for “--begin”.

  --begin=16:00
  --begin=now+1hour
  --begin=now+60           (seconds by default)

Modify StartTime of a waiting job

The earliest possible start time of a job can be setted during the submission, but also when it is already in the queue. This might be usefull during maintenances without proper down time (aka 'playing the queue lottery').

Modify a jobs information by stating your job ID and a start time.

scontrol update JobID=12345 StartTime=2018-06-16T02:0:00

Search for “StartTime” in the manual of scontrol to get complete information.

Modify the TimeLimit of a waiting job

Before a job has been started, the requested ressources can be modified by the user. To update the TimeLimit of a job run the following scontrol command. The ID “12345” should be replaced with your job.

$ scontrol update JobId=12345 TimeLimit=2-20:00:00

For more information check the manual of scontrol.

$ man scontrol
  

What is the current TimeLimit for the queues

If you don't know the current time limit for the queues run:

$ sinfo  -o "%12P %.10A %.11l"
PARTITION    NODES(A/I)   TIMELIMIT
gpu                 0/2  4-20:00:00
big*              136/0  2-20:00:00
short              6/10     1:00:00
longrun            10/1 14-20:00:00

For more information check the manual of sinfo.

$ man sinfo

Change Number of Nodes of a waiting job

If the partition has changed the number of nodes available you can change/modify the number of nodes with scontrol but also change the NUMBER OF CPUs, too!

Modify a jobs information by stating your job ID and the minimum number and maximum number of nodes. Than calculate the number of CPUs this will be and write it also.

scontrol update JobId=12345 NumNodes=8-8 NumCPUs=128 

Search for “NumNodes” in the manual of scontrol to get complete information.

Change the Partition of a waiting job

In case you want to change the partition of an alread submitted job, you can use scontrol to change the queue where a job shall be run. This would be done in situations, when you find a “free” spot in an other queue(partion)

scontrol update JobId=12345 Partition=longrun

Replace 12345 with your own job's id. Change longrun in any Partition stated by the sinfo command.

You cannot change another users job.

Search for “update” in the manual of scontrol to get complete information.

Ein Job der auf die Beendigung eines vorherigen Job wartet

FIXME translate me. Oftmals ist die Partition für long runs voll. Jedoch können viele Programme von einem vorherigen gespeicherten Stand weiterrechnen.

Diese Fähigkeit kann man nutzen um auch in der big-Warteschleife lange Rechnungen zu realisieren.

Im Detail heißt das, man submitted mehrere Jobs. In der Anzahl wie viel Zeit man summiert benötigt. Der erste Job wird hierbei ganz normal gestartet. Der nächste Job muss jedoch auf den vorherigen warten. Dies kann mit dem Artgument sbatch -d singleton aktiviert werden.

Aufgrund der aktuellen Strategie der Lastverteilung verzögern jedoch wartende Jobs eigene, im Anschluss submittete Jobs auch wenn sie zu keinem anderen Job abhängig sind.

Hier ein Beispiel zum Ausprobieren:

Das job-Script job.sh:

#!/usr/bin/bash
#SBATCH -j dependRun
#SBATCH -p sw01_short

sleep 60

Der erste Job wird submittet:

[user@login]$ sbatch job.sh

Jeder folgende Job wird mit einer zusätzlichen Option gestartet:

[user@login]$ sbatch -d singleton job.sh

Achtung! Die option -d singleton funktioniert nur bei Jobs mit dem gleichen Namen.

Alternativ bietet sich die Option -d afterany:[jobid] an. Diese Option fügt eine Abhängigkeit zu einem beliebigen Job hinzu. Der abhängige Job wird mit afterany immer dann gestartet wenn der angegebene Job beendet ist, egal ob dieser erfolgreich beendet wurde oder nicht.

[user@login]$ sbatch job.sh
Submitted batch job 12987
[user@login]$ sbatch -d afterany:12987 job.sh
[user@login]$ squeue -u user
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             12988 sw01_shor dependRu  user    PD       0:00      1 (Dependency)
             12987 sw01_shor dependRu  user     R       0:37      1 c001

Es gibt noch weitere Optionen die im Detail in man sbatch erklärt werden.

Open a shell within a running job

Assume your currently running job has the ID 12345. Then you could open a shell in that running job. You can check for example cpu and memory usage. Run this line in login:

$ srun --pty -u --jobid=12345 bash -i

That way you will open a bash shell in your job's head node.

Interactive job

To start an interactive job run this command:

srun -p short -N 1 --pty /bin/bash 

This will open a shell within a new job. However, this will only work when there are free resources. Otherwise this command will wait until resources become free, or until you kill it.

guide/neumann/slurm_tricks.txt · Last modified: 2019/02/18 11:22 by seengel
Back to top
CC Attribution-Share Alike 3.0 Unported
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0