Inhaltsverzeichnis
First Login
This page gives an overview of the process to use the Neumann-Cluster. Be refered to specific guides for any details.
In short
- ssh is used to connect to 141.44.8.59
- create a SSL key pair for your computer
- send the following details to Dr. Schulenburg (see below) with the subject WWW-t100
- public key of your key pair
- Short description of your project
- approximate required ressources
- if your are a student, get an allowance from your supervisor
In case of questions regarding Neumann contact Dr. Schulenburg at joerg.schulenburg@ovgu.de, or call Tel. -58408.
To connect to Neumann, you need to be inside the university's ip-range. This automatically done, when you are connected to the university network. From outside use VPN.
With an existing account (here user), connect to the IP address 141.44.8.59 with ssh to log into the head-node (login node) of Neumann.
$ ssh user@141.44.8.59
To get an account on Neumann, be referred to Neumann-Cluster. You will find pieces of information which should be send to the administrator. This includes some brief details on your project, approximate required resources, and the public part of a SSL key pair on your computer. Send those information to Dr. Jörg Schulenburg (joerg.schulenburg@ovgu.de). If your are a student, you need to ask your supervisor to vouch for you by sending a brief mail which states that you are allowed to use Neumann.
Generate SSL key pair
Before the first login you created a key pair for your own computer. You have to repeat this step on Neumann, too. This step ensures, that programs (or you) can log in to individual compute nodes.
To create a key pair use ssh-keygen
. Beware, do not set a passphrase for this key. The key will be of no use if you set a passphrase here. Go with the default settings.
[user@login ~]$ ssh-keygen
Now, you created a new key pair (id_rsa
and id_rsa.pub
, or similar). This key needs to be added to your authorized_keys
so that the compute nodes know you (they share some storage as the login node). Use the following line.
[user@login ~]$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
If your are a user of one of the following programs, you have to do the above steps. For other users, this steps might be optional.
- CD-Adapco StarCCM+
- ANSYS Fluent
- …
Create a work directory
There are two directories relevant for your work on Neumann,/home/user
and /scratch/tmp/
.
In the home
-directory, you should not put any big files or a large amount of files. It is only a storage for your own programs, log files and job scripts. Moreover, home
has only about 1 TB of storage which is shared among all users. On top of that, if home gets filled completely by your files, all running jobs on the cluster will be affected! Therefore keep your home
directory as clean as possible.
To run simulations, big computations and file manipulations use the scratch
directory. It not just have much more storage (currently 290TB), it is also much faster. It's rated at 8GB/s!
To check how much space is left use the program df
:
[user@login ~]$ df -h ~ Filesystem Size Used Avail Use% Mounted on controller:/home/vc-a 914G 592G 323G 65% /home [user@login ~]$ df -h /scratch/tmp/ Filesystem Size Used Avail Use% Mounted on beegfs_nodev 292T 4.7T 287T 2% /mnt/beegfs
For your own account, you should create a work directory on scratch
in the following way:
[user@login ~]$ mkdir /scratch/tmp/${USER}
(${USER}
is a variable which is automatically replaced with your actual account name)
From now on, you can change to your directory with 'cd /scratch/tmp/user
' into your own work directory (replace user with your user name). Beware, that the default settings allow all users to read your scratch
(but not home
).
Submit a Job
The following short job script, practically does nothing but shall be an example of full job scripts. All this script does is to run sleep
for 5 minutes. The program sleep does nothing but let the computer wait till the time is over (without any cpu load).
- job.sh
#!/bin/bash #SBATCH -J testRun # jobname displayed by squeue #SBATCH -N 1 #SBATCH --time 00:05:00 # set walltime, what you expect as maximum, in 24h format #SBATCH -p short srun sleep 5m
This script can be submitted to SLURM by calling sbatch
:
[user@login ~]$ sbatch job.sh Submitted batch job 14875
If there is a node free, the job will be run immediately.
home
-directory. Then the job script should change into scratch
by itself (if necessary).
Most often, the cluster is not free (which is good in terms of utilization). In this case a submitted job will be enqueued. You can check the current que with the command squeue
[user@login ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 14726 big LSUB user1 PD 0:00 140 (Resources) 14743 big job-01 user2 PD 0:00 90 (Priority) 14821 big BTP user3 PD 0:00 40 (Dependency) 14824 big act user3 PD 0:00 30 (Resources) 14827 big sqt10 user4 PD 0:00 70 (Resources) 14710 big RGB_D4 user4 R 4-02:36:20 1 c116 14875 short testRun user R 0:05 1 c002 14411 sw04_long MISLC user6 PD 0:00 4 (Dependency) 13212 sw04_long MISLC user6 R 9-21:22:05 4 c[088,107-109] 14437 sw04_long z_t user5 R 3:48:26 7 c[045-046,065...
You will find your test run with the name testRun
in that list, and a some more details on each job, such as requested resources, computing nodes, or que status.
Alternatively for squeue
, you can use squeue_all
, or qstat
. If you are familiar with other resource managers your can find a translation chart for SLURM.
Control a submitted job
If you have to cancel a job early or if you want to take a job out of a que, use the command scancel
. It takes a job ID (here 12345) to identify which job should be canceled. Of course, this only works with your own jobs (excluding admins).
[user@login ~]$ scancel 12345
To get all information on a submitted job, you can use the command scontrol
in the the following way:
[user@login ~]$ scontrol show job 12345 JobId=12345 Name=BTP UserId=user(1000) GroupId=user(1000) Priority=4294892716 Nice=0 Account=(null) QOS=(null) JobState=PENDING Reason=Dependency Dependency=singleton Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:00 TimeLimit=4-19:00:00 TimeMin=N/A SubmitTime=2016-07-22T13:39:14 EligibleTime=Unknown StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=big AllocNode:Sid=c502:32676 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=40-40 NumCPUs=640 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=0 MinCPUsNode=16 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/home/user/job.sh WorkDir=/home/user StdErr=/home/user/slurm-12345.out StdIn=/dev/null StdOut=/home/user/slurm-12345.out
That's it?
No. Of course, not. There are much more features not covered by this short introduction. Please use the detail pages, such as sbatch to find comprehensive information. You can also find an overview on all guides here