de.jpgDE

First Login

This page gives an overview of the process to use the Neumann-Cluster. Be refered to specific guides for any details.

In short

  • ssh is used to connect to 141.44.8.59
  • create a SSL key pair for your computer
  • send the following details to Dr. Schulenburg (see below) with the subject WWW-t100
    • public key of your key pair
    • Short description of your project
    • approximate required ressources
  • if your are a student, get an allowance from your supervisor

In case of questions regarding Neumann contact Dr. Schulenburg at joerg.schulenburg@ovgu.de, or call Tel. -58408.

To connect to Neumann, you need to be inside the university's ip-range. This automatically done, when you are connected to the university network. From outside use VPN.

With an existing account (here user), connect to the IP address 141.44.8.59 with ssh to log into the head-node (login node) of Neumann.

$ ssh user@141.44.8.59

To get an account on Neumann, be referred to Neumann-Cluster. You will find pieces of information which should be send to the administrator. This includes some brief details on your project, approximate required resources, and the public part of a SSL key pair on your computer. Send those information to Dr. Jörg Schulenburg (joerg.schulenburg@ovgu.de). If your are a student, you need to ask your supervisor to vouch for you by sending a brief mail which states that you are allowed to use Neumann.

Read the Connection Guide if you don't know how to use ssh, how to create a key pair, or if you use windows.

Generate SSL key pair

Before the first login you created a key pair for your own computer. You have to repeat this step on Neumann, too. This step ensures, that programs (or you) can log in to individual compute nodes.

To create a key pair use ssh-keygen. Beware, do not set a passphrase for this key. The key will be of no use if you set a passphrase here. Go with the default settings.

[user@login ~]$ ssh-keygen

Now, you created a new key pair (id_rsa and id_rsa.pub, or similar). This key needs to be added to your authorized_keys so that the compute nodes know you (they share some storage as the login node). Use the following line.

[user@login ~]$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys

If your are a user of one of the following programs, you have to do the above steps. For other users, this steps might be optional.

  • CD-Adapco StarCCM+
  • ANSYS Fluent

Create a work directory

There are two directories relevant for your work on Neumann,/home/user and /scratch/tmp/. In the home-directory, you should not put any big files or a large amount of files. It is only a storage for your own programs, log files and job scripts. Moreover, home has only about 1 TB of storage which is shared among all users. On top of that, if home gets filled completely by your files, all running jobs on the cluster will be affected! Therefore keep your home directory as clean as possible.

To run simulations, big computations and file manipulations use the scratch directory. It not just have much more storage (currently 290TB), it is also much faster. It's rated at 8GB/s!

The file storage on the server is protected against data loss. However, there is no backup system to restore deleted files! Make sure you backup important files yourself.

To check how much space is left use the program df:

[user@login ~]$ df -h ~
Filesystem             Size  Used Avail Use% Mounted on
controller:/home/vc-a  914G  592G  323G  65% /home

[user@login ~]$ df -h /scratch/tmp/
Filesystem      Size  Used Avail Use% Mounted on
beegfs_nodev    292T  4.7T  287T   2% /mnt/beegfs

For your own account, you should create a work directory on scratch in the following way:

[user@login ~]$ mkdir /scratch/tmp/${USER}

(${USER} is a variable which is automatically replaced with your actual account name)

From now on, you can change to your directory with 'cd /scratch/tmp/user' into your own work directory (replace user with your user name). Beware, that the default settings allow all users to read your scratch (but not home).

Submit a Job

The following short job script, practically does nothing but shall be an example of full job scripts. All this script does is to run sleep for 5 minutes. The program sleep does nothing but let the computer wait till the time is over (without any cpu load).

job.sh
#!/bin/bash
#SBATCH -J testRun # jobname displayed by squeue
#SBATCH -N 1
#SBATCH --time 00:05:00 # set walltime, what you expect as maximum, in 24h format
#SBATCH -p short
srun sleep 5m

This script can be submitted to SLURM by calling sbatch:

[user@login ~]$ sbatch job.sh
Submitted batch job 14875

If there is a node free, the job will be run immediately.

Owing to the system's configuration, you should always submit job scripts from the home-directory. Then the job script should change into scratch by itself (if necessary).

Most often, the cluster is not free (which is good in terms of utilization). In this case a submitted job will be enqueued. You can check the current que with the command squeue

[user@login ~]$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
14726       big     LSUB    user1 PD       0:00    140 (Resources)
14743       big   job-01    user2 PD       0:00     90 (Priority)
14821       big      BTP    user3 PD       0:00     40 (Dependency)
14824       big      act    user3 PD       0:00     30 (Resources)
14827       big    sqt10    user4 PD       0:00     70 (Resources)
14710       big   RGB_D4    user4  R 4-02:36:20      1 c116
14875     short  testRun    user   R       0:05      1 c002
14411 sw04_long    MISLC    user6 PD       0:00      4 (Dependency)
13212 sw04_long    MISLC    user6  R 9-21:22:05      4 c[088,107-109]
14437 sw04_long      z_t    user5  R    3:48:26      7 c[045-046,065...

You will find your test run with the name testRun in that list, and a some more details on each job, such as requested resources, computing nodes, or que status.

Alternatively for squeue, you can use squeue_all, or qstat. If you are familiar with other resource managers your can find a translation chart for SLURM.

Control a submitted job

If you have to cancel a job early or if you want to take a job out of a que, use the command scancel. It takes a job ID (here 12345) to identify which job should be canceled. Of course, this only works with your own jobs (excluding admins).

[user@login ~]$ scancel 12345

To get all information on a submitted job, you can use the command scontrol in the the following way:

[user@login ~]$ scontrol show job 12345
JobId=12345 Name=BTP
   UserId=user(1000) GroupId=user(1000)
   Priority=4294892716 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Dependency Dependency=singleton
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=4-19:00:00 TimeMin=N/A
   SubmitTime=2016-07-22T13:39:14 EligibleTime=Unknown
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=big AllocNode:Sid=c502:32676
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=40-40 NumCPUs=640 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=0
   MinCPUsNode=16 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/user/job.sh
   WorkDir=/home/user
   StdErr=/home/user/slurm-12345.out
   StdIn=/dev/null
   StdOut=/home/user/slurm-12345.out

That's it?

No. Of course, not. There are much more features not covered by this short introduction. Please use the detail pages, such as sbatch to find comprehensive information. You can also find an overview on all guides here

guide/first_login.txt · Zuletzt geändert: 2017/09/27 09:41 von Sebastian Engel
Nach oben
CC Attribution-Share Alike 3.0 Unported
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0