Neumann Cluster

This page is dedicated to information collected by users of the Neumann HPC-Cluster.
For general information and news provided by Dr. Schulenburg, visit the Neumann-Cluster's homepage.

Quick introduction

  1. Understand how to work with a terminal, e.g. do the UNIX tutorial.
  2. Understand how to setup SSH connections, e.g. read the Remote Connection Guide.
  3. Ask your supervisor to order an account for you. If you are an employee or similar you can directly ask Dr. Schulenburg.
    He'll need the public part of your SSH-Key, or will send you a password on request.
  4. Read the Neumann Introduction written by Laszlo Daroczy (2016)

Quick SLURM Reference

program explanation
sbatch Used to submit jobs with a job script e.g. sbatch -p big myjob.sh to run myjob.sh in the big-que
squeue Shows the current que
scancel Cancels waiting and running jobs e.g. scancel 32611, cancel job with id 32611
srun Used to run interactive and parallel jobs e.g. “srun -p short -N 1 --pty /bin/bash” to start an interactive job
sinfo Shows current status of ques
module Allows to load pre-configured settings for certain programs, such as compiler, mpi

For advanced usage, have a look at SLURM - Tips & Tricks

All Topics regarding Neumann

list created automatically

HPC etiquette

Don't waste resources

  • Use as few resources as possible. The cluster is needed by dozens of university members!
  • Check the performance and scalability of your programs. The parallel efficiency should not be less than 0.5 to avoid wasting resources unnecessarily.
  • Use the -nice option if you submit many individual jobs at once (man sbatch).
  • Use array jobs for a large number of small jobs; with a limit on the number of concurrent jobs (-tc).
  • Limit the requested memory (#SBATCH -mem 120000).
  • No Pre-/Postprocessing in the main queues! For interactive work, the short queue is available.

Use the scratch directory as working directory

  • Your scratch folder should be the working directory and contain most of the data. A lot of disk space and read/write capacity is available there. The home folder, however, is reserved for quick-access configuration files.
  • SLURM scripts should be located in the home directory and should be submitted from there

Clean up after yourself

  • Release unused nodes when your work is complete.
  • Remove unnecessary files.
  • Do not leave any zombie programs behind.

Backup your data

  • No backups of your Neumann data is carried out automatically: backup your results and critical files to your own computer.
  • Administrators may delete old files if they deem it necessary.

Do not cause the login node to crash

  • The login node controls the entire cluster, including SLURM which runs jobs and the queues.
  • The login node is to be used only:
    • to submit and monitor jobs
    • to edit scripts and files
    • to prepare data ( scp, cp, mv,…)
  • For any resource-intensive task (computing, heavy data processing, interactive sessions etc.), reserve a node with an sbatch script.

Don't share passwords

  • Do not share your password with anyone
  • Never request the passwords of other users (colleagues, students etc.)
  • Each person using Neumann should have their own username: this enhances security and resource attribution. Contact the administrator every time a new person is to use Neumann.

Frequent Problems

DOS line breaks

Sbatch Message Example:

$ sbatch script_name.sh
sbatch: error: Batch script contains DOS line breaks (\r\n) 
sbatch: error: instead of expected UNIX line breaks (\n)

Solution

  • convert your script with
    $ dos2unix script_name.sh

    alternatively use

    $ sed -i 's/^M//' script_name.sh

    The character ^M is a single special character. To type it press and holt CTRL. Then, Press and release v. Still holding CTRL, press m.

Cause

  • Microsoft Windows (DOS) has a different format how to break a line of text than UNIX, such as linux and mac, does.
  • Some programs, such as sbatch might not be able to handle arbitrary formats.

More information on this problem here.

Stray Process (StarCCM)

Error Message Example:

c013.vc-a.217Received 643681 out-of-context eager message(s) from stray process PID=8216 running on host 172.16.0.14 (LID 0x61, ptype=0x1, subop=0x22, elapsed=35800.987s)  (err=49)

Solution

  • Log into c014 via ssh and run top to see which user might has lost a process
  • ask respective user to clean-up c014 by using the clean-up script with a machinefile containing only c014 in one line
  • if contacting this user is not possible/didn't work, send a request to Dr. Schulenburg to clean up user processes on the respective node(s)

Cause

  • lost StarCCM process is running on c014 (IP address: 172.16.0.14)
  • lost process tries to communicate with c013
  • c013 is now running a different simulation

More information on this problem here.

MPI Init

Error Message example:

c014.vc-a.170can't open /dev/ipath, network down (err=26)
starccm+: Rank 0:170: MPI_Init: psm_ep_open() failed
starccm+: Rank 0:170: MPI_Init: Can't initialize RDMA device
starccm+: Rank 0:170: MPI_Init: Internal Error: Cannot initialize RDMA protocol

Solution

  • Log into c014 via ssh and run top to see which user might has lost a process
  • ask respective user to clean-up c014 by using the clean-up script with a machinefile containing only c014 in one line
  • if contacting this user is not possible/didn't work, send a request to Dr. Schulenburg to clean up user processes on the respective node(s)

Cause

  • Lost mpi deamon is running on c014
  • Prevents new mpi connections

More information on this problem here.

No access to compute nodes

Symptoms

  1. mpirun, StarCCM, Fluent or manual connections might complain:
    mpirun: Warning one or more remote shell commands exited with non-zero status, which may indicate a remote access problem.
    error: Design STAR-CCM+ simulation completed
    Server process ended unexpectedly (return code 255)
    mpirun: Warning one or more remote shell commands exited with non-zero status, which may indicate a remote access problem.
  2. mpirun might complain:
    mpirun: Warning one or more remote shell commands exited with non-zero status, which may indicate a remote access problem.
  3. Login to Compute notes does not work (such as ssh node001) but asks for your password.

Solution

  • Log into the remote computer
  • Create a second keypair. Press enter three times after running this command:
    ssh-keygen
  • Add the new key to authorized_keys:
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Cause

  • Compute nodes are a subnet work from the login nodes
  • Compute nodes read authorized_keys file to self identify user
  • A required key pair is not created automatically, or might have been overwritten

Command not found

Symptoms

Error Message Example:

/var/spool/slurmd/job11105/slurm_script: line 106: starccm+: command not found

Solution

  • Check that the module loaded correctly. If you used the templates here, check the slurm.log file and see which modules have been loaded:
    Loaded Modules: starCCM/13.02.013
    pwd=/scratch/tmp/
  • store and run the sbatch script in your home directory (~)

Cause

  • the terminal does not know where to look for the progrom (in this case starccm+)
  • module files are used to setup the environment to run progrems such as starccm+. If these fail, or you have a typo in the module load command, then the environment is not properly setup with the information where the program is located.

Undefined slave list

Symptoms

Error Message Example:

Checking whether servers are clean (of user processes)
ERROR === On server node001 are user processes running: "      1 nscd
      1 ntp"
Finally Running on 0 processes on 0 servers.
Starting local server: /cluster/apps/starCCM/starccm+_12.02.011/STAR-CCM+12.02.011-R8/star/bin/starccm+ -licpath 1999@flex.cd-adapco.com -power -podkey XXXXXXXXXXXXXXXXXXXXXX -collab -np 0 -machinefile /scratch/tmp/seengel/machinefile.11102.txt -server -rsh /usr/bin/ssh /scratch/tmp/seengel/star.sim
Error: Undefined slave list
error: Design STAR-CCM+ simulation completed
Server process ended unexpectedly (return code 1)

Solution

  • Check the requested nodes for user processes.
  • Check the sbatch log file for the ERROR lines and see whether the found “users” are actually users or system processes. If the later is the case, add them to the templates, such as root, munge, dbus,… -users are removed:
    sed '/root/d;/munge/d;/dbus/d;/ldap/d;'

Cause

  • The job-starccm.sh script currently has a feature to detect whether there are user processes left. If this feature detects such processes which are not filtered out, if will remove servers from the machinefile, to allow StarCCM to run without further complications. If this features filter is not correctly working, for example due to changes to the system, the script will fail in this way, too.


guide/neumann.txt · Last modified: 2020/09/04 00:38 by seengel
Back to top
CC Attribution-Share Alike 3.0 Unported
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0