OBSOLETE

# StarCCM+ jobs during the Neumann maintainance

## Symptoms

StarCCM+ starts up running on reserved servers but don't answer. Processes run on 100%. But nothing will be written to the logfile, or only a few sparse informations. No connection via tunnel possible.

On a previous configuration StarCCM+ couldn't run on more than approx. 70 nodes in parallel. To solve this issue, the software, such as mpi, infiniband and a few other components have been updated. The crucial difference between the server's software versions is the presence of the programm lspci. It is used by StarCCM+ to determine the right communication paths (, i believe).

This maintainance results in two simultaneously existing versions of servers. StarCCM+ cannot run on an inconsistant set of servers. It will fail to start up if you get an inconsistant set of servers.

## Exclude old servers

To avoid this problem, exclude all servers which still have the old version. This can be done by using the -x argument and the list of servers (or –exclude) on submit.

sbatch -x c[005,006,007,...] myjobscript.sh

For details on this argument, check the manual man sbatch.

The current set of old servers (10/2017) can be found in the following file:

old.lst
c[005,006,007,008,010,011,012,013,014,015]

For ease of use you can download old.lst and copy it in your home directory (~) and run: sbatch -x $(cat ~/old.lst) myjobscript.sh By 11/2017, the remaining servers have been drained from the available list of servers and are awaiting maintenance. Therefore, inconsistency between servers will not occur any more. The exclude option does not have to be included anymore. ## Get a recent list To create the most recent list of old servers run the following script on Neumann. sortNodeVersions.sh #!/bin/bash ## Script for Neumann Cluster to check whether lspci is present on servers ## lspci is only available on newest servers ## ## SE rm -f tempNewNodes tempOldNodes notresponding.lst echo "Check presence of 'lspci' on following servers" for n in {001..172} do ## Check if server is available or drained for maintainance (not available) ONLINE=$(scontrol show node c$n | grep -c Reason) if [$ONLINE -eq 0 ]
then
## If available, check if lspci is present, but don't wait to long
echo "checking c$n" STATUS=$(timeout 4s ssh c$n stat /usr/sbin/lspci 2>&1) RESPOND=$(echo $STATUS | wc -w) FOUND=$(echo $STATUS | grep -c cannot) ## If not responded, then skip, else decide whether old or new server if [$RESPOND -eq 0 ]
then
echo "c$n didn't respond" echo$n >> notresponding.lst
else
if [ $FOUND -eq 0 ] then echo c$n >> tempNewNodes
else
echo c$n >> tempOldNodes fi fi else echo "c$n not available"
echo $n >> notresponding.lst fi done wait echo "Survey done" ## Sort server's numbers and delete temps cat tempNewNodes | cut -d'c' -f 2 | sort | paste -s -d ',' > new.lst cat tempOldNodes | cut -d'c' -f 2 | sort | paste -s -d ',' > old.lst rm -f tempNewNodes tempOldNodes ## Add prefix and suffix to server number list echo "c[$(cat new.lst)]" > new.lst
echo "c[\$(cat old.lst)]" > old.lst