OBSOLETE

StarCCM+ jobs during the Neumann maintainance

Symptoms

StarCCM+ starts up running on reserved servers but don't answer. Processes run on 100%. But nothing will be written to the logfile, or only a few sparse informations. No connection via tunnel possible.

About

On a previous configuration StarCCM+ couldn't run on more than approx. 70 nodes in parallel. To solve this issue, the software, such as mpi, infiniband and a few other components have been updated. The crucial difference between the server's software versions is the presence of the programm lspci. It is used by StarCCM+ to determine the right communication paths (, i believe).

This maintainance results in two simultaneously existing versions of servers. StarCCM+ cannot run on an inconsistant set of servers. It will fail to start up if you get an inconsistant set of servers.

Exclude old servers

To avoid this problem, exclude all servers which still have the old version. This can be done by using the -x argument and the list of servers (or –exclude) on submit.

sbatch -x c[005,006,007,...] myjobscript.sh

For details on this argument, check the manual man sbatch.

The current set of old servers (10/2017) can be found in the following file:

old.lst
c[005,006,007,008,010,011,012,013,014,015]

For ease of use you can download old.lst and copy it in your home directory (~) and run: sbatch -x $(cat ~/old.lst) myjobscript.sh

By 11/2017, the remaining servers have been drained from the available list of servers and are awaiting maintenance. Therefore, inconsistency between servers will not occur any more. The exclude option does not have to be included anymore.

Get a recent list

To create the most recent list of old servers run the following script on Neumann.

sortNodeVersions.sh
#!/bin/bash
## Script for Neumann Cluster to check whether lspci is present on servers
## lspci is only available on newest servers
## 
## SE
 
 
rm -f tempNewNodes tempOldNodes notresponding.lst
 
echo "Check presence of 'lspci' on following servers"
 
for n in {001..172}
do
        ## Check if server is available or drained for maintainance (not available)
        ONLINE=$(scontrol show node c$n | grep -c Reason)
        if [ $ONLINE -eq 0 ]
        then
                ## If available, check if lspci is present, but don't wait to long
                echo "checking c$n"
                STATUS=$(timeout 4s ssh c$n stat /usr/sbin/lspci 2>&1)
                RESPOND=$(echo $STATUS | wc -w)
                FOUND=$(echo $STATUS | grep -c cannot)
 
                ## If not responded, then skip, else decide whether old or new server
                if [ $RESPOND -eq 0 ]
                then
                        echo "c$n didn't respond"
                        echo $n >> notresponding.lst
                else
                        if [ $FOUND -eq 0 ]
                        then
                                echo c$n >> tempNewNodes
                        else
                                echo c$n >> tempOldNodes
                        fi
                fi
 
        else
                echo "c$n not available"
                echo $n >> notresponding.lst
        fi 
done
 
wait
 
echo "Survey done"
 
 
## Sort server's numbers and delete temps
cat tempNewNodes | cut -d'c' -f 2 | sort | paste -s -d ',' > new.lst
cat tempOldNodes | cut -d'c' -f 2 | sort | paste -s -d ',' > old.lst
rm -f tempNewNodes tempOldNodes
 
## Add prefix and suffix to server number list
echo "c[$(cat new.lst)]" > new.lst
echo "c[$(cat old.lst)]" > old.lst
guide/starccm/star_and_neumann_during_update.txt · Last modified: 2018/01/04 14:37 by seengel
Back to top
CC Attribution-Share Alike 3.0 Unported
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0