StarCCM+ MPI Problem on startup

Recently the following errors have been found often. This indicates, that there might is a ghost process on c052.

c052.vc-a.31ipath_userinit: assign_context command failed: Network is down
c052.vc-a.31can't open /dev/ipath, network down (err=26)
starccm+: Rank 0:31: MPI_Init: psm_ep_open() failed
starccm+: Rank 0:31: MPI_Init: Can't initialize RDMA device
starccm+: Rank 0:31: MPI_Init: Internal Error: Cannot initialize RDMA protocol
MPI Application rank 31 exited before MPI_Init() with status 1
starccm+ Terminated
starccm+ Terminated
...

This problem is known. The reason are old starccm processes which couldn't been stopped correctly. Unfortunately, it is not known what causes starccm ghost process, yet.

An other problem with ghost processes can be that they still try to communicate with there “old friends” which are the other node from the machinefile-list. In this case your error message looks like this:

c013.vc-a.217Received 690121 out-of-context eager message(s) from stray process PID=8216 running on host 172.16.0.14 (LID 0x61, ptype=0x1, subop=0x22, elapsed=38383.939s)  (err=49)
c013.vc-a.241Received 690121 out-of-context eager message(s) from stray process PID=8216 running on host 172.16.0.14 (LID 0x61, ptype=0x1, subop=0x22, elapsed=38383.946s)  (err=49)
c013.vc-a.193Received 690122 out-of-context eager message(s) from stray process PID=8216 running on host 172.16.0.14 (LID 0x61, ptype=0x1, subop=0x22, elapsed=38383.952s)  (err=49)
c018.vc-a.220Received 690122 out-of-context eager message(s) from stray process PID=8216 running on host **172.16.0.14** (LID 0x61, ptype=0x1, subop=0x22, elapsed=38383.870s)  (err=49)
c020.vc-a.246Received 690122 out-of-context eager message(s) from stray process PID=8216 running on host **172.16.0.14** (LID 0x61, ptype=0x1, subop=0x22, elapsed=38384.077s)  (err=49)
c031.vc-a.344Received 1767581 out-of-context eager message(s) from stray process PID=13693 running on host 172.16.0.143 (LID 0x7a, ptype=0x1, subop=0x22, elapsed=38395.089s)  (err=49)
c031.vc-a.224Received 883794 out-of-context eager message(s) from stray process PID=13693 running on host 172.16.0.143 (LID 0x7a, ptype=0x1, subop=0x22, elapsed=38395.217s)  (err=49)
c032.vc-a.273Received 883795 out-of-context eager message(s) from stray process PID=13693 running on host **172.16.0.143** (LID 0x7a, ptype=0x1, subop=0x22, elapsed=38395.185s)  (err=49)
c031.vc-a.32Received 883796 out-of-context eager message(s) from stray process PID=13693 running on host **172.16.0.143** (LID 0x7a, ptype=0x1, subop=0x22, elapsed=38395.339s)  (err=49)

This means that the process PID=13693 from the node c143 (because he has the IP 172.16.0.143)tries to reach node c013 and course a problem with Star-CCM+! The same problem is appears on node c018 because of the ghost process PID=8216 from node c014.

To test whether this problem is still present on your servers, run the following script with your machinefile by calling

./count_uprocs.sh machinefile.txt
count_uprocs.sh
#!/bin/bash
echo "Count User processes on servers"
for s in $(cat $1)
do
        echo "checking $s:"
        ONLINE=$(scontrol show node $s | grep -c Reason)
        if [ $ONLINE -eq 0 ]; then
                STATUS=$(timeout 4s ssh $s ps -aux | tail -n +2 | sed '/sshd/d;/ps -aux/d' |cut -d' ' -f1 |sort| sed '/root/d;/munge/d;/dbus/d;/ldap/d;/nslcd/d;/postfix/d' | uniq -c )
                RESPOND=$(echo $STATUS | wc -w)
 
                if [ $RESPOND -eq 0 ]; then echo "";
                else echo "$STATUS"; fi
        else echo "not available"; fi
done
 
wait
echo "Survey done"

It will write an output, such as:

Count User processes on servers
checking c001:
      1 kerikous
      1 khairatb
      1 kinzel
     20 richter
checking c002:
     17 richter
checking c003:
     17 richter
Survey done.

If you see more users than one user, then there is something unusual. It can either be someone who has logged in manually on that node. Or, it can be a ghost process from starccm.

To check this login to the corresponding node, and run top:

ssh c001
top

You can see something like this snippet in the hidden section. To close top, press q one, or more times.

Click to display ⇲

Click to hide ⇱

top - 13:50:57 up 73 days, 21:57,  0 users,  load average: 16.00, 16.01, 16.00
Tasks:  40 total,   1 running,  39 sleeping,   0 stopped,   0 zombie
%Cpu(s): 23.1 us,  0.5 sy,  0.0 ni, 76.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 26405563+total, 19848259+free,  6584296 used, 58988748 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 24720867+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4894 richter   20   0 12.175g 309860  10320 S 106.7  0.1 224:19.39 heis36_144_cmpl
 4896 richter   20   0 12.175g 311176  10436 S 106.7  0.1 224:19.58 heis36_144_cmpl
 4898 richter   20   0 12.176g 335208  10536 S 106.7  0.1 224:19.40 heis36_144_cmpl
 4900 richter   20   0 12.176g 323612  10412 S 106.7  0.1 224:19.41 heis36_144_cmpl
 4901 richter   20   0 12.175g 239916  10448 S 106.7  0.1 224:19.50 heis36_144_cmpl
 4903 richter   20   0 12.174g 258580  10372 S 106.7  0.1 224:19.37 heis36_144_cmpl
 4904 richter   20   0 12.176g 325208  10368 S 106.7  0.1 224:19.54 heis36_144_cmpl
 4905 richter   20   0 12.177g 280736  10500 S 106.7  0.1 224:19.44 heis36_144_cmpl
 4906 richter   20   0 12.180g 271244  10720 S 106.7  0.1 224:19.42 heis36_144_cmpl
 4893 richter   20   0 12.181g 308036  12204 S 100.0  0.1 224:18.66 heis36_144_cmpl
 4895 richter   20   0 12.175g 310424  10464 S 100.0  0.1 224:19.46 heis36_144_cmpl
 4897 richter   20   0 12.174g 350412  10440 S 100.0  0.1 224:19.55 heis36_144_cmpl
 4899 richter   20   0 12.175g 300800  10472 S 100.0  0.1 224:19.39 heis36_144_cmpl
 4902 richter   20   0 12.175g 297440  10376 S 100.0  0.1 224:19.50 heis36_144_cmpl
 4907 richter   20   0 12.178g 247768  10608 S 100.0  0.1 224:19.50 heis36_144_cmpl
 4908 richter   20   0 12.179g 313328  10648 S 100.0  0.1 224:19.41 heis36_144_cmpl
 5962 root      20   0  138080   4608   3444 S   0.0  0.0   0:00.00 sshd
 5964 khairatb  20   0  138080   2076    912 S   0.0  0.0   0:00.00 sshd
 5965 khairatb  20   0   15440   2100   1676 S   0.0  0.0   0:00.00 bash
 6317 root      20   0  138080   4608   3444 S   0.0  0.0   0:00.00 sshd
 6319 kerikous  20   0  138080   2076    912 S   0.0  0.0   0:00.00 sshd
 6320 kerikous  20   0   15572   2176   1680 S   0.0  0.0   0:00.00 bash
 6836 root      20   0  138080   4604   3444 S   0.0  0.0   0:00.00 sshd
 6838 kinzel    20   0  138080   2068    908 S   0.0  0.0   0:00.00 sshd
 6839 kinzel    20   0   15548   2064   1560 S   0.0  0.0   0:00.00 bash
 7187 root      20   0  138080   4608   3444 S   0.0  0.0   0:00.00 sshd

You'll see in the COMMAND column that the users kerikous, khairatb, and kinzel are logged in via ssh and run a bash shell. (sshd is the remote side program to run connect with ssh, the ssh daemon. And bash is the default command line interface.)

If you find a small amount of star-ccm+, starccm+, or mpid of a user who didn't reserve the current node, then you have found a ghost process. What you can do is to either ask Dr. Schulenburg to remove this process. However, this sometimes takes a bit of time. It might be simpler to ask the users to kill the processes themselves. Tell him/her to login to Neumann, and then to run the following lines to kill the remaining programs on that node. Replace c001 with the node in question.

ssh c001 pkill -9 star-ccm+
ssh c001 pkill -9 starccm+
ssh c001 pkill -9 mpid
If you have a NEW job running with some of the nodes from your old machinefile? The next lines will kill your job

To test if there is still a ghost process on one of the node from your old machinefile list and kill them at the same time you can execute the following file

./killandcount.sh oldmachinefile.txt
killandcount.sh
#!/bin/bash
    echo "Manual kill and count of StarProceses"
    for s in $(cat $1)
    do
            echo "cleaning $s"
        ONLINE=$(scontrol show node $s | grep -c Reason)
        if [ $ONLINE -eq 0 ]; then
                STATUS=$(timeout 4s ssh $s ps -aux | tail -n +2 | sed '/sshd/d;/ps -aux/d' |cut -d' ' -f1 |sort| sed '/root/d;/munge/d;/dbus/d;/ldap/d;/nslcd/d;/postfix/d' | uniq -c )
                RESPOND=$(echo $STATUS | wc -w)
                KILLSTAR=$(timeout 4s ssh $s pkill -9 starccm+ 2>/dev/null)
                KILLSTAR2=$(timeout 4s ssh $s pkill -9 star-ccm+ 2>/dev/null)
                KILLMPI=$(timeout 4s ssh $s pkill -9 mpid 2>/dev/null)
                REMOVE=$(timeout 4s ssh $s rm -v /dev/shm/* 2>/dev/null)
 
                if [ $RESPOND -eq 0 ]; then echo "";
                else echo "$STATUS"; fi
        else echo "not available"; fi
 
 
    done
 
wait
echo "clean up done! Have a nice day"
guide/starccm/starmpiproblem.txt · Last modified: 2018/01/26 16:06 by nlichten
Back to top
CC Attribution-Share Alike 3.0 Unported
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0