2 Down CMMs on 61000 chassis
problem description
symptoms 1. "asg stat" shows 2 down CMMs and 0/2 up as seen here
[Expert@61k_fw-ch01-01]# asg stat -------------------------------------------------------------------------- | System Status | -------------------------------------------------------------------------- | Up time | 1 year, 38 days, 23:14:58 hours | -------------------------------------------------------------------------- | Current CPUs load average | 1 % | | Concurrent connections | 5176 | | Health | CMMs 2 Down | -------------------------------------------------------------------------- | Chassis 1 | STANDBY UP / Required | | | SGMs 12 / 12 | | | Ports 2 / 2 | | | Fans 6 / 6 | | | SSMs 2 / 2 | | | CMMs 0 / 2 (!) | | | Power Supplies 5 / 5 |
2. bay 1(bottom) CMM has red status light
The bay1 CMM had a red status light...I am not sure which LED (act, ctr, pwe, mjr, hs, mnr) was red. I failed to gather that info from the onsite person helping me. bay2 LEDs were normal, none red.
3. inter-chassis network connectivity to CMMs failing
There were are no console cables plugged into the CMM cards for this device. So, no troubleshooting could be done there. The active CMM was not reachable via the CMM IPs 198.51.100.33 or 192.51.100.233. These are the IPs used on all 61000 devices for intercommunication with the CMMS. packet captures below show the CMM not responding to arp requests.
listing of the chassis ports for CMM connectivity...
[Expert@61k_fw-ch01-01]# ifconfig -a |grep -A 1 CIN eth1-CIN Link encap:Ethernet HWaddr 00:1C:7F:20:14:7C inet addr:198.51.100.1 Bcast:198.51.100.127 Mask:255.255.255.128 eth2-CIN Link encap:Ethernet HWaddr 00:1C:7F:20:14:7D inet addr:198.51.100.201 Bcast:198.51.100.255 Mask:255.255.255.128
packet capture taken on CMM networks show arp requests but no replies. Neither CMM appears to be responding on their network connection.
[Expert@61k_fw-ch01-01]# tcpdump -i eth2-CIN tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth2-CIN, link-type EN10MB (Ethernet), capture size 96 bytes 05:37:25.865315 arp who-has 198.51.100.233 tell 198.51.100.204 05:37:26.864981 arp who-has 198.51.100.233 tell 198.51.100.204 05:37:27.864873 arp who-has 198.51.100.233 tell 198.51.100.204 05:37:29.467760 arp who-has 198.51.100.233 tell 198.51.100.204 ...
[Expert@61k_fw-ch01-01]# tcpdump -i eth1-CIN tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1-CIN, link-type EN10MB (Ethernet), capture size 96 bytes 05:38:01.206437 arp who-has 198.51.100.33 tell 198.51.100.1 05:38:02.206092 arp who-has 198.51.100.33 tell 198.51.100.1 05:38:04.143829 arp who-has 198.51.100.33 tell 198.51.100.1
problem resolution
Due to the lack of CMM console cables and telnet/ssh connectivity, we resorted to physically resetting the cards one at a time. . First bay 1, then bay 2. After resetting bay 1, there was no change in status. After resetting bay2, then the red error status light on bay1 went green. Also the CMM status changed from 0/2 up to 2/2 up as seen below.
[Expert@61k_fw-ch01-01]# asg stat | grep -i -E "chassis|cmms" | Chassis 1 | STANDBY UP / Required | | | CMMs 2 / 2 |
root cause
root cause undetermined