All the physical machines in phx2/qa have -mgmt interfaces.
We should monitor them. Some information gathering would be required before adding them. Check each -mgmt host from dns and confirm:
For any that are not answering, we need to work on fixing them.
Then add checks for:
Pinging from noc01:
{{{
ping bnfs01-mgmt : OK ping ;s390-hub-mgmt : ERR ping tape02-mgmt : OK ping tape02 : OK ping backup01-mgmt : OK ping backup03-mgmt : OK ping bc01-mgmt : OK ping bc02-mgmt : OK ping bxen04-mgmt : OK ping virthost-comm01-mgmt : OK ping bxen03-mgmt : OK ping bvirthost01-mgmt : ERR ping virthost-comm01-mgmt : ERR ping download01-mgmt : OK ping download02-mgmt : OK ping download03-mgmt : OK ping download04-mgmt : OK ping download05-mgmt : OK ping junk01-mgmt : OK ping junk02-mgmt : OK ping junk05-mgmt : OK ping ;ppc-comm01-mgmt : ERR ping ;ppc-comm02-mgmt : ERR ping qa01-mgmt : OK ping qa02-mgmt : OK ping qa03-mgmt : OK ping qa04-mgmt : OK ping qa05-mgmt : OK ping qa06-mgmt : OK ping qa07-mgmt : OK ping qa08-mgmt : OK ping ppc04-mgmt : OK ping sign-vault01-mgmt : OK ping sign-vaultXX-mgmt : OK ping unknown00-mgmt : OK ping unknown01-mgmt : ERR ping unknown02-mgmt : ERR ping unknown03-mgmt : ERR ping unknown04-mgmt : OK ping unknown05-mgmt : OK ping unknown06-mgmt : OK ping unknown07-mgmt : ERR ping unknown08-mgmt : ERR ping unknown09-mgmt : ERR ping virthost01-mgmt : OK ping virthost02-mgmt : ERR ping virthost03-mgmt : OK ping virthost13-mgmt : OK ping xen01-mgmt : OK ping junk05-mgmt : OK ping xen03-mgmt : OK ping xen04-mgmt : ERR ping xen05-mgmt : OK ping xen08-mgmt : ERR ping xen09-mgmt : OK ping junk03-mgmt : OK ping xen15-mgmt : OK ping xen16-mgmt : ERR ping xen17-mgmt : ERR ping xen18-mgmt : ERR ping xen19-mgmt : ERR
}}}
HTTP tests: {{{ http bnfs01-mgmt : OK http ;s390-hub-mgmt : ERR http tape02-mgmt : OK http tape02 : OK http backup01-mgmt : OK http backup03-mgmt : OK http bc01-mgmt : OK http bc02-mgmt : OK http bxen04-mgmt : OK http virthost-comm01-mgmt : OK http bxen03-mgmt : OK http bvirthost01-mgmt : ERR http virthost-comm01-mgmt : OK http download01-mgmt : OK http download02-mgmt : OK http download03-mgmt : OK http download04-mgmt : OK http download05-mgmt : OK http junk01-mgmt : OK http junk02-mgmt : OK http junk05-mgmt : OK http ;ppc-comm01-mgmt : ERR http ;ppc-comm02-mgmt : ERR http qa01-mgmt : OK http qa02-mgmt : OK http qa03-mgmt : OK http qa04-mgmt : OK http qa05-mgmt : OK http qa06-mgmt : OK http qa07-mgmt : OK http qa08-mgmt : OK http ppc04-mgmt : OK http sign-vault01-mgmt : OK http sign-vaultXX-mgmt : OK http unknown00-mgmt : ERR http unknown01-mgmt : ERR http unknown02-mgmt : ERR http unknown03-mgmt : ERR http unknown04-mgmt : ERR http unknown05-mgmt : ERR http unknown06-mgmt : ERR http unknown07-mgmt : ERR http unknown08-mgmt : ERR http unknown09-mgmt : ERR http virthost01-mgmt : OK http virthost02-mgmt : ERR http virthost03-mgmt : OK http virthost13-mgmt : OK http xen01-mgmt : OK http junk05-mgmt : OK http xen03-mgmt : OK http xen04-mgmt : ERR http xen05-mgmt : OK http xen08-mgmt : ERR http xen09-mgmt : OK http junk03-mgmt : OK http xen15-mgmt : OK http xen16-mgmt : ERR http xen17-mgmt : ERR http xen18-mgmt : ERR http xen19-mgmt : ERR }}}
HTTPS tests: {{{ https bnfs01-mgmt : ERR https ;s390-hub-mgmt : ERR https tape02-mgmt : ERR https tape02 : ERR https backup01-mgmt : OK https backup03-mgmt : ERR https bc01-mgmt : OK https bc02-mgmt : ERR https bxen04-mgmt : ERR https virthost-comm01-mgmt : ERR https bxen03-mgmt : ERR https bvirthost01-mgmt : ERR https virthost-comm01-mgmt : ERR https download01-mgmt : OK https download02-mgmt : ERR https download03-mgmt : OK https download04-mgmt : OK https download05-mgmt : OK https junk01-mgmt : OK https junk02-mgmt : OK https junk05-mgmt : OK https ;ppc-comm01-mgmt : ERR https ;ppc-comm02-mgmt : ERR https qa01-mgmt : OK https qa02-mgmt : OK https qa03-mgmt : OK https qa04-mgmt : OK https qa05-mgmt : OK https qa06-mgmt : OK https qa07-mgmt : OK https qa08-mgmt : OK https ppc04-mgmt : ERR https sign-vault01-mgmt : OK https sign-vaultXX-mgmt : ERR https unknown00-mgmt : ERR https unknown01-mgmt : ERR https unknown02-mgmt : ERR https unknown03-mgmt : ERR https unknown04-mgmt : ERR https unknown05-mgmt : ERR https unknown06-mgmt : ERR https unknown07-mgmt : ERR https unknown08-mgmt : ERR https unknown09-mgmt : ERR https virthost01-mgmt : ERR https virthost02-mgmt : ERR https virthost03-mgmt : ERR https virthost13-mgmt : OK https xen01-mgmt : ERR https junk05-mgmt : OK https xen03-mgmt : OK https xen04-mgmt : ERR https xen05-mgmt : OK https xen08-mgmt : ERR https xen09-mgmt : OK https junk03-mgmt : OK https xen15-mgmt : OK https xen16-mgmt : ERR https xen17-mgmt : ERR https xen18-mgmt : ERR https xen19-mgmt : ERR
Great. ;)
Please leave the following out of monitoring:
ping ;s390-hub-mgmt : ERR ping ;ppc-comm01-mgmt : ERR ping ;ppc-comm02-mgmt : ERR ping unknown01-mgmt : ERR ping unknown02-mgmt : ERR ping unknown03-mgmt : ERR ping unknown07-mgmt : ERR ping unknown08-mgmt : ERR ping unknown09-mgmt : ERR
http ;s390-hub-mgmt : ERR http ;ppc-comm01-mgmt : ERR http ;ppc-comm02-mgmt : ERR http unknown00-mgmt : ERR http unknown01-mgmt : ERR http unknown02-mgmt : ERR http unknown03-mgmt : ERR http unknown04-mgmt : ERR http unknown05-mgmt : ERR http unknown06-mgmt : ERR http unknown07-mgmt : ERR http unknown08-mgmt : ERR http unknown09-mgmt : ERR
https ;s390-hub-mgmt : ERR
https ;ppc-comm01-mgmt : ERR https ;ppc-comm02-mgmt : ERR https ppc04-mgmt : ERR https sign-vaultXX-mgmt : ERR https unknown00-mgmt : ERR https unknown01-mgmt : ERR https unknown02-mgmt : ERR https unknown03-mgmt : ERR https unknown04-mgmt : ERR https unknown05-mgmt : ERR https unknown06-mgmt : ERR https unknown07-mgmt : ERR https unknown08-mgmt : ERR https unknown09-mgmt : ERR
These have been cleaned up/removed from dns now:
ping xen08-mgmt : ERR ping xen16-mgmt : ERR ping xen17-mgmt : ERR ping xen18-mgmt : ERR ping xen19-mgmt : ERR
I've fixed https on the following:
bnfs01-mgmt backup03-mgmt bc02-mgmt bxen04-mgmt virthost-comm01-mgmt bxen03-mgmt download-02-mgmt
Also, tape02/tape02-mgmt has no https and is the same machine, so monitor just tape02-mgmt for ping and http.
After filtering, the following hosts are not pingable from noc01 (DNS seems OK):
bvirthost01-mgmt: ERR virthost02-mgmt: ERR xen04-mgmt: ERR
Manual verification:
{{{ [athmane@noc01 ~]$ ping bvirthost01-mgmt.phx2.fedoraproject.org PING bvirthost01-mgmt.phx2.fedoraproject.org (10.5.126.224) 56(84) bytes of data. From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=2 Destination Host Unreachable From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=3 Destination Host Unreachable From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=4 Destination Host Unreachable ^C --- bvirthost01-mgmt.phx2.fedoraproject.org ping statistics --- 4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3710ms
[athmane@noc01 ~]$ ping virthost02-mgmt.phx2.fedoraproject.org PING virthost02-mgmt.phx2.fedoraproject.org (10.5.126.223) 56(84) bytes of data. From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=1 Destination Host Unreachable From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=2 Destination Host Unreachable From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=3 Destination Host Unreachable ^C --- virthost02-mgmt.phx2.fedoraproject.org ping statistics --- 5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4541ms
[athmane@noc01 ~]$ ping xen04-mgmt.phx2.fedoraproject.org PING xen04-mgmt.phx2.fedoraproject.org (10.5.126.204) 56(84) bytes of data. From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=2 Destination Host Unreachable From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=3 Destination Host Unreachable From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=4 Destination Host Unreachable ^C --- xen04-mgmt.phx2.fedoraproject.org ping statistics --- 6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 5920ms }}}
Yes, all three of those should work, but do not. ;(
We are going to need to power cycle those machines to get the mgmt to reset and hopefully come up. So, I would like to add them to monitoring, then ack the alert so we know they are pending problems to be fixed and we know when they come back up. ;)
Login to comment on this ticket.