#4541 Setup Monitoring for Taskotron Production System
Closed: Fixed None Opened 9 years ago by tflink.

Before moving Taskotron to production, we'd like to have some monitoring in place so that there are notifications when stuff is not working.

There will be more clients to add later but the initial setup will have 4 client machines.

== HTTP Pings ==
Simple http pings to make sure that front-facing services are still running:
* http://resultsdb01.qa.fedoraproject.org/resultsdb (resultsdb_frontend check)
* http://restulsdb01.qa.fedoraproject.org/resultsdb_api (resultsdb check)
* http://taskotron01.qa.fedoraproject.org/taskmaster/ (buildbot check)

== System Pings ==
Simple ping to systems to make sure that they're still running and have network:
* taskotron01.qa.fedoraproject.org
* resultsdb01.qa.fedoraproject.org
* taskotron-client22.qa.fedoraproject.org
* taskotron-client23.qa.fedoraproject.org
* taskotron-client24.qa.fedoraproject.org
* taskotron-client25.qa.fedoraproject.org

== buildbot nagios plugin ==

buildbot has a nagios plugin in the upstream git repo. If possible, we'd like to have that plugin added and pointed at the production buildmaster on taskotron01.qa.fedoraproject.org

The nagios plugin is available at: https://github.com/buildbot/buildbot/blob/master/master/contrib/check_buildbot.py


So, what we typically have is nrpe setup on servers and check a variety of items via that. (via the nagios_client role in ansible). This checks disk space, processes, ssh, swap, etc.

We may need to have the RHIT firewall tweaked to allow this from noc01 -> qa network, but it might already be allowed.

The url checks should be pretty simple to add.

Are there any external urls to check as well? we have a external nagios that checks connectivity of external urls if desired.

The buildbot plugin will need packaged up. Perhaps we could see if anyone on the infra list would like to do that to help out?

Oh, I didn't think about the external URLS. For production, they would be:
* https://taskotron.fedoraproject.org/taskmaster/
* https://taskotron.fedoraproject.org/resultsdb/
* https://taskotron.fedoraproject.org/resultsdb_api/

I'll take a look at other services which are using nagios and add that to the taskotron roles. Is there anything which needs to be done on the nagios server/collector? I suspect that we'll find out pretty quick if there are firewall issues.

I wonder if submitting a patch to the buildbot package to add a buildbot-nagios subpackage would be the best way forward. The monitoring plugin is distributed with the source and that would be one less package to maintain, assuming that the plugin doesn't need to be modified.

I've added the external urls to nagios-external.

(after falling down a rabbit hole noticing we weren't monitoring all our proxies right)

I think everything is done now here... can we close this out and just add the buildbot plugin once we have it somewhere?

yeah, that works for me. Closing ticket

Login to comment on this ticket.

Metadata