Ticket #2516 (closed outage: fixed)

Opened 3 years ago

Last modified 3 years ago

PHX2 Netapp problems

Reported by: smooge Owned by: smooge
Priority: major Milestone:
Component: Systems Version:
Severity: The Sky Is Falling Keywords: meeting
Cc: Blocked By:
Blocking: Sensitive:

Description

phenomenon

Due to problems with some NFS in our PHX2 facility we are experiencing diminished capacity to several of our services. We are working with our provider and engineers on how to deal with this issue soon.

To convert UTC to your local time, take a look at http://fedoraproject.org/wiki/Infrastructure/UTCHowto or run:

date -d '2010-12-15 22:00 UTC'

Reason for outage: NFS operations with filer are peaking below expected rates causing hangs on NFS clients.

Affected Services:

BFO - http://boot.fedoraproject.org/ Buildsystem - http://koji.fedoraproject.org/ CVS / Source Control Main Website - http://fedoraproject.org/ Mirror List - https://mirrors.fedoraproject.org/ Mirror Manager - https://admin.fedoraproject.org/mirrormanager/ Package Database - https://admin.fedoraproject.org/pkgdb/

Unaffected Services:

Ticket Link:

Contact Information:

Please join #fedora-admin in irc.freenode.net or respond to this email to track the status of this outage.

reason

recommendation

Change History

comment:1 Changed 3 years ago by smooge

  • Status changed from new to assigned

mitigations being worked on

  1. move fi-repo from NFS to disks on puppet.
  2. move lookaside cache to disks on equalogix

trying to figure out next steps.

comment:2 Changed 3 years ago by smooge

<skvidal> okay
<skvidal> 1. performance problems - those are likely to continue since we/ve not removed any load
<skvidal> 2. nothing-works-not-even-a-mount problem appears to have been some dns issues which we are expecting an explanation on "soon" - but the changes, thus far, do appear to be solving them
<skvidal> the next steps are:
<skvidal> a. see if the performance issues gets better w/o the svn repos adding load
<skvidal> b. if the answer to a is yes - see if we can limp along through to the new year so we don't have to play silly buggers over the holiday
<skvidal> c. if the answer to a is no then come up with a new plan
<skvidal> long-ish term (mid feb) is to transition to a new netapp and magically solve all our problems (and find some new ones)
<skvidal> wow, echoing silence as a reply
<skvidal> fantastic

comment:3 Changed 3 years ago by smooge

  • Resolution set to fixed
  • Status changed from assigned to closed

Outage seems to have been solved by cleaning up bad DNS connection. It looks like at sometime in November some change caused the RHIT servers no longer to get DNS from Fedora. When the phx2.fedoraproject.org tables timed out systems trying to get new mounts failed and other issues stopped.

DNS problems were corrected and other jobs that were causing high CPU usage on the server were removed. Traffic seems to be moving back to normal.

Note: See TracTickets for help on using tickets.