Ticket #1059 (closed outage: fixed)

Opened 5 years ago

Last modified 5 years ago

db3 outage

Reported by: ricky Owned by: mmcgrath
Priority: blocker Milestone:
Component: Systems Version:
Severity: The Sky Is Falling Keywords:
Cc: robert@…, rakesh@… Blocked By:
Blocking: Sensitive:

Description

db3 went down around 2008-12-16 08:10 UTC, I had to kick it on the PDU, and now I have confirmed file corruption/loss on the / filesystem, at the very least. I have paged Mike to take a look at this issue.

Change History

comment:1 Changed 5 years ago by robert

  • Cc robert@… added

comment:2 Changed 5 years ago by mmcgrath

  • Status changed from new to assigned

K, I'm on it now. We've heard there's issues on / (not a big deal we store no data there) though we've actually found issues on /backup. Normally not a problem but db1 is running from there right now. I'm in the process of bringing db1 back up.

comment:3 Changed 5 years ago by mmcgrath

So initial thoughts almost have to be "db3 can't handle the load of running two databases". The problem is db1, under normal load (which its been getting) isn't terribly busy. DB3 is a pretty beefy box. Also load was always fairly low, and it never swapped. Though, aside from some disk issues, there's nothing indicating what the crash was. The RSAII management card is not reporting any faults.

comment:4 follow-up: ↓ 6 Changed 5 years ago by mmcgrath

RSA card finally throwing some errors including:

aacraid: Host adapter abort request (0,0,5,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,3,0) aacraid: Host adapter abort request (0,0,0,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,4,0) aacraid: Host adapter abort request (0,0,1,0) aacraid: Host adapter reset request. SCSI hang ? AAC: Host adapter BLINK LED 0xa AAC0: adapter kernel panic'd a. AAC0: adapter kernel panic'd a.

and:

SERVPROC 12/16/08, 11:00:08 Hard Drive 3 Fault SERVPROC 12/16/08, 10:59:59 Hard Drive 5 Fault SERVPROC 12/16/08, 10:59:59 Hard Drive 2 Fault SERVPROC 12/16/08, 10:59:59 Hard Drive 1 Fault SERVPROC 12/16/08, 10:59:58 Hard Drive 0 Fault

comment:5 Changed 5 years ago by mmcgrath

IBM is going to replace the raid controller and mobo. the mobo is going to take a while to get in. I've tried repeatedly to rebuild the array and bring postgres back online and its failed every time. however in a degraded mode, I seem to be able to read from the arrays. The last backup did work as required, at worst we're looking at 9 hours of data loss.

I've decided it best (it is the middle of the night right now) to just take the downtime and go for 0 hours of downtime. I'm syncing the raw data files off now. At 83G it will take a while (the drives are borked). It's about 5% done. I'll have a better estimate on how long it will take soon.

comment:6 in reply to: ↑ 4 Changed 5 years ago by mmcgrath

Replying to mmcgrath:

RSA card finally throwing some errors including:

aacraid: Host adapter reset request. SCSI hang ? AAC: Host adapter BLINK LED 0xa AAC0: adapter kernel panic'd a. AAC0: adapter kernel panic'd a.

Actually the above errors were from the kernel, the others were from the RSA card (just to avoid confusion)

comment:7 Changed 5 years ago by mmcgrath

Ok, db3's files are copied off. There was corruption of:

/var/lib/pgsql/data/base/19461/pg_internal.init

But that seems to be the only file.

comment:8 Changed 5 years ago by mmcgrath

Ok.

  1. postgres files moved from db3 to xen3 in guest 'db3tmp'
  2. postgres turned back on
  3. koji turned back on
  4. a build is processing now.
  5. db3 is currently at IP 10.8.34.188
  6. db3tmp is at 10.8.34.213
  7. db3 has had its network interface disabled, it won't come back up on reboot
  8. puppet has been disabled on both

The current plan is to wait until the IBM tech gets on site. He'll replace the motherboard and backplane. Once we're confident its working properly we'll schedule another outage, probably asap, to copy those files back.

Differences between xen3 and db3. xen3 only has one other app on it, app4. we can safely disable it if we need to. Both db3 has 18G ram, db3tmp (on xen3) has 20G ram. The biggest difference is that xen3 just has a single RAID1 mirror for its data. db3 had a RAID1 array for its logs and a raid10 array for the data. IO is likely to be the biggest cause of issues until we move back.

comment:9 Changed 5 years ago by rakesh

  • Cc rakesh@… added

comment:10 Changed 5 years ago by mmcgrath

IBM guy left, replaced the back plane but not motherboard. I'm going to run some fsck's, let the arrays get back in shape and put some general load on it for the next 24 hours (provided koji continues to run ok where it is)

comment:11 Changed 5 years ago by mmcgrath

  • Status changed from assigned to closed
  • Resolution set to fixed

K, db3's arrays have synced. Everything looks good. I'm going to setup bonnie to run to stress it over the next 12 hours or so. if its still working I'll schedule downtime to move back to it.

Note: See TracTickets for help on using tickets.