Issue #2525: Monitor SIGKILL timer issue and service restart failure - sssd

SSSD / sssd

#2525 Monitor SIGKILL timer issue and service restart failure

Closed: Invalid None Opened 9 years ago by kieren.

Per IRC conv with sgallagh, sssd (1.9.2) failed to SIGKILL sssd_pam which subsequently prevents the service being restarted.

Log extract:

(Wed Dec 10 15:16:52 2014) [sssd] [mt_svc_sigkill] (0x0010): [mydomain][933] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): [pam][935] is not responding to SIGTERM. Sending SIGKILL.
(Wed Dec 10 15:24:18 2014) [sssd] [mt_svc_sigkill] (0x0010): Sending signal to child (pam:935) failed! Ignore and pretend child is dead.

IRC log:

18:32 < kieren> if sssd kills a process (like sssd_pam), will it try at some point to restart it itself?
19:01 < sgallagh> kieren: Yes, if SSSD detects the death of (or kills) one of its subprocesses, it *should*
                  immediately relaunch it
19:02 < kieren> sgallagh: great - do you know if that appeared in a particular version?
19:02 < kieren> i have rhel6.4 / sssd 1.9.2 and it didn't seem to respawn it
19:02 < sgallagh> kieren: It was supposed to work that way from the very beginning
19:03 < sgallagh> Actually, it will try three times to restart it, then give up
19:03 < kieren> after "[pam][935] is not responding to SIGTERM. Sending SIGKILL." i got the error "Sending
                signal to child (pam:935) failed! Ignore and pretend child is dead."
19:03 < sgallagh> Wait, what?
19:04 < kieren> then nothing else in the sssd.log
19:04 < sgallagh> That... shouldn't be possibel
19:04 < sgallagh> *possible
19:06 < kieren> which bit shouldn't be possible - the 'ignore and pretend child is dead' bit?
19:07 < sgallagh> "Sending signal to child failed!"
19:13 < sgallagh> OK, unfortunately, we're not printing the reason that kill() fails here.
19:15 < sgallagh> That talloc_free() is likely incorrect.
19:21 < sgallagh> kieren: Please file a bug on this. I'll have a patch ready shortly
19:21 < sgallagh> But there's actually two bugs here.
19:22 < sgallagh> 1) When the child exit handler files, it doesn't remove the SIGKILL timer
19:22 < sgallagh> 2) The SIGKILL timer talloc_free()s the service, so it doesn't restart.
19:23 < sgallagh> Interestingly, I think it will only have an effect the *second* time the sssd_pam crashes.
19:23 < sgallagh> Because unless there's a race, the child will be restarted before the SIGKILL tries to hit
                  the old PID and then delete the svc object
19:28 < sgallagh> Actually, there's a third bug here too.. an access-after-free() if the kill(SIGTERM) fails...
19:30 < sgallagh> Looks like some pieces of it are fixed in master, but not all

dpal commented 9 years ago

This ticket gave me a good laugh!

sgallagh commented 9 years ago

Patch submitted: https://lists.fedorahosted.org/pipermail/sssd-devel/2014-December/022793.html

The actual reasons turned out to be a little more complex and esoteric. It was a combination of two small bugs, a race condition and an improper talloc_free().

The short version is that there's a race where, if the SIGTERM takes a while to process through, it leaves open a several-second gap where the SIGKILL timer could fire, fail (because the process already exited) and then talloc_free() the service, preventing it from being started again. Ugly and next to impossible to reproduce reliably. I think the patch will fix it, though.

owner: somebody => sgallagh
patch: 0 => 1
status: new => assigned

jhrozek commented 9 years ago

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.12.4

jhrozek commented 9 years ago

master: 152251b

resolution: => fixed
status: assigned => closed

jhrozek commented 9 years ago

Fields changed

rhbz: => 0

jhrozek commented 8 years ago

Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1267761 (Red Hat Enterprise Linux 6)

rhbz: 0 => [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761]

preichl commented 8 years ago

I'm afraid that the bug was not fixed completely:

https://bugzilla.redhat.com/show_bug.cgi?id=1276781

resolution: fixed =>
sensitive: => 0
status: closed => reopened

jhrozek commented 8 years ago

Reopened bugs belong to triage.

milestone: SSSD 1.12.4 => NEEDS_TRIAGE

jhrozek commented 8 years ago

We should take a look at the code again but we don't have a reproducer.

milestone: NEEDS_TRIAGE => SSSD 1.13.3
priority: major => minor

jhrozek commented 8 years ago

Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1276781 (Red Hat Enterprise Linux 6)

rhbz: [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761] => [https://bugzilla.redhat.com/show_bug.cgi?id=1267761 1267761], [https://bugzilla.redhat.com/show_bug.cgi?id=1276781 1276781]

jhrozek commented 8 years ago

This ticket still needs work and we need to release 1.13.3 soon.

milestone: SSSD 1.13.3 => SSSD 1.13.4
owner: sgallagh => somebody
status: reopened => new

jhrozek commented 8 years ago

This will be (hopefully) mitigated by some changes being worked on

Simo rewrote the watchdog to be in-process
the cache writes should be less frequent in 1.14 as well
Pavel is changing the requests talloc hierarchy

Because of the two above and because we don't have a way to reproduce this problem, I'm marking this bug as minor and moving to a release further away. I would prefer to see if we still have issues after 1.14 changes.

milestone: SSSD 1.13.4 => SSSD 1.13.5

jhrozek commented 8 years ago

Fields changed

milestone: SSSD 1.13.5 => SSSD 1.15 beta

jhrozek commented 7 years ago

The watchdog and the DP rewrite make this ticket obsolete in my opinion.

review: 0 => 1
selected: => Not need

jhrozek commented 7 years ago

Bugs like these shouldn't happen with the new talloc hierarchy of the requests. Please reopen if you can reproduce the issue with 1.14 or newer.

resolution: => worksforme
status: new => closed

Metadata Update from @kieren:
- Issue set to the milestone: SSSD Future releases (no date set yet)

7 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3567

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

minor

Milestone

SSSD Future releases (no date set yet)

type

defect

component

Service Monitor

version

1.9.2

selected

Not need

testsupdated

patch

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=1267761, https://bugzilla.redhat.com/show_bug.cgi?id=1276781

design_review

review

changelog

None

keywords

None

coverity

None

mark

blocking

None

design

None

sensitive

None

blockedby

None

feature_milestone

None

SSSD / sssd

Source Code

Documentation

#2525 Monitor SIGKILL timer issue and service restart failure Closed: Invalid None Opened 9 years ago by kieren.

Metadata

#2525 Monitor SIGKILL timer issue and service restart failure

Closed: Invalid None Opened 9 years ago by kieren.