Issue #2904: sssd_be AD segfaults on missing A record - sssd

SSSD / sssd

#2904 sssd_be AD segfaults on missing A record

Closed: Fixed None Opened 8 years ago by lukebigum.

When using sssd on CentOS 6.6 with the AD backend against a Samba 4.2 AD domain, sssd does not handle a rare failure condition; when the SRV records point at a DC but the A record for that domain controller is missing. sssd_be periodically crashes, it restarts a couple times but generally does not recover:

Dec 16 09:44:05 host kernel: sssd_be[107682]: segfault at 0 ip 00007fd12c5e018b sp 00007fffba8db420 error 4 in libsss_ad.so[7fd12c5ca000+20000]
Dec 16 09:44:05 host abrtd: Directory 'ccpp-2015-12-16-09:44:05-107682' creation detected
Dec 16 09:44:05 host abrt[107687]: Saved core dump of pid 107682 (/usr/libexec/sssd/sssd_be) to /var/spool/abrt/ccpp-2015-12-16-09:44:05-107682 (1978368 bytes)

The SIGSEGV appears to happen in sss_ldap_init_send(), src/util/sss_ldap.c:331.

Getting into this condition is rare - it's a Samba bug that I'm working on separately. The situation could probably be replicated by poisoning DNS though. My expected behavior would be to give up on this DC, try any other DCs in the Site, then try other DCs in other Sites.

I have ABRT crashes and cores / backtraces from GDB.

lukebigum commented 8 years ago

ABRT crash report
sssd_be_ccpp.tgz

lukebigum commented 8 years ago

I'm unable to attach the GDB backtraces and cores, so you can download a compressed tarball of it here if you want it (https://files.lmax.com/rmo325). It will survive there for 20 days.

Some more information:

sssd-1.11.6-30.el6_6.4.x86_64

Will attach the conf and log file.

lukebigum commented 8 years ago

attachment
sssd.log

lukebigum commented 8 years ago

attachment
sssd.conf

jhrozek commented 8 years ago

Thank you for the bug report. Is there a way you can test with more recent packages? 6.6 is quite old..

Either 6.7 or Lukas' test repo: https://copr.fedoraproject.org/coprs/lslebodn/sssd-1-12/ or the 6.8 preview repo: https://copr-fe.cloud.fedoraproject.org/coprs/jhrozek/SSSD-6.8-preview/

lukebigum commented 8 years ago

That's probably doable; I'm half expecting the Samba server to delete it's own A record at exactly 4pm today, so there's a good chance I'll have an opportunity to try it out. I've synced down your SSSD-6.8-preview repo and will let you know how it goes.

lukebigum commented 8 years ago

As expected, Samba deleted it's own A record at 4pm. I've got this sssd version on VM: sssd-1.13.2-7.el6.unsupported_preview.x86_64 And it exhibits the exact same symptoms, right down to the same line of code: #0 sss_ldap_init_send (mem_ctx=<value optimized out>, ev=0x128c560, uri=0x12e39a0 "(null)", addr=0x0, addr_len=128, timeout=6) at src/util/sss_ldap.c:349 ret = 0 req = 0x12db5e0 state = 0x12e43f0 __FUNCTION__ = "sss_ldap_init_send" subreq = 0x12e39a0 tv = {tv_sec = 19775120, tv_usec = 104} You can download a GDB core dump and trace from here: https://files.lmax.com/mnfa5p And I will upload an ABRT crash report that also contains a cut down core.

lukebigum commented 8 years ago

attachment
abrt_sssd-1.13.2-7.el6.unsupported_preview.tar.gz

jhrozek commented 8 years ago

Thank you very much for testing, then I think this is something we should fix in the next upstream version.

mkosek commented 8 years ago

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1292456

rhbz: => [https://bugzilla.redhat.com/show_bug.cgi?id=1292456 1292456]

mkosek commented 8 years ago

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1292458

rhbz: [https://bugzilla.redhat.com/show_bug.cgi?id=1292456 1292456] => [https://bugzilla.redhat.com/show_bug.cgi?id=1292456 1292456], [https://bugzilla.redhat.com/show_bug.cgi?id=1292458 1292458]

jhrozek commented 8 years ago

Moving into 1.13.4 as per Dec-17 ticket triage.

milestone: NEEDS_TRIAGE => SSSD 1.13.4

pbrezina commented 8 years ago

Fields changed

owner: somebody => pbrezina
status: new => assigned

pbrezina commented 8 years ago

Hi, unfortunately, abbrt does not contain sssd logs for some reason. Can you also attach complete logs (/var/log/sssd) with level set to 0x3ff0 please? Thanks.

The fix is relatively trivial but I'd also like to see what is happening there so I can choose proper place to fix.

lukebigum commented 8 years ago

attachment
sssd.log_debug_0x3ff0

lukebigum commented 8 years ago

log file added, the failure condition can be replicated easily enough with dnsmasq by sending DNS requests for the DC to a nowhere:

dnsqmasq --server=/dc.example.com/

lukebigum commented 8 years ago

I've just realised that log is not complete, it doesn't have any backend logging, which is probably what you want, the AD backend logs?

pbrezina commented 8 years ago

Hi, unfortunately I am not able to reproduce the issue with current master nor with 1.11 neither by using dnsmasq nor by deleting A record from DNS. Yes, sssd.log is not sufficient, I meant to send all logs in /var/log/sssd/* and as you correctly guessed I am especially interested in sssd_$domain.log

lukebigum commented 8 years ago

This is really frustrating... 24 hours ago I can get sssd_be to segfault with dnsmasq as per last attached log, now I can't. There must be some other condition as well as the missing A record that is causing this to fail that my environment now doesn't have, and I don't know what it is to cause it again.

At this point in time I can't get you the logs you want. What is happening to me now is the sssd_be is failing to resolve the primary Site's DC and then goes looking for other backup DCs (as you'd expect). The auth still doesn't work to a backup DC for some other reason, but it's not crashing any more.

Not sure if you want to try fix blindly based on the core dump or close this now.

pbrezina commented 8 years ago

We'll fix it without the logs somehow, but I'd like to know what situation occurred. If you manage to obtain the logs after all, send it please.

pbrezina commented 8 years ago

Hi, I think I see the code area where the bug lies, I can't identify the exact location without the logs or reproducer. I sent a patch that prevents segafault to the list but if you manage to get those logs, please attach it here. Thanks.

patch: 0 => 1

jhrozek commented 8 years ago

master: 8bd9ec3
sssd-1-13: b32ea7b

resolution: => fixed
status: assigned => closed

Metadata Update from @lukebigum:
- Issue assigned to pbrezina
- Issue set to the milestone: SSSD 1.13.4

7 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3945

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata

Assignee

pbrezina

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

SSSD 1.13.4

type

defect

component

SSSD

version

1.11.6

selected

None

testsupdated

patch

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=1292456, https://bugzilla.redhat.com/show_bug.cgi?id=1292458

design_review

review

changelog

None

keywords

None

coverity

None

mark

blocking

None

design

None

sensitive

None

blockedby

None

feature_milestone

None

SSSD / sssd

Source Code

Documentation

#2904 sssd_be AD segfaults on missing A record Closed: Fixed None Opened 8 years ago by lukebigum.

Metadata

#2904 sssd_be AD segfaults on missing A record

Closed: Fixed None Opened 8 years ago by lukebigum.