Issue #131: bind hangs after reload/GSSAPI Error: The referenced context has expired (Success) - bind-dyndb-ldap

bind-dyndb-ldap

#131 bind hangs after reload/GSSAPI Error: The referenced context has expired (Success)

Closed: Fixed None Opened 10 years ago by davix.

What does not work as expected?[[BR]]
After a while (about once in a week) the bind daemon is in the state hang/zombie. The bind daemon seems to be present and accept requests from the clients, but is not answering any dns requests.
Only killing the process with kill -9 can stop the daemon. After starting bind again, it works fine, until the problem occurs again.[[BR]]

How can we reproduce the problem?[[BR]]
About once in a week at 03:xx am.

Do you see any error messages or suspicious information in logs?[[BR]]

/var/log/messages[[BR]]

Example Day1:[[BR]]

Mar 7 01:06:39 ipa named[5434]: LDAP query timed out. Try to adjust "timeout" parameter[[BR]]

Mar 7 01:06:43 ipa named[5434]: LDAP query timed out. Try to adjust "timeout" parameter[[BR]]

Mar 7 01:06:49 ipa named[5434]: LDAP query timed out. Try to adjust "timeout" parameter[[BR]]

Mar 7 01:06:53 ipa named[5434]: LDAP query timed out. Try to adjust "timeout" parameter[[BR]]

Mar 7 01:06:53 ipa named[5434]: LDAP query timed out. Try to adjust "timeout" parameter[[BR]]
[[BR]]

Mar 7 03:21:02 ipa named[5434]: received control channel command 'reload'[[BR]]

Mar 7 03:21:02 ipa named[5434]: loading configuration from '/etc/named.conf'[[BR]]

Mar 7 03:21:02 ipa named[5434]: using default UDP/IPv4 port range: [1024, 65535][[BR]]

Mar 7 03:21:02 ipa named[5434]: using default UDP/IPv6 port range: [1024, 65535][[BR]]

Mar 7 03:21:02 ipa named[5434]: sizing zone task pool based on 6 zones[[BR]]

Mar 7 03:21:02 ipa named[5434]: bind to LDAP server failed: Local error[[BR]]

Mar 7 03:21:02 ipa named[5434]: reloading configuration failed: failure[[BR]]

Mar 7 03:21:02 ipa named[5434]: received SIGHUP signal to reload zones[[BR]]

Mar 7 03:21:02 ipa named[5434]: loading configuration from '/etc/named.conf'[[BR]]

Mar 7 03:21:02 ipa named[5434]: using default UDP/IPv4 port range: [1024, 65535][[BR]]

Mar 7 03:21:02 ipa named[5434]: using default UDP/IPv6 port range: [1024, 65535][[BR]]

Mar 7 03:21:02 ipa named[5434]: sizing zone task pool based on 6 zones[[BR]]

Mar 7 03:21:02 ipa named[5434]: bind to LDAP server failed: Local error[[BR]]

Mar 7 03:21:02 ipa named[5434]: reloading configuration failed: failure[[BR]]

Mar 7 03:22:23 ipa named[5434]: shutting down[[BR]]

Mar 7 03:22:23 ipa named[5434]: stopping command channel on 127.0.0.1#953[[BR]]

Mar 7 03:22:23 ipa named[5434]: stopping command channel on ::1#953[[BR]]

Mar 7 03:22:23 ipa named[5434]: no longer listening on ::#53[[BR]]

Mar 7 03:22:23 ipa named[5434]: no longer listening on 127.0.0.1#53[[BR]]

Mar 7 03:22:23 ipa named[5434]: no longer listening on xxx#53[[BR]]
[[BR]]

Example Day2:[[BR]]

Mar 18 03:33:01 ipa named[24746]: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Ticket expired)[[BR]]

Mar 18 03:33:01 ipa named[24746]: bind to LDAP server failed: Local error[[BR]]

Mar 18 03:33:01 ipa named[24746]: reloading configuration failed: failure[[BR]]

Mar 18 03:33:27 ipa named[24746]: clients-per-query increased to 15[[BR]]

[[BR]]
versions in use[[BR]]

OS:[[BR]]
centos-release-6-5.el6.centos.11.2.x86_64

Bind:[[BR]]
bind-9.8.2-0.17.rc1.el6_4.6.x86_64
bind-dyndb-ldap-2.3-5.el6.x86_64
bind-libs-9.8.2-0.17.rc1.el6_4.6.x86_64
bind-utils-9.8.2-0.17.rc1.el6_4.6.x86_64

IPA:[[BR]]

ipa-server-3.0.0-37.el6.x86_64
ipa-client-3.0.0-37.el6.x86_64[[BR]][[BR]]

Configurations[[BR]]

named.conf:*[[BR]]

dynamic-db "ipa" {
library "ldap.so";
arg "uri ldapi://%2fvar%2frun%2fslapd-xxx.socket";
arg "base cn=dns, dc=xxx";
arg "fake_mname xxx.";
arg "auth_method sasl";
arg "sasl_mech GSSAPI";
arg "sasl_user DNS/xxx";
arg "cache_ttl 600";
arg "zone_refresh 0";
arg "psearch yes";
arg "serial_autoincrement yes";
arg "connections 4";
};

pspacek commented 10 years ago

Hello!

I would like to get details about this problem.

When it happens again, please run following commands on the affected server:

$ rndc trace 20
$ dig @127.0.0.1 some.name.in.your.ldap.zone.example.

It is possible that BIND will not log anything. In any case, please capture coredump from the runnning process:

$ gdb `which named` `pgrep named`
(gdb) gcore

send (compressed) core dump file to e-mail pspacek@redhat.com
attach (compressed) log file /var/named/data/named.run to this ticket or also send it to me privately
add information about yout Kerberos ticket policy to this ticket: {{{ipa krbtpolicy-show}}}

Thank you very much!

pspacek commented 10 years ago

I have received some coredumps:

(gdb) thread apply all bt

Thread 4 (Thread 0x7f7820ba07c0 (LWP 3665)):
#0  0x00007f781e40dc94 in do_sigsuspend (set=<value optimized out>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:63
#1  __sigsuspend (set=<value optimized out>) at ../sysdeps/unix/sysv/linux/sigsuspend.c:78
#2  0x00007f781f5b05a4 in isc__app_ctxrun (ctx0=0x7f781f7d5880) at app.c:680
#3  0x00007f7820bde2f5 in main (argc=<value optimized out>, argv=0x7fff903b87a8) at ./main.c:1085

Thread 3 (Thread 0x7f781ba61700 (LWP 3668)):
#0  0x00007f781e4c3f43 in epoll_wait () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007f781f5bf8d4 in watcher (uap=0x7f7820b67010) at socket.c:3763
#2  0x00007f781ef61851 in start_thread (arg=0x7f781ba61700) at pthread_create.c:301
#3  0x00007f781e4c394d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 2 (Thread 0x7f781c462700 (LWP 3667)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216
#1  0x00007f781f5c188e in isc_condition_waituntil (c=0x7f7820b65078, m=0x7f7820b65028, t=0x7f7820b6506c) at condition.c:59
#2  0x00007f781f5aeb0d in run (uap=0x7f7820b65010) at timer.c:825
#3  0x00007f781ef61851 in start_thread (arg=0x7f781c462700) at pthread_create.c:301
#4  0x00007f781e4c394d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Thread 1 (Thread 0x7f781ce63700 (LWP 3666)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f781f5ac1c1 in dispatch (uap=0x7f7820b64010) at task.c:960
#2  run (uap=0x7f7820b64010) at task.c:1157
#3  0x00007f781ef61851 in start_thread (arg=0x7f781ce63700) at pthread_create.c:301
#4  0x00007f781e4c394d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

My best guess for the moment is that internal reference counting doesn't work correctly in some corner cases, probably in some error handler. As a result:

A reference counter for some zone/other object doesn't drop to 0 during shutdown
A thread is waiting until all objects referencing the particular thread are freed
=> deadlock

I will dig into it a bit more but this will be very hard to debug without a reproducer. Does it happen every time? Could you provide complete logs to me? (You can sent them privately to pspacek@redhat.com if you don't want to expose logs in the ticket.)

Thanks!

pspacek commented 10 years ago

Guys, can you provide me some more information about this problem? I can't move without further information. A reproducer would be the best thing!

pspacek commented 10 years ago

Guys, I don't have enough information to fix this so I'm closing this ticket with resolution worksforme. Feel free to reopen it if you get more information about the problem.

Have a nice day!

pspacek commented 9 years ago

It seems that I have reproduced it by chance. My log now says:

Aug  9 01:01:01 vm-108 named[1430]: GSSAPI Error: The referenced context has expired (Success)
Aug  9 01:01:01 vm-108 named[1430]: bind to LDAP server failed: Local error
Aug  9 01:01:01 vm-108 named[1430]: reloading configuration failed: failure
Aug  9 01:01:01 vm-108 named[1430]: received SIGHUP signal to reload zones

Version: bind-dyndb-ldap-2.3-5.el6.x86_64

pspacek commented 9 years ago

It seems that logic dealing with creds.times.endtime in krb5_helper.c is broken. Let's fix that.

pspacek commented 9 years ago

This problem was fixed in all maintained branches:

master: 24f05cf
v4: bc5f313
v3: 55c9148
v2: 80f7663

Metadata Update from @pspacek:
- Issue assigned to pspacek
- Issue set to the milestone: Fedora 21

7 years ago

Metadata

Assignee

pspacek

Tags

None

Blocking

None

Depending on

None

Priority

critical

Milestone

Fedora 21

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=1142176

blockedby

None

blocking

None

fixedby

80f7663f309c0d0b9cb89ed8f8b38301b207360d

fixedin

3.7

proposedmilestone

None

type

defect

component

ldap driver

version

2.3

keywords

nihsyndrome, deadlock

davix, sigbjorn@nixtra.com, freeipa-users@dstipp.mail.coolhack.net, jan.orel@gooddata.com

bind-dyndb-ldap

Source Code

Documentation

#131 bind hangs after reload/GSSAPI Error: The referenced context has expired (Success) Closed: Fixed None Opened 10 years ago by davix.

Metadata

#131 bind hangs after reload/GSSAPI Error: The referenced context has expired (Success)

Closed: Fixed None Opened 10 years ago by davix.