Here is the current status
- During some tests on main (April 9th) I hit a crash on a master instance The crash occured while stopping the replica agreement protocol. The protocol structure has been overwritten. Likely it was freed and reuse to store something looking like an entry/valueset. (gdb) thread 1 [Switching to thread 1 (Thread 0x7f3ad972e700 (LWP 6192))] #0 __pthread_mutex_lock (mutex=0x6e6f69) at pthread_mutex_lock.c:51 51 unsigned int type = PTHREAD_MUTEX_TYPE (mutex); (gdb) where #0 __pthread_mutex_lock (mutex=0x6e6f69) at pthread_mutex_lock.c:51 #1 0x000000370c6235c9 in PR_Lock () from /lib64/libnspr4.so #2 0x00007f3adf44c0e8 in event_occurred (prp=0x16efc20, event=32) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1227 #3 0x00007f3adf44af0e in repl5_inc_run (prp=0x16efc20) at ldap/servers/plugins/replication/repl5_inc_protocol.c:736 #4 0x00007f3adf4528d8 in prot_thread_main (arg=0x1727530) at ldap/servers/plugins/replication/repl5_protocol.c:295 #5 0x000000370c628cf3 in ?? () from /lib64/libnspr4.so #6 0x00000035d7007d14 in start_thread (arg=0x7f3ad972e700) at pthread_create.c:309 #7 0x00000035d6cf168d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) frame 2 #2 0x00007f3adf44c0e8 in event_occurred (prp=0x16efc20, event=32) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1227 1227 PR_Lock(prp->lock); (gdb) list 1202,1217 1200 } 1201 1202 /* 1203 * Notify the protocol about some event. Signal the condition 1204 * variable in case the protocol is sleeping. Multiple occurences 1205 * of a single event type are not remembered (e.g. no stack 1206 * of events is maintained). 1207 */ 1208 static void 1209 event_notify(Private_Repl_Protocol *prp, PRUint32 event) 1210 { 1211 PR_ASSERT(NULL != prp); 1212 PR_Lock(prp->lock); 1213 prp->eventbits |= event; 1214 PR_NotifyCondVar(prp->cvar); 1215 PR_Unlock(prp->lock); 1216 } 1217 (gdb) print prp $1 = (Private_Repl_Protocol *) 0x16efc20 (gdb) print *prp $2 = {delete = 0x6c437463656a626f, run = 0x737361, stop = 0, status = 0x21, notify_update = 0x706f74, notify_agmt_changed = 0, notify_window_opened = 0, notify_window_closed = 0x21, update_now = 0x7470697263736564, lock = 0x6e6f69, cvar = 0x0, stopped = 33, terminate = 0, eventbits = 22277696, conn = 0x0, last_acquire_response_code = 0, agmt = 0x21, replica_object = 0x3272657473616d, private = 0x0, replica_acquired = 0, repl50consumer = 0, repl71consumer = 33, repl90consumer = 0, timeout = 16805475} (gdb) print (char *) 0x16efc20 $5 = 0x16efc20 "objectClass" (gdb) x/4g 0x16efc20 0x16efc20: 0x6c437463656a626f 0x0000000000737361 0x16efc30: 0x0000000000000000 0x0000000000000021 (gdb) print (char *) 0x16efc20 $6 = 0x16efc20 "objectClass" Others active threads are looking "normal" (closing the CL). (gdb) info threads Id Target Id Frame 9 Thread 0x7f3ad974f700 (LWP 6191) pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86 8 Thread 0x7f3ad9f50700 (LWP 6083) pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:166 7 Thread 0x7f3adbf54700 (LWP 6071) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82 6 Thread 0x7f3ab77e6700 (LWP 6226) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82 5 Thread 0x7f3adb753700 (LWP 6072) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82 4 Thread 0x7f3ada751700 (LWP 6074) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82 3 Thread 0x7f3ae30e8800 (LWP 6063) pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86 2 Thread 0x7f3adaf52700 (LWP 6073) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82 * 1 Thread 0x7f3ad972e700 (LWP 6192) __pthread_mutex_lock (mutex=0x6e6f69) at pthread_mutex_lock.c:51 (gdb) thread 9 [Switching to thread 9 (Thread 0x7f3ad974f700 (LWP 6191))] #0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86 86 subq $WRITERS_WAKEUP, %rdi (gdb) where #0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86 #1 0x00007f3ae31dd9b1 in slapi_rwlock_wrlock (rwlock=0x134d3c0) at ldap/servers/slapd/slapi2nspr.c:267 #2 0x00007f3ae31990a4 in slapd_log_error_proc_internal (subsystem=0x7f3adf48d740 "NSMMReplicationPlugin - changelog program", fmt=0x7f3adf4899a2 "_cl5TrimMain: exiting\n", ap_err=0x7f3ad974eca0, ap_file=0x7f3ad974ec88) at ldap/servers/slapd/log.c:1806 #3 0x00007f3ae3199849 in slapi_log_error (severity=12, subsystem=0x7f3adf48d740 "NSMMReplicationPlugin - changelog program", fmt=0x7f3adf4899a2 "_cl5TrimMain: exiting\n") at ldap/servers/slapd/log.c:2013 #4 0x00007f3adf42e11f in _cl5TrimMain (param=0x0) at ldap/servers/plugins/replication/cl5_api.c:3436 #5 0x000000370c628cf3 in ?? () from /lib64/libnspr4.so #6 0x00000035d7007d14 in start_thread (arg=0x7f3ad974f700) at pthread_create.c:309 #7 0x00000035d6cf168d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) thread 3 [Switching to thread 3 (Thread 0x7f3ae30e8800 (LWP 6063))] #0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86 86 subq $WRITERS_WAKEUP, %rdi (gdb) where #0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86 #1 0x00007f3ae31dd9b1 in slapi_rwlock_wrlock (rwlock=0x134d3c0) at ldap/servers/slapd/slapi2nspr.c:267 #2 0x00007f3ae31990a4 in slapd_log_error_proc_internal (subsystem=0x7f3adf48d740 "NSMMReplicationPlugin - changelog program", fmt=0x7f3adf4896d8 "_cl5Close: waiting for threads to exit: %d thread(s) still active\n", ap_err=0x7fffb4c87b80, ap_file=0x7fffb4c87b68) at ldap/servers/slapd/log.c:1806 #3 0x00007f3ae3199849 in slapi_log_error (severity=12, subsystem=0x7f3adf48d740 "NSMMReplicationPlugin - changelog program", fmt=0x7f3adf4896d8 "_cl5Close: waiting for threads to exit: %d thread(s) still active\n") at ldap/servers/slapd/log.c:2013 #4 0x00007f3adf42da4d in _cl5Close () at ldap/servers/plugins/replication/cl5_api.c:3204 #5 0x00007f3adf4294d5 in cl5Close () at ldap/servers/plugins/replication/cl5_api.c:585 #6 0x00007f3adf437b92 in changelog5_cleanup () at ldap/servers/plugins/replication/cl5_init.c:111 #7 0x00007f3adf44f6f6 in multimaster_stop (pb=0x7fffb4c87d60) at ldap/servers/plugins/replication/repl5_init.c:755 #8 0x00007f3ae31ba631 in plugin_call_func (list=0x1484350, operation=210, pb=0x7fffb4c87d60, call_one=1) at ldap/servers/slapd/plugin.c:1453 #9 0x00007f3ae31ba51e in plugin_call_one (list=0x1484350, operation=210, pb=0x7fffb4c87d60) at ldap/servers/slapd/plugin.c:1421 #10 0x00007f3ae31ba431 in plugin_dependency_closeall () at ldap/servers/slapd/plugin.c:1365 #11 0x00007f3ae31ba4c4 in plugin_closeall (close_backends=1, close_globals=1) at ldap/servers/slapd/plugin.c:1408 #12 0x000000000041988d in slapd_daemon (ports=0x7fffb4c882a0) at ldap/servers/slapd/daemon.c:1358 #13 0x0000000000420dab in main (argc=7, argv=0x7fffb4c88428) at ldap/servers/slapd/main.c:1266 - This crash is not reproducible on demand. - This problem is looking similar as https://fedorahosted.org/389/ticket/618 (integrated in main March 22nd) Possibly the fix for ticket618 was not complete
Here are the next steps
- Evaluate if it exist case where the protocol is freed before all replication agreements are stopped.
attachment 0001-Ticket-47325-Crash-at-shutdown-on-a-replica-aggremen.patch
What is the timeout - conn->timeout.tv_sec = agmt_get_timeout(conn->agmt) - ? Maybe we need a much shorter timeout by default?
'''Here is the current status'''
'''Here are the next steps'''
attachment 0002-Ticket-47325-Crash-at-shutdown-on-a-replica-aggremen.patch
git merge ticket47325 Updating 87f5ef5..1cbd6d8 Fast-forward ldap/servers/plugins/replication/repl5_connection.c | 2 +- ldap/servers/plugins/replication/repl5_protocol.c | 4 ++-- ldap/servers/slapd/ldaputil.c | 18 +++++++++++++++--- 3 files changed, 18 insertions(+), 6 deletions(-)
git push origin master Counting objects: 19, done. Delta compression using up to 4 threads. Compressing objects: 100% (10/10), done. Writing objects: 100% (10/10), 1.42 KiB, done. Total 10 (delta 8), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 87f5ef5..1cbd6d8 master -> master
commit 1cbd6d8 Author: Thierry bordaz (tbordaz) tbordaz@redhat.com Date: Wed Apr 10 18:45:17 2013 +0200
push into 389-ds-base-1.3.1
git push origin 389-ds-base-1.3.1 Counting objects: 25, done. Delta compression using up to 4 threads. Compressing objects: 100% (12/12), done. Writing objects: 100% (12/12), 2.65 KiB, done. Total 12 (delta 8), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 78f029f..3e8c2a4 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 3e8c2a4 Author: Thierry bordaz (tbordaz) tbordaz@redhat.com Date: Wed Apr 10 18:45:17 2013 +0200
git patch file (master) -- coverity fix 0001-Coverity-fix-13164-Logically-dead-code.patch
Bug description: commit 1cbd6d8 for Ticket 47325 - Crash at shutdown on a replica agreement introduced '13164: Logically dead code'.
Fix description: Since bind_timeout never be NULL, no need to do NULL check. Eliminated the dead code.
Pushed to master: commit 324eb76 Pushed to 389-ds-base-1.3.1: 32d7742
Metadata Update from @tbordaz: - Issue assigned to tbordaz - Issue set to the milestone: 1.3.1
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/662
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.