#47325 Crash at shutdown on a replica aggrement
Closed: wontfix None Opened 11 years ago by tbordaz.

Here is the current status

- During some tests on main (April 9th) I hit a crash on a master instance
  The crash occured while stopping the replica agreement protocol.
  The protocol structure has been overwritten. Likely it was freed and reuse to store 
  something looking like an entry/valueset.


(gdb) thread 1
[Switching to thread 1 (Thread 0x7f3ad972e700 (LWP 6192))]
#0  __pthread_mutex_lock (mutex=0x6e6f69) at pthread_mutex_lock.c:51
51    unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
(gdb) where
#0  __pthread_mutex_lock (mutex=0x6e6f69) at pthread_mutex_lock.c:51
#1  0x000000370c6235c9 in PR_Lock () from /lib64/libnspr4.so
#2  0x00007f3adf44c0e8 in event_occurred (prp=0x16efc20, event=32) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1227
#3  0x00007f3adf44af0e in repl5_inc_run (prp=0x16efc20) at ldap/servers/plugins/replication/repl5_inc_protocol.c:736
#4  0x00007f3adf4528d8 in prot_thread_main (arg=0x1727530) at ldap/servers/plugins/replication/repl5_protocol.c:295
#5  0x000000370c628cf3 in ?? () from /lib64/libnspr4.so
#6  0x00000035d7007d14 in start_thread (arg=0x7f3ad972e700) at pthread_create.c:309
#7  0x00000035d6cf168d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) frame 2
#2  0x00007f3adf44c0e8 in event_occurred (prp=0x16efc20, event=32) at ldap/servers/plugins/replication/repl5_inc_protocol.c:1227
1227        PR_Lock(prp->lock);
(gdb) list 1202,1217
1200    }
1201    
1202    /*
1203     * Notify the protocol about some event. Signal the condition
1204     * variable in case the protocol is sleeping. Multiple occurences
1205     * of a single event type are not remembered (e.g. no stack
1206     * of events is maintained).
1207     */
1208    static void
1209    event_notify(Private_Repl_Protocol *prp, PRUint32 event)
1210    {
1211        PR_ASSERT(NULL != prp);
1212        PR_Lock(prp->lock);
1213        prp->eventbits |= event;
1214        PR_NotifyCondVar(prp->cvar);
1215        PR_Unlock(prp->lock);
1216    }
1217    
(gdb) print prp
$1 = (Private_Repl_Protocol *) 0x16efc20
(gdb) print *prp
$2 = {delete = 0x6c437463656a626f, run = 0x737361, stop = 0, status = 0x21, notify_update = 0x706f74, notify_agmt_changed = 0, 
  notify_window_opened = 0, notify_window_closed = 0x21, update_now = 0x7470697263736564, lock = 0x6e6f69, cvar = 0x0, 
  stopped = 33, terminate = 0, eventbits = 22277696, conn = 0x0, last_acquire_response_code = 0, agmt = 0x21, 
  replica_object = 0x3272657473616d, private = 0x0, replica_acquired = 0, repl50consumer = 0, repl71consumer = 33, 
  repl90consumer = 0, timeout = 16805475}
(gdb) print (char *) 0x16efc20
$5 = 0x16efc20 "objectClass"
(gdb) x/4g 0x16efc20
0x16efc20:  0x6c437463656a626f  0x0000000000737361
0x16efc30:  0x0000000000000000  0x0000000000000021
(gdb) print (char *) 0x16efc20
$6 = 0x16efc20 "objectClass"



        Others active threads are looking "normal" (closing the CL).


(gdb) info threads
  Id   Target Id         Frame 
  9    Thread 0x7f3ad974f700 (LWP 6191) pthread_rwlock_wrlock ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86
  8    Thread 0x7f3ad9f50700 (LWP 6083) pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:166
  7    Thread 0x7f3adbf54700 (LWP 6071) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82
  6    Thread 0x7f3ab77e6700 (LWP 6226) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82
  5    Thread 0x7f3adb753700 (LWP 6072) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82
  4    Thread 0x7f3ada751700 (LWP 6074) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82
  3    Thread 0x7f3ae30e8800 (LWP 6063) pthread_rwlock_wrlock ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86
  2    Thread 0x7f3adaf52700 (LWP 6073) 0x00000035d6cea9f3 in select () at ../sysdeps/unix/syscall-template.S:82
* 1    Thread 0x7f3ad972e700 (LWP 6192) __pthread_mutex_lock (mutex=0x6e6f69) at pthread_mutex_lock.c:51
(gdb) thread 9
[Switching to thread 9 (Thread 0x7f3ad974f700 (LWP 6191))]
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86
86      subq    $WRITERS_WAKEUP, %rdi
(gdb) where
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86
#1  0x00007f3ae31dd9b1 in slapi_rwlock_wrlock (rwlock=0x134d3c0) at ldap/servers/slapd/slapi2nspr.c:267
#2  0x00007f3ae31990a4 in slapd_log_error_proc_internal (subsystem=0x7f3adf48d740 "NSMMReplicationPlugin - changelog program", 
    fmt=0x7f3adf4899a2 "_cl5TrimMain: exiting\n", ap_err=0x7f3ad974eca0, ap_file=0x7f3ad974ec88) at ldap/servers/slapd/log.c:1806
#3  0x00007f3ae3199849 in slapi_log_error (severity=12, subsystem=0x7f3adf48d740 "NSMMReplicationPlugin - changelog program", 
    fmt=0x7f3adf4899a2 "_cl5TrimMain: exiting\n") at ldap/servers/slapd/log.c:2013
#4  0x00007f3adf42e11f in _cl5TrimMain (param=0x0) at ldap/servers/plugins/replication/cl5_api.c:3436
#5  0x000000370c628cf3 in ?? () from /lib64/libnspr4.so
#6  0x00000035d7007d14 in start_thread (arg=0x7f3ad974f700) at pthread_create.c:309
#7  0x00000035d6cf168d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f3ae30e8800 (LWP 6063))]
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86
86      subq    $WRITERS_WAKEUP, %rdi
(gdb) where
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:86
#1  0x00007f3ae31dd9b1 in slapi_rwlock_wrlock (rwlock=0x134d3c0) at ldap/servers/slapd/slapi2nspr.c:267
#2  0x00007f3ae31990a4 in slapd_log_error_proc_internal (subsystem=0x7f3adf48d740 "NSMMReplicationPlugin - changelog program", 
    fmt=0x7f3adf4896d8 "_cl5Close: waiting for threads to exit: %d thread(s) still active\n", ap_err=0x7fffb4c87b80, 
    ap_file=0x7fffb4c87b68) at ldap/servers/slapd/log.c:1806
#3  0x00007f3ae3199849 in slapi_log_error (severity=12, subsystem=0x7f3adf48d740 "NSMMReplicationPlugin - changelog program", 
    fmt=0x7f3adf4896d8 "_cl5Close: waiting for threads to exit: %d thread(s) still active\n") at ldap/servers/slapd/log.c:2013
#4  0x00007f3adf42da4d in _cl5Close () at ldap/servers/plugins/replication/cl5_api.c:3204
#5  0x00007f3adf4294d5 in cl5Close () at ldap/servers/plugins/replication/cl5_api.c:585
#6  0x00007f3adf437b92 in changelog5_cleanup () at ldap/servers/plugins/replication/cl5_init.c:111
#7  0x00007f3adf44f6f6 in multimaster_stop (pb=0x7fffb4c87d60) at ldap/servers/plugins/replication/repl5_init.c:755
#8  0x00007f3ae31ba631 in plugin_call_func (list=0x1484350, operation=210, pb=0x7fffb4c87d60, call_one=1)
    at ldap/servers/slapd/plugin.c:1453
#9  0x00007f3ae31ba51e in plugin_call_one (list=0x1484350, operation=210, pb=0x7fffb4c87d60) at ldap/servers/slapd/plugin.c:1421
#10 0x00007f3ae31ba431 in plugin_dependency_closeall () at ldap/servers/slapd/plugin.c:1365
#11 0x00007f3ae31ba4c4 in plugin_closeall (close_backends=1, close_globals=1) at ldap/servers/slapd/plugin.c:1408
#12 0x000000000041988d in slapd_daemon (ports=0x7fffb4c882a0) at ldap/servers/slapd/daemon.c:1358
#13 0x0000000000420dab in main (argc=7, argv=0x7fffb4c88428) at ldap/servers/slapd/main.c:1266




- This crash is not reproducible on demand.

- This problem is looking similar as https://fedorahosted.org/389/ticket/618 (integrated in main March 22nd)
  Possibly the fix for ticket618 was not complete

Here are the next steps

- Evaluate if it exist case where the protocol is freed before all replication
  agreements are stopped.

Here is the current status

Here are the next steps

  • This bug fix should change ​in prot_start(repl5_protocol.c) the thread create flag from PR_UNJOINABLE_THREAD to PR_JOINABLE_THREAD
Here is the current status * testing the fix I hit a hang at shutdown {{{ Thread 2 (Thread 0x7f1af3fff700 (LWP 28542)): #0 0x00000035d6ce8bdf in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87 #1 0x000000370fa2b8ad in ldap_int_select (ld=ld@entry=0x7f1ae8000e30, timeout=timeout@entry=0x0) at os-ip.c:1128 #2 0x000000370fa15d35 in wait4msg (result=<optimized out>, timeout=<optimized out>, all=1, msgid=1, ld=0x7f1ae8000e30) at result.c:312 #3 ldap_result (ld=0x7f1ae8000e30, msgid=1, all=1, timeout=<optimized out>, result=<optimized out>) at result.c:117 #4 0x00007f1b3129bc21 in slapi_ldap_bind (ld=0x7f1ae8000e30, bindid=0x7f1ae8009570 "cn=replication manager,o=fr", creds=0x7f1ae8000d70 "secret12", mech=0x0, serverctrls=0x0, returnedctrls=0x7f1af3ffead8, timeout=0x0, msgidp=0x0) at ldap/servers/slapd/ldaputil.c:1129 #5 0x00007f1b2d5782c2 in bind_and_check_pwp (conn=0x7f1aec006280, binddn=0x7f1ae8009570 "cn=replication manager,o=fr", password=0x7f1ae8000d70 "secret12") at ldap/servers/plugins/replication/repl5_connection.c:1784 #6 0x00007f1b2d576f5b in conn_connect (conn=0x7f1aec006280) at ldap/servers/plugins/replication/repl5_connection.c:1150 #7 0x00007f1b2d581cb0 in acquire_replica (prp=0x7f1aec0063e0, prot_oid=0x7f1b2d5bf7d9 "2.16.840.1.113730.3.6.1", ruv=0x7f1af3ffecf8) at ldap/servers/plugins/replication/repl5_protocol_util.c:169 #8 0x00007f1b2d579883 in repl5_inc_run (prp=0x7f1aec0063e0) at ldap/servers/plugins/replication/repl5_inc_protocol.c:789 #9 0x00007f1b2d581248 in prot_thread_main (arg=0x7f1aec0068e0) at ldap/servers/plugins/replication/repl5_protocol.c:296 #10 0x000000370c628cf3 in ?? () from /usr/lib64/libnspr4.so #11 0x00000035d7007d14 in start_thread (arg=0x7f1af3fff700) at pthread_create.c:309 #12 0x00000035d6cf168d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Thread 1 (Thread 0x7f1b31218800 (LWP 27453)): #0 0x00000035d7008e60 in pthread_join (threadid=139753739515648, thread_return=0x7fff742cb328) at pthread_join.c:93 #1 0x000000370c62934e in PR_JoinThread () from /usr/lib64/libnspr4.so #2 0x00007f1b2d5815d9 in prot_stop (rp=0x7f1aec0068e0) at ldap/servers/plugins/replication/repl5_protocol.c:427 #3 0x00007f1b2d56ff91 in agmt_stop (ra=0x7f1aec005170) at ldap/servers/plugins/replication/repl5_agmt.c:734 #4 0x00007f1b2d574df7 in agmtlist_shutdown () at ldap/servers/plugins/replication/repl5_agmtlist.c:687 #5 0x00007f1b2d57e043 in multimaster_stop (pb=0x7fff742cb470) at ldap/servers/plugins/replication/repl5_init.c:755 #6 0x00007f1b312cec79 in plugin_call_func (list=0x16b5cc0, operation=210, pb=0x7fff742cb470, call_one=1) at ldap/servers/slapd/plugin.c:1453 #7 0x00007f1b312ceb66 in plugin_call_one (list=0x16b5cc0, operation=210, pb=0x7fff742cb470) at ldap/servers/slapd/plugin.c:1421 #8 0x00007f1b312cea72 in plugin_dependency_closeall () at ldap/servers/slapd/plugin.c:1365 #9 0x00007f1b312ceb0c in plugin_closeall (close_backends=1, close_globals=1) at ldap/servers/slapd/plugin.c:1408 #10 0x000000000041a141 in slapd_daemon (ports=0x7fff742cb9c0) at ldap/servers/slapd/daemon.c:1358 #11 0x0000000000421669 in main (argc=7, argv=0x7fff742cbb48) at ldap/servers/slapd/main.c:1267 }}} Here are the next steps - Evaluate how to prevent this hang

What is the timeout - conn->timeout.tv_sec = agmt_get_timeout(conn->agmt) - ?
Maybe we need a much shorter timeout by default?

'''Here is the current status''' - bind_and_check_pwp is calling slapi_ldap_bind without timeout {{{ (gdb) where #0 slapi_ldap_bind (ld=0x7fc24000f4e0, bindid=0x7fc240000ef0 "uid=rmanager,cn=config", creds=0x7fc240000d50 "Secret123", mech=0x0, serverctrls=0x0, returnedctrls=0x7fc29dfeaad8, timeout=0x0, msgidp=0x0) at ldap/servers/slapd/ldaputil.c:1090 #1 0x00007fc2c0675bbe in bind_and_check_pwp (conn=0x7fc27c002700, binddn=0x7fc240000ef0 "uid=rmanager,cn=config", password=0x7fc240000d50 "Secret123") at ldap/servers/plugins/replication/repl5_connection.c:1784 #2 0x00007fc2c0674857 in conn_connect (conn=0x7fc27c002700) at ldap/servers/plugins/replication/repl5_connection.c:1150 #3 0x00007fc2c067f308 in acquire_replica (prp=0x7fc27c002860, prot_oid=0x7fc2c06bc7e9 "2.16.840.1.113730.3.6.1", ruv=0x7fc29dfeacf8) at ldap/servers/plugins/replication/repl5_protocol_util.c:169 #4 0x00007fc2c067756a in repl5_inc_run (prp=0x7fc27c002860) at ldap/servers/plugins/replication/repl5_inc_protocol.c:883 #5 0x00007fc2c067e8d8 in prot_thread_main (arg=0x7fc27c004d40) at ldap/servers/plugins/replication/repl5_protocol.c:295 #6 0x000000370c628cf3 in ?? () from /lib64/libnspr4.so #7 0x00000035d7007d14 in start_thread (arg=0x7fc29dfeb700) at pthread_create.c:309 #8 0x00000035d6cf168d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 (gdb) print timeout $3 = (struct timeval *) 0x0 }}} - This explains why the RA thread was waiting indefinitely an answer from its per server - This issue is independant from RA being JOINED or not. '''Here are the next steps''' - RA code should call slapi_ldap_bind with a timeout (RA prp->timeout ?) and if it does not, slapi_ldap_bind should implemented an hardcoded one.

'''Here is the current status'''

  • The fix is sent for review
    The fix makes the RA thread joinable at creation. It also fix the risk of unlimited bind timeout of the RA toward its server peer. It uses the default protocol timeout (10m).

'''Here are the next steps'''

  • Wait for review

git merge ticket47325
Updating 87f5ef5..1cbd6d8
Fast-forward
ldap/servers/plugins/replication/repl5_connection.c | 2 +-
ldap/servers/plugins/replication/repl5_protocol.c | 4 ++--
ldap/servers/slapd/ldaputil.c | 18 +++++++++++++++---
3 files changed, 18 insertions(+), 6 deletions(-)

git push origin master
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.42 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
87f5ef5..1cbd6d8 master -> master

commit 1cbd6d8
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Wed Apr 10 18:45:17 2013 +0200

push into 389-ds-base-1.3.1

git push origin 389-ds-base-1.3.1
Counting objects: 25, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (12/12), done.
Writing objects: 100% (12/12), 2.65 KiB, done.
Total 12 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
78f029f..3e8c2a4 389-ds-base-1.3.1 -> 389-ds-base-1.3.1

commit 3e8c2a4
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Wed Apr 10 18:45:17 2013 +0200

Bug description: commit 1cbd6d8
for Ticket 47325 - Crash at shutdown on a replica agreement
introduced '13164: Logically dead code'.

Fix description: Since bind_timeout never be NULL, no need to
do NULL check. Eliminated the dead code.

Pushed to master: commit 324eb76
Pushed to 389-ds-base-1.3.1: 32d7742

Metadata Update from @tbordaz:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.3.1

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/662

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata