Issue #48181: Deadlock in retroCL - 389-ds-base

First analyze {{{ This is looking like a deadlock between thread 14 and thread 13 where both threads acquired lock in the opposite order. Thread 14 is processing a betxn_postop but thread 13 is doing a postop. I wonder if the problem is that the test/set of the changenumber is done outside of the backend lock. Thread 14 holds some db page locks and tries to acquire retrocl plugin lock (retrocl_internal_lock) 800000f6 dd= 6 locks held 108 write locks 57 pid/thread 1863/140490418628352 flags 0 priority 100 800000f6 WRITE 1 HELD userRoot/id2entry.db page 2495 800000f6 READ 1 HELD userRoot/id2entry.db page 2495 800000f6 WRITE 3 HELD changelog/objectclass.db page 1 800000f6 WRITE 1 HELD changelog/changenumber.db page 68 Thread 14 (Thread 0x7fc6797f2700 (LWP 1930)): #0 0x00007fc6b31eff1d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fc6b31ea9be in pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x00007fc6b38420b9 in PR_Lock () from /lib64/libnspr4.so #3 0x00007fc6a7ec99b0 in retrocl_postob () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #4 0x00007fc6b546156f in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #5 0x00007fc6b5461893 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #6 0x00007fc6a965d231 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #7 0x00007fc6b544e905 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #8 0x00007fc6b544f387 in modify_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0 #9 0x00007fc6a8d354a2 in memberof_fix_memberof_callback () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #10 0x00007fc6a8d35f65 in memberof_modop_one_replace_r () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #11 0x00007fc6a8d362a6 in memberof_mod_smod_list () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #12 0x00007fc6a8d3758d in memberof_postop_modify () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #13 0x00007fc6b546156f in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #14 0x00007fc6b5461893 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #15 0x00007fc6a965d231 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #16 0x00007fc6b544e905 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #17 0x00007fc6b544fbe7 in do_modify () from /usr/lib64/dirsrv/libslapd.so.0 #18 0x00007fc6b5932925 in connection_threadmain () #19 0x00007fc6b3847cab in _pt_root () from /lib64/libnspr4.so #20 0x00007fc6b31e852a in start_thread () from /lib64/libpthread.so.0 #21 0x00007fc6b2f2422d in clone () from /lib64/libc.so.6 While thread 13 which hold the retrocl plugin lock, tries to acquire a page lock (held by thread 14) Thread 13 (Thread 0x7fc678ff1700 (LWP 1931)): 65 READ 1 WAIT changelog/changenumber.db page 68 Thread 13 (Thread 0x7fc678ff1700 (LWP 1931)): #0 0x00007fc6b31ed590 inpthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc6adc811fb in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so #2 0x00007fc6adc8059b in __db_tas_mutex_lock () from /lib64/libdb-5.3.so #3 0x00007fc6add2aa91 in __lock_get_internal () from /lib64/libdb-5.3.so #4 0x00007fc6add2b657 in __lock_get () from /lib64/libdb-5.3.so #5 0x00007fc6add576af in __db_lget () from /lib64/libdb-5.3.so #6 0x00007fc6adc9e4f0 in __bam_search () from /lib64/libdb-5.3.so #7 0x00007fc6adc89144 in __bamc_search () from /lib64/libdb-5.3.so #8 0x00007fc6adc8b074 in __bamc_get () from /lib64/libdb-5.3.so #9 0x00007fc6add43ba3 in __dbc_iget () from /lib64/libdb-5.3.so #10 0x00007fc6add53092 in __dbc_get_pp () from /lib64/libdb-5.3.so #11 0x00007fc6a9672b70 in ldbm_back_seq () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #12 0x00007fc6b54650f9 in seq_internal_callback_pb () from /usr/lib64/dirsrv/libslapd.so.0 #13 0x00007fc6b54652a9 in slapi_seq_callback () from /usr/lib64/dirsrv/libslapd.so.0 #14 0x00007fc6a7ec8919 in retrocl_update_lastchangenumber () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #15 0x00007fc6a7ec89bb in retrocl_assign_changenumber () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #16 0x00007fc6a7ec99b5 in retrocl_postob () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #17 0x00007fc6b546156f in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #18 0x00007fc6b5461893 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #19 0x00007fc6a965d231 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #20 0x00007fc6b544e905 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #21 0x00007fc6b544fbe7 in do_modify () from /usr/lib64/dirsrv/libslapd.so.0 #22 0x00007fc6b5932925 in connection_threadmain () #23 0x00007fc6b3847cab in _pt_root () from /lib64/libnspr4.so #24 0x00007fc6b31e852a in start_thread () from /lib64/libpthread.so.0 #25 0x00007fc6b2f2422d in clone () from /lib64/libc.so.6 Many others threads are waiting for others pages held by thread 14: Thread 4 (Thread 0x7fc6747e8700 (LWP 1940)): Thread 7 (Thread 0x7fc675feb700 (LWP 1937)): Thread 30 (Thread 0x7fc691922700 (LWP 1914)): Thread 28 (Thread 0x7fc690920700 (LWP 1916)): Thread 29 (Thread 0x7fc691121700 (LWP 1915)): 49 READ 1 WAIT changelog/objectclass.db page 1 Thread 13 (Thread 0x7fc678ff1700 (LWP 1931)): 65 READ 1 WAIT changelog/changenumber.db page 68 Thread 26 (Thread 0x7fc67f7fe700 (LWP 1918)): 2 READ 1 WAIT userRoot/id2entry.db page 2495 }}}

tbordaz commented 8 years ago

attachment
pstack.deadlock.gz

tbordaz commented 8 years ago

attachment
db_stat.deadlock.gz

mreynolds commented 8 years ago

So thread 13 is searching the retro cl using ldbm_back_seq and its waiting for a lock on the changenumber index, but how can thread 14 already hold the lock on the changenumber index when it's just entering the retro changelog code?

MemberOf Plugin should not be doing any updates on the cn=changelog backend anyway. Since there is no debuginfo package installed there is no way to tell what is really going on.
There is not enough information to continune investigating this particular retrocl/memberof deadlock. Closing ticket until more information is provided, or a reprodcible testcase is available.

Instead, I will focus on the other retrocl/memberof deadlock ticket: https://fedorahosted.org/389/ticket/47931

lkrispen commented 8 years ago

Replying to [comment:4 mreynolds]:

So thread 13 is searching the retro cl using ldbm_back_seq and its waiting for a lock on the changenumber index, but how can thread 14 already hold the lock on the changenumber index when it's just entering the retro changelog code?
the memeberof plugin couldupdate multiple entries, all updates are done under the transaction of the parent do_modify
so thread 14 could take the cl lock, acquire the index page lock, release the cl lock (but the index keeps lockoed becuse the txn is active), the tries to get the cl lock again.
In between thread 13 could have acquired the cl lock and waits for the index page lock

MemberOf Plugin should not be doing any updates on the cn=changelog backend anyway. Since there is no debuginfo package installed there is no way to tell what is really going on.

memberof updates an e entry with the memberof attribute, this is a valid mod and will be written to the normal and retro cl

There is not enough information to continune investigating this particular retrocl/memberof deadlock. Closing ticket until more information is provided, or a reprodcible testcase is available.

MAybe a testcase could be:
- have two suffixes
- add a large number of users
- add one group
- add all users to this group in one mod
- in parallel do many small updates to the other suffix

Instead, I will focus on the other retrocl/memberof deadlock ticket: https://fedorahosted.org/389/ticket/47931

mreynolds commented 8 years ago

Replying to [comment:5 lkrispen]:

Replying to [comment:4 mreynolds]:

So thread 13 is searching the retro cl using ldbm_back_seq and its waiting for a lock on the changenumber index, but how can thread 14 already hold the lock on the changenumber index when it's just entering the retro changelog code?
the memeberof plugin couldupdate multiple entries, all updates are done under the transaction of the parent do_modify
so thread 14 could take the cl lock, acquire the index page lock, release the cl lock (but the index keeps lockoed becuse the txn is active), the tries to get the cl lock again.

So bdb needs to use monitor locks(reentrant locks) instead of exclusive locks :-) Thanks for explaining the behavior. Retrocl needs to serialize its operations becuase of its internal changenumber count. So it's going to be tricky to work around this and still keep the proper changenumber count when failures occur.

In between thread 13 could have acquired the cl lock and waits for the index page lock

MemberOf Plugin should not be doing any updates on the cn=changelog backend anyway. Since there is no debuginfo package installed there is no way to tell what is really going on.

memberof updates an e entry with the memberof attribute, this is a valid mod and will be written to the normal and retro cl

There is not enough information to continune investigating this particular retrocl/memberof deadlock. Closing ticket until more information is provided, or a reprodcible testcase is available.

MAybe a testcase could be:
- have two suffixes
- add a large number of users
- add one group
- add all users to this group in one mod
- in parallel do many small updates to the other suffix

I'll give this shot.

Instead, I will focus on the other retrocl/memberof deadlock ticket: https://fedorahosted.org/389/ticket/47931

lkrispen commented 8 years ago

So bdb needs to use monitor locks(reentrant locks) instead of exclusive locks :-)

they are, thread 14 is waiting for a PR lock, it can reuse the bdb locks, but thread 13 can't

mreynolds commented 8 years ago

I was able to easily reproduce the deadlock with the exact same stacktraces using ludwig's proposed testcase:

have two suffixes
add a large number of users
add one group
add all users to this group in one mod
in parallel do many small updates to the other suffix

Adding scoping to the retrocl resolved the deadlock.

mreynolds commented 8 years ago

Closing this ticket as a duplicate of

https://fedorahosted.org/389/ticket/47931

tbordaz commented 8 years ago

Reopening this ticket as it may not be a duplicated of https://fedorahosted.org/389/ticket/47931 Testing on 389-ds-base-1.3.4.6-1.fc23.x86_64 where 47931 is fixed http://koji.fedoraproject.org/koji/buildinfo?buildID=710779 A deadlock occurs between two threads taking locks in the opposite order. Thread 37 waiting retrocl_internal_lock but holding retroCL page 1 {{{ Thread 37 (Thread 0x7fcaedffb700 (LWP 14824)): #0 0x00007fcb060ca89d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fcb060c4ae6 in pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x00007fcb0671d949 in PR_Lock () from /lib64/libnspr4.so #3 0x00007fcaf9e48619 in retrocl_postob () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #4 0x00007fcb08552969 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #5 0x00007fcb08552bf5 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #6 0x00007fcafb5c88b0 in ldbm_back_add () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #7 0x00007fcb084f8dfa in op_shared_add () from /usr/lib64/dirsrv/libslapd.so.0 #8 0x00007fcb084f9663 in add_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0 #9 0x00007fcb084fa385 in slapi_add_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0 #10 0x00007fcafdece6c2 in dna_update_shared_config () from /usr/lib64/dirsrv/plugins/libdna-plugin.so #11 0x00007fcafded166d in dna_update_config_event () from /usr/lib64/dirsrv/plugins/libdna-plugin.so #12 0x00007fcb0851ac3a in eq_loop () from /usr/lib64/dirsrv/libslapd.so.0 #13 0x00007fcb067235cb in _pt_root () from /lib64/libnspr4.so #14 0x00007fcb060c260a in start_thread () from /lib64/libpthread.so.0 #15 0x00007fcb05dfca7d in clone () from /lib64/libc.so.6 }}} Thread 24 waiting for RetroCL page 1 while holding retrocl_internal_lock {{{ Thread 24 (Thread 0x7fcadeff5700 (LWP 14837)): #0 0x00007fcb060c7b10 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fcb002890fb in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so #2 0x00007fcb0028852f in __db_tas_mutex_lock () from /lib64/libdb-5.3.so #3 0x00007fcb00332c97 in __lock_get_internal () from /lib64/libdb-5.3.so #4 0x00007fcb00333727 in __lock_get () from /lib64/libdb-5.3.so #5 0x00007fcb0035f4c7 in __db_lget () from /lib64/libdb-5.3.so #6 0x00007fcb002a5385 in __bam_get_root () from /lib64/libdb-5.3.so #7 0x00007fcb002a5696 in __bam_search () from /lib64/libdb-5.3.so #8 0x00007fcb00290f0c in __bamc_search () from /lib64/libdb-5.3.so #9 0x00007fcb002932a4 in __bamc_get () from /lib64/libdb-5.3.so #10 0x00007fcb0034bd14 in __dbc_iget () from /lib64/libdb-5.3.so #11 0x00007fcb0035aef2 in __dbc_get_pp () from /lib64/libdb-5.3.so #12 0x00007fcafb5f879c in ldbm_back_seq () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #13 0x00007fcb08557709 in seq_internal_callback_pb () from /usr/lib64/dirsrv/libslapd.so.0 #14 0x00007fcb085578b9 in slapi_seq_callback () from /usr/lib64/dirsrv/libslapd.so.0 #15 0x00007fcaf9e47489 in retrocl_update_lastchangenumber () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #16 0x00007fcaf9e4752b in retrocl_assign_changenumber () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #17 0x00007fcaf9e4861e in retrocl_postob () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #18 0x00007fcb08552969 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #19 0x00007fcb08552bf5 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #20 0x00007fcafb5e325b in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #21 0x00007fcb08540595 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #22 0x00007fcb085418a7 in do_modify () from /usr/lib64/dirsrv/libslapd.so.0 #23 0x0000558aec3962ac in connection_threadmain () #24 0x00007fcb067235cb in _pt_root () from /lib64/libnspr4.so #25 0x00007fcb060c260a in start_thread () from /lib64/libpthread.so.0 #26 0x00007fcb05dfca7d in clone () from /lib64/libc.so.6 }}} A problem may be that the Thread 24 is waiting for retroCL page 1 outside of a txn {{{ Thread 24 locks 4 dd=164 locks held 0 write locks 0 pid/thread 14813/140509301397248 flags 0 priority 100 8 dd=160 locks held 0 write locks 0 pid/thread 14813/140509301397248 flags 0 priority 100 62 dd=105 locks held 0 write locks 0 pid/thread 14813/140509301397248 (7FCADEFF5700)flags 0 priority 100 62 READ 1 WAIT changelog/changenumber.db page 1 78 dd=83 locks held 0 write locks 0 pid/thread 14813/140509301397248 flags 0 priority 100 79 dd=82 locks held 0 write locks 0 pid/thread 14813/140509301397248 flags 0 priority 100 7a dd=81 locks held 0 write locks 0 pid/thread 14813/140509301397248 flags 0 priority 100 800000d0 dd=33 locks held 8 write locks 5 pid/thread 14813/140509301397248 (7FCADEFF5700) flags 0 priority 100 800000d0 WRITE 4 HELD ipaca/vlv#kraallpkitomcatindex.db page 1 800000d0 WRITE 3 HELD ipaca/entryusn.db page 1 800000d0 READ 1 HELD ipaca/entryusn.db page 1 800000d0 WRITE 4 HELD ipaca/requesttype.db page 1 800000d0 READ 1 HELD ipaca/requesttype.db page 1 800000d0 WRITE 4 HELD ipaca/requeststate.db page 1 800000d0 READ 1 HELD ipaca/requeststate.db page 1 800000d0 WRITE 1 HELD ipaca/id2entry.db page 124 Thread 27 locks 8000009f dd=46 locks held 59 write locks 25 pid/thread 14813/140509553080064 (7FCAEDFFB700) flags 0 priority 100 8000009f WRITE 1 HELD changelog/changenumber.db page 1 800000fd dd=46 locks held 16 write locks 12 pid/thread 14813/140509553080064 (7FCAEDFFB700) flags 0 priority 100 800000fd WRITE 1 HELD userRoot/objectclass.db page 5 800000fd WRITE 1 HELD userRoot/objectclass.db page 2 800000fd READ 7 HELD userRoot/entryrdn.db page 12 }}}

tbordaz commented 8 years ago

attachment
hang.db_stat_CA.gz

tbordaz commented 8 years ago

attachment
hang.stack.gz

tbordaz commented 8 years ago

The two attached files are related to last hang https://fedorahosted.org/389/ticket/48181#comment:10

lkrispen commented 8 years ago

part of the fix of #47931 was to allow to configure a scope for the retro changelog. The latest deadlock involves the domain backend, the ipaca and retrock backend.

I think, excluding ipaca from the retrocl could prevent the deadlock

lkrispen commented 8 years ago

Limiting the scope of the Retro CL was suggested in an IPA ticket: https://fedorahosted.org/freeipa/ticket/5538#comment:6

and in followup discussion on freeipa-devel there were no objections

tbordaz commented 8 years ago

Ludwig,

You are right, thread 24 was an upate on 'o=ipaca'

{{{
(gdb) thread 24
[Switching to thread 24 (Thread 0x7fcadeff5700 (LWP 14837))]
#22 0x00007fcafb5e325b in ldbm_back_modify (pb=0x7fcadeff4b00) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:821
821 if ((retval = plugin_call_plugins(pb, SLAPI_PLUGIN_BE_TXN_POST_MODIFY_FN))) {
(gdb) frame 22
#22 0x00007fcafb5e325b in ldbm_back_modify (pb=0x7fcadeff4b00) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:821
821 if ((retval = plugin_call_plugins(pb, SLAPI_PLUGIN_BE_TXN_POST_MODIFY_FN))) {
(gdb) print pb->pb_op->o_params.target_address.udn
$16 = 0x7fcaa001b840 "cn=5,ou=kra,ou=requests,o=kra,o=ipaca"

}}}

And this suffix is not excluded from retrocl

{{{
dn: cn=Retro Changelog Plugin,cn=plugins,cn=config
cn: Retro Changelog Plugin
modifiersName: cn=Directory Manager
modifyTimestamp: 20160114005905Z
nsslapd-attribute: nsuniqueid:targetUniqueId
nsslapd-changelogmaxage: 2d
nsslapd-plugin-depends-on-named: Class of Service
nsslapd-plugin-depends-on-type: database
nsslapd-pluginDescription: Retrocl Plugin
nsslapd-pluginEnabled: on
nsslapd-pluginId: retrocl
nsslapd-pluginInitfunc: retrocl_plugin_init
nsslapd-pluginPath: libretrocl-plugin
nsslapd-pluginType: object
nsslapd-pluginVendor: 389 Project
nsslapd-pluginVersion: 1.3.4.6
nsslapd-pluginbetxn: on
nsslapd-pluginprecedence: 25
objectClass: top
objectClass: nsSlapdPlugin
objectClass: extensibleObject

}}}

tbordaz commented 8 years ago

The deadlock scenario is common when multiple backends log their changes into the RetroCL backend (http://www.port389.org/docs/389ds/design/exclude-backends-from-plugin-operations.html).

Following https://fedorahosted.org/389/ticket/48181#comment:15, freeipa configuration is now excluding 'o=ipaca' from the scope of RetroCL.

I am closing again this ticket as will not fix.

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1512

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

389-ds-base

#48181 Deadlock in retroCL

Closed: wontfix None Opened 8 years ago by tbordaz.

Metadata

389-ds-base

Source Code

#48181 Deadlock in retroCL Closed: wontfix None Opened 8 years ago by tbordaz.

Metadata

#48181 Deadlock in retroCL

Closed: wontfix None Opened 8 years ago by tbordaz.