Packages: freeipa-server-4.1.3-2.fc21.x86_64 389-ds-base-1.3.3.8-1.fc21.x86_64 389-ds-base-libs-1.3.3.8-1.fc21.x86_64
The deadlock occurs between two threads taking locks in opposite order. A thread in betxn_postop (memberof update) holds some DB page (especially changelog/changenumber.db) in write mode and tries to acquire the retroCL lock (retrocl_internal_lock). An other thread in retrocl postop holds the retroCL lock and tries to acquire the same DB page.
No identified test case
attachment pstack.deadlock.gz
attachment db_stat.deadlock.gz
So thread 13 is searching the retro cl using ldbm_back_seq and its waiting for a lock on the changenumber index, but how can thread 14 already hold the lock on the changenumber index when it's just entering the retro changelog code?
MemberOf Plugin should not be doing any updates on the cn=changelog backend anyway. Since there is no debuginfo package installed there is no way to tell what is really going on. There is not enough information to continune investigating this particular retrocl/memberof deadlock. Closing ticket until more information is provided, or a reprodcible testcase is available.
Instead, I will focus on the other retrocl/memberof deadlock ticket: https://fedorahosted.org/389/ticket/47931
Replying to [comment:4 mreynolds]:
So thread 13 is searching the retro cl using ldbm_back_seq and its waiting for a lock on the changenumber index, but how can thread 14 already hold the lock on the changenumber index when it's just entering the retro changelog code? the memeberof plugin couldupdate multiple entries, all updates are done under the transaction of the parent do_modify so thread 14 could take the cl lock, acquire the index page lock, release the cl lock (but the index keeps lockoed becuse the txn is active), the tries to get the cl lock again. In between thread 13 could have acquired the cl lock and waits for the index page lock MemberOf Plugin should not be doing any updates on the cn=changelog backend anyway. Since there is no debuginfo package installed there is no way to tell what is really going on.
So thread 13 is searching the retro cl using ldbm_back_seq and its waiting for a lock on the changenumber index, but how can thread 14 already hold the lock on the changenumber index when it's just entering the retro changelog code? the memeberof plugin couldupdate multiple entries, all updates are done under the transaction of the parent do_modify so thread 14 could take the cl lock, acquire the index page lock, release the cl lock (but the index keeps lockoed becuse the txn is active), the tries to get the cl lock again. In between thread 13 could have acquired the cl lock and waits for the index page lock
MemberOf Plugin should not be doing any updates on the cn=changelog backend anyway. Since there is no debuginfo package installed there is no way to tell what is really going on.
memberof updates an e entry with the memberof attribute, this is a valid mod and will be written to the normal and retro cl
There is not enough information to continune investigating this particular retrocl/memberof deadlock. Closing ticket until more information is provided, or a reprodcible testcase is available.
MAybe a testcase could be: - have two suffixes - add a large number of users - add one group - add all users to this group in one mod - in parallel do many small updates to the other suffix
Replying to [comment:5 lkrispen]:
Replying to [comment:4 mreynolds]: So thread 13 is searching the retro cl using ldbm_back_seq and its waiting for a lock on the changenumber index, but how can thread 14 already hold the lock on the changenumber index when it's just entering the retro changelog code? the memeberof plugin couldupdate multiple entries, all updates are done under the transaction of the parent do_modify so thread 14 could take the cl lock, acquire the index page lock, release the cl lock (but the index keeps lockoed becuse the txn is active), the tries to get the cl lock again.
So thread 13 is searching the retro cl using ldbm_back_seq and its waiting for a lock on the changenumber index, but how can thread 14 already hold the lock on the changenumber index when it's just entering the retro changelog code? the memeberof plugin couldupdate multiple entries, all updates are done under the transaction of the parent do_modify so thread 14 could take the cl lock, acquire the index page lock, release the cl lock (but the index keeps lockoed becuse the txn is active), the tries to get the cl lock again.
So bdb needs to use monitor locks(reentrant locks) instead of exclusive locks :-) Thanks for explaining the behavior. Retrocl needs to serialize its operations becuase of its internal changenumber count. So it's going to be tricky to work around this and still keep the proper changenumber count when failures occur.
In between thread 13 could have acquired the cl lock and waits for the index page lock MemberOf Plugin should not be doing any updates on the cn=changelog backend anyway. Since there is no debuginfo package installed there is no way to tell what is really going on. memberof updates an e entry with the memberof attribute, this is a valid mod and will be written to the normal and retro cl There is not enough information to continune investigating this particular retrocl/memberof deadlock. Closing ticket until more information is provided, or a reprodcible testcase is available. MAybe a testcase could be: - have two suffixes - add a large number of users - add one group - add all users to this group in one mod - in parallel do many small updates to the other suffix
In between thread 13 could have acquired the cl lock and waits for the index page lock
I'll give this shot.
So bdb needs to use monitor locks(reentrant locks) instead of exclusive locks :-)
they are, thread 14 is waiting for a PR lock, it can reuse the bdb locks, but thread 13 can't
I was able to easily reproduce the deadlock with the exact same stacktraces using ludwig's proposed testcase:
Adding scoping to the retrocl resolved the deadlock.
Closing this ticket as a duplicate of
https://fedorahosted.org/389/ticket/47931
attachment hang.db_stat_CA.gz
attachment hang.stack.gz
The two attached files are related to last hang https://fedorahosted.org/389/ticket/48181#comment:10
part of the fix of #47931 was to allow to configure a scope for the retro changelog. The latest deadlock involves the domain backend, the ipaca and retrock backend.
I think, excluding ipaca from the retrocl could prevent the deadlock
Limiting the scope of the Retro CL was suggested in an IPA ticket: https://fedorahosted.org/freeipa/ticket/5538#comment:6
and in followup discussion on freeipa-devel there were no objections
Ludwig,
You are right, thread 24 was an upate on 'o=ipaca'
{{{ (gdb) thread 24 [Switching to thread 24 (Thread 0x7fcadeff5700 (LWP 14837))] #22 0x00007fcafb5e325b in ldbm_back_modify (pb=0x7fcadeff4b00) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:821 821 if ((retval = plugin_call_plugins(pb, SLAPI_PLUGIN_BE_TXN_POST_MODIFY_FN))) { (gdb) frame 22 #22 0x00007fcafb5e325b in ldbm_back_modify (pb=0x7fcadeff4b00) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:821 821 if ((retval = plugin_call_plugins(pb, SLAPI_PLUGIN_BE_TXN_POST_MODIFY_FN))) { (gdb) print pb->pb_op->o_params.target_address.udn $16 = 0x7fcaa001b840 "cn=5,ou=kra,ou=requests,o=kra,o=ipaca"
}}}
And this suffix is not excluded from retrocl
{{{ dn: cn=Retro Changelog Plugin,cn=plugins,cn=config cn: Retro Changelog Plugin modifiersName: cn=Directory Manager modifyTimestamp: 20160114005905Z nsslapd-attribute: nsuniqueid:targetUniqueId nsslapd-changelogmaxage: 2d nsslapd-plugin-depends-on-named: Class of Service nsslapd-plugin-depends-on-type: database nsslapd-pluginDescription: Retrocl Plugin nsslapd-pluginEnabled: on nsslapd-pluginId: retrocl nsslapd-pluginInitfunc: retrocl_plugin_init nsslapd-pluginPath: libretrocl-plugin nsslapd-pluginType: object nsslapd-pluginVendor: 389 Project nsslapd-pluginVersion: 1.3.4.6 nsslapd-pluginbetxn: on nsslapd-pluginprecedence: 25 objectClass: top objectClass: nsSlapdPlugin objectClass: extensibleObject
The deadlock scenario is common when multiple backends log their changes into the RetroCL backend (http://www.port389.org/docs/389ds/design/exclude-backends-from-plugin-operations.html).
Following https://fedorahosted.org/389/ticket/48181#comment:15, freeipa configuration is now excluding 'o=ipaca' from the scope of RetroCL.
I am closing again this ticket as will not fix.
Metadata Update from @lkrispen: - Issue assigned to mreynolds - Issue set to the milestone: 1.3.4.4
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1512
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Invalid)
Login to comment on this ticket.