A bind operation tries to update the target entry. To do this it triggers an internal modify that tries to lock the entry into the entry cache.
The entry lock, is already acquired (so the bind operation hang). But the owner thread is processing an other operation and can not own the lock.
This is looking like an entry was locked in entry cache during an operation. The operation completed (successfully or not) but the entry lock was not released
The problem happened when running ipa tests. IPA version is development version (master+user_life_cycle patches, master HEAD is '''b48cfe0''' ''ldap: Remove IPASimpleLDAPObject'')
DS version is a development version (master+1209573. master HEAD is '''d61a073''' ''Tests to support ticket 48026'')
Hang is possibly dynamic. I ran the attached test script (demo_ulc.sh) several times.
attachment demo_ulc.sh
Error logs contains some errors (possibly related to the previous operation done by thread 15):
{{{ [22/Apr/2015:19:14:35 +0200] - Parent cn=groups,cn=accounts,SUFFIX has no children. (op 0x2, repl_op 0x10) [22/Apr/2015:19:14:36 +0200] ldbm_back_delete - conn=335 op=8 parent_update_on_childchange: old_entry=0x7f97b0047f80, new_entry=0x0, rc=-1 [22/Apr/2015:19:19:53 +0200] - Parent cn=groups,cn=accounts,SUFFIX has no children. (op 0x2, repl_op 0x0) [22/Apr/2015:19:19:53 +0200] ldbm_back_delete - conn=0 op=0 parent_update_on_childchange: old_entry=0x7f97882bfe70, new_entry=0x0, rc=-1 [22/Apr/2015:19:19:53 +0200] managed-entries-plugin - mep_del_post_op: failed to delete managed entry (cn=my_manager,cn=groups,cn=accounts,SUFFIX) - error (1) [22/Apr/2015:19:19:53 +0200] - Parent cn=groups,cn=accounts,SUFFIX has no children. (op 0x2, repl_op 0x0) [22/Apr/2015:19:19:53 +0200] ldbm_back_delete - conn=0 op=0 parent_update_on_childchange: old_entry=0x7f97882bfe70, new_entry=0x0, rc=-1 [22/Apr/2015:19:19:53 +0200] managed-entries-plugin - mep_del_post_op: failed to delete managed entry (cn=my_manager,cn=groups,cn=accounts,SUFFIX) - error (1) [22/Apr/2015:19:24:01 +0200] referint-plugin - _update_all_per_mod: entry cn=ipausers,cn=groups,cn=accounts,SUFFIX: deleting "member: uid=my_manager,cn=users,cn=accounts,SUFFIX" failed (16) [22/Apr/2015:19:24:01 +0200] - SLAPI_PLUGIN_BE_TXN_POST_DELETE_FN plugin returned error code but did not set SLAPI_RESULT_CODE [22/Apr/2015:19:24:01 +0200] managed-entries-plugin - mep_del_post_op: failed to delete managed entry (cn=my_manager,cn=groups,cn=accounts,SUFFIX) - error (32) [22/Apr/2015:19:24:17 +0200] DSRetroclPlugin - replog: an error occured while adding change number 1343, dn = changenumber=1343,cn=changelog: Already exists. [22/Apr/2015:19:24:17 +0200] retrocl-plugin - retrocl_postob: operation failure [68] [22/Apr/2015:19:34:14 +0200] DSRetroclPlugin - replog: an error occured while adding change number 1343, dn = changenumber=1343,cn=changelog: Already exists. [22/Apr/2015:19:34:14 +0200] retrocl-plugin - retrocl_postob: operation failure [68] }}}
The hang was detected at '''22/Apr/2015:19:34:14''' (*mods of the bind 20150422172417Z") but the failure (lock not release) happened before
The test case (demo_ulc.sh) can reproduce (1 time out of 10) this kind of error
{{{ [23/Apr/2015:16:18:32 +0200] - Parent cn=groups,cn=accounts,dc=idm,dc=lab,dc=eng,dc=brq,dc=redhat,dc=com has no children. (op 0x2, repl_op 0x10) [23/Apr/2015:16:18:32 +0200] ldbm_back_delete - conn=507 op=8 parent_update_on_childchange: old_entry=0x7fb01c08dec0, new_entry=0x0, rc=-1 [23/Apr/2015:16:21:14 +0200] - Parent cn=groups,cn=accounts,dc=idm,dc=lab,dc=eng,dc=brq,dc=redhat,dc=com has no children. (op 0x2, repl_op 0x0) [23/Apr/2015:16:21:14 +0200] ldbm_back_delete - conn=0 op=0 parent_update_on_childchange: old_entry=0x7fb034072d30, new_entry=0x0, rc=-1 [23/Apr/2015:16:21:14 +0200] managed-entries-plugin - mep_del_post_op: failed to delete managed entry (cn=my_manager,cn=groups,cn=accounts,dc=idm,dc=lab,dc=eng,dc=brq,dc=redhat,dc=com) - error (1) [23/Apr/2015:16:21:14 +0200] - Parent cn=groups,cn=accounts,dc=idm,dc=lab,dc=eng,dc=brq,dc=redhat,dc=com has no children. (op 0x2, repl_op 0x0) [23/Apr/2015:16:21:14 +0200] ldbm_back_delete - conn=0 op=0 parent_update_on_childchange: old_entry=0x7fb034072d30, new_entry=0x0, rc=-1 [23/Apr/2015:16:21:14 +0200] managed-entries-plugin - mep_del_post_op: failed to delete managed entry (cn=my_manager,cn=groups,cn=accounts,dc=idm,dc=lab,dc=eng,dc=brq,dc=redhat,dc=com) - error (1)
[23/Apr/2015:16:18:32 +0200] conn=507 op=8 DEL dn="cn=my_group,cn=groups,cn=accounts,dc=idm,dc=lab,dc=eng,dc=brq,dc=redhat,dc=com" [23/Apr/2015:16:18:32 +0200] conn=507 op=8 RESULT err=1 tag=107 nentries=0 etime=0 ... [23/Apr/2015:16:21:13 +0200] conn=530 op=20 DEL dn="uid=my_manager,cn=users,cn=accounts,dc=idm,dc=lab,dc=eng,dc=brq,dc=redhat,dc=com" [23/Apr/2015:16:21:14 +0200] conn=530 op=20 RESULT err=1 tag=107 nentries=0 etime=1
}}}
The problem is around the numsubordinates attributes. The failure occurs because when deleting an entry and updating the parent's numbsubordinates an error condition is detected (the parent is supposed to have no child). In fact we can see that this attribute value is invalid:
{{{ ldapsearch -LLL -D "cn=directory manager" -w Secret123 -b "cn=groups,cn=accounts,SUFFIX" -s base numsubordinates dn: cn=groups,cn=accounts,SUFFIX numsubordinates: 2
ldapsearch -D "cn=directory manager" -w Secret123 -b "cn=groups,cn=accounts,SUFFIX" -LLL dn dn: cn=groups,cn=accounts,SUFFIX
dn: cn=admins,cn=groups,cn=accounts,SUFFIX
dn: cn=ipausers,cn=groups,cn=accounts,SUFFIX
dn: cn=editors,cn=groups,cn=accounts,SUFFIX
dn: cn=trust admins,cn=groups,cn=accounts,SUFFIX
dn: cn=my_group,cn=groups,cn=accounts,SUFFIX
dn: cn=stageadm,cn=groups,cn=accounts,SUFFIX
dn: cn=my_manager,cn=groups,cn=accounts,SUFFIX
[24/Apr/2015:11:49:59 +0200] - Parent cn=groups,cn=accounts,SUFFIX has no children. (op 0x2, repl_op 0x10) [24/Apr/2015:11:49:59 +0200] ldbm_back_delete - conn=594 op=8 parent_update_on_childchange: old_entry=0x7f07600c45d0, new_entry=0x0, rc=-1 [24/Apr/2015:11:54:55 +0200] - Parent cn=groups,cn=accounts,SUFFIX has no children. (op 0x2, repl_op 0x10) [24/Apr/2015:11:54:55 +0200] ldbm_back_delete - conn=627 op=8 parent_update_on_childchange: old_entry=0x7f07841919f0, new_entry=0x0, rc=-1
This bug was very likely a duplicate of https://fedorahosted.org/389/ticket/47978, where a thread could forget to unlock an entry in entry cache.
Metadata Update from @tbordaz: - Issue assigned to tbordaz - Issue set to the milestone: 1.3.5 backlog
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1496
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Duplicate)
Login to comment on this ticket.