Ticket #387 (closed defect: fixed)

Opened 23 months ago

Last modified 22 months ago

managed entry sometimes doesn't delete the managed entry

Reported by: rmeggins Owned by: rmeggins
Priority: blocker Milestone: 1.2.11.5
Component: Directory Server Version:
Keywords: Cc: amessina@…, ssieb
Blocked By: Blocking:
Review: ack Ticket origin:
Red Hat Bugzilla: 830343

Description

I don't believe this is related to ticket #358

TET managedentry tests pass - IPA server and replica install work

But the http://freeipa.org/page/Testing tests/test_xmlrpc/test_user_plugin.py fails sometimes. The test may pass, but it will leave the group entry behind. During test cleanup, it is supposed to remove the user entries, and the group entry should be removed by mep. But if this fails, the group entry cannot be deleted directly without first removing the mepManagedEntry objectclass. Subsequent tests runs will fail due to the presence of this group entry.

I've been able to reproduce this very seldom. Martin saw this:

find_entry_internal_uniqueid retry count exceeded; uniqueid = (%s)

which suggests either a mishandled deadlock or a cache problem.

At the very least we need some error reporting from mep_del_post_op - there is currently no error checking or logging for the delete.

Attachments

errors-ticket387.txt (207.3 KB) - added by nkinder 23 months ago.
errors log
logging-diff (2.8 KB) - added by mreynolds 23 months ago.
0001-Ticket-387-managed-entry-sometimes-doesn-t-delete-th.patch (4.2 KB) - added by rmeggins 22 months ago.
clean up cache handling in ldbm_back_modify - plus cache removal
0002-Ticket-387-managed-entry-sometimes-doesn-t-delete-th.3.patch (11.5 KB) - added by rmeggins 22 months ago.
preserve the bepre "after" entry and restore it after txn failures

Change History

comment:1 Changed 23 months ago by rmeggins

If you can reproduce this problem easily, please try turning on the LDAP_DEBUG_CACHE (32768) level error logging and see if you can reproduce. This may help narrow down the problem to the cache.
http://port389.org/wiki/FAQ#Troubleshooting

comment:2 Changed 23 months ago by amessina

  • Cc amessina@… added

comment:3 Changed 23 months ago by nkinder

I have been able to reproduce the issue using a debug build from the tip of the 389-ds-base-1.2.11 branch. The issue occurs intermittently. I have had times where it fails after 2 or 3 cycles of the IPA test_user_plugins.py tests, or after 15. With LDAP_DEBUG_CACHE logging enabled, it took nearly 50 test cycles for the issue to occur. This indicates that it is timing related.

When the problem occurs, I am left with a dangling managed entry for tuser1 (the group entry). I dumped id2entry.db4 using dbscan and confirmed that the managed entry does actually exist in the database and not just the entry cache.

I will attach the errors log with LDAP_DEBUG_CACHE logging enabled.

Changed 23 months ago by nkinder

errors log

comment:4 Changed 23 months ago by nkinder

The errors log doesn't show much of interest, though I notice that the dncache return messages related to the tuser1 managed group entry stop once the problem occurs. This might simply be a byproduct of the operations that the test performs related to the tuser1 user entry no longer go through.

One other thing to note is that the managed entry plug-in thinks that the internal delete was successful. I applied a patch from Mark that would log an error in mep_del_post_op() if the internal delete operation of the managed entry returned anything other than LDAP_SUCCESS. That new logging statement never appears in my logfile.

comment:5 Changed 23 months ago by mreynolds

Added more logging, see attachment (logging-diff) to try and test Nathans theory that another operation is deleting an entry while the post op from another thread is adding mep objectclass and attribute.

In the process of reproducing with a script...

Changed 23 months ago by mreynolds

comment:6 Changed 23 months ago by nkinder

  • Red Hat Bugzilla set to [https://bugzilla.redhat.com/show_bug.cgi?id=830343 830343]

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=830343

comment:7 Changed 22 months ago by ssieb

  • Cc ssieb added

comment:8 Changed 22 months ago by rmeggins

I was able to reproduce this with "plain" 389. I say "plain" because I had to reproduce the ipa dse config, schema, and data tree - basically ipa without the ipa and slapi-nis plugins.

The problem occurs when we retry a modify operation due to a deadlock. The problem is that a be_txn_preop plugin can modify the entry and the mod list. The first problem is that we do not have a good way to keep track of this and restore it. For example, the code currently stores the original entry i.e. the entry before modifying. If we have to retry, we have to reproduce the after entry by applying the mods to the before entry, expanding objectclasses, checking for schema/syntax/etc. If we do this, we may lose any changes made to the "after" entry that were made in a bepreop. The original entry gets written to the db, which is a no-op, and the data we thought was supposed to be written is not, because it was discarded.

bepreop and betxnpreop code can alter the data that goes into the db in two ways - by altering the SLAPI_MODIFY_EXISTING_ENTRY Slapi_Entry*, and by altering the SLAPI_MODIFY_MODS mods list. After the plugins are called, the code applies the mods to the after entry, slapi_schema_expand_objectclasses, slapi_entry_schema_check, slapi_mods_syntax_check, slapi_entry_rdn_values_present, etc.

We should save the "after" entry in the original_entry. That way the betxnpreop code can further modify the entry modified by the bepreop plugins. The problem then comes in with applying the mods to this entry. Since we save the original mods, we would have to reapply all of the mods that we already applied, in addition to any mods made in the betxnpreop. So instead, we just assume that the betxnpreop should not alter any of the original mods in any way, and instead should just append/extend the mods list. Then we can just apply any additional mods added by the betxnpreop (including mods that could undo any previous mods in the list). So a betxnpreop that wants to alter the mods list can do one of the following:

1) alter the mods list instead in the bepreop
2) extend the mods list with undo ops e.g. add a mod/delete to counter an earlier mod/add.

comment:9 Changed 22 months ago by rmeggins

  • Review set to review?

comment:10 follow-up: ↓ 11 Changed 22 months ago by nhosoi

  • Review changed from review? to ack

Your patches look good to me.
I think this is safe, but if you could give me some more details regarding the ec_in_cache case in the patch (0002-Ticket-387-managed-entry-sometimes-doesn-t-delete-th.2.patch), it makes me feel better. :) My question is when an error occurs AND ec is in cache, I don't see it's removed by CACHE_REMOVE, but it's just returned. Could it leave ec ("after" entry with mods applied) in the entry cache when the mod operation fails?

comment:11 in reply to: ↑ 10 Changed 22 months ago by rmeggins

Replying to nhosoi:

Your patches look good to me.

Thanks!

I think this is safe, but if you could give me some more details regarding the ec_in_cache case in the patch (0002-Ticket-387-managed-entry-sometimes-doesn-t-delete-th.2.patch), it makes me feel better. :)

Sure, no problem.

My question is when an error occurs AND ec is in cache, I don't see it's removed by CACHE_REMOVE, but it's just returned. Could it leave ec ("after" entry with mods applied) in the entry cache when the mod operation fails?

Yes, I think you are correct. I do need to cache_remove the ec entry in that case. Another patch coming up.

Changed 22 months ago by rmeggins

clean up cache handling in ldbm_back_modify - plus cache removal

Changed 22 months ago by rmeggins

preserve the bepre "after" entry and restore it after txn failures

comment:12 Changed 22 months ago by rmeggins

  • Review changed from ack to review?

comment:13 Changed 22 months ago by nhosoi

  • Review changed from review? to ack

comment:14 Changed 22 months ago by rmeggins

  • Status changed from new to closed
  • Resolution set to fixed

master:
commit changeset:6c17ec56076d34540929acbcf2f3e65534060a43/389-ds-base
Author: Rich Megginson <rmeggins@…>
Date: Mon Jun 18 12:21:50 2012 -0600

commit changeset:c49496ad6469f5786642c53de6019dba26c751b0/389-ds-base
Author: Rich Megginson <rmeggins@…>
Date: Wed Jun 13 21:03:37 2012 -0600

1.2.11:
commit changeset:8e67aac72ce0802bd9dc7285da340191b71dafc3/389-ds-base
Author: Rich Megginson <rmeggins@…>
Date: Mon Jun 18 12:21:50 2012 -0600

commit changeset:f2818827cf38046a16f9b7f39779ae5ab53338bd/389-ds-base
Author: Rich Megginson <rmeggins@…>
Date: Wed Jun 13 21:03:37 2012 -0600

comment:15 Changed 20 months ago by nkinder

  • screened set to 1

Added initial screened field value.

Note: See TracTickets for help on using tickets.