The code that resets the original entries/mods/etc. when retrying deadlocked transactions in ldbm_back_delete et. al. need to be cache aware when deleting entries that are cached - for example, ldbm_delete.c:485
backentry_free(&e);
The problem is that e is cached, so cannot be freed like this. Freeing cached entries seems to cause problems in the server, assertion failures, other crashes.
Agh.. You are right.
We have to do something like this... cache_find_dn/id if exists, cache_return; then cache_remove otherwise backentry_free
Could this be the cause of the other crash bug, e.g., bz808770?
Replying to [comment:1 nhosoi]:
Agh.. You are right. We have to do something like this... cache_find_dn/id if exists, cache_return; then cache_remove otherwise backentry_free Could this be the cause of the other crash bug, e.g., bz808770?
No, because the txn retry code is only in 1.2.11 - I found it trying to reproduce 808770 with 1.2.11 . . .
Rich, I'm wondering if you are already working on this ticket? This ticket is assigned to me. Is it okay for me to start looking into this?
Replying to [comment:3 nhosoi]:
I'm not working on this. Feel free to take it.
Replying to [comment:4 rmeggins]:
I'm not working on this. Feel free to take it. Thanks!
Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy?
Replying to [comment:6 nhosoi]:
not easily - what we really need is a way to "inject" deadlocks - that is, we need to be able to trigger a deadlock programatically rather than having to put the directory server under a heavy load with many updates + searches at the same time
Replying to [comment:7 rmeggins]:
Replying to [comment:6 nhosoi]: Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy? not easily - what we really need is a way to "inject" deadlocks - that is, we need to be able to trigger a deadlock programatically rather than having to put the directory server under a heavy load with many updates + searches at the same time
I see. Thanks... We may be able to make it happen easier once we disable the backend SERIALLOCK. BTW, the SERIALLOCK is not applied to the replicated operation. Are you getting the assertion failure/crash in the MMR topology?
Replying to [comment:8 nhosoi]:
Replying to [comment:7 rmeggins]: Replying to [comment:6 nhosoi]: Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy? not easily - what we really need is a way to "inject" deadlocks - that is, we need to be able to trigger a deadlock programatically rather than having to put the directory server under a heavy load with many updates + searches at the same time I see. Thanks... We may be able to make it happen easier once we disable the backend SERIALLOCK. BTW, the SERIALLOCK is not applied to the replicated operation. Are you getting the assertion failure/crash in the MMR topology?
My test replicates the same setup as the reported problem, which does use MMR, but I don't think you need to use MMR to reproduce the problem - we just need some way to make calls to db->put() db->get() dbc->c_put() dbc->c_get() and dbc->c_close() return DB_LOCK_DEADLOCK errors without having to generate large loads.
Thanks! I could reproduce the problem quite easily on the standalone server. I'm seeing the assertion failures and crashes. Continue debugging...
git patch file (master) 0001-Trac-Ticket-335-transaction-retries-need-to-be-cache.patch
Fix description: When libdb returns DEADLOCK and backend update function retries the operation, the target entry is reset to the original shape. The target entry could be or could not be in the entry cache. Regardless of the status, the original code just released the entry with backentry_free before going into the next loop, which causes the cache error.
This patch checks the status of the entry. If it is in the entry cache, remove it from the entry cache and add a new entry back to the cache if necessary. To get the accurate cache status of each entry, the output argument cache_res to id2entry_add_ext is added.
Additinally, error checking for the conflict value in index_add_mods was week (curr_attr). This patch is adding the check.
Reviewed by Rich (Thank you!!)
Pushed to master.
$ git merge ticket335 Updating a50f8c0..bddb5a4 Fast-forward ldap/servers/slapd/back-ldbm/cache.c | 2 +- ldap/servers/slapd/back-ldbm/id2entry.c | 16 ++++++--- ldap/servers/slapd/back-ldbm/import-threads.c | 2 +- ldap/servers/slapd/back-ldbm/index.c | 35 +++++++++++--------- ldap/servers/slapd/back-ldbm/ldbm_add.c | 20 ++++++++++- ldap/servers/slapd/back-ldbm/ldbm_delete.c | 33 ++++++++++++++----- ldap/servers/slapd/back-ldbm/ldbm_modify.c | 41 +++++++++++++++++++----- ldap/servers/slapd/back-ldbm/ldbm_modrdn.c | 23 +++++++++++-- ldap/servers/slapd/back-ldbm/proto-back-ldbm.h | 2 +- 9 files changed, 129 insertions(+), 45 deletions(-) $ git push Counting objects: 88, done. Delta compression using up to 4 threads. Compressing objects: 100% (34/34), done. Writing objects: 100% (34/34), 6.17 KiB, done. Total 34 (delta 30), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git a50f8c0..bddb5a4 master -> master
git patch file (master) 0002-Trac-Ticket-335-transaction-retries-need-to-be-cache.patch
Fix description: Commit bddb5a4 includes this fix:
The fix was incomplete. If an add-attempted attribute type itself does not exist in the entry (not only the attribute value) after mods applied, the attribute type/value should not have been indexed. This patch fixes it.
It's been reviewed by Rich on bz832560. (Thanks, Rich!!)
$ git push Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 992 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 79799c9..420c6c2 master -> master
Pushed to 389-ds-base-1.2.11, as well.
$ git push origin ds1211-local:389-ds-base-1.2.11 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 1.02 KiB, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git ff11ccc..7d6801f ds1211-local -> 389-ds-base-1.2.11
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=833202
master: commit changeset:420c6c2/389-ds-base Author: Noriko Hosoi nhosoi@totoro.usersys.redhat.com Date: Fri Jun 15 16:32:09 2012 -0700
1.2.11: commit changset:7d6801f/389-ds-base Author: Noriko Hosoi nhosoi@totoro.usersys.redhat.com Date: Fri Jun 15 16:32:09 2012 -0700
Added initial screened field value.
Metadata Update from @nhosoi: - Issue assigned to nhosoi - Issue set to the milestone: 1.2.11.rc1
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/335
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.