Ticket #335 (closed defect: fixed)

Opened 2 years ago

Last modified 22 months ago

transaction retries need to be cache aware

Reported by: rmeggins Owned by: nhosoi
Priority: critical Milestone: 1.2.11.rc1
Component: Database - General Version:
Keywords: Cc:
Blocked By: Blocking:
Review: ack Ticket origin:
Red Hat Bugzilla: 833202

Description

The code that resets the original entries/mods/etc. when retrying deadlocked transactions in ldbm_back_delete et. al. need to be cache aware when deleting entries that are cached - for example, ldbm_delete.c:485

			backentry_free(&e);

The problem is that e is cached, so cannot be freed like this. Freeing cached entries seems to cause problems in the server, assertion failures, other crashes.

Attachments

0001-Trac-Ticket-335-transaction-retries-need-to-be-cache.patch (18.8 KB) - added by nhosoi 2 years ago.
git patch file (master)
0002-Trac-Ticket-335-transaction-retries-need-to-be-cache.patch (2.1 KB) - added by nhosoi 23 months ago.
git patch file (master)

Change History

comment:1 follow-up: ↓ 2 Changed 2 years ago by nhosoi

Agh.. You are right.

We have to do something like this...
cache_find_dn/id
if exists, cache_return; then cache_remove
otherwise backentry_free

Could this be the cause of the other crash bug, e.g., bz808770?

comment:2 in reply to: ↑ 1 Changed 2 years ago by rmeggins

Replying to nhosoi:

Agh.. You are right.

We have to do something like this...
cache_find_dn/id
if exists, cache_return; then cache_remove
otherwise backentry_free

Could this be the cause of the other crash bug, e.g., bz808770?

No, because the txn retry code is only in 1.2.11 - I found it trying to reproduce 808770 with 1.2.11 . . .

comment:3 follow-up: ↓ 4 Changed 2 years ago by nhosoi

Rich, I'm wondering if you are already working on this ticket?
This ticket is assigned to me. Is it okay for me to start looking into this?

comment:4 in reply to: ↑ 3 ; follow-up: ↓ 5 Changed 2 years ago by rmeggins

Replying to nhosoi:

Rich, I'm wondering if you are already working on this ticket?
This ticket is assigned to me. Is it okay for me to start looking into this?

I'm not working on this. Feel free to take it.

comment:5 in reply to: ↑ 4 Changed 2 years ago by nhosoi

  • Status changed from new to assigned

Replying to rmeggins:

I'm not working on this. Feel free to take it.

Thanks!

comment:6 follow-up: ↓ 7 Changed 2 years ago by nhosoi

Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy?

comment:7 in reply to: ↑ 6 ; follow-up: ↓ 8 Changed 2 years ago by rmeggins

Replying to nhosoi:

Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy?

not easily - what we really need is a way to "inject" deadlocks - that is, we need to be able to trigger a deadlock programatically rather than having to put the directory server under a heavy load with many updates + searches at the same time

comment:8 in reply to: ↑ 7 ; follow-up: ↓ 9 Changed 2 years ago by nhosoi

Replying to rmeggins:

Replying to nhosoi:

Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy?

not easily - what we really need is a way to "inject" deadlocks - that is, we need to be able to trigger a deadlock programatically rather than having to put the directory server under a heavy load with many updates + searches at the same time

I see. Thanks... We may be able to make it happen easier once we disable the backend SERIALLOCK. BTW, the SERIALLOCK is not applied to the replicated operation. Are you getting the assertion failure/crash in the MMR topology?

comment:9 in reply to: ↑ 8 Changed 2 years ago by rmeggins

Replying to nhosoi:

Replying to rmeggins:

Replying to nhosoi:

Rich, do you happen to have a reproducer (for the assertion failure / crashes) handy?

not easily - what we really need is a way to "inject" deadlocks - that is, we need to be able to trigger a deadlock programatically rather than having to put the directory server under a heavy load with many updates + searches at the same time

I see. Thanks... We may be able to make it happen easier once we disable the backend SERIALLOCK. BTW, the SERIALLOCK is not applied to the replicated operation. Are you getting the assertion failure/crash in the MMR topology?

My test replicates the same setup as the reported problem, which does use MMR, but I don't think you need to use MMR to reproduce the problem - we just need some way to make calls to db->put() db->get() dbc->c_put() dbc->c_get() and dbc->c_close() return DB_LOCK_DEADLOCK errors without having to generate large loads.

comment:10 Changed 2 years ago by nhosoi

Thanks! I could reproduce the problem quite easily on the standalone server. I'm seeing the assertion failures and crashes. Continue debugging...

Changed 2 years ago by nhosoi

git patch file (master)

comment:11 Changed 2 years ago by nhosoi

  • Review set to review?

Fix description:
When libdb returns DEADLOCK and backend update function retries
the operation, the target entry is reset to the original shape.
The target entry could be or could not be in the entry cache.
Regardless of the status, the original code just released the
entry with backentry_free before going into the next loop, which
causes the cache error.

This patch checks the status of the entry. If it is in the entry
cache, remove it from the entry cache and add a new entry back to
the cache if necessary. To get the accurate cache status of each
entry, the output argument cache_res to id2entry_add_ext is added.

Additinally, error checking for the conflict value in index_add_mods
was week (curr_attr). This patch is adding the check.

comment:12 Changed 2 years ago by rmeggins

  • Review changed from review? to ack

comment:13 Changed 2 years ago by nhosoi

  • Status changed from assigned to closed
  • Resolution set to fixed

Reviewed by Rich (Thank you!!)

Pushed to master.

$ git merge ticket335
Updating a50f8c0..bddb5a4
Fast-forward

ldap/servers/slapd/back-ldbm/cache.c | 2 +-
ldap/servers/slapd/back-ldbm/id2entry.c | 16 ++++++---
ldap/servers/slapd/back-ldbm/import-threads.c | 2 +-
ldap/servers/slapd/back-ldbm/index.c | 35 +++++++++++---------
ldap/servers/slapd/back-ldbm/ldbm_add.c | 20 ++++++++++-
ldap/servers/slapd/back-ldbm/ldbm_delete.c | 33 ++++++++++++++-----
ldap/servers/slapd/back-ldbm/ldbm_modify.c | 41 +++++++++++++++++++-----
ldap/servers/slapd/back-ldbm/ldbm_modrdn.c | 23 +++++++++++--
ldap/servers/slapd/back-ldbm/proto-back-ldbm.h | 2 +-
9 files changed, 129 insertions(+), 45 deletions(-)

$ git push
Counting objects: 88, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (34/34), done.
Writing objects: 100% (34/34), 6.17 KiB, done.
Total 34 (delta 30), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git

a50f8c0..bddb5a4 master -> master

Changed 23 months ago by nhosoi

git patch file (master)

comment:14 Changed 23 months ago by nhosoi

Fix description:
Commit bddb5a45f2cd27705cb6629f436ef9a7e2248677 includes this fix:

Additinally, error checking for the conflict value in index_add_mods
was week (curr_attr). This patch is adding the check.


The fix was incomplete. If an add-attempted attribute type itself
does not exist in the entry (not only the attribute value) after
mods applied, the attribute type/value should not have been indexed.
This patch fixes it.

It's been reviewed by Rich on bz832560. (Thanks, Rich!!)

Pushed to master.

$ git push
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 992 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git

79799c9..420c6c2 master -> master

Pushed to 389-ds-base-1.2.11, as well.

$ git push origin ds1211-local:389-ds-base-1.2.11
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.02 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git

ff11ccc..7d6801f ds1211-local -> 389-ds-base-1.2.11

comment:15 Changed 22 months ago by nkinder

  • Red Hat Bugzilla set to [https://bugzilla.redhat.com/show_bug.cgi?id=833202 833202]

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=833202

comment:16 Changed 22 months ago by rmeggins

master:
commit changeset:420c6c289681aa36dfcafba0497de55843aa9157/389-ds-base
Author: Noriko Hosoi <nhosoi@…>
Date: Fri Jun 15 16:32:09 2012 -0700

1.2.11:
commit changset:7d6801fa139fb78b0c5a15785a59e5009bee7ff0/389-ds-base
Author: Noriko Hosoi <nhosoi@…>
Date: Fri Jun 15 16:32:09 2012 -0700

comment:17 Changed 20 months ago by nkinder

  • screened set to 1

Added initial screened field value.

Note: See TracTickets for help on using tickets.