There is a race condition when looking up the parent of a deleted entry, where the cached entry is replaced by tombstone purging(updating the parents tombstonenumsubordinate attribute). The cached entry gets removed just after retrieving the entry from the cache, but before it can be locked (cache_lock_entry). This results in an operations error when trying to lock the entry that was just retrieved from the cache.
Now I see the cause of the problem... A great catch!
It looks parentsdn was leaking in the error case? Probably, it'd be better to call "slapi_sdn_done(&parentsdn);" once after the common_return label?
And I know this is always hard to determine, but why you think 10 is good instead of 3 or 10000? ;)
Replying to [comment:1 nhosoi]:
Now I see the cause of the problem... A great catch! It looks parentsdn was leaking in the error case? Probably, it'd be better to call "slapi_sdn_done(&parentsdn);" once after the common_return label? And I know this is always hard to determine, but why you think 10 is good instead of 3 or 10000? ;)
We have other retry "magic numbers" - would prefer to use one of those.
Also, does this actually work without doing a sleep or thread yield between loop iterations? If so, it must be because of the db I/O and/or mutexes that are called in id2entry. Seems bad to have a tight loop without a yield to let other threads continue.
Replying to [comment:2 rmeggins]:
Replying to [comment:1 nhosoi]: Now I see the cause of the problem... A great catch! It looks parentsdn was leaking in the error case? Probably, it'd be better to call "slapi_sdn_done(&parentsdn);" once after the common_return label?
Now I see the cause of the problem... A great catch! It looks parentsdn was leaking in the error case? Probably, it'd be better to call "slapi_sdn_done(&parentsdn);" once after the common_return label?
I'll look into this.
And I know this is always hard to determine, but why you think 10 is good instead of 3 or 10000? ;) We have other retry "magic numbers" - would prefer to use one of those.
There is a LDBM CACHE RETRY define set to 1000 I can use.
In all of my tests the very next pass succeeds - it's never gone past one iteration. id2entry does take the cache lock, but I can add a short sleep. I'll work on a new patch...
revision 0001-Ticket-47771-Performing-deletes-during-tombstone-pur.patch
Replying to [comment:3 mreynolds]:
Replying to [comment:2 rmeggins]: Replying to [comment:1 nhosoi]: Now I see the cause of the problem... A great catch! It looks parentsdn was leaking in the error case? Probably, it'd be better to call "slapi_sdn_done(&parentsdn);" once after the common_return label? I'll look into this. And I know this is always hard to determine, but why you think 10 is good instead of 3 or 10000? ;) We have other retry "magic numbers" - would prefer to use one of those. There is a LDBM CACHE RETRY define set to 1000 I can use. Also, does this actually work without doing a sleep or thread yield between loop iterations? If so, it must be because of the db I/O and/or mutexes that are called in id2entry. Seems bad to have a tight loop without a yield to let other threads continue. In all of my tests the very next pass succeeds - it's never gone past one iteration. id2entry does take the cache lock, but I can add a short sleep. I'll work on a new patch...
New patch attached...
ack
git merge ticket47771 Updating ce23d5d..844d09d Fast-forward ldap/servers/slapd/back-ldbm/back-ldbm.h | 1 + ldap/servers/slapd/back-ldbm/cache.c | 6 ++- ldap/servers/slapd/back-ldbm/ldbm_delete.c | 46 +++++++++++++++++++-------- 3 files changed, 37 insertions(+), 16 deletions(-)
git push origin master ce23d5d..844d09d master -> master
commit 844d09d Author: Mark Reynolds mreynolds@redhat.com Date: Tue Apr 8 14:39:47 2014 -0400
0ddf772..48844b2 389-ds-base-1.3.2 -> 389-ds-base-1.3.2 commit 48844b2
203970e..44ecbbe 389-ds-base-1.3.1 -> 389-ds-base-1.3.1 commit 44ecbbec4490b012e4ea5d02dfe9d77ec5526acf
48e2a96..c5f22dd 389-ds-base-1.2.11 -> 389-ds-base-1.2.11 commit c5f22dd
Fixed Cherry-pick issue on 1.2.11
git merge ticket47771 Updating 5e45f45..b5cd247 Fast-forward ldap/servers/slapd/back-ldbm/ldbm_delete.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-)
git push origin 389-ds-base-1.2.11 5e45f45..b5cd247 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit b5cd247 Author: Mark Reynolds mreynolds@redhat.com Date: Mon Apr 21 11:32:55 2014 -0400
Fixed double slapi_sdn_done() call - only on 1.2.11 0001-Ticket-47771-Cherry-pick-issue-parentsdn-freed-twice.patch
You have my ack, but I liked your previous way putting slapi_sdn_done in the clean-up area. Another way -- initializing parentsdn with NULL does not help?
Replying to [comment:11 nhosoi]:
It's a local variable not a pointer, but I moved the initialization to the top of the function. So it should be okay now.
Attaching new patch...
Revision 0001-Ticket-47771-Move-parentsdn-initialization-to-avoid-.patch
Yep, that's better. Thanks!
git merge ticket47771 Updating 06a3d0a..991984f Fast-forward ldap/servers/slapd/back-ldbm/ldbm_delete.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)
git push origin master 06a3d0a..991984f master -> master
commit 991984f Author: Mark Reynolds mreynolds@redhat.com Date: Mon Apr 21 12:43:09 2014 -0400
97412d6..bdb6962 389-ds-base-1.3.2 -> 389-ds-base-1.3.2 commit bdb6962
eb6ea36..4b986ee 389-ds-base-1.3.1 -> 389-ds-base-1.3.1 commit 4b986ee0536114a568c30b808663ee590f0c1a86
b5cd247..00a7594 389-ds-base-1.2.11 -> 389-ds-base-1.2.11 commit 00a7594
Metadata Update from @nhosoi: - Issue assigned to mreynolds - Issue set to the milestone: 1.2.11.30
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1103
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.