Ticket #359 (closed defect: fixed)

Opened 2 years ago

Last modified 2 years ago

Database RUV could mismatch the one in changelog under the stress

Reported by: nhosoi Owned by: nhosoi
Priority: major Milestone: 1.2.10
Component: Replication - General Version: 1.2.10
Keywords: Cc: rmeggins, nkinder, mreynolds
Blocked By: Blocking:
Review: ack Ticket origin:
Red Hat Bugzilla: 819643, 821176

Description

Found in a stress test.

On one master (blademtv5), one RUV (against another master blademtv1) on database stopped being updated, on the other hand, RUV in the changelog keeps updated. The mismatch stops the replication originated on blademtv1.

The cause was the simultaneous MODRDN operations caused conflicts and one conflict resolution failed, which left uncommitted CSN in the CSN list in the RUV element. It prevented to get the max CSN to update the RUV on database.

Attachments

0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.patch (10.9 KB) - added by nhosoi 2 years ago.
git patch file (master)
0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.2.patch (7.5 KB) - added by nhosoi 2 years ago.
git patch file (389-ds-base-1.2.10)
0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.3.patch (9.3 KB) - added by nhosoi 2 years ago.
git patch file (master)
0013-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.patch (8.8 KB) - added by nhosoi 2 years ago.
git patch file (389-ds-base-1.2.10)
0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch (1.6 KB) - added by rmeggins 2 years ago.
0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch - csnplRollUp leak

Change History

Changed 2 years ago by nhosoi

git patch file (master)

comment:1 Changed 2 years ago by nhosoi

  • Review set to review?
Fix description
. csnplRollUp (csnpl.c) - To get the first committed csndata, if
  there are preceded uncommitted csn's in the csnpl list, this
  patch skips them and returns the first committed csn.
. llistRemoveCurrentAndGetNext (llist.c) - when the last item
  in the list is removed, tail pointer is initialized, too.
. multimaster_preop|bepreop_* (repl5_plugins.c) - process_operation
  is moved from multimaster_preop_* to multimaster_bepreop_* to
  avoid the uncommitted csn set in the csnpl (RUV element) by
  process_operation is left without being committed, which is done
  at the BE_TXN_POST timing.

comment:2 Changed 2 years ago by rmeggins

  • Review changed from review? to ack

Looks good. I would still like to know what changed in 1.2.10 that caused this problem.

comment:3 Changed 2 years ago by rmeggins

  • Red Hat Bugzilla set to [https://bugzilla.redhat.com/show_bug.cgi?id=819643 819643]

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=819643

Changed 2 years ago by nhosoi

git patch file (389-ds-base-1.2.10)

comment:4 Changed 2 years ago by nhosoi

Fix description:
. csnplRollUp (csnpl.c) - To get the first committed csndata, if
  there are preceded uncommitted csn's in the csnpl list, this
  patch skips them and returns the first committed csn.
. llistRemoveCurrentAndGetNext (llist.c) - when the last item
  in the list is removed, tail pointer is initialized, too.
. ldbm_back_add, ldbm_back_modrdn (ldbm_add.c, ldbm_modrdn.c) -
  make sure SLAPI_RESULT_CODE and SLAPI_PLUGIN_OPRETURN are set
  not just when the transaction is started, but in general.
  If an error occurs the RESULT_CODE triggers to remove the CSN
  from the RUV element. 
. plugin_call_func (plugin.c) - when the plugin type is be pre/
  post op, respect the fatal error code (-1) instead of OR the
  results from all the plugins.  The error code -1 is checked
  in ldap_back_add and ldbm_back_modrdn to distinguish from the
  URP operation bits.
Last edited 2 years ago by nhosoi (previous) (diff)

Changed 2 years ago by nhosoi

git patch file (master)

comment:7 Changed 2 years ago by nhosoi

  • Cc rmeggins added
  • Status changed from new to assigned
  • Milestone changed from 0.0 NEEDS_TRIAGE to 1.2.10
  • Owner changed from rmeggins to nhosoi

comment:8 Changed 2 years ago by nhosoi

  • Review changed from review? to nack

Sorry, I have to back off.

Found another problem in the patch. :(

[08/May/2012:12:08:54 -0700] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 1 ldap://<host>:<port>} 4fa96f07000000010000 4fa96f07000000010000] which is present in RUV [database RUV]

Last edited 2 years ago by nhosoi (previous) (diff)

comment:9 follow-up: ↓ 10 Changed 2 years ago by nhosoi

Okay, the problem (ruv_compare_ruv) was introduced prior to my patch.

If you go back beyond this commit, the problem disappears.
commit 0f50544b9567907edd0ba645951d7cd325354107
Date: Mon Apr 23 13:36:04 2012 -0400

Ticket #337 - RFE - Improve CLEANRUV functionality

comment:10 in reply to: ↑ 9 ; follow-up: ↓ 11 Changed 2 years ago by rmeggins

Replying to nhosoi:

Okay, the problem (ruv_compare_ruv) was introduced prior to my patch.

If you go back beyond this commit, the problem disappears.
commit 0f50544b9567907edd0ba645951d7cd325354107
Date: Mon Apr 23 13:36:04 2012 -0400

Ticket #337 - RFE - Improve CLEANRUV functionality

But I thought you were seeing the original problem for ticket #359 with 1.2.10? I believe that patch only applies to 1.2.11 and later. Or are you saying your patch for #359 works with 1.2.10 but not with 1.2.11 because of #337?

comment:11 in reply to: ↑ 10 ; follow-up: ↓ 12 Changed 2 years ago by nhosoi

Replying to rmeggins:

Replying to nhosoi:

Okay, the problem (ruv_compare_ruv) was introduced prior to my patch.

If you go back beyond this commit, the problem disappears.
commit 0f50544b9567907edd0ba645951d7cd325354107
Date: Mon Apr 23 13:36:04 2012 -0400

Ticket #337 - RFE - Improve CLEANRUV functionality

But I thought you were seeing the original problem for ticket #359 with 1.2.10? I believe that patch only applies to 1.2.11 and later. Or are you saying your patch for #359 works with 1.2.10 but not with 1.2.11 because of #337?

Sorry about the confusion. But there is something odd going on...
On my F16, I tested both. My local build from master shows the problem (ruv_compare_ruv) after #337 is included. I don't see it with my local build from 1.2.10 branch with my patch. (See #337, just installing 2 Masters + 1 Hub shows the problem.)

On blademtv5, I installed my local 1.2.10.8 build with my patch (no #337) on top of the Michael's test env, which showed this error:
[07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 2 ldap://blademtv4-6.lab.sjc.redhat.com:38001} 4f91ecf6000000020000 4fa2f707000000020000] which is present in RUV [database RUV]
[07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: for replica o=my_suffix.com there were some differences between the changelog max RUV and the database RUV. If there are obsolete elements in the database RUV, you should remove them using CLEANRUV task. If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog.

Unfortunately, since RUV on the server blademtv5 was broken anyway, it was hard for me to figure out the problem. So, I switched to test it on my machine.

comment:12 in reply to: ↑ 11 Changed 2 years ago by nhosoi

Replying to nhosoi:

Replying to rmeggins:

On blademtv5, I installed my local 1.2.10.8 build with my patch (no #337) on top of the Michael's test env, which showed this error:
[07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 2 ldap://blademtv4-6.lab.sjc.redhat.com:38001} 4f91ecf6000000020000 4fa2f707000000020000] which is present in RUV [database RUV]
[07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: for replica o=my_suffix.com there were some differences between the changelog max RUV and the database RUV. If there are obsolete elements in the database RUV, you should remove them using CLEANRUV task. If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog.

Unfortunately, since RUV on the server blademtv5 was broken anyway, it was hard for me to figure out the problem. So, I switched to test it on my machine.

I believe the above problem was caused by the broken max RUVs in the changelog. To fix it, I had to reinitialize on each master and do some update operation on each to update the max ruv on each. Now, all the masters dump the expected max RUVs in the changelog when it's shutdown and the following restart does not issue the ruv_compare_ruv error any more.
dbid: 0000014d000000000000

max ruv:

{replicageneration} 4f91eb54000000010000
{replica 3} 4fa19249000000030000 4fa19249000000030000
{replica 4} 4faaac03000000040000 4faabc99000000040000
{replica 2} 4faabd9c000100020000 4faabd9c000100020000
{replica 1} 4faabea2000000010000 4faabea2000000010000

comment:13 Changed 2 years ago by nhosoi

  • Cc nkinder, mreynolds added
  • Review changed from nack to review?

comment:14 Changed 2 years ago by rmeggins

  • Review changed from review? to ack

comment:15 Changed 2 years ago by nhosoi

  • Status changed from assigned to closed
  • Resolution set to fixed

Reviewed by Rich (Thank you!!!)

Pushed to master.

$ git merge trac359
Updating 4d7d59e..f0f74b5
Fast-forward

ldap/servers/plugins/replication/csnpl.c | 23 ++++++++++++---------
ldap/servers/plugins/replication/llist.c | 8 +++++-
ldap/servers/plugins/usn/usn.c | 4 ++-
ldap/servers/slapd/back-ldbm/ldbm_add.c | 29 ++++++++++++++-------------
ldap/servers/slapd/back-ldbm/ldbm_modrdn.c | 29 ++++++++++++++-------------
ldap/servers/slapd/plugin.c | 10 +++++++-
6 files changed, 60 insertions(+), 43 deletions(-)

$ git push
Counting objects: 29, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (15/15), 2.36 KiB, done.
Total 15 (delta 12), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git

4d7d59e..f0f74b5 master -> master

Pushed to 389-ds-base-1.2.10 branch.

$ git push origin ds1210-local:389-ds-base-1.2.10
Enter passphrase for key '/home/nhosoi/.ssh/id_rsa':
Counting objects: 29, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (15/15), 2.54 KiB, done.
Total 15 (delta 12), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git

4c31c0d..ed1ebf6 ds1210-local -> 389-ds-base-1.2.10

Pushed to 389-ds-base-1.2.11 branch.

$ git push origin ds1211-local:389-ds-base-1.2.11
Enter passphrase for key '/home/nhosoi/.ssh/id_rsa':
Counting objects: 39, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (23/23), done.
Writing objects: 100% (23/23), 3.28 KiB, done.
Total 23 (delta 18), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git

6041d86..c89ea2f ds1211-local -> 389-ds-base-1.2.11

comment:16 Changed 2 years ago by nkinder

  • Red Hat Bugzilla changed from [https://bugzilla.redhat.com/show_bug.cgi?id=819643 819643] to [https://bugzilla.redhat.com/show_bug.cgi?id=819643 819643], [https://bugzilla.redhat.com/show_bug.cgi?id=821176 821176]

Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=821176 (Red Hat Enterprise Linux 6)

Changed 2 years ago by nhosoi

git patch file (389-ds-base-1.2.10)

comment:17 Changed 2 years ago by nhosoi

  • Review changed from ack to review?

comment:18 Changed 2 years ago by rmeggins

  • Review changed from review? to ack

comment:19 Changed 2 years ago by rmeggins

  • Status changed from closed to reopened
  • Resolution fixed deleted

Changed 2 years ago by rmeggins

0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch - csnplRollUp leak

comment:20 Changed 2 years ago by nhosoi

ack on "0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch - csnplRollUp leak"

comment:21 Changed 2 years ago by rmeggins

  • Status changed from reopened to closed
  • Resolution set to fixed

e5bdf55..4d982e5 389-ds-base-1.2.10 -> 389-ds-base-1.2.10

commit changeset:4d982e5cfdd3f2091a79cce6be94df998c9736f3/389-ds-base

f9dfeea..b5f3f98 389-ds-base-1.2.11 -> 389-ds-base-1.2.11

commit changeset:b5f3f98fc0a8f94ecf1b4bf0c68d8a17b75a233b/389-ds-base

59ac943..12567ff master -> master

commit changeset:12567ffd8c5504cd3ba7dc7783e7ca1f237c82be/389-ds-base

comment:22 Changed 20 months ago by nkinder

  • screened set to 1

Added initial screened field value.

Note: See TracTickets for help on using tickets.