Found in a stress test.
On one master (blademtv5), one RUV (against another master blademtv1) on database stopped being updated, on the other hand, RUV in the changelog keeps updated. The mismatch stops the replication originated on blademtv1.
The cause was the simultaneous MODRDN operations caused conflicts and one conflict resolution failed, which left uncommitted CSN in the CSN list in the RUV element. It prevented to get the max CSN to update the RUV on database.
git patch file (master) 0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.patch
{{{ Fix description . csnplRollUp (csnpl.c) - To get the first committed csndata, if there are preceded uncommitted csn's in the csnpl list, this patch skips them and returns the first committed csn. . llistRemoveCurrentAndGetNext (llist.c) - when the last item in the list is removed, tail pointer is initialized, too. . multimaster_preop|bepreop_ (repl5_plugins.c) - process_operation is moved from multimaster_preop_ to multimaster_bepreop_* to avoid the uncommitted csn set in the csnpl (RUV element) by process_operation is left without being committed, which is done at the BE_TXN_POST timing.
}}}
Looks good. I would still like to know what changed in 1.2.10 that caused this problem.
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=819643
git patch file (389-ds-base-1.2.10) 0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.2.patch
{{{ Fix description: . csnplRollUp (csnpl.c) - To get the first committed csndata, if there are preceded uncommitted csn's in the csnpl list, this patch skips them and returns the first committed csn. . llistRemoveCurrentAndGetNext (llist.c) - when the last item in the list is removed, tail pointer is initialized, too. . ldbm_back_add, ldbm_back_modrdn (ldbm_add.c, ldbm_modrdn.c) - make sure SLAPI_RESULT_CODE and SLAPI_PLUGIN_OPRETURN are set not just when the transaction is started, but in general. If an error occurs the RESULT_CODE triggers to remove the CSN from the RUV element. . plugin_call_func (plugin.c) - when the plugin type is be pre/ post op, respect the fatal error code (-1) instead of OR the results from all the plugins. The error code -1 is checked in ldap_back_add and ldbm_back_modrdn to distinguish from the URP operation bits. }}}
git patch file (master) 0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.3.patch
Note: this link is obsolete: https://fedorahosted.org/389/attachment/ticket/359/0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.patch
Please review these 2 links: https://fedorahosted.org/389/attachment/ticket/359/0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.2.patch (389-ds-base-1.2.10)
https://fedorahosted.org/389/attachment/ticket/359/0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.3.patch (master)
Sorry, I have to back off.
Found another problem in the patch. :(
[08/May/2012:12:08:54 -0700] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 1 ldap://<host>:<port>} 4fa96f07000000010000 4fa96f07000000010000] which is present in RUV [database RUV]
Okay, the problem (ruv_compare_ruv) was introduced prior to my patch.
If you go back beyond this commit, the problem disappears. commit 0f50544 Date: Mon Apr 23 13:36:04 2012 -0400 Ticket #337 - RFE - Improve CLEANRUV functionality
Replying to [comment:9 nhosoi]:
Okay, the problem (ruv_compare_ruv) was introduced prior to my patch. If you go back beyond this commit, the problem disappears. commit 0f50544 Date: Mon Apr 23 13:36:04 2012 -0400 Ticket #337 - RFE - Improve CLEANRUV functionality
But I thought you were seeing the original problem for ticket #359 with 1.2.10? I believe that patch only applies to 1.2.11 and later. Or are you saying your patch for #359 works with 1.2.10 but not with 1.2.11 because of #337?
Replying to [comment:10 rmeggins]:
Replying to [comment:9 nhosoi]: Okay, the problem (ruv_compare_ruv) was introduced prior to my patch. If you go back beyond this commit, the problem disappears. commit 0f50544 Date: Mon Apr 23 13:36:04 2012 -0400 Ticket #337 - RFE - Improve CLEANRUV functionality But I thought you were seeing the original problem for ticket #359 with 1.2.10? I believe that patch only applies to 1.2.11 and later. Or are you saying your patch for #359 works with 1.2.10 but not with 1.2.11 because of #337?
Sorry about the confusion. But there is something odd going on... On my F16, I tested both. My local build from master shows the problem (ruv_compare_ruv) after #337 is included. I don't see it with my local build from 1.2.10 branch with my patch. (See #337, just installing 2 Masters + 1 Hub shows the problem.)
On blademtv5, I installed my local 1.2.10.8 build with my patch (no #337) on top of the Michael's test env, which showed this error: [07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 2 ldap://blademtv4-6.lab.sjc.redhat.com:38001} 4f91ecf6000000020000 4fa2f707000000020000] which is present in RUV [database RUV] [07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: for replica o=my_suffix.com there were some differences between the changelog max RUV and the database RUV. If there are obsolete elements in the database RUV, you should remove them using CLEANRUV task. If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog.
Unfortunately, since RUV on the server blademtv5 was broken anyway, it was hard for me to figure out the problem. So, I switched to test it on my machine.
Replying to [comment:11 nhosoi]:
Replying to [comment:10 rmeggins]: On blademtv5, I installed my local 1.2.10.8 build with my patch (no #337) on top of the Michael's test env, which showed this error: [07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 2 ldap://blademtv4-6.lab.sjc.redhat.com:38001} 4f91ecf6000000020000 4fa2f707000000020000] which is present in RUV [database RUV] [07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: for replica o=my_suffix.com there were some differences between the changelog max RUV and the database RUV. If there are obsolete elements in the database RUV, you should remove them using CLEANRUV task. If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog. Unfortunately, since RUV on the server blademtv5 was broken anyway, it was hard for me to figure out the problem. So, I switched to test it on my machine.
I believe the above problem was caused by the broken max RUVs in the changelog. To fix it, I had to reinitialize on each master and do some update operation on each to update the max ruv on each. Now, all the masters dump the expected max RUVs in the changelog when it's shutdown and the following restart does not issue the ruv_compare_ruv error any more. dbid: 0000014d000000000000 max ruv: {replicageneration} 4f91eb54000000010000 {replica 3} 4fa19249000000030000 4fa19249000000030000 {replica 4} 4faaac03000000040000 4faabc99000000040000 {replica 2} 4faabd9c000100020000 4faabd9c000100020000 {replica 1} 4faabea2000000010000 4faabea2000000010000
I'm resurrecting the review request...
Reviewed by Rich (Thank you!!!)
Pushed to master.
$ git merge trac359 Updating 4d7d59e..f0f74b5 Fast-forward ldap/servers/plugins/replication/csnpl.c | 23 ++++++++++++--------- ldap/servers/plugins/replication/llist.c | 8 +++++- ldap/servers/plugins/usn/usn.c | 4 ++- ldap/servers/slapd/back-ldbm/ldbm_add.c | 29 ++++++++++++++------------- ldap/servers/slapd/back-ldbm/ldbm_modrdn.c | 29 ++++++++++++++------------- ldap/servers/slapd/plugin.c | 10 +++++++- 6 files changed, 60 insertions(+), 43 deletions(-)
$ git push Counting objects: 29, done. Delta compression using up to 4 threads. Compressing objects: 100% (15/15), done. Writing objects: 100% (15/15), 2.36 KiB, done. Total 15 (delta 12), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 4d7d59e..f0f74b5 master -> master
Pushed to 389-ds-base-1.2.10 branch.
$ git push origin ds1210-local:389-ds-base-1.2.10 Enter passphrase for key '/home/nhosoi/.ssh/id_rsa': Counting objects: 29, done. Delta compression using up to 4 threads. Compressing objects: 100% (15/15), done. Writing objects: 100% (15/15), 2.54 KiB, done. Total 15 (delta 12), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 4c31c0d..ed1ebf6 ds1210-local -> 389-ds-base-1.2.10
Pushed to 389-ds-base-1.2.11 branch.
$ git push origin ds1211-local:389-ds-base-1.2.11 Enter passphrase for key '/home/nhosoi/.ssh/id_rsa': Counting objects: 39, done. Delta compression using up to 4 threads. Compressing objects: 100% (23/23), done. Writing objects: 100% (23/23), 3.28 KiB, done. Total 23 (delta 18), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 6041d86..c89ea2f ds1211-local -> 389-ds-base-1.2.11
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=821176 (''Red Hat Enterprise Linux 6'')
git patch file (389-ds-base-1.2.10) 0013-Trac-Ticket-359-Database-RUV-could-mismatch-the-one.patch
0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch - csnplRollUp leak 0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch
ack on "0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch - csnplRollUp leak"
e5bdf55..4d982e5 389-ds-base-1.2.10 -> 389-ds-base-1.2.10 commit changeset:4d982e5/389-ds-base f9dfeea..b5f3f98 389-ds-base-1.2.11 -> 389-ds-base-1.2.11 commit changeset:b5f3f98/389-ds-base 59ac943..12567ff master -> master commit changeset:12567ff/389-ds-base
Added initial screened field value.
I came across this ticket when investiating #49008.
The reason that the RUV cannot be updated because of an uncommited CSN in the pending list is that if urp decides to ignore an operation the csn is already in the pending list. urp sets an ldap result code to the pblock and makes the ioperation a NOOP (success). Later send_ldap_result is called with err=0 and the result code is reset. Only in the process_postop an uncommitted csn would be cancelled, but since it sees success it doesn't cancel it.
In my opinion the correct solution would be to rollup the pending list only until the first uncommitted csn and move the cancelling of uncommitted csns to the bepostop calls, maybe the error could be handled in write_changelog_and_ruv() which is always called.
Metadata Update from @nhosoi: - Issue assigned to nhosoi - Issue set to the milestone: 1.2.10
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/359
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.