#47304 reinitialization of a master with a disabled agreement hangs
Closed: wontfix None Opened 11 years ago by rcritten.

Tested in the context of IPA.

I installed two masters. On one master I disabled the replication agreement, then did a full re-init. The re-initialization never completes.

To reproduce:

  • ipa-server-install ...
  • ipa-replica-prepare ...
  • ipa-replica-install ...
  • ldapmodify -h replica ... nsds5ReplicaEnabled=off
  • ipa-replica-manage re-initialize --from=master

389-ds-base-1.3.0.2-1.fc18.x86_64

The context of this is backup and restore in a replicated environment. I'm looking for a way to pause replication when doing a restore, so that all non-restored masters can get re-initialized from the restored master.


Anything interesting in the errors log on the supplier or consumer?

Sorry, I've already removed the VMs.

My reason for wanting to do it this way is in the context of restoring a master to a known good state. I want to disable replication on all masters, restore one of them, then re-initialize all the other masters against the restored one.

I can enable replication again, then immediately do a re-initialize operation, but I have the feeling there is a small window where the replication plugin could start sending out updates before re-init happens.

My initialization is failing, not hanging, and it looks like this behavior is by design:

[27/Mar/2013:16:21:30 -0400] NSMMReplicationPlugin - Total update aborted: Replication agreement for "agmt="cn=to master 2" (localhost:16483)" can not be updated while the replica is disabled

Could this failure be the reason "ipa-replica-manage" is hanging? How does it detect the initialization worked/failed?

I guess this is coming down to the meaning of "disable" a replication agreement. Should that still block a total update? Tough to say, but I think it's a valid request, and it would make disabling replication agreements more robust.

Well I need to do some more investigation on this, but let me know more about how IPA detects the initialization status. Did you really hang, or was the script waiting on a response that never came due to the error above?

Ok, so looks like we need a new option. A disabled agmt is a disable agmt, period. There's just too much going on, or not going on, to allow for just total updates.

So what you really need is a start/stop replication agmt feature. This means it would allow updates(incremental/total), but it wouldn't send any updates out. Is that something that would work for you?

Replying to [comment:6 mreynolds]:

Ok, so looks like we need a new option. A disabled agmt is a disable agmt, period. There's just too much going on, or not going on, to allow for just total updates.

So what you really need is a start/stop replication agmt feature. This means it would allow updates(incremental/total), but it wouldn't send any updates out. Is that something that would work for you?

What about if we had a nsds5ReplDisable: incoming/outgoing/both?

For the hang, yes, we check the replication status and it never errored out apparently (I didn't look a the response code). IPA considered it as still updating.

I'm actually fine having to re-enable replication when I do a re-initialization, it just feels like there would be a window where the db could send out updates in the current two-step process I use (enable, then do a re-init).

Would doing everything in the same update make a difference? Or is there a way I should wipe the changelog/database, then enable replication, then re-init?

Replying to [comment:8 rcritten]:

For the hang, yes, we check the replication status and it never errored out apparently (I didn't look a the response code).

When you say check the replication status what are you referring to? nsds5replicaLastUpdateStatus?

IPA considered it as still updating.

I'm actually fine having to re-enable replication when I do a re-initialization, it just feels like there would be a window where the db could send out updates in the current two-step process I use (enable, then do a re-init).

But even if updates do go out, that replica would get reinitialized anyway. So does it really matter in the end?

Would doing everything in the same update make a difference?

It should reduce the possibility of updates going out.

Or is there a way I should wipe the changelog/database, then enable replication, then re-init?

If you remove the changelog, then that master replica would need to be reinitialized as well. Not sure if I completely understand this sequence.

The IPA code checks for nsds5BeginReplicaRefresh. If it has no value yet we print "Update in progress". I saw this status for over 5 minutes when re-init generally takes 4 seconds. I didn't look at nsds5replicaUpdateInProgress or nsds5ReplicaLastInitStatus.

What I'm afraid of is where I have masters A and B. Lets say there have been a lot of recent changes, then I restore A. I don't want B to "catch up" A with the changes.

To do this I disable all replication agreements. I then want to re-initialize B from A but in order to do that I need to re-enable the agreement, then re-init. During that short period I don't want any changes on B to flow to A.

Ok, as for the IPA script and the disabled agmt, I do see nsds5BeginReplicaRefresh is set to "start" even though it failed:

nsds5ReplicaEnabled: off
nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: agreement disabled
nsds5BeginReplicaRefresh: start
nsds5replicaLastInitStatus: 12 Total update aborted: Replication agreement for agmt="cn=MARK" (localhost:22222) can not be updated while the replica is disabled.
(If the suffix is disabled you must enable it then restart the server for replication to take place).

Are you saying nsds5BeginReplicaRefresh is not even present?

As for master B sending updates to master A before it can be initialized, you can always disable the agmt on B that points to A. Keep the repl agmt on A that points to B enabled, and initialize. Then on B re-enable the agmt to A.

Would this work for you?

I've done some code changes that change the behavior. Now when you try and initialize a disable agmt you get an error(before it returned success):

ldapmodify -D cn=dm -w password
dn: cn=MARK,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
changetype: modify
replace: nsds5beginreplicaRefresh
nsds5beginreplicaRefresh: start

modifying entry "cn=MARK,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config"
ldap_modify: Server is unwilling to perform (53)
additional info: Replication agreement is disabled

[root@localhost ldif]# ldapsearch -xLLL -D cn=dm -w password -b "cn=MARK,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config" objectclass=top nsds5beginreplicaRefresh
dn: cn=MARK,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds5beginreplicaRefresh: Total update aborted

I could also simply remove the attribute "nsds5beginreplicaRefresh".

The real question is, what can we do to make the IPA script detect the error?

Here is the new behavior....

An error 53 is returned to the client(along with error text), and the nsds5BeginReplicaRefresh attribute is removed from the replication agreement. Previously, the attribute was never removed, and it was always set to "start".

Sending patch out for review...

git merge ticket47304
Updating 9d5dedd..db6bcd7
Fast-forward
ldap/servers/plugins/replication/repl5_agmt.c | 28 +++++++++++++-------
ldap/servers/plugins/replication/repl5_agmtlist.c | 23 ++++++++++------
2 files changed, 32 insertions(+), 19 deletions(-)

git push origin master
Counting objects: 15, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 1.37 KiB, done.
Total 8 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
9d5dedd..db6bcd7 master -> master

commit db6bcd7

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.3.1

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/641

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata