#2303 Run CLEANRUV task when completely deleting a replica
Closed: Fixed None Opened 12 years ago by simo.

When a replica is disconnected or removed we need to run the CLEANRUV task in DS to avoid having hanging pointers to the old replica.


It may make sense to have this handled by an IPA plugin. The trigger would be a master server entry disappearing from the common tree. This would allow all servers to prune agreements and CLEANRUV automatically when a mster is removed. But needs careful consideration.

Utilizing the ipa-replica-manage del command, I was able to successfully delete several problematic / broken / crashed servers from my FreeIPA replica pool... almost a full month later, I came to find out that I had caused a serious problem as there were orphaned modifications to my directory.

When I finally got around to performing the RUV cleanup manually, I ran head first into a 389 bug that caused my replica server to segfault, it continued to do this until I reinitialized back from one of the Unfixed replica masters.

So. It appears that the RUV task is not just a nice to have, but rather important since it appears possible to cause a small bit of misalignment / corruption in the core of the 389 directory.

Putting back in needs triage

Replying to [comment:7 jraquino]:

Utilizing the ipa-replica-manage del command, I was able to successfully delete several problematic / broken / crashed servers from my FreeIPA replica pool... almost a full month later, I came to find out that I had caused a serious problem as there were orphaned modifications to my directory.

When I finally got around to performing the RUV cleanup manually, I ran head first into a 389 bug that caused my replica server to segfault, it continued to do this until I reinitialized back from one of the Unfixed replica masters.

So. It appears that the RUV task is not just a nice to have, but rather important since it appears possible to cause a small bit of misalignment / corruption in the core of the 389 directory.

I think having the old replicas in the RUV was not the cause of crashing, but I believe it is the cause of the annoying error messages in your errors log about "unable to find CSN xxx" where the replica ID part of the CSN is the deleted replica, and of "NSMMReplicationPlugin - repl_set_mtn_referrals: could not set referrals for replica - err 20" https://fedorahosted.org/389/ticket/282

I believe running CLEANRUV will clean up these and similar error messages. Running CLEANRUV however will not prevent crashes. So yes, running CLEANRUV is better than a "nice to have" but not a necessity.

Per IRC conversation with richm, the cleanup task is one that wants to be scripted and performed against all replica partners following the deletion. Otherwise, it requires a single master server to be cleaned up followed by a re-initialization of all replica partners, which is more costly.

JR, I don't follow. If a server gets removed then we need to run the cleanup task on the remote servers that had agreements with it? If so that bumps the scope up a bit as we'd need a 389-ds plugin to catch that.

Replying to [comment:13 rcritten]:

JR, I don't follow. If a server gets removed then we need to run the cleanup task on the remote servers that had agreements with it?
Yes. Any server that may have the removed server listed as one of the RUV elements.
If so that bumps the scope up a bit as we'd need a 389-ds plugin to catch that.
How so? That is, how does a 389 plugin running on server A know that server B has been removed as a replica?

Replying to [comment:14 rmeggins]:

How so? That is, how does a 389 plugin running on server A know that server B has been removed as a replica?

Well, right. Maybe it's my lack of understanding what we need to do for CLEANRUV. I'm still unclear on which host(s) we need to do anything when removing a replica.

My reading of JR's comment was "any host that knew anything about a replica that is deleted needs a task run".

Replying to [comment:16 rcritten]:

Replying to [comment:14 rmeggins]:

How so? That is, how does a 389 plugin running on server A know that server B has been removed as a replica?

Well, right. Maybe it's my lack of understanding what we need to do for CLEANRUV. I'm still unclear on which host(s) we need to do anything when removing a replica.

My reading of JR's comment was "any host that knew anything about a replica that is deleted needs a task run".

Once you create a replica, the information about that replica is propagated to all other replicas (eventually, depending on the speed of replication) and stored in the RUV tombstone entry. So when you remove a replica, you should also run the CLEANRUV task to remove the information about that replica from all other replicas.

Ok, I was reading JR right then. The big question is, how can I signal all the other replicas to run the task? My plugin thinking was bogus, I was thinking something would be replicated and I could trigger on that but this happens in cn=config.

We delegate some permissions to search for agreements, and I'm pretty sure write as well, so I may have all the pieces I need. I guess the algo is something like:

masters = search('cn=masters,cn=ipa,cn=etc,$SUFFIX')
for master in masters:
agreements = find_agreements(master)
for agreement in agreement
cleanruv_task

-Patch attached-

Instructions for testing this Patch:

  1. Setup at least 3 FreeIPA replica servers.

  2. Perform the following search on one of the servers to verify the Replica-ID's in the Tombstone:
    $ ldapsearch -xLLL -D "cn=directory manager" -W -b dc=example,dc=com \
    '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))'

  3. Verify that all 3 servers are present in the replica list:
    $ ipa-replica-manage list

  4. Delete one of the Replicas
    $ ipa-replia-manage del ipa#.example.com

  5. ReRun the Tombstone search on all remaining servers to confirm the RUV entry has been cleaned:

ldapsearch -xLLL -D "cn=directory manager" -W -b dc=example,dc=com \

'(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))'

  1. Verify that the replica server has been deleted:
    $ ipa-replica-manage list
(03:22:13 PM) JrAquino: simo: well. i will say that this is going to suck really bad now that the RUV thing isn't making it into 2.2
(03:22:35 PM) JrAquino: simo: we've found that if you clean an ruv on a server and forget to do the same on any of the others. you'll stop replicating with them...

Simo suggested that a 389-ds-base plugin would be a better solution because this relies on all servers being up and reachable.

The plugin would monitor the list of masters and if one is removed then cleanup is done on that master. Because this deletion would be replicated as soon as each master received that update it would make sure that master was cleaned up.

I agree with the above, however, Rich will need to confirm, but my understanding was that the Tombstone entries are NOT replicated, that something would need to run on every replica in the topology fleet.

Rich, is the above doable with a 389 plugin?

Current approach behaves strange when one of the replica is down when ipa-replica-manage del is cleaning RUVs. RUV is then not cleared on that replica (as expected).

However, when I started the replica that was down during RUV cleanup, tombstone was replicated again to all replicas. This would mean that manual CleanRUV would have to be run on all replicas again and not just the one that was down. Are we OK with this?

Replying to [comment:25 mkosek]:

Current approach behaves strange when one of the replica is down when ipa-replica-manage del is cleaning RUVs. RUV is then not cleared on that replica (as expected).

However, when I started the replica that was down during RUV cleanup, tombstone was replicated again to all replicas. This would mean that manual CleanRUV would have to be run on all replicas again and not just the one that was down. Are we OK with this?

That was unexpected. We need to revisit the behavior of 389 to better understand how to address this issue. Originally, my understanding was that Tombstone data was not replicated, but this is not the case... I believe Simo is probably right and we are going to need 389 to detect the deletion of a Replica peer and trigger a CleanRUV task.

The problem of unclean servers causing replication to stop or partially stop is a troubling one, but it seems that attacking this problem from the outside might not be the most efficient method to address the issue.

We need a solution that can address the problem when 'a' server is down, or if 'the' server being delete is down. It can be very problematic for a server to recover from some downtime, only to poison the rest of the replica pool with data that should have been purged.

Moving to next month iteration.

We couple of weeks ago Simo, Rich, Martin and I had a discussion on how best to proceed with RUV. The current process is very delicate and can easily be broken if a single replica is not completely cleaned up. In effect, if you miss one (because it is down, slow, unreachable, etc) then when it comes back online it will simply undo all the cleaning already done.

We agreed that this needs to happen at the 389-ds level. This is being tracked as ticket https://fedorahosted.org/389/ticket/337

This is going to require some code changes to ipa-replica-manage to put the replica into read-only, track the cleaning, etc.

To verify that the CLEANALLRUV task was successful run this on the remaining replicas to be sure that the removed replica id is gone:

ldapsearch -xLLL -D "cn=directory manager" -w password -h localhost -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv

Moving to next milestone. The 389-ds team is revising the CLEANRUV work. Need to wait for that to be done.

Metadata Update from @simo:
- Issue assigned to rcritten
- Issue set to the milestone: FreeIPA 3.0 RC1

7 years ago

Login to comment on this ticket.

Metadata