freeipa

#2303 Run CLEANRUV task when completely deleting a replica

Closed: Fixed None Opened 12 years ago by simo.

When a replica is disconnected or removed we need to run the CLEANRUV task in DS to avoid having hanging pointers to the old replica.

simo commented 12 years ago

It may make sense to have this handled by an IPA plugin. The trigger would be a master server entry disappearing from the common tree. This would allow all servers to prune agreements and CLEANRUV automatically when a mster is removed. But needs careful consideration.

simo commented 12 years ago

Reference to CLEANRUV:
http://directory.fedoraproject.org/wiki/Howto:CLEANRUV

dpal commented 12 years ago

Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=784378 (Red Hat Enterprise Linux 6)

jraquino commented 12 years ago

Utilizing the ipa-replica-manage del command, I was able to successfully delete several problematic / broken / crashed servers from my FreeIPA replica pool... almost a full month later, I came to find out that I had caused a serious problem as there were orphaned modifications to my directory.

When I finally got around to performing the RUV cleanup manually, I ran head first into a 389 bug that caused my replica server to segfault, it continued to do this until I reinitialized back from one of the Unfixed replica masters.

So. It appears that the RUV task is not just a nice to have, but rather important since it appears possible to cause a small bit of misalignment / corruption in the core of the 389 directory.

simo commented 12 years ago

Putting back in needs triage

rmeggins commented 12 years ago

Replying to [comment:7 jraquino]:

Utilizing the ipa-replica-manage del command, I was able to successfully delete several problematic / broken / crashed servers from my FreeIPA replica pool... almost a full month later, I came to find out that I had caused a serious problem as there were orphaned modifications to my directory.

When I finally got around to performing the RUV cleanup manually, I ran head first into a 389 bug that caused my replica server to segfault, it continued to do this until I reinitialized back from one of the Unfixed replica masters.

So. It appears that the RUV task is not just a nice to have, but rather important since it appears possible to cause a small bit of misalignment / corruption in the core of the 389 directory.

I think having the old replicas in the RUV was not the cause of crashing, but I believe it is the cause of the annoying error messages in your errors log about "unable to find CSN xxx" where the replica ID part of the CSN is the deleted replica, and of "NSMMReplicationPlugin - repl_set_mtn_referrals: could not set referrals for replica - err 20" https://fedorahosted.org/389/ticket/282

I believe running CLEANRUV will clean up these and similar error messages. Running CLEANRUV however will not prevent crashes. So yes, running CLEANRUV is better than a "nice to have" but not a necessity.

jraquino commented 12 years ago

Per IRC conversation with richm, the cleanup task is one that wants to be scripted and performed against all replica partners following the deletion. Otherwise, it requires a single master server to be cleaned up followed by a re-initialization of all replica partners, which is more costly.

rcritten commented 12 years ago

JR, I don't follow. If a server gets removed then we need to run the cleanup task on the remote servers that had agreements with it? If so that bumps the scope up a bit as we'd need a 389-ds plugin to catch that.

rmeggins commented 12 years ago

Replying to [comment:13 rcritten]:

JR, I don't follow. If a server gets removed then we need to run the cleanup task on the remote servers that had agreements with it?
Yes. Any server that may have the removed server listed as one of the RUV elements.
If so that bumps the scope up a bit as we'd need a 389-ds plugin to catch that.
How so? That is, how does a 389 plugin running on server A know that server B has been removed as a replica?

rcritten commented 12 years ago

Replying to [comment:14 rmeggins]:

How so? That is, how does a 389 plugin running on server A know that server B has been removed as a replica?

Well, right. Maybe it's my lack of understanding what we need to do for CLEANRUV. I'm still unclear on which host(s) we need to do anything when removing a replica.

My reading of JR's comment was "any host that knew anything about a replica that is deleted needs a task run".

rmeggins commented 12 years ago

Replying to [comment:16 rcritten]:

Replying to [comment:14 rmeggins]:

How so? That is, how does a 389 plugin running on server A know that server B has been removed as a replica?

Well, right. Maybe it's my lack of understanding what we need to do for CLEANRUV. I'm still unclear on which host(s) we need to do anything when removing a replica.

My reading of JR's comment was "any host that knew anything about a replica that is deleted needs a task run".

Once you create a replica, the information about that replica is propagated to all other replicas (eventually, depending on the speed of replication) and stored in the RUV tombstone entry. So when you remove a replica, you should also run the CLEANRUV task to remove the information about that replica from all other replicas.

rcritten commented 12 years ago

Ok, I was reading JR right then. The big question is, how can I signal all the other replicas to run the task? My plugin thinking was bogus, I was thinking something would be replicated and I could trigger on that but this happens in cn=config.

We delegate some permissions to search for agreements, and I'm pretty sure write as well, so I may have all the pieces I need. I guess the algo is something like:

masters = search('cn=masters,cn=ipa,cn=etc,$SUFFIX')
for master in masters:
agreements = find_agreements(master)
for agreement in agreement
cleanruv_task

jraquino commented 12 years ago

-Patch attached-

Instructions for testing this Patch:

Setup at least 3 FreeIPA replica servers.
Perform the following search on one of the servers to verify the Replica-ID's in the Tombstone:
$ ldapsearch -xLLL -D "cn=directory manager" -W -b dc=example,dc=com \
'(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))'
Verify that all 3 servers are present in the replica list:
$ ipa-replica-manage list
Delete one of the Replicas
$ ipa-replia-manage del ipa#.example.com
ReRun the Tombstone search on all remaining servers to confirm the RUV entry has been cleaned:

ldapsearch -xLLL -D "cn=directory manager" -W -b dc=example,dc=com \

'(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))'

Verify that the replica server has been deleted:
$ ipa-replica-manage list

jraquino commented 12 years ago

attachment
freeipa-jraquino-0042-Add-CleanRUV-Task-to-ipa-replica-manage-del.patch

rmeggins commented 12 years ago

(03:22:13 PM) JrAquino: simo: well. i will say that this is going to suck really bad now that the RUV thing isn't making it into 2.2
(03:22:35 PM) JrAquino: simo: we've found that if you clean an ruv on a server and forget to do the same on any of the others. you'll stop replicating with them...

rcritten commented 12 years ago

Simo suggested that a 389-ds-base plugin would be a better solution because this relies on all servers being up and reachable.

The plugin would monitor the list of masters and if one is removed then cleanup is done on that master. Because this deletion would be replicated as soon as each master received that update it would make sure that master was cleaned up.

jraquino commented 12 years ago

I agree with the above, however, Rich will need to confirm, but my understanding was that the Tombstone entries are NOT replicated, that something would need to run on every replica in the topology fleet.

Rich, is the above doable with a 389 plugin?

mkosek commented 12 years ago

Current approach behaves strange when one of the replica is down when ipa-replica-manage del is cleaning RUVs. RUV is then not cleared on that replica (as expected).

However, when I started the replica that was down during RUV cleanup, tombstone was replicated again to all replicas. This would mean that manual CleanRUV would have to be run on all replicas again and not just the one that was down. Are we OK with this?

jraquino commented 12 years ago

Replying to [comment:25 mkosek]:

Current approach behaves strange when one of the replica is down when ipa-replica-manage del is cleaning RUVs. RUV is then not cleared on that replica (as expected).

However, when I started the replica that was down during RUV cleanup, tombstone was replicated again to all replicas. This would mean that manual CleanRUV would have to be run on all replicas again and not just the one that was down. Are we OK with this?

That was unexpected. We need to revisit the behavior of 389 to better understand how to address this issue. Originally, my understanding was that Tombstone data was not replicated, but this is not the case... I believe Simo is probably right and we are going to need 389 to detect the deletion of a Replica peer and trigger a CleanRUV task.

The problem of unclean servers causing replication to stop or partially stop is a troubling one, but it seems that attacking this problem from the outside might not be the most efficient method to address the issue.

We need a solution that can address the problem when 'a' server is down, or if 'the' server being delete is down. It can be very problematic for a server to recover from some downtime, only to poison the rest of the replica pool with data that should have been purged.

mkosek commented 12 years ago

Moving to next month iteration.

rcritten commented 12 years ago

We couple of weeks ago Simo, Rich, Martin and I had a discussion on how best to proceed with RUV. The current process is very delicate and can easily be broken if a single replica is not completely cleaned up. In effect, if you miss one (because it is down, slow, unreachable, etc) then when it comes back online it will simply undo all the cleaning already done.

We agreed that this needs to happen at the 389-ds level. This is being tracked as ticket https://fedorahosted.org/389/ticket/337

rcritten commented 11 years ago

This is going to require some code changes to ipa-replica-manage to put the replica into read-only, track the cleaning, etc.

rcritten commented 11 years ago

To verify that the CLEANALLRUV task was successful run this on the remaining replicas to be sure that the removed replica id is gone:

ldapsearch -xLLL -D "cn=directory manager" -w password -h localhost -b "dc=example,dc=com" '(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' nsds50ruv

rcritten commented 11 years ago

attachment
freeipa-rcrit-1031-cleanruv.patch

rcritten commented 11 years ago

Moving to next milestone. The 389-ds team is revising the CLEANRUV work. Need to wait for that to be done.

rcritten commented 11 years ago

Updated version in 389-ds-base-1.2.11.9

mkosek commented 11 years ago

master: c9c55a2[[BR]]
ipa-3-0: 40582a1

Metadata Update from @simo:
- Issue assigned to rcritten
- Issue set to the milestone: FreeIPA 3.0 RC1

7 years ago

Metadata

Assignee

rcritten

Tags

None

Blocking

None

Depending on

None

Priority

important

Milestone

FreeIPA 3.0 RC1

None

affects_doc

None

source

None

knownissue

None

type

defect

blockedby

None

test_case

None

component

CLI

blocking

None

on_review

keywords

None

test_coverage

None

reviewer

None

external_tracker

None

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=784378

tester

None

changelog

None

design

None

freeipa

Source Code

#2303 Run CLEANRUV task when completely deleting a replica Closed: Fixed None Opened 12 years ago by simo.

ldapsearch -xLLL -D "cn=directory manager" -W -b dc=example,dc=com \

Metadata

#2303 Run CLEANRUV task when completely deleting a replica

Closed: Fixed None Opened 12 years ago by simo.