#4279 DS replication errors on unittests
Closed: Fixed None Opened 10 years ago by mkosek.

I installed a FreeIPA master on Fedora 19 (freeipa-server-3.3.5-1.fc19.x86_64, 389-ds-base-1.3.1.22-1.fc19.x86_64) and FreeIPA replica with PKI on Fedora 20 (freeipa-server-3.3.5-1.fc20.x86_64, 389-ds-base-1.3.2.16-1.fc20.x86_64)

Then I run our unit test suite on the replica and saw many replication errors.

On F20 replica:

...
[28/Mar/2014:13:03:25 +0100] NSMMReplicationPlugin - Schema agmt="cn=meTovm-119.idm.lab.bos.redhat.com" (vm-119:389) must not be overwritten (set replication log for additional info)
...
[28/Mar/2014:13:04:02 +0100] ldbm_back_modify - Attempt to modify a tombstone entry nsuniqueid=b88baafb-b67011e3-be6ec657-72df99fa,cn=hostgroup2,cn=hostgroups,cn=accounts,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com
...
[28/Mar/2014:13:04:21 +0100] NSMMReplicationPlugin - agmt="cn=meTovm-119.idm.lab.bos.redhat.com" (vm-119:389): Warning: unable to replicate schema: rc=1

On F19 master:

...
[28/Mar/2014:08:03:42 -0400] NSMMReplicationPlugin - agmt="cn=meTovm-236.idm.lab.eng.brq.redhat.com"    (vm-236:389): Consumer failed to replay change (uniqueid b88baaf3-b67011e3-be6ec657-72df99fa, CSN       5335668f000000040000): Server is unwilling to perform (53). Will retry later.
...
(vm-236:389): Consumer failed to replay change (uniqueid dc4ef003-b67011e3-be6ec657-72df99fa, CSN       5335668b000100040000): Server is unwilling to perform (53). Will retry later.
...
[28/Mar/2014:08:21:58 -0400] managed-entries-plugin - mep_pre_op: Unable to fetch origin entry "cn=hg1, cn=hostgroups,cn=accounts,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com".

During the tests I noticed at least one unreplicated change - a user that was already deleted on master was present on the replica. It eventually disappeared after few minutes though.


DS errors in F19 FreeIPA master
errors-f19.gz

DS errors in F20 FreeIPA replica
errors-f20.gz

just before the "tombstone" errors start there are also error messages about schema replication failures.

Could this issue be related to 389 ticket #47721 which addresses schema replication problems between f19 and f20 ?

The schema messages are related to 47721:

[28/Mar/2014:13:03:34 +0100] NSMMReplicationPlugin - Schema agmt="cn=meTovm-119.idm.lab.bos.redhat.com" (vm-119:389) must not be overwritten (set replication log for additional info)

The tombstone messages comes from Ticket 47396 - crash on modrdn of tombstone
This fix was to prevent ldap client to MODRDN tombstones

Oppss click 'submit' too fast.
About 47396, I wonder if the operation should not be successful if coming from a replication session.
Of course legacy version still accept those operations and new versions have to handle them.

Thanks for info Thierry. Let me then assign this ticket to you so that you can verify also this scenario after you fix 47721.

About 47396 - can you/Ludwig/DS team discuss what can we do about it? Do you need a 389 ticket for that? That long slew of errors could cloud any real errors, so IMO we should either lower the level of the log or do a fix similar to what you said.

I tried to reproduce the problem but failed !

I created two DS masters (1.3.4 (to mimic F20) - 1.3.1 (to mimic F19)).
I was unable to update/modrdn a tombstone entry on F19. In fact the fix 47396 is fixed in 1.3.1.

So a direct ldap client MOD would not be allowed on F19 as well as F20. I wonder how this update was introduced.
My only guess is that an IPA plugin generated this operation as an internal MOD and later this MOD was replicated.

Do you know what steps (IPA or DS) I have to do to reproduce this issue ?

  1. Install FreeIPA on F19 - ipa-server-install
  2. Install FreeIPA replica on F20 - ipa-replica-install --setup-ca REPLICA_INFO_FILE (with REPLICA_INFO_FILE created on F19 by ipa-replica-prepare)
  3. When F20 replica is installed, run the FreeIPA unit tests I mentioned in the ticket:
    • mkdir -p ~/.ipa/
    • ln -sf /etc/ipa/default.conf ~/.ipa/default.conf
    • git clone git://git.fedorahosted.org/git/freeipa.git -b ipa-3-3 freeipa
    • cd freeipa
    • make version-update
    • kinit admin
    • ./make-testcert
    • ./make-test

More information about testing can be found in FreeIPA Testing page.

This ticket reports two issues:

- Schema messages are related to https://fedorahosted.org/389/ticket/47721, that is now fixed
- Replication messages (failure of a replicated MOD) is related to https://fedorahosted.org/389/ticket/47787 (under review). 
- Investigating the RC of the replication messages I hit the three following bugs that are under triage:

https://fedorahosted.org/389/ticket/47788

https://fedorahosted.org/389/ticket/47783

https://fedorahosted.org/389/ticket/47784

I will try to reproduce https://fedorahosted.org/freeipa/ticket/4279 to check that https://fedorahosted.org/389/ticket/47787 fix fixes it.

Thanks for info Thierry, I like the progress, I am glad we captured some bugs with this effort.

  • I opened a twin ticker in DS to track the replicated MOD failure (https://fedorahosted.org/389/ticket/47787)

  • I implemented a fix. that is now reviewed

  • test on 389-DS are successful

  • test with freeipa unit tests creates a hang (deadlock ?) that I detailed in https://fedorahosted.org/389/ticket/47787

  • I know that the fix for 47787 was part of the scenario that created the hang. Now I do not know if the fix just reveal a potential hang or was the cause of the hang

  • running freeipa unit tests I can reproduce 389-ds hang: https://fedorahosted.org/389/ticket/47797

  • This hang is not systematic and can be CPU intensive loop (default) or deadlock if txn are created with NOWAIT.

  • The hang is not related to https://fedorahosted.org/389/ticket/47787 but to several threads accessing retroCL.

  • Need to verify that freeipa unit tests (when they do not hang) are successful to MOD a tombstone.

The hang being not systematic, I was able to test successfully the fix with freeipa unit tests.

----------------------------------------------------------------------
Ran 1744 tests in 794.788s

FAILED (SKIP=64, errors=11, failures=1)

During that run, the fix works as it was able to apply successfully a MOD on a tombstone (if coming from replication) and replication continue without problem.

[06/May/2014:06:36:05 -0400] conn=175 fd=97 slot=97 connection from 10.16.78.61 to 10.16.78.61
...
[06/May/2014:06:36:06 -0400] conn=175 op=5 DEL dn="cn=group1,cn=groups,cn=accounts,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com"
...
[06/May/2014:06:36:06 -0400] conn=27 op=88 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[06/May/2014:06:36:06 -0400] conn=27 op=88 RESULT err=0 tag=120 nentries=0 etime=0
[06/May/2014:06:36:06 -0400] conn=27 op=89 MOD dn="cn=ipausers,cn=groups,cn=accounts,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com"
...
[06/May/2014:06:36:06 -0400] conn=175 op=5 RESULT err=0 tag=107 nentries=0 etime=0 csn=5368bb1b000200090000
[06/May/2014:06:36:06 -0400] conn=27 op=89 RESULT err=0 tag=103 nentries=0 etime=0 csn=5368bb1a000200040000
[06/May/2014:06:36:06 -0400] conn=27 op=90 MOD dn="cn=group1,cn=groups,cn=accounts,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com"
[06/May/2014:06:36:06 -0400] conn=27 op=90 RESULT err=0 tag=103 nentries=0 etime=0 csn=5368bb1a000300040000
[06/May/2014:06:36:06 -0400] conn=27 op=91 MOD dn="cn=ipausers,cn=groups,cn=accounts,dc=idm,dc=lab,dc=bos,dc=redhat,dc=com"
[06/May/2014:06:36:06 -0400] conn=27 op=91 RESULT err=0 tag=103 nentries=0 etime=0 csn=5368bb1a000400040000

Bugs https://fedorahosted.org/389/ticket/47721 and ​https://fedorahosted.org/389/ticket/47787 are both fixed in 1.3.2 branch. Waiting for 1.3.2.17 availability (should be available) to check I can not reproduce the failure with new DS version.

Result of the tests:

F19 master  (389-ds-base-1.3.2.19-1 / freeipa-server-3.3.5-1)
F20 replica (389-ds-base-1.3.1.22-1 / freeipa-server-3.3.5-1)

Unit tests run on replica F20

Replication issue (https://fedorahosted.org/389/ticket/47787) is not reproduced

Schema issue (https://fedorahosted.org/389/ticket/47721). It logs a couple of transient messages that are expected :

[14/Aug/2014:11:39:38 -0400] schema - [C] Local objectClasses must not be overwritten (set replication log for additional info)
[14/Aug/2014:11:45:44 -0400] NSMMReplicationPlugin - [S] Schema agmt="cn=meTovm-116.idm.lab.bos.redhat.com" (vm-116:389) must not be overwritten (set replication log for additional info)

Those messages states that the replica acting as a 'C'onsumer then as a 'S'upplier does not accept/send its schema. But during this phase, it learns what is in extra/missing in the master schema and during the next replication session, schema on both side are in sync.

Both bugs are verified. Closing that bug

Metadata Update from @mkosek:
- Issue assigned to tbordaz
- Issue set to the milestone: FreeIPA 3.3.6 (bug fixing)

7 years ago

Login to comment on this ticket.

Metadata