#47954 db deadlock with busy replication
Closed: wontfix None Opened 9 years ago by pj101.

I've continued testing 389ds v.1.3.2.24 on CentOS 7.

Here is another issue with replication :
i have two servers with multi-master agreements on each of them (the same configuration as in ticket https://fedorahosted.org/389/ticket/47942).

We add/delete a lot of groups (943, to be exact). Each group may contain a large number of referenced entries, up to ~250 (uniqueMember: dn). MemberOf plugin is activated and works fine. Referential integrity plugin is also activated but of course it is of any sense only when deleting groups (or renaming them). It goes on for a long time (20-30 minutes or more). Some time after the beginning of the operations (typically 5-8 minutes) we have replication erros and inconsistency of the replica concerning the entries mentioned in error log.

When adding and deleting groups the supplier is ok. Howevere the consumer has several (from one to four or five) groupe deletions/adds that are not replicated. The error on the supplier:

[12/Nov/2014:16:46:42 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to replay change (uniqueid fa90219d-6a8211e4-a42c901a-94623bee, CSN 546380d6000000020000): Operations error (1). Will retry later.
[12/Nov/2014:16:47:55 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to replay change (uniqueid 1e5367ae-6a8311e4-a42c901a-94623bee, CSN 54638125000000020000): Operations error (1). Will retry later.
[12/Nov/2014:16:53:14 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to replay change (uniqueid f4e70b85-6a8311e4-a42c901a-94623bee, CSN 54638260000000020000): Operations error (1). Will retry later.
[12/Nov/2014:16:55:12 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to replay change (uniqueid 3c6d978a-6a8411e4-a42c901a-94623bee, CSN 546382d6000400020000): Operations error (1). Will retry later.
[12/Nov/2014:16:56:31 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to replay change (uniqueid 6030dd93-6a8411e4-a42c901a-94623bee, CSN 54638325000000020000): Operations error (1). Will retry later.
[12/Nov/2014:16:57:22 +0100] NSMMReplicationPlugin - agmt="cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to replay change (uniqueid 83f42395-6a8411e4-a42c901a-94623bee, CSN 5463835d000000020000): Operations error (1). Will retry later.

The corresponding errors on the consumer seem to hint deadlocks in these cases:
[12/Nov/2014:16:46:41 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=546380d6000000020000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Nov/2014:16:46:41 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (546380d6000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[12/Nov/2014:16:46:41 +0100] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=LAN452ESP-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu (uniqid: fa90219d-6a8211e4-a42c901a-94623bee, optype: 16) to changelog csn 546380d6000000020000
[12/Nov/2014:16:47:54 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=54638125000000020000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Nov/2014:16:47:54 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (54638125000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[12/Nov/2014:16:47:54 +0100] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=LAN472EFLE-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu (uniqid: 1e5367ae-6a8311e4-a42c901a-94623bee, optype: 16) to changelog csn 54638125000000020000
[12/Nov/2014:16:53:13 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=54638260000000020000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Nov/2014:16:53:13 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (54638260000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[12/Nov/2014:16:53:13 +0100] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=MAT471-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu (uniqid: f4e70b85-6a8311e4-a42c901a-94623bee, optype: 16) to changelog csn 54638260000000020000
[12/Nov/2014:16:55:11 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=546382d6000400020000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Nov/2014:16:55:11 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (546382d6000400020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[12/Nov/2014:16:55:11 +0100] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=MEC592-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu (uniqid: 3c6d978a-6a8411e4-a42c901a-94623bee, optype: 16) to changelog csn 546382d6000400020000
[12/Nov/2014:16:56:29 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=54638325000000020000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Nov/2014:16:56:29 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (54638325000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[12/Nov/2014:16:56:29 +0100] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=PHY566-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu (uniqid: 6030dd93-6a8411e4-a42c901a-94623bee, optype: 16) to changelog csn 54638325000000020000
[12/Nov/2014:16:57:20 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=5463835d000000020000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[12/Nov/2014:16:57:20 +0100] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (5463835d000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[12/Nov/2014:16:57:20 +0100] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for cn=PHY651K-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu (uniqid: 83f42395-6a8411e4-a42c901a-94623bee, optype: 16) to changelog csn 5463835d000000020000

Each time the failed group entries are different.

The replication agreements look liek this :
0 cn=Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr,cn=replica,cn=dc\3Did\2Cdc\3Dpolytechnique\2Cdc\3Dedu,cn=mapping tree,cn=config
objectClass: top
objectClass: nsDS5ReplicationAgreement
cn: Replication from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr
description: Replication agreement from server ldap-edev.polytechnique.fr to server ldap-model.polytechnique.fr
nsDS5ReplicaHost: ldap-model.polytechnique.fr
nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu
nsDS5ReplicaPort: 636
nsDS5ReplicaTransportInfo: SSL
nsDS5ReplicaBindDN: cn=RepliX,cn=config
nsDS5ReplicaBindMethod: simple
nsDS5ReplicatedAttributeList: (objectclass=) $ EXCLUDE entryusn memberOf
nsDS5ReplicatedAttributeListTotal: (objectclass=
) $ EXCLUDE entryusn
nsds5ReplicaStripAttrs: modifiersName modifyTimestamp internalModifiersName internalModifyTimestamp internalCreatorsname
nsds5replicaBusyWaitTime: 3
nsds5replicaTimeout: 30
nsDS5ReplicaCredentials: {DES}...
nsds5ReplicaEnabled: on
nsds50ruv: {replicageneration} 54636df3000000020000
nsds50ruv: {replica 1 ldap://ldap-model.polytechnique.fr:389} 54637b3c000000010000 54637dce000000010000
nsds50ruv: {replica 2 ldap://ldap-edev.polytechnique.fr:389} 54636ffe000000020000 54637d2d000000020000
nsruvReplicaLastModified: {replica 1 ldap://ldap-model.polytechnique.fr:389} 00000000
nsruvReplicaLastModified: {replica 2 ldap://ldap-edev.polytechnique.fr:389} 00000000
nsds5replicareapactive: 0
nsds5replicaLastUpdateStart: 20141112173922Z
nsds5replicaLastUpdateEnd: 20141112173922Z
nsds5replicaChangesSentSinceStartup: 2:1899/21484
nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental update succeeded
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0

The DNA plug-in is disabled (excluding https://fedorahosted.org/389/ticket/47410)


Looks like a dup of ticket/47409

Hi Rich,
do you propose to change nsslapd-db-deadlock-policy?

or i could try to increase MAX_TRIALS in ./ldap/servers/plugins/replication/cl5.h and see if it helps.

Replying to [comment:2 pj101]:

Hi Rich,
do you propose to change nsslapd-db-deadlock-policy?

Yes. Try setting it as described in the bug https://bugzilla.redhat.com/show_bug.cgi?id=979169

Replying to [comment:3 pj101]:

or i could try to increase MAX_TRIALS in ./ldap/servers/plugins/replication/cl5.h and see if it helps.

I don't think that will help, but if you try it and it works, please let us know.

Changing nsslapd-db-deadlock-policy to 6 does indeed eliminate the problem and teh replicas are always consistent.

However "Retry count exceeded in modify" in error logs are still here, and their number decreases with increasing MAX_TRIALS in ./ldap/servers/plugins/replication/cl5.h and RETRY_TIMES in ./ldap/servers/slapd/back-ldbm/back-ldbm.h.

I'm closing this ticket, it's indeed the duplicate of ticket/47409

Metadata Update from @rmeggins:
- Issue set to the milestone: N/A

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1285

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Duplicate)

3 years ago

Login to comment on this ticket.

Metadata