Issue #374: consumer can go into total update mode for no reason - 389-ds-base

389-ds-base

#374 consumer can go into total update mode for no reason

Closed: wontfix None Opened 11 years ago by rmeggins.

This is probably the root cause of https://bugzilla.redhat.com/show_bug.cgi?id=547503 which we could never reproduce. What we believe is happening is that the same connection is sending two start repl extops, the second one before the first one is sent. If the consumer is heavily loaded, the first one may get swapped out and the second one processed at the "wrong" time - repl_extop.c:1095

    if (NSDS50_REPL_REPLICA_READY != response)
    {
        /*
         * Something went wrong, and we never told the other end that the
         * replica had been acquired, so we'd better release it.
         */
        if (NULL != connext && NULL != connext->replica_acquired)
        {
            Object *r_obj = (Object*)connext->replica_acquired;
            replica_relinquish_exclusive_access((Replica*)object_get_data (r_obj),
                                                connid, opid);
        }
        /* Remove any flags that would indicate repl session in progress */
        if (NULL != connext)
        {
            connext->repl_protocol_version = REPL_PROTOCOL_UNKNOWN;
            connext->isreplicationsession = 0;
        }
        slapi_pblock_set( pb, SLAPI_CONN_IS_REPLICATION_SESSION, &zero );
    }

Two or more connections may share the connext object. The second thread will get the error condition, and release the replica, and set connext->repl_protocol_version = REPL_PROTOCOL_UNKNOWN. If the first thread then gets swapped in, it will hit this - repl_extop.c:923

     if (connext->repl_protocol_version == REPL_PROTOCOL_50_INCREMENTAL)
    {
        /* The supplier ruv may have changed, so let's update the referrals */
        consumer5_set_mapping_tree_state_for_replica(replica, connext->supplier_ruv);
    }
    else /* full protocol */
    {

since the second thread just set it to REPL_PROTOCOL_UNKNOWN, that's not equal to INCREMENTAL, and so when the first thread resumes it will start a total update.

Since the second thread also prematurely releases the exclusive access to the replica, a third thread from another connection can come in an claim access.

rmeggins commented 11 years ago

Should also check the locking in multimaster_extop_EndNSDS50ReplicationRequest - could have some similar race conditions there too

telackey commented 11 years ago

Proposed fix
fdsHEAD.patch

telackey commented 11 years ago

Above is a proposed fix. As described by Rich, the issue is that the consumer connection extension is shared by all operations on that connection. When things are going smoothly, that is not a problem as it is highly unlikely that there will be more than one replication session started on the same connection. If the consumer gets unusually bogged down though, it might be possible for the supplier to timeout waiting on a response to the 'start' extop and decide to send another one. If that happens, the extension state can get muddled as it is manipulated by both 'start' operations, causing one or the other to misinterpret the normal incremental update as a total update.

The patch introduces the idea of "checking out" the consumer extension for the duration of the 'start' or 'end' operation. The phrase "exclusive access" is used to match up with the related functions for the replica, but the lifetime is a bit different. For the consumer connection extension, exclusive access should be relinquished after the current operation is done (eg, at the end of the 'start' or 'end' function), not held for the entire session.

I tested this pretty decently on 1.1.x. I ran it overnight with my magic bug trigging code (attachment to follow) in place without issue. Without the fix, it will only last a few minutes. I also ran it for a few days constantly looping some simple tests without encountering any issues (about a million changes worth) and then did a real total update. Everything seemed good.

However, the attachment is blindly patched against 1.2.x. I looked over each of the files and it looked OK, but I had to make some adjustments (PRint64 connection IDs) that are completely untested.

telackey commented 11 years ago

Trigger the bug (created against 1.1.3)
causebug374.patch

telackey commented 11 years ago

The attachment above is a good way to trigger the bug. It was created against 1.1.3, but ought to be easy to apply to HEAD. In the logs, you want to look for something like this error log from the consumer.

In this instance, op=25 (bold) acquires the replica and then is put to sleep by the demo code. After a couple of minutes, op=26 (italics) begins. It fails to acquire the replica, ''but successfully relinquishes it''! Behind the scenes, it also resets the replication protocol version to REPL_PROTOCOL_UNKNOWN, which is shown in op=25 as it wakes up from it sleep. That causes it to branch into the total update code, as Rich mentioned in the initial report.

Most of the commentary has focused on the latter, but obviously relinquishing a replica that may be in use has a chance to cause a different set of issues. The proposed fix should kill two birds with one stone, by denying access to the connext in the first place.

[31/May/2012:22:48:06 -0500] NSMMReplicationPlugin - conn=0 op=25 repl="dc=example,dc=local": Begin incremental protocol
[31/May/2012:22:48:06 -0500] NSMMReplicationPlugin - conn=0 op=25 repl="dc=example,dc=local": Acquired replica
[31/May/2012:22:48:06 -0500] NSMMReplicationPlugin - '''conn=0 op=25 repl="dc=example,dc=local": 374 - Starting sleep: connext->repl_protocol_version == 2'''
[31/May/2012:22:49:02 -0500] NSMMReplicationPlugin - agmt="cn="Replication to test6.example.local"" (test6:389): Linger timeout has expired on the connection
[31/May/2012:22:49:02 -0500] NSMMReplicationPlugin - agmt="cn="Replication to test6.example.local"" (test6:389): Disconnected from the consumer
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - conn=0 op=26 repl="dc=example,dc=local": Begin incremental protocol
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - ''conn=0 op=26 repl="dc=example,dc=local": Replica in use locking_purl=conn=0 id=25''
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - ''conn=0 op=26 replica="dc=example,dc=local": Unable to acquire replica: error: replica busy locked by conn=0 id=25 for incremental update''
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - ''conn=0 op=26 repl="dc=example,dc=local": StartNSDS50ReplicationRequest: response=1 rc=0''
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - ''conn=0 op=26 repl="dc=example,dc=local": Released replica held by locking_purl=conn=0 id=25''
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - '''conn=0 op=25 repl="dc=example,dc=local": 374 - Finished sleep: connext->repl_protocol_version == 0'''
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=example,dc=local is going offline; disabling replication
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - agmt="cn="Replication to test6.example.local"" (test6:389): State: wait_for_changes -> wait_for_changes
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - agmt="cn="Replication to test6.example.local"" (test6:389): State: wait_for_changes -> wait_for_changes
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - agmt="cn="Replication to test6.example.local"" (test6:389): No linger to cancel on the connection
[31/May/2012:22:50:10 -0500] NSMMReplicationPlugin - agmt="cn="Replication to test6.example.local"" (test6:389): Disconnected from the consumer
[31/May/2012:22:50:11 -0500] NSMMReplicationPlugin - agmt="cn="Replication to test6.example.local"" (test6:389): repl5_inc_stop: protocol stopped after 1 seconds
[31/May/2012:22:50:11 -0500] NSMMReplicationPlugin - conn=0 op=0 repl="dc=example,dc=local": Acquired replica
[31/May/2012:22:50:11 -0500] NSMMReplicationPlugin - replica_disable_replication: replica dc=example,dc=local is acquired
[31/May/2012:22:50:11 -0500] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[31/May/2012:22:50:11 -0500] NSMMReplicationPlugin - conn=0 op=25 repl="dc=example,dc=local": StartNSDS50ReplicationRequest: response=0 rc=0
[31/May/2012:22:50:16 -0500] NSMMReplicationPlugin - conn=0 op=28 repl="dc=example,dc=local": Begin incremental protocol
[31/May/2012:22:50:16 -0500] - csngen_adjust_time: gen state before 4fc83bf30000:1338522610:0:1
[31/May/2012:22:50:16 -0500] NSMMReplicationPlugin - conn=0 op=28 repl="dc=example,dc=local": Replica in use locking_purl=replica_disable_replication
[31/May/2012:22:50:16 -0500] NSMMReplicationPlugin - conn=0 op=28 replica="dc=example,dc=local": Unable to acquire replica: error: replica busy locked by replica_disable_replication for total update
[31/May/2012:22:50:16 -0500] NSMMReplicationPlugin - conn=0 op=28 repl="dc=example,dc=local": StartNSDS50ReplicationRequest: response=1 rc=0
[31/May/2012:22:50:16 -0500] NSMMReplicationPlugin - conn=0 op=28 repl="dc=example,dc=local": Released replica held by locking_purl=replica_disable_replication
[31/May/2012:22:50:16 -0500] NSMMReplicationPlugin - conn=0 op=29 repl="dc=example,dc=local": Replica not in use
[31/May/2012:22:50:16 -0500] - ERROR bulk import abandoned
[31/May/2012:22:50:16 -0500] - import userRoot: Aborting all import threads...
[31/May/2012:22:50:21 -0500] - import userRoot: Import threads aborted.
[31/May/2012:22:50:21 -0500] - import userRoot: Closing files...
[31/May/2012:22:50:21 -0500] - import userRoot: Import failed.

rmeggins commented 11 years ago

set default ticket origin to Community

nkinder commented 11 years ago

Added initial screened field value.

rmeggins commented 11 years ago

To ssh://git.fedorahosted.org/git/389/ds.git
20dc4bc..f1e01ac master -> master
commit changeset:f1e01ac/389-ds-base
Author: Rich Megginson rmeggins@redhat.com
Date: Thu Sep 6 14:14:38 2012 -0600

rmeggins commented 11 years ago

To ssh://git.fedorahosted.org/git/389/ds.git
dcd14f1..4561c4c 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit changeset:4561c4c/389-ds-base
Author: Rich Megginson rmeggins@redhat.com
Date: Thu Sep 6 14:14:38 2012 -0600

rmeggins commented 11 years ago

Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=852202 (''Red Hat Enterprise Linux 6'')

Metadata Update from @rmeggins:
- Issue assigned to rmeggins
- Issue set to the milestone: 1.2.11.14

7 years ago

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/374

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Metadata

Assignee

rmeggins

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

1.2.11.14

reviewstatus

ack

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=852202

origin

Community

389-ds-base

Source Code

#374 consumer can go into total update mode for no reason Closed: wontfix None Opened 11 years ago by rmeggins.

Metadata

#374 consumer can go into total update mode for no reason

Closed: wontfix None Opened 11 years ago by rmeggins.