#344 deadlock in replica_write_ruv
Closed: wontfix None Opened 12 years ago by rmeggins.

replica_write_ruv() does the modify with the OP_FLAG_REPL_FIXUP
replica_create_ruv_tombstone() does too, and so does replica_replace_ruv_tombstone() - the OP_FLAG_REPL_FIXUP flag causes the database to be not locked:

    if(SERIALLOCK(li) && !operation_is_flag_set(operation,OP_FLAG_REPL_FIXUP)) {
        dblayer_lock_backend(be);
        dblock_acquired= 1;
    }

If the event queue fires replica_write_ruv() at the right time, it will conflict with the same RUV update from replica_replace_ruv_tombstone() or (probably not) replica_create_ruv_tombstone().

I think the solution is to always do the database SERIALLOCK. Since inst->inst_db_mutex is now a PRMonitor instead of a plain mutex, it is already re-entrant to the same thread, which was the original intent of the OP_FLAG_REPL_FIXUP flag - to allow the urp database plugins to modify entries. Alternately, change the urp be pre/post op plugins to be betxn pre/post op plugins.


set default ticket origin to Community

Added initial screened field value.

Replying to [ticket:344 rmeggins]:

replica_write_ruv() does the modify with the OP_FLAG_REPL_FIXUP
replica_create_ruv_tombstone() does too, and so does replica_replace_ruv_tombstone() - the OP_FLAG_REPL_FIXUP flag causes the database to be not locked:
{{{
if(SERIALLOCK(li) && !operation_is_flag_set(operation,OP_FLAG_REPL_FIXUP)) {
dblayer_lock_backend(be);
dblock_acquired= 1;
}
}}}
If the event queue fires replica_write_ruv() at the right time, it will conflict with the same RUV update from replica_replace_ruv_tombstone() or (probably not) replica_create_ruv_tombstone().

I think the solution is to always do the database SERIALLOCK. Since inst->inst_db_mutex is now a PRMonitor instead of a plain mutex, it is already re-entrant to the same thread, which was the original intent of the OP_FLAG_REPL_FIXUP flag - to allow the urp database plugins to modify entries. Alternately, change the urp be pre/post op plugins to be betxn pre/post op plugins.

In the process of making the plugins betxn aware, the location of SERIALLOCK is being moved into dblayer_txn_begin and the lock is held regardless of the OP_FLAG_REPL_FIXUP flag as Rich suggested. So, this issue would be solved together with the ticket #351 fix.

To verify the bug, what would be the best scenario? I ran quite a heavy stress test add, modify, and delete cases involved against the server which contains the #351 patch for a week. The replication topology is made from the 4 masters + 2 hubs + 4 read-only replicas. Could it be good enough to say this bug is solved?

Replying to [comment:5 nhosoi]:

In the process of making the plugins betxn aware, the location of SERIALLOCK is being moved into dblayer_txn_begin and the lock is held regardless of the OP_FLAG_REPL_FIXUP flag as Rich suggested. So, this issue would be solved together with the ticket #351 fix.

To verify the bug, what would be the best scenario? I ran quite a heavy stress test add, modify, and delete cases involved against the server which contains the #351 patch for a week. The replication topology is made from the 4 masters + 2 hubs + 4 read-only replicas. Could it be good enough to say this bug is solved?

Yes.

Metadata Update from @nhosoi:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.3.0.a1

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/344

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Duplicate)

3 years ago

Login to comment on this ticket.

Metadata