#47975 Replication flow control: monitoring and tuning
Closed: wontfix 3 years ago by spichugi. Opened 9 years ago by tbordaz.

Replication agreement can hang during a replication session (total or incremental update) depending on how fast a consumer can process the received updates/entries.

Ticket https://fedorahosted.org/389/ticket/47942, implements a flow control based on configurable attributes (nsds5ReplicaFlowControlWindow/nsds5ReplicaFlowControlPause).

This new ticket is to enhance this flow control so that:
- default values (window/pause) matches the general purpose use case
- describes a procedure so that administrator can determine their own tuning
- implement an automatic tuning that would use the recent updates/entries rate
- implement a monitoring of flow control events (e.g. cn=monitor,<RA>)


Per triage, push the target milestone to 1.3.6.

Metadata Update from @nhosoi:
- Issue set to the milestone: 1.3.6.0

7 years ago

Metadata Update from @mreynolds:
- Issue close_status updated to: None
- Issue set to the milestone: 1.4 backlog (was: 1.3.6.0)

6 years ago

@msauton , do you think it would be valuable to help tuning. If yes, what kind of information would you expect ?

@msauton , do you think it would be valuable to help tuning. If yes, what kind of information would you expect ?

that may need some debate and thoughs, but will try:

with larger IPA deployments, plus replicas and hosts provisioning in cloud environment, I would say yes, such feature would help.

often, the static configuration does not fit or scale to burst of activity, for cache(s), threads, dblocks, and for the online total and incremental updates with nsds5ReplicaFlowControlPause , nsds5ReplicaFlowControlWindow.

the errors log file should have messages with a severity level related to events and trigger conditions (WARN ?) and configuration changes ( INFO ?)
may be the possibility of some monitoring output with INFO, so we could collect historical data.

ideally, we like to see a more general value of entries/second , but we should probably see some more protocol related values like
- the replica id and the replication agreement id
- number of entries sent without acknowledgment
- last_message_id_received and last_message_id_sent
- the delay from flowControlPause
- may be the busywaittime and pausetime

one detail and possibly a different topic, related to replication logging in general, but for example like
slapi_log_err(SLAPI_LOG_REPL, repl_plugin_name,
"repl5_inc_waitfor_async_results - %d %d\n",
rd->last_message_id_received, rd->last_message_id_sent);

and for example
ERR - NSMMReplicationPlugin - repl5_inc_waitfor_async_results - Timed out waiting for responses: 69564 69578
->
the replica id and the replication agreement id would be an interesting info to collect, as now days we have many more replication agreements.

is the "automatic tuning that would use the recent updates/entries rate" a one time setting, or dynamic, with regular checks?

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1306

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata