For example - attempting to do repl init with a very large 100MB entry - the messages in the errors and access logs are not informative, and the user has no idea how to proceed e.g. increase nsslapd-maxbersize (but how large? there is no information in the access or errors log), increase nsslapd-cachememsize, increase nsslapd-import-cachesize (but again, there is no information about how much to increase these).
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1034902
Replying to [ticket:47606 rmeggins]:
In terms of the too large ber size error, the error is returned from openldap ber_get_next: {{{ tagp = ber_get_next( conn->c_sb, &bytes_scanned, ber ); }}} It returns LBER_OVERFLOW and assigned to tagp, returned bytes_scanned stores 0, ERANGE is set to errno, and returned ber is almost empty. {{{ p ber $9 = {ber_opts = {lbo_valid = 2, lbo_options = 0, lbo_debug = 0}, ber_tag = 48, ber_len = 42237, ber_usertag = 0, ber_buf = 0x0, ber_ptr = 0x7f265e9920c7 "", ber_end = 0x0, ber_sos_ptr = 0x0, ber_rwptr = 0x7f265e9920c7 "", ber_memctx = 0x0} }}} Here's the comment when ber is too large in get_next_from_buffer. {{{ / openldap does not differentiate between length == 0 and length > max - all we know is that there was a problem with the length - assume too big */ err = SLAPD_DISCONNECT_BER_TOO_BIG; }}} Since there's no entry size info available in get_next_from_buffer, it just logs the minimum error message in the error log: {{{ connection - conn=4 fd=65 Incoming BER Element was too long, max allowable is 20480 bytes. Change the nsslapd-maxbersize attribute in cn=config to increase. }}} Do we have a chance to get the entry size info when it failed?
There may be enough information in the ber to use ber_peek_tag(BerElement ber, ber_len_t len) to get the len.
git patch file (master) 0001-Ticket-47606-replica-init-bulk-import-errors-should-.patch
Description: 1. maxbersize: If the size of an entry is larger than the consumer's maxbersize, the following error used to be logged: Incoming BER Element was too long, max allowable is ### bytes. Change the nsslapd-maxbersize attribute in cn=config to increase. This message does not indicate how large the maxbersize needs to be. This patch adds the code to retrieve the failed ber size. Revised message: Incoming BER Element was @@@ bytes, max allowable is ### bytes. Change the nsslapd-maxbersize attribute in cn=config to increase. Note: There is no lber API that returns the ber size if it fails to handle the ber. This patch borrows the internal structure of ber and get the size. This could be risky since the size or structure of the ber could be updated in the openldap lber. 2. cache size: The bulk import depends upon the nsslapd-cachememsize value in the backend instance entry (e.g., cn=userRoot,cn=ldbm database,cn=plugins,cn=config). If an entry size is larger than the cachememsize, the bulk import used to fail with this message: import userRoot: REASON: entry too large (@@@ bytes) for the import buffer size (### bytes). Try increasing nsslapd- cachememsize. Also, the message follows the skipping entry message: import userRoot: WARNING: skipping entry "<DN>" but actually, it did NOT "skip" the entry and continue the bulk import, but it failed there and completely wiped out the backend database. This patch modifies the message as follows: import userRoot: REASON: entry too large (@@@ bytes) for the effective import buffer size (### bytes). Try increasing nsslapd- cachememsize for the backend instance "userRoot". and as the message mentions, it just skips the failed entry and continues the bulk import.
ber_peek_tag doesn't work? or ber_peek_element?
So the way bulk import/repl init works with this patch is that entries will be skipped? If entries are skipped, will the overall operation report success? I'm not sure that's a good thing. If someone does a repl init, it will report success, and they will think everything is fine, only to find out (much?) later that everything is not fine.
I think import should fail hard as soon as there is a problem. If we clearly tell people what's wrong, and how to fix it, that should be sufficient.
git patch file (master) -- revised: including mozldap ber, as well. 0001-Ticket-47606-replica-init-bulk-import-errors-should-.2.patch
Looks good, but I don't think we should skip entries during replica init - I think it should be a hard failure. The user may not know there were problems. The consumer errors log may fill up with thousands of error messages.
Replying to [comment:10 rmeggins]:
Yes, that's true... My concern is when I was debugging this bug and bulk import failed, the consumer's DB is completely wiped out and any attempt to recover the db was quite tricky and painful -- I had to remove the consumer's replica, reinitializing the db using a suffix ldif file, then added back the replication setup to the consumer. It might confuse the customers, as well...
side note: http://www.openldap.org/its/index.cgi Logged in as guest Viewing Incoming/7764 Date: Fri, 13 Dec 2013 21:42:03 +0000 From: nhosoi@gmail.com To: openldap-its@OpenLDAP.org Subject: RFE: library lber method which returns the ber size even if the ber is overflown
Full_Name: Noriko Hosoi Version: 2.4.35-4 OS: Fedora 18 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (209.132.181.86)
We use the OpenLdap library in our software. LDAP clients could send too large ber and cause LBER_OVERFLOW (or LBER_DEFAULT) to the lber APIs. We'd like to learn how large the ber size we should prepare from the error. When we look into the BerElement ber using gdb, ber->ber_len stores the value. But the value is not returned to the caller when the API fails, e.g., by the overflow. Could it be possible to have a way to retrieve the ber size under any condition?
Replying to [comment:11 nhosoi]:
Replying to [comment:10 rmeggins]: Looks good, but I don't think we should skip entries during replica init - I think it should be a hard failure. The user may not know there were problems. The consumer errors log may fill up with thousands of error messages. Yes, that's true... My concern is when I was debugging this bug and bulk import failed, the consumer's DB is completely wiped out and any attempt to recover the db was quite tricky and painful -- I had to remove the consumer's replica, reinitializing the db using a suffix ldif file, then added back the replication setup to the consumer. It might confuse the customers, as well...
In any case, when bulk import fails, the database is wiped out. That's the point of bulk import/replica init - it wipes out your database and replaces it with the one you specify. But I'm not sure why you had to reconfigure the replica - usually when replica init fails you just retry it, you don't have to reconfigure. That sounds like another bug.
Thanks for your comments, Rich.
The code skipping a failed entry instead of failing there in bulk_import_queue was backed off.
And this is the cause why repl5_tot_waitfor_async_results did not return with "done" when a failure was returned from the connection. 3. In repl5_tot_result_threadmain, when conn_read_result_ex returns non zero (non SUCCESS), it sets abort, but does not set any error code to rc (return code), which is not considered as "finished" in repl5_tot_waitfor_async_results and it contines waiting until the code reaches the max loop count (about 5 minutes). This patch sets LDAP_CONNECT_ERROR to the return code along with setting abort, if conn_read_result_ex returns CONN_NOT_CONNECTED. This makes the bulk import finishes quickly when it fails.
git patch file (master) -- take 3 0001-Ticket-47606-replica-init-bulk-import-errors-should-.3.patch
Reviewed by Rich (Thank you!!)
Pushed to master: 97cf45d..1119083 master -> master commit 1119083
Pushed to 389-ds-base-1.3.2: 4ea2fde..fde9ed5 389-ds-base-1.3.2 -> 389-ds-base-1.3.2 commit fde9ed5
Pushed to 389-ds-base-1.3.1: ae3daa1..c9d0b6c 389-ds-base-1.3.1 -> 389-ds-base-1.3.1 commit c9d0b6c
Metadata Update from @nhosoi: - Issue assigned to nhosoi - Issue set to the milestone: 1.3.1.17
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/943
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.