#546 Hang during shutdown within replication plugin
Closed: wontfix None Opened 11 years ago by minfrin.

When an attempt is made to use setup-ds.pl to script the configuration of a new LDAP server instance, and the ConfigFile directive is used to add a replication agreement on initial install, you are forced to shut the server down immediately after to initialise the SSL certificates, which cannot be done when the server is running. On shutdown, the server hangs inside the replication code. The hang occurs on select: [root@beachfront ~]# strace -p 4637 Process 4637 attached - interrupt to quit select(0, NULL, NULL, NULL, {0, 777223}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}^C <unfinished ...> Process 4637 detached The backtrace at this point looks like this: (gdb) bt full #0 0x00000030548e0d03 in select () at ../sysdeps/unix/syscall-template.S:82 No locals. #1 0x00007fb7adeb3ea9 in DS_Sleep (ticks=<value optimized out>) at ldap/servers/slapd/util.c:802 mSecs = <value optimized out> tm = {tv_sec = 0, tv_usec = 426944} #2 0x00007fb7a6e6b1a8 in repl5_inc_stop (prp=0x142a480) at ldap/servers/plugins/replication/repl5_inc_protocol.c:2002 return_value = <value optimized out> start = 3897664523 maxwait = 1200000 now = <value optimized out> #3 0x00007fb7a6e710db in prot_stop (rp=0x1427c30) at ldap/servers/plugins/replication/repl5_protocol.c:433 No locals. #4 0x00007fb7a6e65fb6 in agmt_stop (ra=0x1426710) at ldap/servers/plugins/replication/repl5_agmt.c:664 rp = 0x1427c30 #5 0x00007fb7a6e670b3 in agmtlist_shutdown () at ldap/servers/plugins/replication/repl5_agmtlist.c:657 ra = 0x1426710 ro = 0x13f9fb0 next_ro = <value optimized out> ---Type <return> to continue, or q <return> to quit--- #6 0x00007fb7a6e6e1bf in multimaster_stop (pb=<value optimized out>) at ldap/servers/plugins/replication/repl5_init.c:550 No locals. #7 0x00007fb7ade8bb8a in plugin_call_func (list=0x11b0c90, operation=210, pb=0x7fff52a4e540, call_one=1) at ldap/servers/slapd/plugin.c:1450 n = <value optimized out> func = 0x7fb7a6e6e160 <multimaster_stop> rc = <value optimized out> return_value = <value optimized out> count = <value optimized out> #8 0x00007fb7ade8bcab in plugin_call_one () at ldap/servers/slapd/plugin.c:1418 No locals. #9 plugin_dependency_closeall () at ldap/servers/slapd/plugin.c:1362 pb = {pb_backend = 0x0, pb_conn = 0x0, pb_op = 0x0, pb_plugin = 0x11b0c90, pb_opreturn = 0, pb_object = 0x0, pb_destroy_fn = 0, pb_requestor_isroot = 0, pb_config_fname = 0x0, pb_config_lineno = 0, pb_config_argc = 0, pb_config_argv = 0x0, pb_target_entry = 0x0, pb_existing_dn_entry = 0x0, pb_existing_uniqueid_entry = 0x0, pb_parent_entry = 0x0, pb_newparent_entry = 0x0, pb_pre_op_entry = 0x0, pb_post_op_entry = 0x0, pb_seq_type = 0, pb_seq_attrname = 0x0, pb_seq_val = 0x0, pb_ldif_file = 0x0, pb_removedupvals = 0, ---Type <return> to continue, or q <return> to quit--- pb_db2index_attrs = 0x0, pb_ldif2db_noattrindexes = 0, pb_ldif_printkey = 0, pb_instance_name = 0x0, pb_task = 0x0, pb_task_flags = 0, pb_mr_filter_match_fn = 0, pb_mr_filter_index_fn = 0, pb_mr_filter_reset_fn = 0, pb_mr_index_fn = 0, pb_mr_oid = 0x0, pb_mr_type = 0x0, pb_mr_value = 0x0, pb_mr_values = 0x0, pb_mr_keys = 0x0, pb_mr_filter_reusable = 0, pb_mr_query_operator = 0, pb_mr_usage = 0, pb_pwd_storage_scheme_user_passwd = 0x0, pb_pwd_storage_scheme_db_passwd = 0x0, pb_managedsait = 0, pb_internal_op_result = 0, pb_plugin_internal_search_op_entries = 0x0, pb_plugin_internal_search_op_referrals = 0x0, pb_plugin_identity = 0x0, pb_plugin_config_area = 0x0, pb_parent_txn = 0x0, pb_txn = 0x0, pb_txn_ruv_mods_fn = 0, pb_dbsize = 0, pb_ldif_files = 0x0, pb_ldif_include = 0x0, pb_ldif_exclude = 0x0, pb_ldif_dump_replica = 0, pb_ldif_dump_uniqueid = 0, pb_ldif_generate_uniqueid = 0, pb_ldif_namespaceid = 0x0, pb_ldif_encrypt = 0, pb_operation_notes = 0, pb_slapd_argc = 0, pb_slapd_argv = 0x0, pb_slapd_configdir = 0x0, pb_ctrls_arg = 0x0, pb_dse_dont_add_write = 0, pb_dse_add_merge = 0, pb_dse_dont_check_dups = 0, pb_dse_is_primary_file = 0, pb_schema_flags = 0, pb_result_code = 0, pb_result_text = 0x0, ---Type <return> to continue, or q <return> to quit--- pb_result_matched = 0x0, pb_nentries = 0, urls = 0x0, pb_import_entry = 0x0, pb_import_state = 0, pb_destroy_content = 0, pb_dse_reapply_mods = 0, pb_urp_naming_collision_dn = 0x0, pb_urp_tombstone_uniqueid = 0x0, pb_server_running = 0, pb_backend_count = 0, pb_pwpolicy_ctrl = 0, pb_vattr_context = 0x0, pb_substrlens = 0x0, pb_plugin_enabled = 0, pb_search_ctrls = 0x0, pb_mr_index_sv_fn = 0, pb_syntax_filter_normalized = 0, pb_syntax_filter_data = 0x0} plugins_closed = <value optimized out> index = <value optimized out> #10 0x0000000000417aa4 in slapd_daemon (ports=0x7fff52a4ece0) at ldap/servers/slapd/daemon.c:874 tcps = <value optimized out> n_tcps = 0x0 s_tcps = 0x0 i_unix = 0x0 fdesp = <value optimized out> num_poll = <value optimized out> pr_timeout = 250 time_thread_p = 0x141f350 threads = <value optimized out> in_referral_mode = 0 connection_table_size = <value optimized out> ---Type <return> to continue, or q <return> to quit--- #11 0x000000000041ddd8 in main (argc=7, argv=0x7fff52a4f078) at ldap/servers/slapd/main.c:1239 return_value = 0 slapdFrontendConfig = <value optimized out> ports_info = {n_port = 389, s_port = 636, n_listenaddr = 0x0, s_listenaddr = 0x0, n_socket = 0x1150ee0, i_listenaddr = 0x0, i_port = 0, i_socket = 0x0, s_socket = 0x0} The INF file used to configure the server is as follows: [General] # the Unix user that the Directory Server will run as (required) SuiteSpotUserID=nobody # the fully qualified host and domain name (required) FullMachineName=beachfront.example.com # the base directory where the runtime files are installed (required) ServerRoot=/usr/lib64/dirsrv # user ID for console login (optional) ConfigDirectoryAdminID=admin # password for ConfigDirectoryAdminID (optional) ConfigDirectoryAdminPwd=password # LDAP URL for the Configuration Directory # the suffix is required and will usually be o=NetscapeRoot (optional) ConfigDirectoryLdapURL=ldap://host.domain.tld:port/o=NetscapeRoot # the administrative domain this instance will belong to (optional) AdminDomain=example.com # the user/group directory used by the Console (optional) UserDirectoryLdapURL=ldap://host.domain.tld:port/dc=devel,dc=example,dc=com [slapd] # the port number the server will listen to (required) ServerPort=389 # the base name of the directory that contains the instance # of this server - will have "slapd-" added to it (required) ServerIdentifier=beachfront # the primary suffix for this server (more can be added later) (required) Suffix=dc=example,dc=com # the DN for the Directory Administrator (required) RootDN=cn=Directory Manager # the password for the RootDN (required) RootDNPwd=password # use this LDIF file to initialize the database # the suffix must be specified in the Suffix directive (optional) InstallLdifFile=none # configuration LDIF file ConfigFile=/etc/dirsrv/slapd-beachfront-replication.ldif #ConfigFile=/etc/dirsrv/slapd-beachfront-ssl.ldif # if true (1), configure this new DS instance as a # Configuration Directory Server (optional) SlapdConfigForMC=1 # if true (1), register this DS with the Configuration DS (optional) UseExistingMC=0 # if true (1), do not configure this DS as a user/group directory # but use the one specified by UserDirectoryLdapURL (optional) UseExistingUG=0 The /etc/dirsrv/slapd-beachfront-replication.ldif file looks like this: # enable the changelog for replication dn: cn=changelog5,cn=config objectclass: top objectclass: extensibleObject cn: changelog5 nsslapd-changelogdir: /var/lib/dirsrv/slapd-beachfront/changelogdb nsslapd-changelogmaxage: 10d # create the supplier bind dn dn: cn=Replication Manager,cn=config objectClass: inetorgperson objectClass: person objectClass: top cn: Replication Manager sn: RM nsIdleTimeout: 0 # enable the supplier replica dn: cn=replica,cn="dc=example,dc=com",cn=mapping tree,cn=config objectclass: top objectclass: nsds5replica objectclass: extensibleObject cn: replica nsds5replicaroot: dc=example,dc=com nsds5replicaid: 7 nsds5replicatype: 3 nsds5flags: 1 nsds5ReplicaPurgeDelay: 604800 nsds5ReplicaBindDN: cn=Replication Manager,cn=config # replication agreement A to B dn: cn=Agreement gatekeeper.example.com,cn=replica,cn="dc=example,dc=com",cn=mapping tree,cn=config changetype: add objectclass: top objectclass: nsds5replicationagreement cn: Agreement gatekeeper.example.com nsds5replicahost: gatekeeper.example.com nsds5replicaport: 636 nsds5ReplicaBindDN: cn=Replication Manager,cn=config nsds5replicabindmethod: SSLCLIENTAUTH nsds5ReplicaTransportInfo: SSL nsds5replicaroot: dc=example,dc=com description: Replication agreement between beachfront.example.com and gatekeeper.example.com nsds5BeginReplicaRefresh: start The server does log the following: [29/Dec/2012:22:19:22 +0200] - slapd shutting down - signaling operation threads [29/Dec/2012:22:19:22 +0200] - slapd shutting down - waiting for 28 threads to terminate [29/Dec/2012:22:19:22 +0200] - slapd shutting down - closing down internal subsystems and plugins [29/Dec/2012:22:19:40 +0200] slapi_ldap_bind - Error: could not send bind request for id [(anon)] mech [EXTERNAL]: error -1 (Can't contact LDAP server) 0 (unknown) 107 (Transport endpoint is not connected) [29/Dec/2012:22:19:40 +0200] NSMMReplicationPlugin - agmt="cn=Agreement gatekeeper.example.com" (gatekeeper:636): Replication bind with EXTERNAL auth failed: LDAP error -1 (Can't contact LDAP server) ((null)) (Note: the machine gatekeeper.example.com does not yet exist, so the error message is correct - but the error message appears during the shutdown process) The server hangs at this point, and the scripted deploy fails.

minfrin,

Any chance you happened to record the entire thread stack? Curious if there were other replication threads running.

Thanks,
Mark

I can not reproduce this issue on the latest version of the server. I believe this might have been fixed with ticket 399.

In my test, I stopped the server and added the replication config & agmt to the dse.ldif. Started the server, and then tried to stop the server. I also tried adding the agmt after server startup. I got the same error messages, but the server stops as expected:

[16/Jan/2013:09:44:05 -0500] slapi_ldap_bind - Error: could not send bind request for id [(anon)] mech [EXTERNAL]: error -1 (Can't contact LDAP server) 0 (unknown) 107 (Transport endpoint is not connected "127.0.0.1")
[16/Jan/2013:09:44:05 -0500] NSMMReplicationPlugin - agmt="cn=Agreement gatekeeper.example.com" (127:636): Replication bind with EXTERNAL auth failed: LDAP error -1 (Can't contact LDAP server) ((null))
[16/Jan/2013:09:44:05 -0500] - SSL alert: SSL client authentication cannot be used (no password). (Netscape Portable Runtime error 0 - unknown)
[16/Jan/2013:09:44:05 -0500] slapi_ldap_bind - Error: could not send bind request for id [(anon)] mech [EXTERNAL]: error -1 (Can't contact LDAP server) 0 (unknown) 107 (Transport endpoint is not connected "127.0.0.1")
[16/Jan/2013:09:44:08 -0500] - SSL alert: SSL client authentication cannot be used (no password). (Netscape Portable Runtime error 0 - unknown)
[16/Jan/2013:09:44:08 -0500] slapi_ldap_bind - Error: could not send bind request for id [(anon)] mech [EXTERNAL]: error -1 (Can't contact LDAP server) 0 (unknown) 107 (Transport endpoint is not connected "127.0.0.1")
[16/Jan/2013:09:44:12 -0500] - slapd shutting down - signaling operation threads
[16/Jan/2013:09:44:12 -0500] - slapd shutting down - waiting for 27 threads to terminate
[16/Jan/2013:09:44:12 -0500] - slapd shutting down - closing down internal subsystems and plugins
[16/Jan/2013:09:44:13 -0500] - Waiting for 4 database threads to stop
[16/Jan/2013:09:44:13 -0500] - All database threads now stopped
[16/Jan/2013:09:44:14 -0500] - slapd stopped.

Can you try with the latest version 1.2.11.17?

I have tested the silent install, using your files(modified to match my system). I can not reproduce this on 1.2.9.10, 1.2.10.24, or 1.3.0.rc1.

This is a plain system, there are no certificates installed, nothing. There is just 389 installed.

In my logs I see:

[17/Jan/2013:11:02:54 -0500] - 389-Directory/1.2.9.10 B2013.017.160 starting up
[17/Jan/2013:11:02:55 -0500] - SSL alert: SSL client authentication cannot be used (no password). (Netscape Portable Runtime error 0 - unknown)
[17/Jan/2013:11:02:55 -0500] - slapd started. Listening on All Interfaces port 389 for LDAP requests
[17/Jan/2013:11:02:55 -0500] slapi_ldap_bind - Error: could not send bind request for id [(anon)] mech [EXTERNAL]: error -1 (Can't contact LDAP server) 0 (unknown) 107 (Transport endpoint is not connected)
[17/Jan/2013:11:02:55 -0500] NSMMReplicationPlugin - agmt="cn=Agreement gatekeeper.example.com" (127:636): Replication bind with EXTERNAL auth failed: LDAP error -1 (Can't contact LDAP server) ((null))
[17/Jan/2013:11:02:55 -0500] - SSL alert: SSL client authentication cannot be used (no password). (Netscape Portable Runtime error 0 - unknown)
[17/Jan/2013:11:02:55 -0500] slapi_ldap_bind - Error: could not send bind request for id [(anon)] mech [EXTERNAL]: error -1 (Can't contact LDAP server) 0 (unknown) 107 (Transport endpoint is not connected)
[17/Jan/2013:11:02:59 -0500] - SSL alert: SSL client authentication cannot be used (no password). (Netscape Portable Runtime error 0 - unknown)
[17/Jan/2013:11:02:59 -0500] slapi_ldap_bind - Error: could not send bind request for id [(anon)] mech [EXTERNAL]: error -1 (Can't contact LDAP server) 0 (unknown) 107 (Transport endpoint is not connected)
[17/Jan/2013:11:03:00 -0500] - slapd shutting down - signaling operation threads
[17/Jan/2013:11:03:00 -0500] - slapd shutting down - waiting for 27 threads to terminate
[17/Jan/2013:11:03:00 -0500] - slapd shutting down - closing down internal subsystems and plugins
[17/Jan/2013:11:03:01 -0500] - Waiting for 4 database threads to stop
[17/Jan/2013:11:03:02 -0500] - All database threads now stopped
[17/Jan/2013:11:03:02 -0500] - slapd stopped.

Do you have certmap.conf setup somewhere? Is there a step I am missing? I noticed you have your SSL config file commented out in your INF file - might this be something I need to reproduce the problem?

While working on a similar ticket (ticket 558) I reproduced this issue. Going to close this out as a duplicate

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.3.1

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/546

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Duplicate)

3 years ago

Login to comment on this ticket.

Metadata