#485 Dirsrv deadlock locking up IPA
Closed: wontfix None Opened 11 years ago by rmeggins.

https://bugzilla.redhat.com/show_bug.cgi?id=863576 (Red Hat Enterprise Linux 6)

Description of problem:
Dirsrv deadlock in abandon

Version-Release number of selected component (if applicable):
ipa-server-2.2.0-16.el6.x86_64
389-ds-base-1.2.11.13-1.el6.x86_64

How reproducible: Have not reproduced yet...

Test Env:
2 IPA Master Servers
2 IPA Clients
10K users
10 Groups 1k per group
Sudo command setup

Activities:
-Running Manually triggered admin through the UI (del sudo rules)
-Hourly running scheduled ssh sudo runtime client load (10 threads)

System running load fine but and began failing early morning 5:00AM.  System
became inoperative.  Could not able to kinit, unable to restart. Cause
determined to be a DirSrv deadlock.

Additional info:
Dev Noriko Hosoi indicates:

This is a self-deadlock in abandon... I see 3 threads waiting for the
same mutex. The first one thread 23 already acquired the mutex in
do_abandon, but it tries to grab it again in
pagedresults_free_one_msgid.

Thread 23 (Thread 0x7f446b458700 (LWP 22023)):
#0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so
#4 0x00000030014889f9 in pagedresults_free_one_msgid (conn=0x7f4470e68670,
msgid=31) at ldap/servers/slapd/pagedresults.c:272
#5 0x000000000040d4ad in do_abandon (pb=0xa56ee70)
at ldap/servers/slapd/abandon.c:156
#6 0x0000000000414104 in connection_dispatch_operation ()
at ldap/servers/slapd/connection.c:646
#7 connection_threadmain () at ldap/servers/slapd/connection.c:2338
#8 0x0000003001c299e3 in ?? () from /lib64/libnspr4.so
#9 0x0000003ffec07851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003ffe8e767d in clone () from /lib64/libc.so.6
(gdb) p conn
$6 = (Connection *) 0x7f4470e68670
(gdb) p *conn
$7 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3,
c_dn = 0xa760d90
"fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com",
c_isroot = 0, c_isreplication_session = 0,
c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0,
c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80,
cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640,
c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852,
c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57,
c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0,
c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590,
c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50,
c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0,
c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990,
c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0,
c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {
prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30},
c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}

Thread 19 (Thread 0x7f4468c54700 (LWP 22027)):
#0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so
#4 0x00000030014883fd in pagedresults_set_search_result
(conn=0x7f4470e68670,
sr=0x0, locked=0, index=0) at ldap/servers/slapd/pagedresults.c:360
#5 0x00007f447b1819fc in ldbm_back_next_search_entry_ext (pb=0xb02fca0,
use_extension=0) at ldap/servers/slapd/back-ldbm/ldbm_search.c:1426
#6 0x00000030014854c1 in iterate (pb=0xb02fca0, be=<value optimized out>,
pnentries=0x7f4468c51394, pagesize=1000, pr_statp=0x7f4468c51388,
send_result=1) at ldap/servers/slapd/opshared.c:1250
#7 0x00000030014859d7 in send_results_ext (pb=0xb02fca0,
nentries=0x7f4468c51394, pagesize=1000, pr_stat=0x7f4468c51388,
send_result=1) at ldap/servers/slapd/opshared.c:1651
#8 0x00000030014866cb in op_shared_search (pb=0xb02fca0, send_result=1)
at ldap/servers/slapd/opshared.c:828
#9 0x00000000004263a4 in do_search (pb=<value optimized out>)
at ldap/servers/slapd/search.c:400
#10 0x000000000041426a in connection_dispatch_operation ()
at ldap/servers/slapd/connection.c:621
#11 connection_threadmain () at ldap/servers/slapd/connection.c:2338
#12 0x0000003001c299e3 in ?? () from /lib64/libnspr4.so
#13 0x0000003ffec07851 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003ffe8e767d in clone () from /lib64/libc.so.6
(gdb) p conn
$4 = (Connection *) 0x7f4470e68670
(gdb) p *conn
$5 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3,
c_dn = 0xa760d90
"fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com",
c_isroot = 0, c_isreplication_session = 0,
c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0,
c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80,
cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640,
c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852,
c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57,
c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0,
c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590,
c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50,
c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0,
c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990,
c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0,
c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {
prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30},
c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}

Thread 1 (Thread 0x7f44818d87c0 (LWP 21797)):
#0 0x0000003ffec0e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003ffec093be in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x0000003ffec09326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003001c23f79 in PR_Lock () from /lib64/libnspr4.so
#4 0x0000000000417ef6 in setup_pr_read_pds (ports=0x7fff13810910)
at ldap/servers/slapd/daemon.c:1675
#5 slapd_daemon (ports=0x7fff13810910) at ldap/servers/slapd/daemon.c:1144
#6 0x000000000041f22f in main (argc=7, argv=0x7fff13810ca8)
at ldap/servers/slapd/main.c:1253
(gdb) p c
$1 = (Connection *) 0x7f4470e68670
(gdb) p *c
$2 = {c_sb = 0x88b1610, c_sd = 84, c_ldapversion = 3,
c_dn = 0xa760d90
"fqdn=sti-high-3.testrelm.com,cn=computers,cn=accounts,dc=testrelm,dc=com",
c_isroot = 0, c_isreplication_session = 0,
c_authtype = 0x995b480 "SASL GSSAPI", c_external_dn = 0x0,
c_external_authtype = 0x30014cf896 "none", cin_addr = 0xae24a80,
cin_destaddr = 0xa607410, c_domain = 0x0, c_ops = 0xae9e640,
c_gettingber = 0, c_currentber = 0x0, c_starttime = 1349254852,
c_connid = 140400, c_opsinitiated = 59, c_opscompleted = 57,
c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x824f3e0,
c_pdumutex = 0x89d5060, c_idlesince = 1349254862, c_private = 0x890d590,
c_flags = 34, c_needpw = 0, c_client_cert = 0x0, c_prfd = 0x85bac50,
c_ci = 84, c_fdi = -1, c_next = 0x7f4470e682c8, c_prev = 0x7f4470e69ed0,
c_bi_backend = 0x0, c_extension = 0xb145440, c_sasl_conn = 0xb5e9990,
c_local_ssf = 0, c_sasl_ssf = 56, c_ssl_ssf = 0, c_unix_local = 0,
c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {
prl_maxlen = 64, prl_count = 6, prl_list = 0x94f5c30},
c_push_io_layer_cb = 0, c_pop_io_layer_cb = 0, c_io_layer_cb_data = 0x0}

Reviewed by Rich (Thank you!!)

Pushed to master.

$ git push
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.78 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
cd48bbd..c19bb9d master -> master

Cherry-picked and pushed to external 389-ds-base-1.2.11.

$ git cherry-pick -x -e c19bb9d

$ git push origin 389-ds-base-1.2.11:389-ds-base-1.2.11
Counting objects: 19, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 1.83 KiB, done.
Total 10 (delta 8), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
009fd8c..4d82507 389-ds-base-1.2.11 -> 389-ds-base-1.2.11

Metadata Update from @nhosoi:
- Issue assigned to rmeggins
- Issue set to the milestone: 1.2.11.16

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/485

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata