Ticket #47629 (closed defect: fixed)

Opened 4 months ago

Last modified 3 months ago

random crash in send_ldap_search_entry_ext()

Reported by: pspacek Owned by:
Priority: blocker Milestone: 1.3.2.11
Component: Directory Server Version:
Keywords: Cc: pspacek, lkrispen@…
Blocked By: Blocking:
Review: ack Ticket origin: IPA
Red Hat Bugzilla:

Description

Hello,

my 389-ds-base-1.3.2.8-1.fc20.x86_64 is randomly crashing somewhere in send_ldap_search_entry_ext() function. I have captured coredump and stack traces, I will attach them to this ticket.

You can analyze the coredump in Brno lab, vm-123.

I have a feeling that it is related to connection handling/connection termination, but I'm not able to find solid reproducer. The crash in the same function occurs randomly but I have seen that several times.

Attachments

core.21937.xz (2.1 MB) - added by pspacek 4 months ago.
backtrace.log (155.6 KB) - added by pspacek 4 months ago.
gcore.log (366 bytes) - added by pspacek 4 months ago.
errors printed by GDB after running 'gcore' command
rpm-qa.log (36.6 KB) - added by pspacek 4 months ago.
output from "rpm -qa"
valgrind.log (179.8 KB) - added by pspacek 4 months ago.
slapd.vg.32575 (17.7 KB) - added by rmeggins 4 months ago.
rhel 6.5 389-ds-base 1.3.2 valgrind with -E sync=rp
slapd.vg.21622 (8.8 KB) - added by rmeggins 4 months ago.
rhel 6.5 389-ds-base 1.3.2 valgrind with -E sync=ro
0001-fix-various-problems-that-cause-a-lot-of-valgrind-no.patch (8.4 KB) - added by rmeggins 4 months ago.
0001-fix-various-problems-that-cause-a-lot-of-valgrind-no.patch
0001-Ticket-47629-random-crashes-related-to-sync-repl.patch (5.6 KB) - added by lkrispen 4 months ago.

Change History

Changed 4 months ago by pspacek

Changed 4 months ago by pspacek

Changed 4 months ago by pspacek

errors printed by GDB after running 'gcore' command

Changed 4 months ago by pspacek

output from "rpm -qa"

comment:1 Changed 4 months ago by rmeggins

  • Milestone changed from 0.0 NEEDS_TRIAGE to 1.3.2.9

Looks like the search is related to syncrepl.

comment:2 Changed 4 months ago by lkrispen

@richm: looking at backtrace.log I do see only one search:
rawbase = 0x7f6224000e40 "cn=dns-100x100, dc=ipa,dc=test"
fstr = 0x7f6224001420 "(|(objectClass=idnsConfigObject)(objectClass=idnsZone)(objectClass=idnsForwardZone)(objectClass=idnsRecord))

which is client srch request, so I don't see the relation to syncrepl.

@pspacek: what makes you think it is related to connection management ? syncrepl in persistent mode keeps the connection and operation, maybe there is a side effect of freeing some parts of the conn.
Could you collect a few more core dumps from the next crashes ?

comment:3 Changed 4 months ago by rmeggins

Thread 42 is in sync_send_results() referencing connection conn = 0x7f624c313410. This is the same connection that is doing the search - Thread 32 - #7 0x00007f6260db024b in flush_ber (pb=pb@entry=0x7f622f7f5ae0, conn=conn@entry=0x7f624c313410

pspacek mentioned that he only sees the crash when using syncrepl.

comment:4 Changed 4 months ago by lkrispen

yes, sync_send_results is the thread for the peristent phase, but it still waiting on the condition variable to get activated once the search has sent the initial content, so this should be ok.

I agree that the crash is most likely related to syncrepl, but don't see how.

pspacek, do you see any pattern when it happens, do you use cookies ?

comment:5 Changed 4 months ago by lkrispen

  • Cc lkrispen@… added

Changed 4 months ago by pspacek

comment:6 Changed 4 months ago by pspacek

I found that send_ldap_search_entry() operates on uninitialized memory sometimes. See attachment:valgrind.log from line 1030.

I have no idea if it is related or not.

The attachment:valgrind.log file also contains memory leaks, some of them look serious. At least the leak on line 2951 seems to affect any any.

I used following commands (running in parallel) for this test:

while true; do sleep $(echo "scale=5; 0.05 + 0.$RANDOM" | bc); pkill -9 ldapsearch; done
while true; do ldapsearch -H "ldap://vm-123:389" -w password -D 'cn=Directory manager' -b "cn=dns,dc=ipa,dc=test" -E sync=rp; done

This code runs ldapsearch with synrepl in refresh & persist mode in a loop and kills the client at arbitrary times.

comment:7 Changed 4 months ago by nkinder

  • Milestone changed from 1.3.2.9 to 1.3.2.10

Changed 4 months ago by rmeggins

rhel 6.5 389-ds-base 1.3.2 valgrind with -E sync=rp

Changed 4 months ago by rmeggins

rhel 6.5 389-ds-base 1.3.2 valgrind with -E sync=ro

Changed 4 months ago by rmeggins

0001-fix-various-problems-that-cause-a-lot-of-valgrind-no.patch

comment:8 Changed 4 months ago by rmeggins

I ran a lot of tests with various combinations of search parameters. I even ran mozldap ldapsearch with persistent search. All of them are mostly valgrind clean except for ldapsearch -E sync=rp https://fedorahosted.org/389/attachment/ticket/47629/slapd.vg.32575 and -E sync=ro https://fedorahosted.org/389/attachment/ticket/47629/slapd.vg.21622

Here is the valgrind related patch I have been using:
https://fedorahosted.org/389/attachment/ticket/47629/0001-fix-various-problems-that-cause-a-lot-of-valgrind-no.patch

It looks like the code in sync_send_results() has mostly been copied from ps_send_results(), with the exception of the following: ps_send_results has some extra pblock cleanup, starting at psearch.c:401; ps_send_results calls connection_remove_operation_ext() which is required in the new no-malloc style op stack; there is this special code in connection_threadmain():

		/* If this op isn't a persistent search, remove it */
		if ( pb->pb_op->o_flags & OP_FLAG_PS ) {
			    PR_Lock( conn->c_mutex );
			    connection_release_nolock (conn); /* psearch acquires ref to conn - release this one now */
			    PR_Unlock( conn->c_mutex );
			    /* ps_add makes a shallow copy of the pb - so we
			     * can't free it or init it here - just memset it to 0
			     * ps_send_results will call connection_remove_operation_ext to free it
			     */
			    memset(pb, 0, sizeof(*pb));

If the intention is that sync_send_results() "owns" the operation pblock and is responsible for its lifecycle management, then somewhere in the syncrepl code the pb->pb_op->o_flags needs to set the OP_FLAG_PS flag so that connection_threadmain knows not to touch this pblock.

comment:9 Changed 4 months ago by rmeggins

Or, we can add some other flag (OP_FLAG_PBLOCK_COPIED?)

comment:10 Changed 4 months ago by lkrispen

The OP_FLAG_PS flag is set, it is set in the pre_search plugin:
sync.h:#define OP_FLAG_SYNC_PERSIST 0x01
sync_refresh.c: slapi_operation_set_flag(operation, OP_FLAG_SYNC_PERSIST);

Regarding the valgrind report, I don't see where uninitialized memory is used, the cookie is set/get by the slapi_set/get_object_extension

All the crashes are in the refresh phase when the existing entries are sent and the pre_entry plugin adds a control, so the sync_send_results is not yet active, just waiting to get notified

comment:11 Changed 4 months ago by lkrispen

second thought, sync_send_results will also check if op is abandoned and does some cleanup, will check.

comment:12 Changed 4 months ago by lkrispen

I understand the crash I'm seeing in the test scenario. The intention of the refresh and persist implementation is to

  • send all requested initial entries (all or depending on a cookie)
  • send all modified entries using a seperate thread running sync_send_results

and sending modified entries should only start when th einitial refresh is completed.
This works with cookies, but without cookie the sync_send_result start to early and interleaves with sending the regular entries. The problem is that the persitent thhread uses the same operation as the original search thread (no conflict if it starts after refresh is complete), and both set and get: SLAPI_SEARCH_RESULT_ENTRY and free it when done, so in some cases on thread tries to use an entry which was just freed.
A fix would be to correctly delay the start of the sync_send_results

The other "crash" doesn't seem to be a crash. If running the test under gdb, gdb halts when the connetion is terminated while writing lin libc_send and logs: Program received signal SIGPIPE, but with continue the process continues and the error is handled in flush_ber calling do_disconnect_server.

comment:13 Changed 4 months ago by lkrispen

with the attached patch the regular crashes in the test env provided by pspacek are no longer reproducable. But the client machine did runout of memory after some time, so long duration test was not yet possible

next I will look into the problems reported by valgrind

comment:14 Changed 3 months ago by nkinder

  • screened changed from 0 to 1

comment:15 Changed 3 months ago by nhosoi

  • Milestone changed from 1.3.2.10 to 1.3.2.11

comment:16 Changed 3 months ago by lkrispen

  • Review set to review?

fix retested in new environment, could not reproduce teh crash.
The mem leak will be handled in ticket 47672

comment:17 Changed 3 months ago by rmeggins

  • Review changed from review? to ack

comment:18 Changed 3 months ago by tbordaz

Hi Ludwig,

I have a question not related to that fix. In sync_persist_terminate, it may call sync_remove_request to remove a request from sync_request_list.
In sync_remove_request I have not seen any free of the request, only removal from the list. How the request is retrieved to be freed ?

regards
thierry

comment:19 Changed 3 months ago by lkrispen

@thiery: in sync_persist_terminate active is set to false and complete to true, so the thread terminates and frees the request it is serving, in sync_send_results()

comment:20 Changed 3 months ago by lkrispen

  • Resolution set to fixed
  • Status changed from new to closed

$ git merge ticket47629
Updating 36c48b8..659b777
Fast-forward

ldap/servers/plugins/sync/sync.h | 7 ++++---
ldap/servers/plugins/sync/sync_persist.c | 3 ++-
ldap/servers/plugins/sync/sync_refresh.c | 18 ++++++++++++------
3 files changed, 18 insertions(+), 10 deletions(-)

$ git push origin master
Counting objects: 17, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 1.30 KiB, done.
Total 9 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git

36c48b8..659b777 master -> master

$ git push origin 389-ds-base-1.3.2
Counting objects: 17, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 1.30 KiB, done.
Total 9 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git

270ad9a..d06de4e 389-ds-base-1.3.2 -> 389-ds-base-1.3.

Note: See TracTickets for help on using tickets.