#48843 Server crashes under load with SEGV in thread #1, when connection id=0, fd=0
Closed: wontfix None Opened 7 years ago by hheinze.

Background:

We had been running a RedHat-supplied 389-ds-base implementation (1.3.4.0). This implementation caused TCP latency spikes of 1000ms and multiples under load (up to 100 searches/s, 10-30 concurrent connections), leading to TCP retransmissions, aborted connections and generally bad performance.

A later version, (1.3.4.5) from the 389 Directory Server project did not show any improvement.

An even later version which we are using currently in production (1.3.5.9) has fixed the latency issue. We suspect that a patch in 1.3.4.7 has brought the improvement (Ticket #48341: 0001-Ticket-48341-deadlock-on-connection-mutex.patch). With version 1.3.5.9, however, we are experiencing frequent crashes of loaded server (every couple of hours). We have not been able to reproduce the issue with artificial load patterns in our development environment.

The issue:

Currently we are trying to upgrade to 1.3.5.1 (or by now: 1.3.5.3) but are waiting for a fix for shadow attributes.
As there had been no significant changes to the connection handling since 1.3.4.9, I would like to flag the issue even before upgrading to 1.3.5.x (with debuginfo enabled for core dumps).

What we are seeing so far is:

  • crashes are segfaults
  • thread #1 is crashing
  • the crashing function is connection_table_move_connection_out_of_active_list (in ldap/servers/slapd/conntable.c)
  • connection id is 0, fd is 0
  • an error log line "connection - conn=0 fd=0 Attempt to release connection that is not acquired" is always the last logged line before the crash.

This pattern is always the same.

I will try to supply an up-to-date stack trace or a core dump once we have migrated to the current 1.3.5.x version


Correction: references to version 1.3.5.9 should read 1.3.4.9.

Note -- tried to understand the code, not sure whether the file description on connection 0 should ever become 0 -- currently trying a workaround just returning immediately from connection_table_move_connection_out_of_active_list when the mentioned conditions apply.

Helmut

Have you enabled nunc-stans?

dn: cn=config
nsslapd-enable-nunc-stans: on

We fixed something very similar to this in: https://fedorahosted.org/389/ticket/48311

As for the crash, can you please obtain a stack trace using the method below:

http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes

Thanks

Hi,

nunc-stans is on (put a lot of hope into it since 1.3.4.5) -- stack traces forthcoming, may need a few days as we have to prepare our poduction environment

Thanks
Helmut

Hi,

OK, this is what happened:

We have rolled out 1.3.5.1 with debug info. Within one hour we had 4 nodes out of 9 crashing, all with the same pattern (according to logs typical load 150 searches/s, 10 connections/s).

We then rolled back to 1.3.4.9, no debug info except for a single node. This one crashed as well. Same pattern. We finally backed out entirely to 1.3.4.9, compiled without debugging stuff, back to what we have been running for a few weeks, seeing SEGV crashes, as described earlier.

The interesting thing is that enabling debugging seems to shift a little the pattern; the servers seem to crash on an ABRT signal.

Find attached a strack strace from a crashed 1.3.5.1 node

Cheers,
Helmut

{{{
static int
ns_handle_closure_nomutex(Connection c)
{
int rc = 0;
PR_ASSERT(c->c_refcnt > 0); /
one for the conn active list, plus possible other threads */
PR_ASSERT(CONN_NEEDS_CLOSING(c));
}}}
This means the refcnt was 0, which means some sort of race condition.

Hi,

Just a little update. We have put in a silly workaround path directly into conntable.c:

--- 389-ds-base-1.3.5.1/ldap/servers/slapd//conntable.c 2016-03-24 08:13:11.000000000 +1100
+++ deakin-patches/ldap/servers/slapd//conntable.c 2016-05-23 09:54:33.452823891 +1000
@@ -261,6 +261,18 @@
c_sd, c->c_refcnt);
return 1; / failed /
}
+ /
+ * Temporary Deakin workaround for crashes:
+ * See:
+ * https://wiki.deakin.edu.au/display/sysinfrastructure/Sys+-+LDAP+-+Known+Issues#Sys-LDAP-KnownIssues-Version1.3.4.9Crashingwithsegfaults
+ * https://fedorahosted.org/389/ticket/48843
+
/
+ if(c->c_connid == 0 && c->c_sd == 0) {
+ slapi_log_error(SLAPI_LOG_FATAL, "connection",
+ "DEAKIN WORKAROUND: conn_id=%d sd=%d not removing connection from connection table in conntable.c\n",
+ c->c_connid, c->c_sd);
+ return 1; / failed /
+ }

 /* We need to lock here because we're modifying the linked list */
 PR_Lock(ct->table_mutex);

This patch seems to avert crashes, at least short term. It indeed looks as if a connection is occasionally removed twice, the second time already with (some) of its values already set to 0 in connection.c:connection_cleanup(). I am not suggesting this as a solution, we just trialling it as a desperate workaround for now.

Not sure whether this helps understanding the issue.

Cheers,
Helmut

changing component to nunc-stans as I believe this issue is only seen when nsslapd-enable-nunc-stans is enabled.

Replying to [comment:7 rmeggins]:

changing component to nunc-stans as I believe this issue is only seen when nsslapd-enable-nunc-stans is enabled.

Hi,

I finally had the opportunity to test with nunc-stans disabled for now 3 days. No crashes so far, seems to confirm the suspicion that nunc-stans plays a role. A little bit too early to make a call but I thought I let you know.

Cheers,
Helmut

hheinze,

Yes this is definitely an issue with nunc-stans what we would really like to fix it assp.

You said it looks like the code tried to close the connection twice. I'm curious what the first "close" connection code was. I'm wondering if the connection had an odd closure code that might be triggering the crash. Since this is a crash, and there is access log buffering, it probably did not make it into the log on disk, but it was in memory(in the core file). Would you be willing to reproduce this again, and gather new data?

Action plan:

[1] Reproduce crash
[2] Open core in gdb, and run:

{{{
(gdb) thread apply all bt full
}}}

Provide this output

[3] Then find the crashing thread and switch to it:

{{{
(gdb) thread <crashing thread number>
[gdb) up
[gdb) up
[gdb) up ---> you should be in ns_handle_closure_nomutex() now
(gdb) p c*
}}}

This will display the connection structure so we can find the connection ID which will allow us to find the connection in the access log/buffer. Then provide this output of the "c" structure.

[4] Dump the access log buffer

http://www.port389.org/docs/389ds/FAQ/faq.html#printing-access-log-buffer

{{{
(gdb) set print elements 0
(gdb) set logging file access.buf
(gdb) set logging on
(gdb) p loginfo.log_access_buffer.top
(gdb) set logging off
(gdb) set print elements 300
}}}

Then from the command line run:

{{{

gdb -ex 'set print elements 0' -ex 'set confirm off' -ex 'set pagination off' -ex 'print loginfo.log_access_buffer.top' -ex 'quit' /usr/sbin/ns-slapd /var/log/dirsrv/slapd-INSTANCENAME/core.PID > access.buf 2>&1

}}}

Cleanup this output:

{{{

sed -i -e 's/\n/\n/g' access.buf

}}}

Provide access.buf

[5] Finally provide the access log from disk (/var/log/dirsrv/slapd-INSTANCE/access)

Thanks!
Mark

Our QE team has not been able to reproduce this. hheinze, are you willing to try and reproduce this again and gather more info? See my last update. If not, please let me know.

Thanks,
Mark

Hi,

Apologies for the delay -- unfortunately the crashes only occur in a production environment under hight, diverse load about once a day in our production environment (up to 60 connections/s, 250 searches/s during peak times).

We have not been able to emulate the crash conditions in our test environment, driving load up to 600 searches/second, feeding test load from up to 15 different servers, some in different network sections, test load based on replaying logs from production.

Unfortunately we cannot run a production server under gdb or with extensive logging, as LDAP constitutes a critical service. Even debug-level logging is not possible due its impact on performance (it interferes with load balancing strategy that assumes a roughly equal capacity on each node).

A possible option may be injecting some debugging code that logs out non-zero closure codes.

Sorry for not being able to help with this request -- I'll keep thinking of alternative approaches, albeit only on the fringes of my other work.

Cheers,
Helmut

Replying to [comment:17 hheinze]:

Hi,

Apologies for the delay -- unfortunately the crashes only occur in a production environment under hight, diverse load about once a day in our production environment (up to 60 connections/s, 250 searches/s during peak times).

We have not been able to emulate the crash conditions in our test environment, driving load up to 600 searches/second, feeding test load from up to 15 different servers, some in different network sections, test load based on replaying logs from production.

Unfortunately we cannot run a production server under gdb or with extensive logging, as LDAP constitutes a critical service. Even debug-level logging is not possible due its impact on performance (it interferes with load balancing strategy that assumes a roughly equal capacity on each node).

A possible option may be injecting some debugging code that logs out non-zero closure codes.

Sorry for not being able to help with this request -- I'll keep thinking of alternative approaches, albeit only on the fringes of my other work.

I completely understand, do you happen to have any access logs from the time when the crashes were occurring?

Also, are you running with your code fix, or did you just turn off nunc-stans? I'm curious if using nunc-stans, with your "fix", was giving you better performance?

Thanks,
Mark

Cheers,
Helmut

Hi Mark,

we are running 8 nodes with nunc-stans (+ the workaround fix) and 1 node without nunc-stans (but still the fix in place, of course). I see no significant difference in the performance; performance may be a 2/3 % decreased but this is within the range we get between different nodes anyway and may be due to some network routing issues.

I'll chase up the access logs during a couple of the averted crashes to gauge whether there's some common pattern.

Helmut

Replying to [comment:19 hheinze]:

Hi Mark,

we are running 8 nodes with nunc-stans (+ the workaround fix) and 1 node without nunc-stans (but still the fix in place, of course). I see no significant difference in the performance; performance may be a 2/3 % decreased but this is within the range we get between different nodes anyway and may be due to some network routing issues.

I'll chase up the access logs during a couple of the averted crashes to gauge whether there's some common pattern.

Actually I don't really need the access logs, what I need is for you to run logconv.pl on your access logs from the time when it was crashing:

./logconv.pl -cip -s 1000 <access logs>

Thanks,
Mark

Helmut

Helmut,

I've attached patch that should retain the original connection id, and log it when the problem occurs. If you can build/test this patch, then we should be able to get some useful information, and be able to track down the client IP address, what the original connection did, and how it was closed(the first time). Just grep for "conn=### " from the access log and we should see everything it did.

This information would be very helpful for this investigation.

Thanks,
Mark

Replying to [comment:21 mreynolds]:

Helmut,

I've attached patch that should retain the original connection id, and log it when the problem occurs. If you can build/test this patch, then we should be able to get some useful information, and be able to track down the client IP address, what the original connection did, and how it was closed(the first time). Just grep for "conn=### " from the access log and we should see everything it did.

This information would be very helpful for this investigation.

Thanks,
Mark

Hi Mark,

I reviewed the patch and deployed a new set of RPMs in our development environment for preliminary testing. I will push it into UAT early next week and then into a selected production node by mid-week. So by end of next week we should be wiser

Helmut

Replying to [comment:22 hheinze]:

Hi Mark,

I reviewed the patch and deployed a new set of RPMs in our development environment for preliminary testing. I will push it into UAT early next week and then into a selected production node by mid-week. So by end of next week we should be wiser

Helmut

This is great, thank you very much!

Question, how often do see the original error message that you added? Just curious how often you are hitting the issue.

Thanks,
Mark

Replying to [comment:23 mreynolds]:

Replying to [comment:22 hheinze]:

Hi Mark,

I reviewed the patch and deployed a new set of RPMs in our development environment for preliminary testing. I will push it into UAT early next week and then into a selected production node by mid-week. So by end of next week we should be wiser

Helmut

This is great, thank you very much!

Question, how often do see the original error message that you added? Just curious how often you are hitting the issue.

Thanks,
Mark

Crashes happen between once a week and a couple per day (seems to be very dependent on load pattern and possibly on some network timing which might explain why some nodes crash more frequently than others)

H

Just checking in, its been 4 weeks since our last update.

Checking in again, any news?

Checking in one last time, is there any news on the test patch that was provided?

Nunc-stans has currently had some major rewrites that can impact the area of the code where this crash occurred. Since there is no update, and no way to reproduce the crash, this ticket will be closed.

Per triage meeting:
Much more likely, fixed by NS 0.2.0. I think close with 1.3.6 (available on F26 and newer)

Metadata Update from @nhosoi:
- Issue assigned to mreynolds
- Issue set to the milestone: FUTURE

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1903

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Invalid)

3 years ago

Login to comment on this ticket.

Metadata