Issue #47542: Both master and slave LDAP 389-console were down, can't find why - 389-ds-base - Pagure.io

389-ds-base

#47542 Both master and slave LDAP 389-console were down, can't find why

Closed: wontfix None Opened 10 years ago by van12.

hello

Red Hat Enterprise Linux Server release 5.9 (Tikanga)

[root@centaur log]# rpm -qa|grep 389-
389-admin-1.1.29-1.el5
389-ds-console-1.2.6-1.el5
389-console-1.1.7-3.el5
389-ds-base-libs-1.2.10.14-2.el5
389-admin-console-1.1.8-1.el5
389-admin-console-doc-1.1.8-1.el5
389-adminutil-1.1.15-1.el5
389-ds-base-1.2.10.14-2.el5
389-dsgw-1.1.10-1.el5
389-ds-console-doc-1.2.6-1.el5
389-ds-1.2.1-1.el5

we lost both ldap servers one morning, master centaur and slave zpm, we can't see much in the log, but we believe the master went to malfunction mode and it then affect to the slave when trying to replicate to the slave through port 389.

During this periode, "nmap cenatur/zpm -p 389" shows closed.

Oct 1 10:50:01 centaur nscd: nss_ldap: could not search LDAP server - Server is unavailable
Oct 1 10:54:09 zpm crond[20435]: nss_ldap: could not search LDAP server - Server is unavailable

The restart of ldap services on both server fixed the problem.

The memory, cpu at that time was normal. Is there a bug in 389-ds-1.2.1-1.el5 ?

van12 commented 10 years ago

attachment
ldap problem to Fedora.txt

rmeggins commented 10 years ago

The directory server appears to be crashing - read http://port389.org/wiki/FAQ#Debugging_Crashes

since you're on EL5, you won't have debuginfo-install, so you'll have to figure out how to track down the debuginfo packages.

van12 commented 10 years ago

Thanks for the link.

The server didn't crash, nmap shows it still listen on port 636, but not port 389. No core dump, the ldap process is there. It was definitely in a "stale" state

The question is why? is this because the 389-console can't handle huge simultaneous requests as "RH Directory server" or is there a bug?

I experienced if I let the 389-console run more than 1,2 weeks, it will take long time (several minutes) to shutdown the process, sometime I need to kill the slapd process.
At the moment we restart the ldap server every 2 days (because I was afraid it go bad as time goes by ), but still it went to malfunction mode the other morning.

I had checked/increase the settings, what can it be?

[root@centaur ~]# ldapsearch -LLLx -h centaur -p 389 -D 'cn=directory manager' -W -b "cn=monitor" "(cn=*)" | egrep connections
Enter LDAP Password:
currentconnections: 779
totalconnections: 93981
connections: 187183
connectionseq: 93981

soft nofile 16384
hard nofile 16384
fds soft nproc 16384
fds hard nproc 16384

nsslapd-conntablesize: 8192

vi /etc/profile
ulimit -S -c 0 -n 16384 > /dev/null 2>&1

rmeggins commented 10 years ago

Replying to [comment:2 van12]:

Thanks for the link.

The server didn't crash, nmap shows it still listen on port 636, but not port 389. No core dump,

The link explains why there may not be a core dump. You have to explicitly enable core dumps.

From your link:
{{{
[01/Oct/2013:12:25:39 +0200] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
}}}

Disorderly shutdown means either you did a kill -9, you machine had a power failure, or dirsrv crashed.

the ldap process is there. It was definitely in a "stale" state

So, let's assume it is the first one - you did a kill -9 because regular service dirsrv stop didn't work.

The question is why? is this because the 389-console can't handle huge simultaneous requests as "RH Directory server" or is there a bug?

I'm not sure what you mean by "389-console can't handle huge simultaneous requests" - are you saying you have several thousand 389-console apps running, all hitting the same directory server?

Let's assume the directory server is hung for some reason. The next step is to get a stack trace as described in http://port389.org/wiki/FAQ#Debugging_Hangs

I experienced if I let the 389-console run more than 1,2 weeks, it will take long time (several minutes) to shutdown the process,

Several minutes to shutdown the 389-console process? What java are you using?

sometime I need to kill the slapd process.

kill -9?

At the moment we restart the ldap server every 2 days (because I was afraid it go bad as time goes by ), but still it went to malfunction mode the other morning.

I had checked/increase the settings, what can it be?

I don't know. http://port389.org/wiki/FAQ#Debugging_Hangs

[root@centaur ~]# ldapsearch -LLLx -h centaur -p 389 -D 'cn=directory manager' -W -b "cn=monitor" "(cn=*)" | egrep connections
Enter LDAP Password:
currentconnections: 779
totalconnections: 93981
connections: 187183
connectionseq: 93981

This looks ok.

soft nofile 16384

hard nofile 16384
fds soft nproc 16384
fds hard nproc 16384

nsslapd-conntablesize: 8192

vi /etc/profile
ulimit -S -c 0 -n 16384 > /dev/null 2>&1

van12 commented 10 years ago

My point is the ldap process becomes weaker and weaker (just like a human body builded op with toxics) and at the end it get malfunction. This behavious can only seen if you have at least 100 accounts (with high activitet), 400 servers with all kind of OS (and apps). It stills respons your request but after a while it will break down (get really sick) like the other morning.

It is why we restart the ldap server every 2 days.

Is there a limit on 389-console compared to the "RH Directory Server"?

Can I get contact with the developer so he can look at this?

I'm not sure what you mean by "389-console can't handle huge simultaneous requests" - are you saying you have several thousand 389-console apps running, all hitting the same directory server?
yes there are many users/servers using the ldap server.
NB: a "w, ps, top.." use the authentication server to show your id as well. During the incident all ldap clients (login through ILO) hang when I issue "ps, su -, w, top...."

kill -9?
yes

Several minutes to shutdown the 389-console process? What java are you using?
just from a putty (ssh), no fancy stuff:
service dirsrv stop

rmeggins commented 10 years ago

Replying to [comment:4 van12]:

My point is the ldap process becomes weaker and weaker (just like a human body builded op with toxics) and at the end it get malfunction. This behavious can only seen if you have at least 100 accounts (with high activitet), 400 servers with all kind of OS (and apps). It stills respons your request but after a while it will break down (get really sick) like the other morning.

It is why we restart the ldap server every 2 days.

Is there a limit on 389-console compared to the "RH Directory Server"?

I don't know what you mean by that.

Can I get contact with the developer so he can look at this?

You are currently in contact with a developer. I have asked you for a stack trace to debug this problem.

I'm not sure what you mean by "389-console can't handle huge simultaneous requests" - are you saying you have several thousand 389-console apps running, all hitting the same directory server?
yes there are many users/servers using the ldap server.
NB: a "w, ps, top.." use the authentication server to show your id as well. During the incident all ldap clients (login through ILO) hang when I issue "ps, su -, w, top...."

kill -9?
yes

Several minutes to shutdown the 389-console process? What java are you using?
just from a putty (ssh), no fancy stuff:
service dirsrv stop

van12 commented 10 years ago

Replying to [comment:5 rmeggins]:

Replying to [comment:4 van12]:

My point is the ldap process becomes weaker and weaker (just like a human body builded op with toxics) and at the end it get malfunction. This behavious can only seen if you have at least 100 accounts (with high activitet), 400 servers with all kind of OS (and apps). It stills respons your request but after a while it will break down (get really sick) like the other morning.

It is why we restart the ldap server every 2 days.

Is there a limit on 389-console compared to the "RH Directory Server"?

I don't know what you mean by that.
I mean: The ldap server becomes slow and slower with time...and in the end even it runs, you do not get any respons from it. No ldap client can't do the authentication.

The software "389-console", when compare with Redhat licensed software "RH directory server", do you know what is the different? Can "RH DS" handle more connections and more stable than "389-console".

Can I get contact with the developer so he can look at this?

You are currently in contact with a developer. I have asked you for a stack trace to debug this problem.

Great, I wish I could give you all the trace you want, but all evidences are gone after a reboot and there was NO core dump at /var/log/dirsrv/slapd-INSTANCENAME according to the link (the slapd process was still there during the incident, it was not died or crash).

I'm not sure what you mean by "389-console can't handle huge simultaneous requests" - are you saying you have several thousand 389-console apps running, all hitting the same directory server?
yes there are many users/servers using the ldap server.
NB: a "w, ps, top.." use the authentication server to show your id as well. During the incident all ldap clients (login through ILO) hang when I issue "ps, su -, w, top...."

kill -9?
yes

Several minutes to shutdown the 389-console process? What java are you using?
just from a putty (ssh), no fancy stuff:
service dirsrv stop

Are there any limits (number of conenctions etc) on "389-console"? what could it make the slapd becomes slow/unstable with time? kernel settings, cache, code/stack garbages ...??

It is a prod env, I can 't, unfortunately replicate this behavious for you (by not restrat the slapd evry 2 days as I do now).

rmeggins commented 10 years ago

Replying to [comment:6 van12]:

Replying to [comment:5 rmeggins]:

Replying to [comment:4 van12]:

My point is the ldap process becomes weaker and weaker (just like a human body builded op with toxics) and at the end it get malfunction. This behavious can only seen if you have at least 100 accounts (with high activitet), 400 servers with all kind of OS (and apps). It stills respons your request but after a while it will break down (get really sick) like the other morning.

It is why we restart the ldap server every 2 days.

Is there a limit on 389-console compared to the "RH Directory Server"?

I don't know what you mean by that.
I mean: The ldap server becomes slow and slower with time...and in the end even it runs, you do not get any respons from it. No ldap client can't do the authentication.

The software "389-console", when compare with Redhat licensed software "RH directory server", do you know what is the different? Can "RH DS" handle more connections and more stable than "389-console".

No. They are practically the same. You might try Java tuning.

Can I get contact with the developer so he can look at this?

You are currently in contact with a developer. I have asked you for a stack trace to debug this problem.

Great, I wish I could give you all the trace you want, but all evidences are gone after a reboot and there was NO core dump at /var/log/dirsrv/slapd-INSTANCENAME according to the link (the slapd process was still there during the incident, it was not died or crash).

If kill -9 is causing the Disorderly Shutdown message, then there is no crash.

I'm not sure what you mean by "389-console can't handle huge simultaneous requests" - are you saying you have several thousand 389-console apps running, all hitting the same directory server?
yes there are many users/servers using the ldap server.
NB: a "w, ps, top.." use the authentication server to show your id as well. During the incident all ldap clients (login through ILO) hang when I issue "ps, su -, w, top...."

kill -9?
yes

Several minutes to shutdown the 389-console process? What java are you using?
just from a putty (ssh), no fancy stuff:
service dirsrv stop

Are there any limits (number of conenctions etc) on "389-console"?

No? Are you thinking perhaps that a single 389-console opens more and more connections to the directory server over time? I don't think so, but you could use something like netstat to see how many connections are open to the directory server.

what could it make the slapd becomes slow/unstable with time? kernel settings, cache, code/stack garbages ...??

All of the above and more.

It is a prod env, I can 't, unfortunately replicate this behavious for you (by not restrat the slapd evry 2 days as I do now).

Ok. At the first sign of directory server becoming sluggish, unresponsive, etc. please get a stack trace and attach to this ticket.

van12 commented 10 years ago

It happens again tonight, sorry no stack trace since there is no core. It get worse, the symptom happens just 2-7 minutes after we restart the ldap process, it happens on both on the master and slave server. Right now I had made at least 100 restart

the strace or the command line
/usr/sbin/ns-slapd -d 4 -D /etc/dirsrv/slapd-NNIT -i /var/run/dirsrv/slapd-NNIT.pid -w /var/run/dirsrv/slapd-NNIT.startpid

shows it hangs on futex call

[root@zpm dirsrv]# strace -p 10058
Process 10058 attached - interrupt to quit
futex(0x2b705c1736e0, FUTEX_WAIT_PRIVATE, 2, NULL

Please help,

http://stackoverflow.com/questions/3905883/python-hangs-in-futex-calls

van12 commented 10 years ago

attachment
stacktrace_core.1380960521.txt

van12 commented 10 years ago

attachment
stacktrace_hang.1380960793.txt

van12 commented 10 years ago

My collegue show me "kill -6" to create the core

here are 2 files, one from the core, one from the hang process.

stacktrace_core.1380960521.txt stacktrace_hang.1380960793.txt

rmeggins commented 10 years ago

Replying to [comment:8 van12]:

It happens again tonight, sorry no stack trace since there is no core.

You don't need a core file to get a stack trace. You just need to use the gdb command as described in Debugging_Hangs.

The stack trace output doesn't contain much information. Please make sure you have the debuginfo packages redhat-ds-base-debuginfo, mozldap-debuginfo, db4-debuginfo, nss-debuginfo, nspr-debuginfo, and glibc-debuginfo, and make sure the version of each debuginfo package matches the version of the corresponding optimized package. If you are unsure, do this

rpm -q redhat-ds-base redhat-ds-base-debuginfo mozldap mozldap-debuginfo db4 db4-debuginfo nss nss-debuginfo nspr nspr-debuginfo glibc glibc-debuginfo

and put the output of that in this ticket.

Are you using StartTLS operations? Run a logconv.pl against your access logs to see if you are unsure.

It get worse, the symptom happens just 2-7 minutes after we restart the ldap process, it happens on both on the master and slave server. Right now I had made at least 100 restart

the strace or the command line
/usr/sbin/ns-slapd -d 4 -D /etc/dirsrv/slapd-NNIT -i /var/run/dirsrv/slapd-NNIT.pid -w /var/run/dirsrv/slapd-NNIT.startpid

shows it hangs on futex call

[root@zpm dirsrv]# strace -p 10058
Process 10058 attached - interrupt to quit
futex(0x2b705c1736e0, FUTEX_WAIT_PRIVATE, 2, NULL

Please help,

http://stackoverflow.com/questions/3905883/python-hangs-in-futex-calls

van12 commented 10 years ago

Many thanks Rich for your support,I am success downloading debuginforpm from http://rpm.pbone.net
The stacktrace from the hang process is attached, I hope the debug infos are there.

By the way the Network department told us there were/are MANY syn flood attack from 2 of the company ldap clients. we shut them down (disable the client agent) but still the sldapd daemon still go sluggish after 10-30 minutes.

I had a tcpdump and wireshark shows there are many "TCP retransmission" and SYN. I can't see if there are coming from a specific host, the Source IP shows different IP.

Sorry the pcap shows some of our public IP, if you interest please send me your email to my email bind to this fedorahosted.org account. I attach hereby 3 screeendumps from wireshark.

LDAP server can accept 8192 connenctions, and there are only 282 conn (SYN_RECV , ESTABLISHED, ESTABLISHED).

nsslapd-maxdescriptors shows 8192 (dse.ldif)

[root@centaur tmp]# id tnng (my account)
id: tnng: No such user
[root@centaur tmp]# getent passwd|tail -2 (show only users from /etc/passwd)
ldapsync:x:10890:100:ldap user for sync:/home/ldapsync:/bin/bash
vmashcsn:x:2202:2202:M (Shell SN):/home/vmashcsn:/bin/bash
[root@centaur tmp]#

[root@centaur tmp]# netstat -a|grep -i ldap|wc
282 1692 25098
[root@centaur tmp]# netstat -a|wc
514 3190 43728

tcp 0 0 centaur:ldap sirius.net:27381 SYN_RECV
tcp 0 0 centaur:ldap uhana01:40202 SYN_RECV
tcp 0 1 centaur:43416 centaur:ldap SYN_SENT
tcp 1 0 centaur:11987 centaur:ldap CLOSE_WAIT
tcp 0 0 :ldap : LISTEN
tcp 0 0 :ldaps : LISTEN
tcp 66 0 centaur:ldaps kasodb006:63677 ESTABLISHED
tcp 597 0 centaur:ldap mail1:46611 CLOSE_WAIT
tcp 39 0 centaur:ldap porait01:48923 CLOSE_WAIT
tcp 32 0 centaur:ldap IT13439.research:55668 CLOSE_WAIT

thanks
Tuan

van12 commented 10 years ago

attachment
stacktrace_hang.1381061792.txt

van12 commented 10 years ago

attachment

van12 commented 10 years ago

attachment

van12 commented 10 years ago

attachment

van12 commented 10 years ago

Are you using StartTLS operations? Run a logconv.pl against your access logs to see if you are unsure.

yes, we use StartTLS for Linux, hpux. AIX client doesn't support TLS so I use ldaps. logconv.pl shows all zeros, except this line
Entry Operations: 255047
-rw------- 1 fds fds 25968492 Oct 6 13:01 access

rmeggins commented 10 years ago

Replying to [comment:12 van12]:

Are you using StartTLS operations? Run a logconv.pl against your access logs to see if you are unsure.

yes, we use StartTLS for Linux, hpux. AIX client doesn't support TLS so I use ldaps. logconv.pl shows all zeros, except this line
Entry Operations: 255047
-rw------- 1 fds fds 25968492 Oct 6 13:01 access

Thanks. Looking at the stack trace, and the fact that you are using Start TLS ops, this seems like https://fedorahosted.org/389/ticket/47375 which was fixed in 1.2.11.22. We are not planning to release a new version of 1.2.10.x for EL5. Instead, we are releasing 1.2.11.24 in EL5. There is a new build in epel-testing for EL5. Please try that build to see if it fixes your problem.

van12 commented 10 years ago

The taskforce found out the problem was at our three new ldap clients at brazil. The combination of many hop, delay, vpn, defragment DF bit ..was the reason the ldap server get SYN attack (the client didn't get the SYN_ACK back), and then 389-ds closed the 389 port (but not port 636). There are thousand of duplicade SYN on firewall toward the ldap server.

Is there a way the ldap server can handle it? harder such kind of situation? (don´t break down, but just reject those SYN?)

can 389-ds set the DF bit depending the env? (if the OS has DF=1, does the 389-ds obey this rule or does it do something else?)

rmeggins commented 10 years ago

Replying to [comment:14 van12]:

The taskforce found out the problem was at our three new ldap clients at brazil. The combination of many hop, delay, vpn, defragment DF bit ..was the reason the ldap server get SYN attack (the client didn't get the SYN_ACK back), and then 389-ds closed the 389 port (but not port 636). There are thousand of duplicade SYN on firewall toward the ldap server.

Is there a way the ldap server can handle it? harder such kind of situation? (don´t break down, but just reject those SYN?)

Not that I know of. Note that the stack trace you provided shows that the directory server is deadlocked - this deadlock has been fixed in 1.2.11.24 - I strongly encourage you to try that version.

can 389-ds set the DF bit depending the env? (if the OS has DF=1, does the 389-ds obey this rule or does it do something else?)

Can you provide a link to more documentation about the DF bit and how the OS TCP stack/API can access this information?

I would like to close this ticket. The original ticket was the crashing/deadlocking issue. If that problem has been fixed by 1.2.11.24, please confirm so I can close this ticket.

Then, if you want to open another ticket for improving 389 handing of SYN attacks/DF bit, please do so.

van12 commented 10 years ago

Thanks, we will try the new release.

I will open a new ticket about the SYN attacks and DF bit.

Again Thanks for your great support Rich.

BR
Tuan

Metadata Update from @van12:
- Issue set to the milestone: 1.2.11.24

7 years ago

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/879

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

1.2.11.24

reviewstatus

no review needed

rhbz

0

origin

Community

Attachments 3

ldap_hang.jpg

Attached 10 years ago View Comment

ldap_hang2.jpg

Attached 10 years ago View Comment

ldap_hang3.jpg

Attached 10 years ago View Comment

Powered by Pagure 5.13.3

Documentation • File an Issue • About • SSH Hostkey/Fingerprint

© Red Hat, Inc. and others.