Learn more about these different git repos.
Other Git URLs
If the AD sssd_be gets a legitimate fork() failure, such as ENOMEM / 'Cannot allocate memory', it enters a state where it leaks file descriptors, soon runs out, and then can't make any new connections. Restarting sssd fixes the problem.
In some cases the ENOMEM is legitimate, eg: Java has used every scrap of RAM on the box. In other cases my monitoring does not pinpoint the cause of the fork failure.
This is sssd-1.12.4-47.el6_7.8.x86_64, EL 6.7, but running on a EL 6.6 server (I fast tracked sssd to see if it made any difference). I haven't tried any 1.13 builds.
I will attach a log file of sssd_be at debug level 7 (hopefully that's enough). The log contains two instances of the problem starting at:
After these points if you look at the sdap_ldap_connect_callback_add log lines you will see the FD number increasing.
Unfortunately I don't have any lsof output to match the log, but from memory the FD leaks were for some sort of pipe (they did not appear to be TCP connections to LDAP).
What other debugging can I get you to diagnose this problem?
attachment sssd_be.log.gz
I had to sanitise that log file quite heavily, let me know if it is excluding something you need to see.
Have another instance of a server this morning unable to fork any child processes for a few hours:
(Fri May 6 03:08:09 2016) [sssd[be[example.com]]] [sdap_fork_child] (0x0020): fork failed [12][Cannot allocate memory]. (Fri May 6 03:08:09 2016) [sssd[be[example.com]]] [sdap_kinit_done] (0x0020): child failed (12 [Cannot allocate memory]) (Fri May 6 03:08:09 2016) [sssd[be[example.com]]] [sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret [12](Cannot allocate memory)
The server recovered after a few hours and SSSD is working, but the sssd_be process is still holding open a large number of pipes, about 200 more FDs than it usually would have:
sssd_be 12159 root 198w FIFO 0,8 0t0 349711344 pipe sssd_be 12159 root 199r FIFO 0,8 0t0 349711345 pipe sssd_be 12159 root 200w FIFO 0,8 0t0 349711345 pipe sssd_be 12159 root 201r FIFO 0,8 0t0 349711347 pipe sssd_be 12159 root 202w FIFO 0,8 0t0 349711347 pipe sssd_be 12159 root 203r FIFO 0,8 0t0 349711348 pipe sssd_be 12159 root 204w FIFO 0,8 0t0 349711348 pipe sssd_be 12159 root 205r FIFO 0,8 0t0 350446786 pipe sssd_be 12159 root 206u IPv4 351864269 0t0 TCP 10.1.1.1:35448->dc:389 (ESTABLISHED) sssd_be 12159 root 207w FIFO 0,8 0t0 350446786 pipe sssd_be 12159 root 208r FIFO 0,8 0t0 350446787 pipe sssd_be 12159 root 209w FIFO 0,8 0t0 350446787 pipe sssd_be 12159 root 210w FIFO 0,8 0t0 350446789 pipe sssd_be 12159 root 211r FIFO 0,8 0t0 350446790 pipe sssd_be 12159 root 212w FIFO 0,8 0t0 350446790 pipe sssd_be 12159 root 213u IPv4 351865156 0t0 TCP 10.1.1.1:42603->dc:3268 (ESTABLISHED) sssd_be 12159 root 214w FIFO 0,8 0t0 350446794 pipe sssd_be 12159 root 215r FIFO 0,8 0t0 350446795 pipe sssd_be 12159 root 216w FIFO 0,8 0t0 350446795 pipe sssd_be 12159 root 217u IPv4 351884545 0t0 TCP 10.1.1.1:35578->dc:389 (ESTABLISHED)
Could they be the stdin/stdout file handles of the ldap_child process that failed to fork?
Managed to catch an strace of a failing process, this is the chunk of the trace before the clone() failure:
write(11, "(Fri May 6 08:23:13 2016) [sssd"..., 155) = 155 umask(077) = 0177 open("/var/lib/sss/pubconf/.krb5info_dummy_KbctFU", O_RDWR|O_CREAT|O_EXCL, 0600) = 247 umask(0177) = 077 write(247, "10.1.1.1:88", 16) = 16 fchmod(247, 0644) = 0 close(247) = 0 rename("/var/lib/sss/pubconf/.krb5info_dummy_KbctFU", "/var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM") = 0 gettimeofday({1462522993, 76737}, NULL) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(11, "(Fri May 6 08:23:13 2016) [sssd"..., 124) = 124 gettimeofday({1462522993, 76994}, NULL) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(11, "(Fri May 6 08:23:13 2016) [sssd"..., 104) = 104 geteuid() = 0 getegid() = 0 pipe([247, 248]) = 0 pipe([249, 250]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fc77a3589d0) = -1 ENOMEM (Cannot allocate memory)
There's no close() call on FDs 247 to 250 in the trace, and those FDs are still held open now.
I've found a few machines that are good candidates for memory pressure. I've installed the latest CentOS 6.8 preview RPMs from here:
https://copr-be.cloud.fedoraproject.org/results/jhrozek/SSSD-6.8-preview/epel-6-x86_64/00174698-sssd/
After a short time an strace of the running sssd_be process catches this fork failure:
write(11, "(Fri May 6 11:22:48 2016) [sssd[be[example.com]]] [be_ptask_execute] (0x0400): Task [AD machine account password renewal]: executing task, timeout 60 seconds\n", 157) = 157 pipe([22, 26]) = 0 pipe([27, 28]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f63c529e9d0) = -1 ENOMEM (Cannot allocate memory) gettimeofday({1462533768, 167433}, NULL) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(11, "(Fri May 6 11:22:48 2016) [sssd[be[example.com]]] [ad_machine_account_password_renewal_send] (0x0020): fork failed [12][Cannot allocate memory].\n", 144) = 144
and if I look at the open FDs, 26, 27 and 28 are hanging open:
... sssd_be 24306 root 20r REG 253,0 10406312 267935 /var/lib/sss/mc/initgroups (deleted) sssd_be 24306 root 21u unix 0xffff880415a2c700 0t0 101198958 /var/lib/sss/pipes/private/sbus-dp_example.com.24306 sssd_be 24306 root 22r FIFO 0,8 0t0 101253974 pipe sssd_be 24306 root 24u IPv4 101264903 0t0 TCP 10.1.1.1:45317->dc03:389 (ESTABLISHED) sssd_be 24306 root 25u IPv4 101267713 0t0 TCP 10.1.1.1:53785->dc03:3268 (ESTABLISHED) sssd_be 24306 root 26w FIFO 0,8 0t0 101253974 pipe sssd_be 24306 root 27r FIFO 0,8 0t0 101253975 pipe sssd_be 24306 root 28w FIFO 0,8 0t0 101253975 pipe sssd_be 24306 root 29u IPv4 101267715 0t0 TCP 10.1.1.1:45389->dc03:389 (ESTABLISHED)
It fits the same pattern as before, so may not be fixed in 13.2 either. I will continue to collect information over the weekend.
I'm sorry about a lack of response in this ticket. This seems like a legitimate bug we should fix. We've been busy working on meeting deadlines for the 1.14 release, though..
I'll leave the ticket in the NEEDS_TRIAGE milestone for one more week and then find an assignee to work on it..
Thank you for the report, I see the bug now. If you let me know the version you are running now, I can build you a test package.
Fields changed
owner: somebody => jhrozek status: new => assigned
Patch on list
patch: 0 => 1
Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1340176 (Red Hat Enterprise Linux 6)
rhbz: => [https://bugzilla.redhat.com/show_bug.cgi?id=1340176 1340176]
milestone: NEEDS_TRIAGE => SSSD 1.13.5
resolution: => fixed status: assigned => closed
Thanks for the patch. Will 1.13.5 make it's way into EL 6.8 or has it missed the cut off? Any chance I could get a bundle of RPMs to test with like in https://copr-be.cloud.fedoraproject.org/results/jhrozek/SSSD-6.8-preview/epel-6-x86_64/?
Not 1.13.5, but I nominated this patch for the next 6.8 update. I can either build you a test package based on the existing 6.8 update (I need to do it anyway for RHEL..) or you can wait until the fix is released in RHEL itself. I would appreciate testing if you have some time, though..
My email reply got lost...
Yes, quite happy to test a patched build on top of a 6.8 RPM before it gets to 6.8 Update.
Replying to [comment:14 lukebigum]:
My email reply got lost... Yes, quite happy to test a patched build on top of a 6.8 RPM before it gets to 6.8 Update.
Sorry about the delay, here are the test builds: https://jhrozek.fedorapeople.org/sssd-test-builds/sssd-6.8-fd-leak/
Thanks, I've updated 150 of 450 servers that are affected, should be a good cross section of hosts to get feedback from. Over the weekend the open file count had started to creep up, so in a few more days I hope I'll be able to tell if 1.13.3-22.el6_8.3 makes a difference.
Some fd leak is to be expected, (there might be a connection open per domain to each domain's DC), but the count should stabilize and not leak further. Please open a bug if you still see some leaks.
Metadata Update from @lukebigum: - Issue assigned to jhrozek - Issue set to the milestone: SSSD 1.13.5
SSSD is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here: - https://github.com/SSSD/sssd/issues/4047
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Login to comment on this ticket.